Text clustering sounds simple on paper: take a collection of documents, group similar ones together, and use those groups to power product features. In practice, it is one of those engineering problems that becomes much harder the moment we move from experiments to real systems.
A clustering approach that looks elegant in a notebook can fail badly in production. It may merge unrelated items, split the same event into several clusters, struggle with streaming updates, or become too expensive to run at scale. The challenge becomes even sharper when the goal is not generic topic discovery, but something operationally useful — grouping related news stories, detecting trends, or consolidating signals from real-time systems.
In this post, I want to compare two very different but highly instructive approaches to text clustering:
- LDA-based story clustering — a classical probabilistic approach
- Embedding-based trend clustering with Agglomerative Clustering — a modern semantic pipeline
Both are trying to solve a similar high-level problem: identify groups of text items that belong together. But they are built on very different assumptions, different representations of text, and different production philosophies.
A brief history of clustering
Before we dive into the two approaches, it helps to understand where clustering comes from. The ideas underpinning modern text clustering did not appear overnight — they evolved across seven decades of computer science, statistics, and linguistics.