Abstract
The language-modeling approach to information retrieval provides an effective statistical framework for tackling various problems and often achieves impressive empirical performance. However, most previous work on language models for information retrieval focused on document-specific characteristics, and therefore did not take into account the structure of the surrounding corpus, a potentially rich source of additional information. We propose a novel algorithmic framework in which information provided by document-based language models is enhanced by the incorporation of information drawn from clusters of similar documents. Using this framework, we develop a suite of new algorithms. Even the simplest typically outperforms the standard language-modeling approach in terms of mean average precision (MAP) and recall, and our new interpolation algorithm posts statistically significant performance improvements for both metrics over all six corpora tested. An important aspect of our work is the way we model corpus structure. In contrast to most previous work on cluster-based retrieval that partitions the corpus, we demonstrate the effectiveness of a simple strategy based on a nearest-neighbors approach that produces overlapping clusters.
Original language | English |
---|---|
Article number | 13 |
Journal | ACM Transactions on Information Systems |
Volume | 27 |
Issue number | 3 |
DOIs | |
State | Published - 1 May 2009 |
Keywords
- Aspect models
- Cluster hypothesis
- Cluster-based language models
- Clustering
- Interpolation model
- Language modeling
- Smoothing
ASJC Scopus subject areas
- Information Systems
- General Business, Management and Accounting
- Computer Science Applications