Clusters, language models, and ad hoc information retrieval

Oren Kurland, Lillian Lee

Research output: Contribution to journalArticlepeer-review

Abstract

The language-modeling approach to information retrieval provides an effective statistical framework for tackling various problems and often achieves impressive empirical performance. However, most previous work on language models for information retrieval focused on document-specific characteristics, and therefore did not take into account the structure of the surrounding corpus, a potentially rich source of additional information. We propose a novel algorithmic framework in which information provided by document-based language models is enhanced by the incorporation of information drawn from clusters of similar documents. Using this framework, we develop a suite of new algorithms. Even the simplest typically outperforms the standard language-modeling approach in terms of mean average precision (MAP) and recall, and our new interpolation algorithm posts statistically significant performance improvements for both metrics over all six corpora tested. An important aspect of our work is the way we model corpus structure. In contrast to most previous work on cluster-based retrieval that partitions the corpus, we demonstrate the effectiveness of a simple strategy based on a nearest-neighbors approach that produces overlapping clusters.

Original languageEnglish
Article number13
JournalACM Transactions on Information Systems
Volume27
Issue number3
DOIs
StatePublished - 1 May 2009

Keywords

  • Aspect models
  • Cluster hypothesis
  • Cluster-based language models
  • Clustering
  • Interpolation model
  • Language modeling
  • Smoothing

ASJC Scopus subject areas

  • Information Systems
  • General Business, Management and Accounting
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Clusters, language models, and ad hoc information retrieval'. Together they form a unique fingerprint.

Cite this