Clustering small-sized collections of short texts

Lili Kotlerman, Ido Dagan, Oren Kurland

Research output: Contribution to journalArticlepeer-review

Abstract

The need to cluster small text corpora composed of a few hundreds of short texts rises in various applications; e.g., clustering top-retrieved documents based on their snippets. This clustering task is challenging due to the vocabulary mismatch between short texts and the insufficient corpus-based statistics (e.g., term co-occurrence statistics) due to the corpus size. We address this clustering challenge using a framework that utilizes a set of external knowledge resources that provide information about term relations. Specifically, we use information induced from the resources to estimate similarity between terms and produce term clusters. We also utilize the resources to expand the vocabulary used in the given corpus and thus enhance term clustering. We then project the texts in the corpus onto the term clusters to cluster the texts. We evaluate various instantiations of the proposed framework by varying the term clustering method used, the approach of projecting the texts onto the term clusters, and the way of applying external knowledge resources. Extensive empirical evaluation demonstrates the merits of our approach with respect to applying clustering algorithms directly on the text corpus, and using state-of-the-art co-clustering and topic modeling methods.

Original languageEnglish
Pages (from-to)273-306
Number of pages34
JournalInformation Retrieval Journal
Volume21
Issue number4
DOIs
StatePublished - 1 Aug 2018

Keywords

  • Clustering
  • Clustering short texts
  • Short text similarities

ASJC Scopus subject areas

  • Information Systems
  • Library and Information Sciences

Fingerprint

Dive into the research topics of 'Clustering small-sized collections of short texts'. Together they form a unique fingerprint.

Cite this