TY - JOUR
T1 - Clustering small-sized collections of short texts
AU - Kotlerman, Lili
AU - Dagan, Ido
AU - Kurland, Oren
N1 - Publisher Copyright:
© 2017, Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2018/8/1
Y1 - 2018/8/1
N2 - The need to cluster small text corpora composed of a few hundreds of short texts rises in various applications; e.g., clustering top-retrieved documents based on their snippets. This clustering task is challenging due to the vocabulary mismatch between short texts and the insufficient corpus-based statistics (e.g., term co-occurrence statistics) due to the corpus size. We address this clustering challenge using a framework that utilizes a set of external knowledge resources that provide information about term relations. Specifically, we use information induced from the resources to estimate similarity between terms and produce term clusters. We also utilize the resources to expand the vocabulary used in the given corpus and thus enhance term clustering. We then project the texts in the corpus onto the term clusters to cluster the texts. We evaluate various instantiations of the proposed framework by varying the term clustering method used, the approach of projecting the texts onto the term clusters, and the way of applying external knowledge resources. Extensive empirical evaluation demonstrates the merits of our approach with respect to applying clustering algorithms directly on the text corpus, and using state-of-the-art co-clustering and topic modeling methods.
AB - The need to cluster small text corpora composed of a few hundreds of short texts rises in various applications; e.g., clustering top-retrieved documents based on their snippets. This clustering task is challenging due to the vocabulary mismatch between short texts and the insufficient corpus-based statistics (e.g., term co-occurrence statistics) due to the corpus size. We address this clustering challenge using a framework that utilizes a set of external knowledge resources that provide information about term relations. Specifically, we use information induced from the resources to estimate similarity between terms and produce term clusters. We also utilize the resources to expand the vocabulary used in the given corpus and thus enhance term clustering. We then project the texts in the corpus onto the term clusters to cluster the texts. We evaluate various instantiations of the proposed framework by varying the term clustering method used, the approach of projecting the texts onto the term clusters, and the way of applying external knowledge resources. Extensive empirical evaluation demonstrates the merits of our approach with respect to applying clustering algorithms directly on the text corpus, and using state-of-the-art co-clustering and topic modeling methods.
KW - Clustering
KW - Clustering short texts
KW - Short text similarities
UR - http://www.scopus.com/inward/record.url?scp=85035786981&partnerID=8YFLogxK
U2 - 10.1007/s10791-017-9324-8
DO - 10.1007/s10791-017-9324-8
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85035786981
SN - 1386-4564
VL - 21
SP - 273
EP - 306
JO - Information Retrieval Journal
JF - Information Retrieval Journal
IS - 4
ER -