Research output: Contribution to journal › Article › peer-review
Clustering Narrow-Domain Short Texts Using K-Means, Linguistic Patterns and LSI. / Popova, Svetlana; Danilova, Vera; Egorov, Artem.
In: Communications in Computer and Information Science, Vol. 436, 2014, p. 66-77.Research output: Contribution to journal › Article › peer-review
}
TY - JOUR
T1 - Clustering Narrow-Domain Short Texts Using K-Means, Linguistic Patterns and LSI
AU - Popova, Svetlana
AU - Danilova, Vera
AU - Egorov, Artem
PY - 2014
Y1 - 2014
N2 - In the present work we consider the problem of narrow-domain clustering of short texts, such as academic abstracts. Our main objective is to check whether it is possible to improve the quality of k-means algorithm expanding the feature space by adding a dictionary of word groups that were selected from texts on the basis of a fixed set of patterns. Also, we check the possibility to increase the quality of clustering by mapping the feature spaces to a semantic space with a lower dimensionality using Latent Semantic Indexing (LSI). The results allow us to assume that the aforementioned modifications are feasible in practical terms as compared to the use of k-means in the feature space defined only by the main dictionary of the corpus.
AB - In the present work we consider the problem of narrow-domain clustering of short texts, such as academic abstracts. Our main objective is to check whether it is possible to improve the quality of k-means algorithm expanding the feature space by adding a dictionary of word groups that were selected from texts on the basis of a fixed set of patterns. Also, we check the possibility to increase the quality of clustering by mapping the feature spaces to a semantic space with a lower dimensionality using Latent Semantic Indexing (LSI). The results allow us to assume that the aforementioned modifications are feasible in practical terms as compared to the use of k-means in the feature space defined only by the main dictionary of the corpus.
M3 - статья
VL - 436
SP - 66
EP - 77
JO - Communications in Computer and Information Science
JF - Communications in Computer and Information Science
SN - 1865-0929
ER -
ID: 5746655