Topic detection based on sentence embeddings and agglomerative clustering with markov moment

Standard

Topic detection based on sentence embeddings and agglomerative clustering with markov moment. / Bodrunova, Svetlana S.; Orekhov, Andrey, V ; Blekanov, Ivan S.; Lyudkevich, Nikolay S.; Tarasov, Nikita A.

In: Future Internet, Vol. 12, No. 9, 144, 09.2020, p. 1-17.

Research output: Contribution to journal › Article › peer-review

BibTeX

@article{67caebf229e94b17a1f199a058c6782a,

title = "Topic detection based on sentence embeddings and agglomerative clustering with markov moment",

abstract = "The paper is dedicated to solving the problem of optimal text classification in the area of automated detection of typology of texts. In conventional approaches to topicality-based text classification (including topic modeling), the number of clusters is to be set up by the scholar, and the optimal number of clusters, as well as the quality of the model that designates proximity of texts to each other, remain unresolved questions. We propose a novel approach to the automated definition of the optimal number of clusters that also incorporates an assessment of word proximity of texts, combined with text encoding model that is based on the system of sentence embeddings. Our approach combines Universal Sentence Encoder (USE) data pre-processing, agglomerative hierarchical clustering by Ward{\textquoteright}s method, and the Markov stopping moment for optimal clustering. The preferred number of clusters is determined based on the “e-2” hypothesis. We set up an experiment on two datasets of real-world labeled data: News20 and BBC. The proposed model is tested against more traditional text representation methods, like bag-of-words and word2vec, to show that it provides a much better-resulting quality than the baseline DBSCAN and OPTICS models with different encoding methods. We use three quality metrics to demonstrate that clustering quality does not drop when the number of clusters grows. Thus, we get close to the convergence of text clustering and text classification.",

keywords = "Clustering of short texts, DBSCAN, Distributive semantics, Least squares method, Markov moment, Neural network algorithms, Sentence embeddings, Text classification, Text clustering",

author = "Bodrunova, {Svetlana S.} and Orekhov, {Andrey, V} and Blekanov, {Ivan S.} and Lyudkevich, {Nikolay S.} and Tarasov, {Nikita A.}",

note = "Funding Information: This work was supported in full by Russian Science Foundation, grant number 16-18-10125-P.",

year = "2020",

month = sep,

doi = "10.3390/fi12090144",

language = "English",

volume = "12",

pages = "1--17",

journal = "Future Internet",

issn = "1999-5903",

publisher = "MDPI AG",

number = "9",

}

RIS

TY - JOUR

T1 - Topic detection based on sentence embeddings and agglomerative clustering with markov moment

AU - Bodrunova, Svetlana S.

AU - Orekhov, Andrey, V

AU - Blekanov, Ivan S.

AU - Lyudkevich, Nikolay S.

AU - Tarasov, Nikita A.

N1 - Funding Information: This work was supported in full by Russian Science Foundation, grant number 16-18-10125-P.

PY - 2020/9

Y1 - 2020/9

N2 - The paper is dedicated to solving the problem of optimal text classification in the area of automated detection of typology of texts. In conventional approaches to topicality-based text classification (including topic modeling), the number of clusters is to be set up by the scholar, and the optimal number of clusters, as well as the quality of the model that designates proximity of texts to each other, remain unresolved questions. We propose a novel approach to the automated definition of the optimal number of clusters that also incorporates an assessment of word proximity of texts, combined with text encoding model that is based on the system of sentence embeddings. Our approach combines Universal Sentence Encoder (USE) data pre-processing, agglomerative hierarchical clustering by Ward’s method, and the Markov stopping moment for optimal clustering. The preferred number of clusters is determined based on the “e-2” hypothesis. We set up an experiment on two datasets of real-world labeled data: News20 and BBC. The proposed model is tested against more traditional text representation methods, like bag-of-words and word2vec, to show that it provides a much better-resulting quality than the baseline DBSCAN and OPTICS models with different encoding methods. We use three quality metrics to demonstrate that clustering quality does not drop when the number of clusters grows. Thus, we get close to the convergence of text clustering and text classification.

AB - The paper is dedicated to solving the problem of optimal text classification in the area of automated detection of typology of texts. In conventional approaches to topicality-based text classification (including topic modeling), the number of clusters is to be set up by the scholar, and the optimal number of clusters, as well as the quality of the model that designates proximity of texts to each other, remain unresolved questions. We propose a novel approach to the automated definition of the optimal number of clusters that also incorporates an assessment of word proximity of texts, combined with text encoding model that is based on the system of sentence embeddings. Our approach combines Universal Sentence Encoder (USE) data pre-processing, agglomerative hierarchical clustering by Ward’s method, and the Markov stopping moment for optimal clustering. The preferred number of clusters is determined based on the “e-2” hypothesis. We set up an experiment on two datasets of real-world labeled data: News20 and BBC. The proposed model is tested against more traditional text representation methods, like bag-of-words and word2vec, to show that it provides a much better-resulting quality than the baseline DBSCAN and OPTICS models with different encoding methods. We use three quality metrics to demonstrate that clustering quality does not drop when the number of clusters grows. Thus, we get close to the convergence of text clustering and text classification.

KW - Clustering of short texts

KW - DBSCAN

KW - Distributive semantics

KW - Least squares method

KW - Markov moment

KW - Neural network algorithms

KW - Sentence embeddings

KW - Text classification

KW - Text clustering

UR - http://www.scopus.com/inward/record.url?scp=85094836132&partnerID=8YFLogxK

UR - https://www.mendeley.com/catalogue/c21d0778-1a84-3e9c-8909-6392dae38f38/

U2 - 10.3390/fi12090144

DO - 10.3390/fi12090144

M3 - Article

VL - 12

SP - 1

EP - 17

JO - Future Internet

JF - Future Internet

SN - 1999-5903

IS - 9

M1 - 144

ER -

ID: 64769116