For the purposes of searching for various communities on the Internet, automatic typology of text messages defined via application of methods of cluster analysis may be used. In this paper, we address one of the significant issues in text classification via cluster analysis, namely determination of the number of clusters. For clustering based on semantics, text documents are typically represented in the form of vectors within n-dimensional linear space. What we suggest as a method for determining the number of clusters is the agglomerative clustering of vectors in the linear space. In our work, statistical analysis is combined with neural network algorithms to obtain a more accurate semantic portrait of a text. Then, using the techniques of distributive semantics, mapping of the derived network structures into a vector form is constructed. A statistical criterion for the completion of the clustering process is derived, defined as a Markovian moment. By obtaining automatic partitioning into clusters, one can compare texts that are closest to the centroids with actual content samples or evaluate such texts with the help of experts. If the display of texts in a vector form is adequate, all informational messages from a fixed cluster have the same meaning and the same emotional coloring. In addition, we discuss a possibility to use vector representation of texts for sentiment detection in short texts like search engines input or tweets.

Original languageEnglish
Title of host publicationInternet Science. 6th International Conference, INSCI 2019
Subtitle of host publicationProceedings
EditorsSamira El Yacoubi, Franco Bagnoli, Giovanna Pacini
Place of PublicationCham
PublisherSpringer Nature
Pages235-249
Number of pages15
ISBN (Electronic)9780030347703
ISBN (Print)9783030347697
DOIs
StatePublished - 1 Dec 2019
Event6th International Conference on Internet Science, INSCI 2019 - Perpignan, France
Duration: 2 Dec 20195 Dec 2019

Publication series

NameLecture Notes in Computer Science
PublisherSpringer
Volume11938
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference6th International Conference on Internet Science, INSCI 2019
Abbreviated titleINSCI'2019
Country/TerritoryFrance
CityPerpignan
Period2/12/195/12/19

    Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

    Research areas

  • Cluster analysis, Distributive semantics, Least squares method, Markov moment, Neural network algorithms, Semantic network, Social network analysis

ID: 49785323