The paper deals with development and application of automatic word clustering (AWC) tool aimed at processing Russian texts of various types, which should satisfy the requirements of flexibility and compatibility with other linguistic resources. The construction of AWC tool requires computer implementation of latent semantic analysis (LSA) combined with clustering algorithms. To meet the need, Python-based software has been developed. Major procedures performed by AWC tool are segmentation of input texts and context analysis, co-occurrence matrix construction, agglomerative and K-means clustering. Special attention is drawn to experimental results on clustering words in raw texts with changing parameters.
Original languageEnglish
Title of host publicationText, Speech and Dialogue
Subtitle of host publication10th International Conference, TSD 2007, Pilsen, Czech Republic, September 3-7, 2007, Proceedings
PublisherSpringer Nature
Pages85-97
ISBN (Electronic)9783540746287
ISBN (Print)9783540746270
StatePublished - 2007
Event10th International Conference - Pilsen, Czech Republic
Duration: 3 Sep 20077 Sep 2007

Publication series

NameLecture Notes in Computer Science
Volume4629

Conference

Conference10th International Conference
Abbreviated titleTSD 2007
Country/TerritoryCzech Republic
CityPilsen
Period3/09/077/09/07

ID: 4509961