Frequency word lists and their variability (the case of Russian fiction in 1900-1930)

Переведенное название: Частотные списки слов и их вариативность (на примере русской прозы 1900-1930 гг.)

T. Sherstinova , A. Grebennikov, T. Skrebtsova, A. Guseva, M. Gukasian, I. Egoshina, M. Turygina

Lexical system is an essential component of any natural language. Frequency word lists are a convenient representation of words functional activity in language as a whole or in some particular text. The parameters and properties of frequency word lists are in the center of attention of NLP experts, since they are used in numerous practical applications related to attribution of authorship, text automatic clustering and classification. The article explores frequency word lists of Russian fiction in the period of 1900-1930, which was marked by a series of dramatic historical events and presents unique statistical data on the most frequent words, parts of speech and keywords, and their dynamics. Special attention is paid to the issues of statistical consistency of frequency word list parameters, which becomes especially relevant when studying big text data. The study was carried out on the basis of fiction texts, which by the variety of topics, lexical and stylistic diversity reflects the variability of linguistic forms better than the other written text genres. In terms of the text corpus size and character, the research of this kind is being carried out for the first time.
