The article describes the results of the research, the purpose of which was to evaluate the influence of linguistic preprocessing on the interpretability of topic models for literary texts. The study was carried out as part of a large project aimed to obtain topic models of Russian short stories written in the first three decades of the 20th century and divided into three successive historical periods: 1) the period of the beginning of the century before the First World War (1900-1913), 2) the time of acute social cataclysms, wars and revolutions (World War I, the February and October revolutions, and the Civil War) (1914-1922), and 3) the early Soviet period (1923-1930). The material of the study was 3 samples of different sizes for each period, containing 100, 500 and 1000 short stories each. Preprocessing included lemmatization using spaCy library and four POS-filtering options: 1) nouns only, 2) nouns and verbs, 3) nouns, adjectives, adverbs, verbs, and 4) no filtering. Using the latent Dirichlet allocation (LDA), 36 topic models were built (9 models for each preprocessing option). The research showed that in case of literary texts topic models built without any POS filters are the most interpretable. The study made it possible to obtain information about topic diversity of Russian short stories, to assess their expert interpretability, and to offer some recommendations for optimizing topic modeling, which are to be used in the development of artificial intelligence systems that process large volumes of literary texts.
Original languageEnglish
Title of host publication2022 31st Conference of Open Innovations Association (FRUCT)
Pages305-312
Number of pages8
Volume2022-April
DOIs
StatePublished - 1 Jan 2022
Event2022 31st Conference of Open Innovations Association (FRUCT) -
Duration: 27 Apr 202229 Apr 2022

Publication series

NameCONFERENCE OF OPEN INNOVATIONS ASSOCIATION, FRUCT
PublisherFRUCT Oy
Volume2022-April
ISSN (Print)2305-7254

Conference

Conference2022 31st Conference of Open Innovations Association (FRUCT)
Period27/04/2229/04/22

ID: 101663042