Topic Modeling of Literary Texts Using LDA: On the Influence of Linguistic Preprocessing on Model Interpretability

DOI

https://doi.org/10.23919/FRUCT54823.2022.9770887
Конечная издательская версия

Tatiana Sherstinova
Anna Moskvina
Margarita Kirina
Irina Zavyalova
Asya Karysheva
Evgenia Kolpashchikova
Polina Maksimenko
Alena Moskalenko

The article describes the results of the research, the purpose of which was to evaluate the influence of linguistic preprocessing on the interpretability of topic models for literary texts. The study was carried out as part of a large project aimed to obtain topic models of Russian short stories written in the first three decades of the 20th century and divided into three successive historical periods: 1) the period of the beginning of the century before the First World War (1900-1913), 2) the time of acute social cataclysms, wars and revolutions (World War I, the February and October revolutions, and the Civil War) (1914-1922), and 3) the early Soviet period (1923-1930). The material of the study was 3 samples of different sizes for each period, containing 100, 500 and 1000 short stories each. Preprocessing included lemmatization using spaCy library and four POS-filtering options: 1) nouns only, 2) nouns and verbs, 3) nouns, adjectives, adverbs, verbs, and 4) no filtering. Using the latent Dirichlet allocation (LDA), 36 topic models were built (9 models for each preprocessing option). The research showed that in case of literary texts topic models built without any POS filters are the most interpretable. The study made it possible to obtain information about topic diversity of Russian short stories, to assess their expert interpretability, and to offer some recommendations for optimizing topic modeling, which are to be used in the development of artificial intelligence systems that process large volumes of literary texts.

Язык оригинала	английский
Название основной публикации	2022 31st Conference of Open Innovations Association (FRUCT)
Страницы	305-312
Число страниц	8
Том	2022-April
DOI	https://doi.org/10.23919/FRUCT54823.2022.9770887
Состояние	Опубликовано - 1 янв 2022
Событие	2022 31st Conference of Open Innovations Association (FRUCT) - Продолжительность: 27 апр 2022 → 29 апр 2022

Серия публикаций

Название	CONFERENCE OF OPEN INNOVATIONS ASSOCIATION, FRUCT
Издатель	FRUCT Oy
Том	2022-April
ISSN (печатное издание)	2305-7254

конференция

конференция	2022 31st Conference of Open Innovations Association (FRUCT)
Период	27/04/22 → 29/04/22

ID: 101663042