Сопоставительный анализ вероятностных тематических моделей китайско-русского корпуса политических текстов

Сопоставительный анализ вероятностных тематических моделей китайско-русского корпуса политических текстов

Результаты исследований: Научные публикации в периодических изданиях › статья › Рецензирование

Кафедра математической лингвистики

Ссылки

DOI

https://doi.org/10.22162/2619-0990
Конечная издательская версия
https://doi.org/10.22162/2619-0990-2025-77-1-247-271
Конечная издательская версия

Хуэй Чжу
Ольга Александровна Митрофанова

Introduction. The article introduces a comparative analysis of probabilistic topic models de-rived from a Chinese-Russian corpus of parallel and comparable political texts. The corpus developed hereto includes a total of three sub-corpora: Reports on the Work of the Government in 2012–2022 (original Chinese-language texts), their Russian translations, and Presidential Addresses to the Federal Assembly of Russia in 2011–2021 (a comparable Russian-language sub-corpus). Goals. The work aims at identifying and describing topics that prove common within the corpus, as well as ones specific to individual texts. Linguistic interpretations have been conducted with topic labeling tools of the Yan-dexGPT language model, the resulting topic labels be further compared to expert-generated annotations and automatically extracted keyphrases. The conducted probabilistic topic modelling involves the LDA algorithm in TMT (Topic Modeling Tool), as well as the YAKE, mBERT, and TF-IDF algorithms from Orange library for Python. The algorithms are intended to identify keyphrases and find out similarities in topical words across different sub-corpora and between the languages under comparison. Results. So, a family of probabilistic topic models that describe semantic organization of the Chinese-Russian parallel and comparable corpus of political texts has been created. The outcomes of our topic modelling are compared to the automatically extracted keyphrases, and reveal certain intersections for each sub-cor-pus. The study also provides a part-of-speech (POS) tagging analysis of topical words. As is shown, the models reproduce key paradigmatic and syntagmatic relationships in the text corpus. The research is first to present automatically constructed probabilistic topic models for a Chinese-Russian parallel and comparable corpus of political texts, thus filling in some gaps existing in this field. © Zhu Hui, Mitrofanova О. А., 2025.

Переведенное название	Chinese-Russian Corpus of Political Texts: A Comparative Analysis of Probabilistic Topic Models
Язык оригинала	русский
Страницы (с-по)	247-271
Число страниц	25
Журнал	Oriental Studies
Том	18
Номер выпуска	1
DOI	https://doi.org/10.22162/2619-0990 https://doi.org/10.22162/2619-0990-2025-77-1-247-271
Состояние	Опубликовано - 18 июн 2025

Области исследований

automatic keyphrase extraction, comparative corpus, parallel corpus, political texts, POS tagging, probabilistic topic modelling, text corpus

ID: 121623352

Pure – это продукт компании Elsevier
На данном информационном ресурсе могут быть опубликованы архивные материалы
с упоминанием физических и юридических лиц, включенных Министерством юстиции
Российской Федерации в реестр иностранных агентов

Вход в Pure