Сопоставительный анализ вероятностных тематических моделей китайско-русского корпуса политических текстов

Standard

Сопоставительный анализ вероятностных тематических моделей китайско-русского корпуса политических текстов. / Чжу, Хуэй; Митрофанова, Ольга Александровна.

в: Oriental Studies, Том 18, № 1, 18.06.2025, стр. 247-271.

Результаты исследований: Научные публикации в периодических изданиях › статья › Рецензирование

BibTeX

@article{43d38037b0144db19f595b341fc65068,

title = "Сопоставительный анализ вероятностных тематических моделей китайско-русского корпуса политических текстов",

abstract = "Introduction. The article introduces a comparative analysis of probabilistic topic models de-rived from a Chinese-Russian corpus of parallel and comparable political texts. The corpus developed hereto includes a total of three sub-corpora: Reports on the Work of the Government in 2012–2022 (original Chinese-language texts), their Russian translations, and Presidential Addresses to the Federal Assembly of Russia in 2011–2021 (a comparable Russian-language sub-corpus). Goals. The work aims at identifying and describing topics that prove common within the corpus, as well as ones specific to individual texts. Linguistic interpretations have been conducted with topic labeling tools of the Yan-dexGPT language model, the resulting topic labels be further compared to expert-generated annotations and automatically extracted keyphrases. The conducted probabilistic topic modelling involves the LDA algorithm in TMT (Topic Modeling Tool), as well as the YAKE, mBERT, and TF-IDF algorithms from Orange library for Python. The algorithms are intended to identify keyphrases and find out similarities in topical words across different sub-corpora and between the languages under comparison. Results. So, a family of probabilistic topic models that describe semantic organization of the Chinese-Russian parallel and comparable corpus of political texts has been created. The outcomes of our topic modelling are compared to the automatically extracted keyphrases, and reveal certain intersections for each sub-cor-pus. The study also provides a part-of-speech (POS) tagging analysis of topical words. As is shown, the models reproduce key paradigmatic and syntagmatic relationships in the text corpus. The research is first to present automatically constructed probabilistic topic models for a Chinese-Russian parallel and comparable corpus of political texts, thus filling in some gaps existing in this field. {\textcopyright} Zhu Hui, Mitrofanova О. А., 2025.",

keywords = "automatic keyphrase extraction, comparative corpus, parallel corpus, political texts, POS tagging, probabilistic topic modelling, text corpus",

author = "Хуэй Чжу and Митрофанова, {Ольга Александровна}",

note = "Export Date: 19 February 2026; Cited By: 0",

year = "2025",

month = jun,

day = "18",

doi = "10.22162/2619-0990",

language = "русский",

volume = "18",

pages = "247--271",

journal = "Oriental Studies",

issn = "2619-0990",

publisher = "Kalmyk Scientific Centre of Russian Academy of Sciences",

number = "1",

}

RIS

TY - JOUR

T1 - Сопоставительный анализ вероятностных тематических моделей китайско-русского корпуса политических текстов

AU - Чжу, Хуэй

AU - Митрофанова, Ольга Александровна

N1 - Export Date: 19 February 2026; Cited By: 0

PY - 2025/6/18

Y1 - 2025/6/18

N2 - Introduction. The article introduces a comparative analysis of probabilistic topic models de-rived from a Chinese-Russian corpus of parallel and comparable political texts. The corpus developed hereto includes a total of three sub-corpora: Reports on the Work of the Government in 2012–2022 (original Chinese-language texts), their Russian translations, and Presidential Addresses to the Federal Assembly of Russia in 2011–2021 (a comparable Russian-language sub-corpus). Goals. The work aims at identifying and describing topics that prove common within the corpus, as well as ones specific to individual texts. Linguistic interpretations have been conducted with topic labeling tools of the Yan-dexGPT language model, the resulting topic labels be further compared to expert-generated annotations and automatically extracted keyphrases. The conducted probabilistic topic modelling involves the LDA algorithm in TMT (Topic Modeling Tool), as well as the YAKE, mBERT, and TF-IDF algorithms from Orange library for Python. The algorithms are intended to identify keyphrases and find out similarities in topical words across different sub-corpora and between the languages under comparison. Results. So, a family of probabilistic topic models that describe semantic organization of the Chinese-Russian parallel and comparable corpus of political texts has been created. The outcomes of our topic modelling are compared to the automatically extracted keyphrases, and reveal certain intersections for each sub-cor-pus. The study also provides a part-of-speech (POS) tagging analysis of topical words. As is shown, the models reproduce key paradigmatic and syntagmatic relationships in the text corpus. The research is first to present automatically constructed probabilistic topic models for a Chinese-Russian parallel and comparable corpus of political texts, thus filling in some gaps existing in this field. © Zhu Hui, Mitrofanova О. А., 2025.

AB - Introduction. The article introduces a comparative analysis of probabilistic topic models de-rived from a Chinese-Russian corpus of parallel and comparable political texts. The corpus developed hereto includes a total of three sub-corpora: Reports on the Work of the Government in 2012–2022 (original Chinese-language texts), their Russian translations, and Presidential Addresses to the Federal Assembly of Russia in 2011–2021 (a comparable Russian-language sub-corpus). Goals. The work aims at identifying and describing topics that prove common within the corpus, as well as ones specific to individual texts. Linguistic interpretations have been conducted with topic labeling tools of the Yan-dexGPT language model, the resulting topic labels be further compared to expert-generated annotations and automatically extracted keyphrases. The conducted probabilistic topic modelling involves the LDA algorithm in TMT (Topic Modeling Tool), as well as the YAKE, mBERT, and TF-IDF algorithms from Orange library for Python. The algorithms are intended to identify keyphrases and find out similarities in topical words across different sub-corpora and between the languages under comparison. Results. So, a family of probabilistic topic models that describe semantic organization of the Chinese-Russian parallel and comparable corpus of political texts has been created. The outcomes of our topic modelling are compared to the automatically extracted keyphrases, and reveal certain intersections for each sub-cor-pus. The study also provides a part-of-speech (POS) tagging analysis of topical words. As is shown, the models reproduce key paradigmatic and syntagmatic relationships in the text corpus. The research is first to present automatically constructed probabilistic topic models for a Chinese-Russian parallel and comparable corpus of political texts, thus filling in some gaps existing in this field. © Zhu Hui, Mitrofanova О. А., 2025.

KW - automatic keyphrase extraction

KW - comparative corpus

KW - parallel corpus

KW - political texts

KW - POS tagging

KW - probabilistic topic modelling

KW - text corpus

U2 - 10.22162/2619-0990

DO - 10.22162/2619-0990

M3 - статья

VL - 18

SP - 247

EP - 271

JO - Oriental Studies

JF - Oriental Studies

SN - 2619-0990

IS - 1

ER -

ID: 121623352