Introduction. The article introduces a comparative analysis of probabilistic topic models de-rived from a Chinese-Russian corpus of parallel and comparable political texts. The corpus developed hereto includes a total of three sub-corpora: Reports on the Work of the Government in 2012–2022 (original Chinese-language texts), their Russian translations, and Presidential Addresses to the Federal Assembly of Russia in 2011–2021 (a comparable Russian-language sub-corpus). Goals. The work aims at identifying and describing topics that prove common within the corpus, as well as ones specific to individual texts. Linguistic interpretations have been conducted with topic labeling tools of the Yan-dexGPT language model, the resulting topic labels be further compared to expert-generated annotations and automatically extracted keyphrases. The conducted probabilistic topic modelling involves the LDA algorithm in TMT (Topic Modeling Tool), as well as the YAKE, mBERT, and TF-IDF algorithms from Orange library for Python. The algorithms are intended to identify keyphrases and find out similarities in topical words across different sub-corpora and between the languages under comparison. Results. So, a family of probabilistic topic models that describe semantic organization of the Chinese-Russian parallel and comparable corpus of political texts has been created. The outcomes of our topic modelling are compared to the automatically extracted keyphrases, and reveal certain intersections for each sub-cor-pus. The study also provides a part-of-speech (POS) tagging analysis of topical words. As is shown, the models reproduce key paradigmatic and syntagmatic relationships in the text corpus. The research is first to present automatically constructed probabilistic topic models for a Chinese-Russian parallel and comparable corpus of political texts, thus filling in some gaps existing in this field. © Zhu Hui, Mitrofanova О. А., 2025.
Переведенное названиеChinese-Russian Corpus of Political Texts: A Comparative Analysis of Probabilistic Topic Models
Язык оригиналарусский
Страницы (с-по)247-271
Число страниц25
ЖурналOriental Studies
Том18
Номер выпуска1
DOI
СостояниеОпубликовано - 18 июн 2025

    Области исследований

  • automatic keyphrase extraction, comparative corpus, parallel corpus, political texts, POS tagging, probabilistic topic modelling, text corpus

ID: 121623352