Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review
High-Frequency Multiword Units and the Typological Distribution of Multiword Units in Spoken Russian. / Богданова-Бегларян, Наталья Викторовна; Шерстинова, Татьяна Юрьевна; Блинова, Ольга Владимировна; Хохлова, Мария Владимировна; Попова, Татьяна Ивановна.
Speech and Computer. SPECOM 2025. Szeged, Hungary : Springer Nature, 2025. p. 257-270 (Lecture Notes in Computer Science; Vol. 16188).Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review
}
TY - GEN
T1 - High-Frequency Multiword Units and the Typological Distribution of Multiword Units in Spoken Russian
AU - Богданова-Бегларян, Наталья Викторовна
AU - Шерстинова, Татьяна Юрьевна
AU - Блинова, Ольга Владимировна
AU - Хохлова, Мария Владимировна
AU - Попова, Татьяна Ивановна
N1 - Conference code: 27
PY - 2025/11/15
Y1 - 2025/11/15
N2 - Multiword units (MWUs) constitute a distinct class of linguistic phenomena located at the crossroads of lexis and syntax. Empirical data on their typology and frequency are essential for solving a wide range of applied problems in natural language processing. This paper presents a corpus-based study of MWUs in Russian everyday speech. Drawing on data from the ORD corpus comprising one million words of transcribed spontaneous discourse, over 8,000 MWU instances were identified and annotated. These MWUs are classified into eight main classes: non-phraseologized collocations, phraseologized collocations, occasional collocations, idiom forms, constructions, precedent texts and their elements, multiword pragmatic markers, and speech formulas. The paper presents a ranked list of the 50 most frequent MWUs in spoken Russian, along with the overall distribution of MWU types. The results indicate that pragmatic markers are the most dominant category (comprising over 30% of all MWUs), followed by non-phraseologized collocations (26%) and speech formulas (21%). The article also discusses the functional combinations of MWUs in spoken interaction and highlights precedent texts as one of the productive sources for MWU formation. The quantitative data obtained in this study contribute to both theoretical models of lexical and grammatical description of Russian everyday speech and practical tasks related to processing and generating spontaneous spoken language.
AB - Multiword units (MWUs) constitute a distinct class of linguistic phenomena located at the crossroads of lexis and syntax. Empirical data on their typology and frequency are essential for solving a wide range of applied problems in natural language processing. This paper presents a corpus-based study of MWUs in Russian everyday speech. Drawing on data from the ORD corpus comprising one million words of transcribed spontaneous discourse, over 8,000 MWU instances were identified and annotated. These MWUs are classified into eight main classes: non-phraseologized collocations, phraseologized collocations, occasional collocations, idiom forms, constructions, precedent texts and their elements, multiword pragmatic markers, and speech formulas. The paper presents a ranked list of the 50 most frequent MWUs in spoken Russian, along with the overall distribution of MWU types. The results indicate that pragmatic markers are the most dominant category (comprising over 30% of all MWUs), followed by non-phraseologized collocations (26%) and speech formulas (21%). The article also discusses the functional combinations of MWUs in spoken interaction and highlights precedent texts as one of the productive sources for MWU formation. The quantitative data obtained in this study contribute to both theoretical models of lexical and grammatical description of Russian everyday speech and practical tasks related to processing and generating spontaneous spoken language.
KW - modern Russian, everyday speech, oral discourse, multiword units, collocations, pragmatic markers, precedent texts, statistical analysis, speech corpus, corpus linguistics, speech technologies
UR - http://www.scopus.com/record/display.url?eid=2-s2.0-105020258744
M3 - Conference contribution
T3 - Lecture Notes in Computer Science
SP - 257
EP - 270
BT - Speech and Computer. SPECOM 2025
PB - Springer Nature
CY - Szeged, Hungary
T2 - 27th International Conference on Speech and Computer
Y2 - 13 October 2025 through 14 October 2025
ER -
ID: 144231378