Multiword units (MWUs) constitute a distinct class of linguistic phenomena located at the crossroads of lexis and syntax. Empirical data on their typology and frequency are essential for solving a wide range of applied problems in natural language processing. This paper presents a corpus-based study of MWUs in Russian everyday speech. Drawing on data from the ORD corpus comprising one million words of transcribed spontaneous discourse, over 8,000 MWU instances were identified and annotated. These MWUs are classified into eight main classes: non-phraseologized collocations, phraseologized collocations, occasional collocations, idiom forms, constructions, precedent texts and their elements, multiword pragmatic markers, and speech formulas. The paper presents a ranked list of the 50 most frequent MWUs in spoken Russian, along with the overall distribution of MWU types. The results indicate that pragmatic markers are the most dominant category (comprising over 30% of all MWUs), followed by non-phraseologized collocations (26%) and speech formulas (21%). The article also discusses the functional combinations of MWUs in spoken interaction and highlights precedent texts as one of the productive sources for MWU formation. The quantitative data obtained in this study contribute to both theoretical models of lexical and grammatical description of Russian everyday speech and practical tasks related to processing and generating spontaneous spoken language.
Переведенное названиеВысокочастотные неоднословные единицы и типологическое распределение неоднословных единиц в разговорном русском языке
Язык оригиналаанглийский
Название основной публикацииSpeech and Computer. SPECOM 2025
Место публикацииSzeged, Hungary
ИздательSpringer Nature
Страницы257-270
СостояниеОпубликовано - 15 ноя 2025
Событие27th International Conference on Speech and Computer - Szeged, Hungary, Szeged, Венгрия
Продолжительность: 13 окт 202514 окт 2025
Номер конференции: 27
https://specom.inf.u-szeged.hu/

Серия публикаций

НазваниеLecture Notes in Computer Science
Том16188

конференция

конференция27th International Conference on Speech and Computer
Сокращенное названиеSpecom 2025
Страна/TерриторияВенгрия
Город Szeged
Период13/10/2514/10/25
Прочее 27-й Международной конференции по вопросам речи и компьютера (SPECOM 2025)
Сайт в сети Internet

    Области исследований

  • modern Russian, everyday speech, oral discourse, multiword units, collocations, pragmatic markers, precedent texts, statistical analysis, speech corpus, corpus linguistics, speech technologies

ID: 144231378