Multiword units (MWUs) constitute a distinct class of linguistic phenomena located at the crossroads of lexis and syntax. Empirical data on their typology and frequency are essential for solving a wide range of applied problems in natural language processing. This paper presents a corpus-based study of MWUs in Russian everyday speech. Drawing on data from the ORD corpus comprising one million words of transcribed spontaneous discourse, over 8,000 MWU instances were identified and annotated. These MWUs are classified into eight main classes: non-phraseologized collocations, phraseologized collocations, occasional collocations, idiom forms, constructions, precedent texts and their elements, multiword pragmatic markers, and speech formulas. The paper presents a ranked list of the 50 most frequent MWUs in spoken Russian, along with the overall distribution of MWU types. The results indicate that pragmatic markers are the most dominant category (comprising over 30% of all MWUs), followed by non-phraseologized collocations (26%) and speech formulas (21%). The article also discusses the functional combinations of MWUs in spoken interaction and highlights precedent texts as one of the productive sources for MWU formation. The quantitative data obtained in this study contribute to both theoretical models of lexical and grammatical description of Russian everyday speech and practical tasks related to processing and generating spontaneous spoken language.
Translated title of the contributionВысокочастотные неоднословные единицы и типологическое распределение неоднословных единиц в разговорном русском языке
Original languageEnglish
Title of host publicationSpeech and Computer. SPECOM 2025
Place of PublicationSzeged, Hungary
PublisherSpringer Nature
Pages257-270
StatePublished - 15 Nov 2025
Event27th International Conference on Speech and Computer - Szeged, Hungary, Szeged, Hungary
Duration: 13 Oct 202514 Oct 2025
Conference number: 27
https://specom.inf.u-szeged.hu/

Publication series

NameLecture Notes in Computer Science
Volume16188

Conference

Conference27th International Conference on Speech and Computer
Abbreviated titleSPECOM 2025
Country/TerritoryHungary
City Szeged
Period13/10/2514/10/25
Internet address

ID: 144231378