Automatic Annotation of Discourse and Speech Formulas in Internet Communication: A Telegram Comment Corpus

DOI

https://doi.org/10.1007/978-3-032-07956-5_20
Final published version

Татьяна Ивановна Попова
Александра Масленикова

This article presents a system for the automatic processing of user comments aimed at annotating speech and discourse formulas that actively function in everyday interaction, including digital communication. A Python-based program using the Telegram API was developed to automate the collection, filtering, and annotation of empirical data. In addition to building a user corpus, the study also included the evaluation of automatic processing results. The source material was drawn from the Telegram news channel Fontanka SPB Online. As a result of automatic processing, 70 speech and discourse formulas were extracted and grouped based on their source lexicons. The classification of the examined multiword units was grounded in the findings of two research projects: the construction of the Pragmaticon in Moscow and the annotation of stable multiword units in Saint Petersburg. The implementation of automatic annotation enabled the identification of formulas with a high pragmatic load and captured their specific functions in internet communication. For example, semantic irony was observed in the use of formulas such as ‘khorosho’ (‘fine’) and ‘bez problem’ (‘no problem’), which traditionally indicate agreement. The study identified the most frequent types of user responses reflected by the formulas: affirmation and negation. The results demonstrate the potential of the automatic approach for describing speech and discourse formulas in digital discourse and highlight the need to refine existing classifications of speech act.

Original language	Russian
Title of host publication	Speech and Computer. SPECOM 2025
Place of Publication	Szeged, Hungary
Publisher	Springer Nature
Pages	278-292
Number of pages	15
ISBN (Print)	9783032079558
DOIs	https://doi.org/10.1007/978-3-032-07956-5_20
State	Published - 2026
Event	27th International Conference on Speech and Computer - Szeged, Hungary, Szeged, Hungary Duration: 13 Oct 2025 → 14 Oct 2025 Conference number: 27 https://specom.inf.u-szeged.hu/

Publication series

Name	Lecture Notes in Computer Science
Number	16187

Conference

Conference	27th International Conference on Speech and Computer
Abbreviated title	SPECOM 2025
Country/Territory	Hungary
City	Szeged
Period	13/10/25 → 14/10/25
Internet address	https://specom.inf.u-szeged.hu/

Research areas

Automatic Annotation, Corpus Linguistics, Discourse Formulas, Internet Comment, Internet Discourse, Modern Russian, Speech Formulas, Statistical Analysis

ID: 144722668