О подготовке к веб-публикации корпуса повседневной русской речи «Один речевой день»: анонимизация текстов и выборочное кодирование лексики

Татьяна Юрьевна Шерстинова

Speech corpus “One Day of Speech” (ORD corpus) is the largest linguistic resource designed
for studies of everyday spoken Russian. Despite the high scientific potential of ORD data, the effectiveness of its use is still significantly limited by the fact that the resource is not accessible for a
wide range of online users, which is caused by the private nature of the most of its audio recordings.
The most suitable option appears to be the web publication of selected anonymized text transcripts.
The article outlines the main difficulties that arise during the preparation of ORD texts to web
publication, including texts anonymization and their “censorship” editing, and discusses the ways to
solve these problems.

Translated title of the contribution	ON THE PREPARATION FOR WEB-PUBLICATION OF “ONE DAY OF SPEECH” CORPUS OF EVERYDAY SPOKEN RUSSIAN: TEXTS ANONIMIZATION AND SELECTED WORDS ENCODING
Original language	Russian
Title of host publication	Труды международной конференции «Корпусная лингвистика-2019»
Editors	В.П. Захаров
Place of Publication	СПб.
Publisher	Издательство Санкт-Петербургского университета
Pages	366–372
State	Published - 2019
Event	Корпусная лингвистика - 2019: международная научная конференция - СПб., Russian Federation Duration: 24 Jun 2019 → 28 Jun 2019 https://events.spbu.ru/events/corpora-2019

Conference

Conference	Корпусная лингвистика - 2019: международная научная конференция
Abbreviated title	corpora-2019
Country/Territory	Russian Federation
City	СПб.
Period	24/06/19 → 28/06/19
Internet address	https://events.spbu.ru/events/corpora-2019

Scopus subject areas

Computer Science Applications
Language and Linguistics

Research areas

Russian language, everyday spoken speech, speech corpus, Internet resource, online publication, texts anonymization, word coding

ID: 51151736