Speech corpus “One Day of Speech” (ORD corpus) is the largest linguistic resource designed
for studies of everyday spoken Russian. Despite the high scientific potential of ORD data, the effectiveness of its use is still significantly limited by the fact that the resource is not accessible for a
wide range of online users, which is caused by the private nature of the most of its audio recordings.
The most suitable option appears to be the web publication of selected anonymized text transcripts.
The article outlines the main difficulties that arise during the preparation of ORD texts to web
publication, including texts anonymization and their “censorship” editing, and discusses the ways to
solve these problems.
Translated title of the contributionON THE PREPARATION FOR WEB-PUBLICATION OF “ONE DAY OF SPEECH” CORPUS OF EVERYDAY SPOKEN RUSSIAN: TEXTS ANONIMIZATION AND SELECTED WORDS ENCODING
Original languageRussian
Title of host publicationТруды международной конференции «Корпусная лингвистика-2019»
EditorsВ.П. Захаров
Place of PublicationСПб.
PublisherИздательство Санкт-Петербургского университета
Pages366–372
StatePublished - 2019
EventКорпусная лингвистика - 2019: международная научная конференция - СПб., Russian Federation
Duration: 24 Jun 201928 Jun 2019
https://events.spbu.ru/events/corpora-2019

Conference

ConferenceКорпусная лингвистика - 2019: международная научная конференция
Abbreviated titlecorpora-2019
Country/TerritoryRussian Federation
CityСПб.
Period24/06/1928/06/19
Internet address

    Scopus subject areas

  • Computer Science Applications
  • Language and Linguistics

    Research areas

  • Russian language, everyday spoken speech, speech corpus, Internet resource, online publication, texts anonymization, word coding

ID: 51151736