Standard

Identifying errors in Russian web corpora. / Khokhlova, Maria.

In: Jazykovedny Casopis, Vol. 72, No. 4, 01.06.2022, p. 977-985.

Research output: Contribution to journalArticlepeer-review

Harvard

Khokhlova, M 2022, 'Identifying errors in Russian web corpora', Jazykovedny Casopis, vol. 72, no. 4, pp. 977-985.

APA

Khokhlova, M. (2022). Identifying errors in Russian web corpora. Jazykovedny Casopis, 72(4), 977-985.

Vancouver

Khokhlova M. Identifying errors in Russian web corpora. Jazykovedny Casopis. 2022 Jun 1;72(4):977-985.

Author

Khokhlova, Maria. / Identifying errors in Russian web corpora. In: Jazykovedny Casopis. 2022 ; Vol. 72, No. 4. pp. 977-985.

BibTeX

@article{f4a31ea74ba64465a1ba9d3718d539d7,
title = "Identifying errors in Russian web corpora",
abstract = "The explosion of the Web leads to the production of large amounts of texts and inevitably influences their quality. Errors that tend to occur more often can distort results, especially when texts are used for scientific purposes, in language teaching or learning. Hence, there is a need to examine the existing corpora based on web texts and to clean up the data, which may contain such {"}noisy{"}fragments. In our study, we deal with the problem of errors and analyze the Aranea Russicum Maximum corpus. Among such errors, we can name, above all, encoding errors, incorrect font types, as well as segments written in other languages. These phenomena result in incorrect morphological analysis and lemmatization, frequency distortion, as well as the fact that lexical units cannot be found and therefore displayed to corpus users. The paper focuses on the errors, describes their types and outlines possible ways to eliminate them. ",
keywords = "corpora, errors, orthography, Russian language, typography, typos, web texts",
author = "Maria Khokhlova",
note = "Publisher Copyright: {\textcopyright} 2022 Maria Khokhlova, published by Sciendo.",
year = "2022",
month = jun,
day = "1",
language = "English",
volume = "72",
pages = "977--985",
journal = "Jazykovedny Casopis",
issn = "0021-5597",
publisher = "De Gruyter",
number = "4",

}

RIS

TY - JOUR

T1 - Identifying errors in Russian web corpora

AU - Khokhlova, Maria

N1 - Publisher Copyright: © 2022 Maria Khokhlova, published by Sciendo.

PY - 2022/6/1

Y1 - 2022/6/1

N2 - The explosion of the Web leads to the production of large amounts of texts and inevitably influences their quality. Errors that tend to occur more often can distort results, especially when texts are used for scientific purposes, in language teaching or learning. Hence, there is a need to examine the existing corpora based on web texts and to clean up the data, which may contain such "noisy"fragments. In our study, we deal with the problem of errors and analyze the Aranea Russicum Maximum corpus. Among such errors, we can name, above all, encoding errors, incorrect font types, as well as segments written in other languages. These phenomena result in incorrect morphological analysis and lemmatization, frequency distortion, as well as the fact that lexical units cannot be found and therefore displayed to corpus users. The paper focuses on the errors, describes their types and outlines possible ways to eliminate them.

AB - The explosion of the Web leads to the production of large amounts of texts and inevitably influences their quality. Errors that tend to occur more often can distort results, especially when texts are used for scientific purposes, in language teaching or learning. Hence, there is a need to examine the existing corpora based on web texts and to clean up the data, which may contain such "noisy"fragments. In our study, we deal with the problem of errors and analyze the Aranea Russicum Maximum corpus. Among such errors, we can name, above all, encoding errors, incorrect font types, as well as segments written in other languages. These phenomena result in incorrect morphological analysis and lemmatization, frequency distortion, as well as the fact that lexical units cannot be found and therefore displayed to corpus users. The paper focuses on the errors, describes their types and outlines possible ways to eliminate them.

KW - corpora

KW - errors

KW - orthography

KW - Russian language

KW - typography

KW - typos

KW - web texts

UR - http://www.scopus.com/inward/record.url?scp=85138616209&partnerID=8YFLogxK

UR - https://www.ceeol.com/search/article-detail?id=1061814

M3 - Article

AN - SCOPUS:85138616209

VL - 72

SP - 977

EP - 985

JO - Jazykovedny Casopis

JF - Jazykovedny Casopis

SN - 0021-5597

IS - 4

ER -

ID: 99230450