Результаты исследований: Научные публикации в периодических изданиях › статья › Рецензирование
Identifying errors in Russian web corpora. / Khokhlova, Maria.
в: Jazykovedny Casopis, Том 72, № 4, 01.06.2022, стр. 977-985.Результаты исследований: Научные публикации в периодических изданиях › статья › Рецензирование
}
TY - JOUR
T1 - Identifying errors in Russian web corpora
AU - Khokhlova, Maria
N1 - Publisher Copyright: © 2022 Maria Khokhlova, published by Sciendo.
PY - 2022/6/1
Y1 - 2022/6/1
N2 - The explosion of the Web leads to the production of large amounts of texts and inevitably influences their quality. Errors that tend to occur more often can distort results, especially when texts are used for scientific purposes, in language teaching or learning. Hence, there is a need to examine the existing corpora based on web texts and to clean up the data, which may contain such "noisy"fragments. In our study, we deal with the problem of errors and analyze the Aranea Russicum Maximum corpus. Among such errors, we can name, above all, encoding errors, incorrect font types, as well as segments written in other languages. These phenomena result in incorrect morphological analysis and lemmatization, frequency distortion, as well as the fact that lexical units cannot be found and therefore displayed to corpus users. The paper focuses on the errors, describes their types and outlines possible ways to eliminate them.
AB - The explosion of the Web leads to the production of large amounts of texts and inevitably influences their quality. Errors that tend to occur more often can distort results, especially when texts are used for scientific purposes, in language teaching or learning. Hence, there is a need to examine the existing corpora based on web texts and to clean up the data, which may contain such "noisy"fragments. In our study, we deal with the problem of errors and analyze the Aranea Russicum Maximum corpus. Among such errors, we can name, above all, encoding errors, incorrect font types, as well as segments written in other languages. These phenomena result in incorrect morphological analysis and lemmatization, frequency distortion, as well as the fact that lexical units cannot be found and therefore displayed to corpus users. The paper focuses on the errors, describes their types and outlines possible ways to eliminate them.
KW - corpora
KW - errors
KW - orthography
KW - Russian language
KW - typography
KW - typos
KW - web texts
UR - http://www.scopus.com/inward/record.url?scp=85138616209&partnerID=8YFLogxK
UR - https://www.ceeol.com/search/article-detail?id=1061814
M3 - Article
AN - SCOPUS:85138616209
VL - 72
SP - 977
EP - 985
JO - Jazykovedny Casopis
JF - Jazykovedny Casopis
SN - 0021-5597
IS - 4
ER -
ID: 99230450