Very Large Russian Corpora: New Opportunities and New Challenges. / Benko, V.; Zakharov, V. P.
Computational Linguistics and Intellectual Technologies. Российский государственный гуманитарный университет, 2016. p. 79-93.Research output: Chapter in Book/Report/Conference proceeding › Article in an anthology › Research
}
TY - CHAP
T1 - Very Large Russian Corpora: New Opportunities and New Challenges
AU - Benko, V.
AU - Zakharov, V. P.
PY - 2016
Y1 - 2016
N2 - Our paper deals with the rapidly developing area of corpus linguistics referred to as Web as Corpus (WaC), i.e., creation of very large corpora composed of texts downloaded from the web. Some problems of compilation and usage of such corpora are addressed, most notably the “language quality” of web texts and the inadequate balance of web corpora, with the latter being an obstacle both for corpus creators, and its users. We introduce the Aranea family of web corpora, describe the various processing procedures used during its compilation, and present an attempt to increase the size of its Russian component by the order of magnitude. We also compare its contents from the user’s perspective among the various sizes of the Russian Aranea, as well as with the other large Russian corpora (RNC, ruTenTen and GICR). We also intent to demonstrate the advantage of a very large corpus in linguistic analysis of low-frequency language phenomena in linguistics, such as usage of idioms and other types of fixed expressions.
AB - Our paper deals with the rapidly developing area of corpus linguistics referred to as Web as Corpus (WaC), i.e., creation of very large corpora composed of texts downloaded from the web. Some problems of compilation and usage of such corpora are addressed, most notably the “language quality” of web texts and the inadequate balance of web corpora, with the latter being an obstacle both for corpus creators, and its users. We introduce the Aranea family of web corpora, describe the various processing procedures used during its compilation, and present an attempt to increase the size of its Russian component by the order of magnitude. We also compare its contents from the user’s perspective among the various sizes of the Russian Aranea, as well as with the other large Russian corpora (RNC, ruTenTen and GICR). We also intent to demonstrate the advantage of a very large corpus in linguistic analysis of low-frequency language phenomena in linguistics, such as usage of idioms and other types of fixed expressions.
KW - web corpora
KW - WaC technology
KW - representativeness
KW - balance
KW - evaluation
M3 - Article in an anthology
SP - 79
EP - 93
BT - Computational Linguistics and Intellectual Technologies
PB - Российский государственный гуманитарный университет
ER -
ID: 7580628