Very Large Russian Corpora: New Opportunities and New Challenges

Research output: Chapter in Book/Report/Conference proceeding › Article in an anthology › Research

V. Benko
V. P. Zakharov

Our paper deals with the rapidly developing area of corpus linguistics referred to as Web as Corpus (WaC), i.e., creation of very large corpora composed of texts downloaded from the web. Some problems of compilation and usage of such corpora are addressed, most notably the “language quality” of web texts and the inadequate balance of web corpora, with the latter being an obstacle both for corpus creators, and its users. We introduce the Aranea family of web corpora, describe the various processing procedures used during its compilation, and present an attempt to increase the size of its Russian component by the order of magnitude. We also compare its contents from the user’s perspective among the various sizes of the Russian Aranea, as well as with the other large Russian corpora (RNC, ruTenTen and GICR). We also intent to demonstrate the advantage of a very large corpus in linguistic analysis of low-frequency language phenomena in linguistics, such as usage of idioms and other types of fixed expressions.

Original language	English
Title of host publication	Computational Linguistics and Intellectual Technologies
Publisher	Российский государственный гуманитарный университет
Pages	79-93
State	Published - 2016
Externally published	Yes

Research areas

web corpora, WaC technology, representativeness, balance, evaluation

ID: 7580628