Our paper deals with the rapidly developing area of corpus linguistics referred to as Web as Corpus (WaC), i.e., creation of very large corpora composed of texts downloaded from the web. Some problems of compilation and usage of such corpora are addressed, most notably the “language quality” of web texts and the inadequate balance of web corpora, with the latter being an obstacle both for corpus creators, and its users. We introduce the Aranea family of web corpora, describe the various processing procedures used during its compilation, and present an attempt to increase the size of its Russian component by the order of magnitude. We also compare its contents from the user’s perspective among the various sizes of the Russian Aranea, as well as with the other large Russian corpora (RNC, ruTenTen and GICR). We also intent to demonstrate the advantage of a very large corpus in linguistic analysis of low-frequency language phenomena in linguistics, such as usage of idioms and other types of fixed expressions.
Язык оригиналаанглийский
Название основной публикацииComputational Linguistics and Intellectual Technologies
ИздательРоссийский государственный гуманитарный университет
СостояниеОпубликовано - 2016
Опубликовано для внешнего пользованияДа

