Our paper deals with the rapidly developing area of corpus linguistics referred to as Web as Corpus (WaC), i.e., creation of very large corpora composed of texts downloaded from the web. Some problems of compilation and usage of such corpora are addressed, most notably the “language quality” of web texts and the inadequate balance of web corpora, with the latter being an obstacle both for corpus creators, and its users. We introduce the Aranea family of web corpora, describe the various processing procedures used during its compilation, and present an attempt to increase the size of its Russian component by the order of magnitude. We also compare its contents from the user’s perspective among the various sizes of the Russian Aranea, as well as with the other large Russian corpora (RNC, ruTenTen and GICR). We also intent to demonstrate the advantage of a very large corpus in linguistic analysis of low-frequency language phenomena in linguistics, such as usage of idioms and other types of fixed expressions.
Original languageEnglish
Title of host publicationComputational Linguistics and Intellectual Technologies
PublisherРоссийский государственный гуманитарный университет
Pages79-93
StatePublished - 2016
Externally publishedYes

    Research areas

  • web corpora, WaC technology, representativeness, balance, evaluation

ID: 7580628