The paper describes the preparation and development of the text collections within the framework of MorphoRuEval-2017 shared task, an evaluation campaign designed to stimulate development of the automatic morphological processing technologies for Russian. The main challenge for the organizers was to standardize all available Russian corpora with the manually verified high-quality tagging to a single format (Universal Dependencies CONLL-U). The sources of the data were the disambiguated subcorpus of the Russian National Corpus, SynTagRus, data and GICR corpus with the resolved homonymy, all exhibiting different tagsets, rules for lemmatization, pipeline architecture, technical solutions and error systematicity. The collections includes both normative texts (the news and modern literature) and more informal discourse (social media and spoken data), the texts are available under CC BY-NC-SA 3.0 license.

Original languageEnglish
Pages (from-to)258-267
Number of pages10
JournalJazykovedny Casopis
Issue number2
StatePublished - Dec 2017

    Research areas

  • Morphological parsing, Morphological tagging, Russian corpora, Shared task, Text collection, Universal dependencies

    Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

ID: 61233855