The paper presents the issue of collocability and collocations in Russian and gives a survey of a wide range of dictionaries both printed and online ones that describe collocations. Our project deals with building a database that will include dictionary and statistical collocations. The former can be described in various lexicographic resources whereas the latter can be extracted automatically from corpora. Dictionaries differ among themselves, the information is given in various ways, making it hard for language learners and researchers to acquire data. A number of dictionaries were analyzed and processed to retrieve verified collocations, however the overlap between the lists of collocations extracted from them is still rather small. This fact indicates there is a need to create a unified resource which takes into account collocability and more examples. The proposed resource will also be useful for linguists and for studying Russian as a foreign language. The obtained results can be important for machine learning and for other NLP tasks, for instance, automatic clustering of word combinations and disambiguation.

Original languageEnglish
Title of host publicationLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Place of PublicationParis
PublisherEuropean Language Resources Association (ELRA)
Pages3198-3206
Number of pages9
ISBN (Electronic)9791095546344
ISBN (Print)9791095546344
StatePublished - 2020
Event12th International Conference on Language Resources and Evaluation - Marseille, France
Duration: 11 May 202016 May 2020

Publication series

NameLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

Conference

Conference12th International Conference on Language Resources and Evaluation
Abbreviated titleLREC 2020
Country/TerritoryFrance
CityMarseille
Period11/05/2016/05/20

    Scopus subject areas

  • Education
  • Library and Information Sciences
  • Language and Linguistics
  • Linguistics and Language

    Research areas

  • Collocations, Lexical database, Russian dictionaries, Parallel corpus, Low-resource language, Wolof, Neural machine translation, Word embeddings

ID: 61200560