Our main objectives are constructing a paraphrase corpus for Russian and developing of the paraphrase identification and classification models based on this corpus. The corpus consists of pairs of news headlines from different media agencies which are extracted and analyzed in real time. Paraphrase candidates are extracted using an unsupervised matrix similarity metric: if the metric value satisfies a certain threshold, the corresponding pair of sentences is included in the corpus. These pairs of sentences are further annotated via crowdsourcing. We provide a user-friendly online interface for crowdsourced annotation which is available at http://paraphraser.ru. There are 7480 annotated sentence pairs in the corpus at the moment, and there are still more to come. The types and the features of these sentence pairs are not introduced to the annotators. We adopt a 3-classes classification of paraphrases and distinguish precise paraphrases (conveying the same meaning), loose paraphrases (conveying similar meaning) and non-paraphrases (conveying different meaning).

Original languageEnglish
Title of host publicationComputational Linguistics and Intelligent Text Processing - 17th International Conference, CICLing 2016, Revised Selected Papers
EditorsAlexander Gelbukh
PublisherSpringer Nature
Pages573-587
Number of pages15
Volume9623 LNCS
ISBN (Print)9783319754765
DOIs
StatePublished - 2018
Event17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2016 - Konya, Turkey
Duration: 2 Apr 20168 Apr 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9623 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2016
Country/TerritoryTurkey
CityKonya
Period2/04/168/04/16

    Research areas

  • Lexical features, Low-level features, Matrix similarity metric, Paraphrase identification, Semantic features

    Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

ID: 7633707