A New Russian Paraphrase Corpus. Paraphrase Identification and Classification Based on Different Prediction Models

Ekaterina Pronoza, Elena Yagunova

    Research output

    Abstract

    Our main objectives are constructing a paraphrase corpus for Russian and developing of the paraphrase identification and classification models based on this corpus. The corpus consists of pairs of news headlines from different media agencies which are extracted and analyzed in real time. Paraphrase candidates are extracted using an unsupervised matrix similarity metric: if the metric value satisfies a certain threshold, the corresponding pair of sentences is included in the corpus. These pairs of sentences are further annotated via crowdsourcing. We provide a user-friendly online interface for crowdsourced annotation which is available at http://paraphraser.ru. There are 7480 annotated sentence pairs in the corpus at the moment, and there are still more to come. The types and the features of these sentence pairs are not introduced to the annotators. We adopt a 3-classes classification of paraphrases and distinguish precise paraphrases (conveying the same meaning), loose paraphrases (conveying similar meaning) and non-paraphrases (conveying different meaning).

    Original languageEnglish
    Title of host publicationComputational Linguistics and Intelligent Text Processing - 17th International Conference, CICLing 2016, Revised Selected Papers
    PublisherSpringer Nature
    Pages573-587
    Number of pages15
    Volume9623 LNCS
    ISBN (Print)9783319754765
    DOIs
    Publication statusPublished - 2018
    Event17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2016 - Konya
    Duration: 2 Apr 20168 Apr 2016

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume9623 LNCS
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349

    Conference

    Conference17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2016
    CountryTurkey
    CityKonya
    Period2/04/168/04/16

    Scopus subject areas

    • Theoretical Computer Science
    • Computer Science(all)

    Fingerprint Dive into the research topics of 'A New Russian Paraphrase Corpus. Paraphrase Identification and Classification Based on Different Prediction Models'. Together they form a unique fingerprint.

    Cite this