This paper presents a crowdsourcing project on the creation of a publicly available corpus of sentential paraphrases for Russian. Collected from the news headlines, such corpus could be applied for information extraction and text summarization. We collect news headlines from different agencies in real-time; paraphrase candidates are extracted from the headlines using an unsupervised matrix similarity metric. We provide user-friendly online interface for crowdsourced annotation which is available at paraphraser. ru. There are 5181 annotated sentence pairs at the moment, with 4758 of them included in the corpus. The annotation process is going on and the current version of the corpus is freely available at http://paraphraser.ru.

Original languageEnglish
Title of host publicationINFORMATION RETRIEVAL, (RUSSIR 2015)
EditorsP Braslavski, Markov, P Pardalos, Y Volkovich, DI Ignatov, S Koltsov, O Koltsova
PublisherSpringer Nature
Pages146-157
Number of pages12
ISBN (Print)978-3-319-41717-2
DOIs
StatePublished - 2016
Event9th Russian Summer School in Information Retrieval (RuSSIR) - St Petersburg
Duration: 24 Aug 201528 Aug 2015

Publication series

NameCommunications in Computer and Information Science
PublisherSPRINGER INTERNATIONAL PUBLISHING AG
Volume573
ISSN (Print)1865-0929

Conference

Conference9th Russian Summer School in Information Retrieval (RuSSIR)
CitySt Petersburg
Period24/08/1528/08/15

    Research areas

  • Russian paraphrase corpus, Lexical similarity metric, Unsupervised paraphrase extraction, Crowdsourcing

ID: 89669620