Construction of a Russian Paraphrase Corpus

DOI

https://doi.org/10.1007/978-3-319-41718-9_8
Final published version

This paper presents a crowdsourcing project on the creation of a publicly available corpus of sentential paraphrases for Russian. Collected from the news headlines, such corpus could be applied for information extraction and text summarization. We collect news headlines from different agencies in real-time; paraphrase candidates are extracted from the headlines using an unsupervised matrix similarity metric. We provide user-friendly online interface for crowdsourced annotation which is available at paraphraser. ru. There are 5181 annotated sentence pairs at the moment, with 4758 of them included in the corpus. The annotation process is going on and the current version of the corpus is freely available at http://paraphraser.ru.

Original language	English
Title of host publication	INFORMATION RETRIEVAL, (RUSSIR 2015)
Editors	P Braslavski, Markov, P Pardalos, Y Volkovich, DI Ignatov, S Koltsov, O Koltsova
Publisher	Springer Nature
Pages	146-157
Number of pages	12
ISBN (Print)	978-3-319-41717-2
DOIs	https://doi.org/10.1007/978-3-319-41718-9_8
State	Published - 2016
Event	9th Russian Summer School in Information Retrieval (RuSSIR) - St Petersburg Duration: 24 Aug 2015 → 28 Aug 2015

Publication series

Name	Communications in Computer and Information Science
Publisher	SPRINGER INTERNATIONAL PUBLISHING AG
Volume	573
ISSN (Print)	1865-0929

Conference

Conference	9th Russian Summer School in Information Retrieval (RuSSIR)
City	St Petersburg
Period	24/08/15 → 28/08/15

Research areas

Russian paraphrase corpus, Lexical similarity metric, Unsupervised paraphrase extraction, Crowdsourcing

ID: 89669620