RuOHQA: Creating QA Corpus in Russian Based on Oral History Archives

Links

https://link.springer.com/10.1007/978-981-96-0990-1_16

DOI

https://doi.org/10.1007/978-981-96-0990-1_16
Final published version

The work described in this paper is aimed to enable better analysis of oral history archives. Our objective was to turn many facts and stories from Holocaust survivors into more accessible forms. We focused on Russian oral history archives, namely on the spoken interviews collected by the Yad Vashem Foundation, transcribed and summarized these valuable data sources automatically as well as manually by experts.We created the new Russian question–answer corpus (ruOHQA) that represents a labeled data-collection of oral history archives containing over 1,555 entries. Structure of SQuAD was used as a base approach for data organization. This paper discusses the detailed creation process, linguistic characteristics, strengths, and weaknesses of the corpus. We compare the ruOHQA with the SberQuAD dataset in order to clearly demonstrate our contributions and the potential for further research in this area. Particular attention was paid to the potential of the corpus for training neural network models. Hence, we present the annotated task-oriented corpora of Holocaust testimonies in Russian.

Original language	English
Title of host publication	Literature, Language and Computing
Subtitle of host publication	Russian Contribution from the LiLaC-2023
Editors	Polina Eismont, Maria Khokhlova, Mikhail Koryshev
Publisher	Springer Nature
Chapter	16
Pages	183-191
Number of pages	9
ISBN (Electronic)	978-981-96-0990-1
ISBN (Print)	978-981-96-0989-5
DOIs	https://doi.org/10.1007/978-981-96-0990-1_16
State	Published - 27 Mar 2025

Research areas

Corpora, Question answering, Visual history archives

ID: 133544088