The work described in this paper is aimed to enable better analysis of oral history archives. Our objective was to turn many facts and stories from Holocaust survivors into more accessible forms. We focused on Russian oral history archives, namely on the spoken interviews collected by the Yad Vashem Foundation, transcribed and summarized these valuable data sources automatically as well as manually by experts.We created the new Russian question–answer corpus (ruOHQA) that represents a labeled data-collection of oral history archives containing over 1,555 entries. Structure of SQuAD was used as a base approach for data organization. This paper discusses the detailed creation process, linguistic characteristics, strengths, and weaknesses of the corpus. We compare the ruOHQA with the SberQuAD dataset in order to clearly demonstrate our contributions and the potential for further research in this area. Particular attention was paid to the potential of the corpus for training neural network models. Hence, we present the annotated task-oriented corpora of Holocaust testimonies in Russian.
Original languageEnglish
Title of host publicationLiterature, Language and Computing
Subtitle of host publicationRussian Contribution from the LiLaC-2023
EditorsPolina Eismont, Maria Khokhlova, Mikhail Koryshev
PublisherSpringer Nature
Chapter16
Pages183-191
Number of pages9
ISBN (Electronic)978-981-96-0990-1
ISBN (Print)978-981-96-0989-5
DOIs
StatePublished - 27 Mar 2025

    Research areas

  • Corpora, Question answering, Visual history archives

ID: 133544088