Research output: Chapter in Book/Report/Conference proceeding › Chapter › peer-review
RuOHQA: Creating QA Corpus in Russian Based on Oral History Archives. / Bukreeva, Liudmila; Guseva, Daria; Dolgushin, Mikhail; Evdokimova, Vera; Obotnina, Vasilisa.
Literature, Language and Computing: Russian Contribution from the LiLaC-2023. ed. / Polina Eismont; Maria Khokhlova; Mikhail Koryshev. Springer Nature, 2025. p. 183-191.Research output: Chapter in Book/Report/Conference proceeding › Chapter › peer-review
}
TY - CHAP
T1 - RuOHQA: Creating QA Corpus in Russian Based on Oral History Archives
AU - Bukreeva, Liudmila
AU - Guseva, Daria
AU - Dolgushin, Mikhail
AU - Evdokimova, Vera
AU - Obotnina, Vasilisa
PY - 2025/3/27
Y1 - 2025/3/27
N2 - The work described in this paper is aimed to enable better analysis of oral history archives. Our objective was to turn many facts and stories from Holocaust survivors into more accessible forms. We focused on Russian oral history archives, namely on the spoken interviews collected by the Yad Vashem Foundation, transcribed and summarized these valuable data sources automatically as well as manually by experts.We created the new Russian question–answer corpus (ruOHQA) that represents a labeled data-collection of oral history archives containing over 1,555 entries. Structure of SQuAD was used as a base approach for data organization. This paper discusses the detailed creation process, linguistic characteristics, strengths, and weaknesses of the corpus. We compare the ruOHQA with the SberQuAD dataset in order to clearly demonstrate our contributions and the potential for further research in this area. Particular attention was paid to the potential of the corpus for training neural network models. Hence, we present the annotated task-oriented corpora of Holocaust testimonies in Russian.
AB - The work described in this paper is aimed to enable better analysis of oral history archives. Our objective was to turn many facts and stories from Holocaust survivors into more accessible forms. We focused on Russian oral history archives, namely on the spoken interviews collected by the Yad Vashem Foundation, transcribed and summarized these valuable data sources automatically as well as manually by experts.We created the new Russian question–answer corpus (ruOHQA) that represents a labeled data-collection of oral history archives containing over 1,555 entries. Structure of SQuAD was used as a base approach for data organization. This paper discusses the detailed creation process, linguistic characteristics, strengths, and weaknesses of the corpus. We compare the ruOHQA with the SberQuAD dataset in order to clearly demonstrate our contributions and the potential for further research in this area. Particular attention was paid to the potential of the corpus for training neural network models. Hence, we present the annotated task-oriented corpora of Holocaust testimonies in Russian.
KW - Corpora
KW - Question answering
KW - Visual history archives
UR - https://www.mendeley.com/catalogue/da248539-0d9a-3218-9c51-6846943fae83/
U2 - 10.1007/978-981-96-0990-1_16
DO - 10.1007/978-981-96-0990-1_16
M3 - Chapter
SN - 978-981-96-0989-5
SP - 183
EP - 191
BT - Literature, Language and Computing
A2 - Eismont, Polina
A2 - Khokhlova, Maria
A2 - Koryshev, Mikhail
PB - Springer Nature
ER -
ID: 133544088