Документы

DOI

The purpose of this work was to test BERT transformer-based models for homograph disambiguation in Russian—a long-standing issue in Text-To-Speech systems. The paper presents different types of Russian homographs and offers an in-depth analysis of existing methods for their disambiguation. A dataset of contexts from the Russian National Corpus for 28 homograph pairs was created and manually annotated. Three BERT models for the Russian language were selected and tested in two experiments. The results have shown that these models could achieve and outperform SOTA results in disambiguating homographs of all types on a relatively small training dataset. The pretrained models could also be used to disambiguate new pairs of intraparadigmatic homographs absent from the original dataset.
Язык оригиналаанглийский
Название основной публикацииLiterature, Language аnd Computing (LiLaC): Russian Contribution from the LiLaC-2023
ИздательSpringer Nature
Страницы73-83
Число страниц11
ISBN (печатное издание)9789819609901
DOI
СостояниеОпубликовано - 2025
СобытиеЛитература, язык и компьютерные технологии LiLaC - СПбГУ
Продолжительность: 9 ноя 202311 ноя 2023
Номер конференции: 2

конференция

конференцияЛитература, язык и компьютерные технологии LiLaC
Период9/11/2311/11/23

ID: 113868018