Documents

DOI

The purpose of this work was to test BERT transformer-based models for homograph disambiguation in Russian—a long-standing issue in Text-To-Speech systems. The paper presents different types of Russian homographs and offers an in-depth analysis of existing methods for their disambiguation. A dataset of contexts from the Russian National Corpus for 28 homograph pairs was created and manually annotated. Three BERT models for the Russian language were selected and tested in two experiments. The results have shown that these models could achieve and outperform SOTA results in disambiguating homographs of all types on a relatively small training dataset. The pretrained models could also be used to disambiguate new pairs of intraparadigmatic homographs absent from the original dataset.
Original languageEnglish
Title of host publicationLiterature, Language аnd Computing (LiLaC): Russian Contribution from the LiLaC-2023
PublisherSpringer Nature
Pages73-83
Number of pages11
ISBN (Print)9789819609901
DOIs
StatePublished - 2025
EventЛитература, язык и компьютерные технологии LiLaC - СПбГУ
Duration: 9 Nov 202311 Nov 2023
Conference number: 2

Conference

ConferenceЛитература, язык и компьютерные технологии LiLaC
Period9/11/2311/11/23

    Research areas

  • BERT, Homograph disambiguation, Russian homographs, Text-to-Speech

ID: 113868018