Языковой перенос нейросетевого обучения для частеречной разметки Санкт Петербургского корпуса агиографических текстов

Standard

Языковой перенос нейросетевого обучения для частеречной разметки Санкт Петербургского корпуса агиографических текстов. / Гудков, Вадим Вадимович ; Митренина, Ольга Владимировна ; Соколов, Евгений Геннадьевич ; Коваль, Ангелина Александровна.

в: ВЕСТНИК САНКТ-ПЕТЕРБУРГСКОГО УНИВЕРСИТЕТА. ЯЗЫК И ЛИТЕРАТУРА , Том 20, № 2, 07.2023, стр. 268-282.

Результаты исследований: Научные публикации в периодических изданиях › статья › Рецензирование

BibTeX

@article{26d0b9c0c86c4509ae31de56215847d5,

title = "Языковой перенос нейросетевого обучения для частеречной разметки Санкт Петербургского корпуса агиографических текстов",

abstract = "The article describes an experiment about training a part-of-speech tagger using artificial neural networks on the St. Petersburg Corpus of Hagiographic Texts (SKAT), which is being developed at the Department of Mathematical Linguistics of St. Petersburg State University. The corpus includes the texts of 23 manuscripts dating from the 15th–18th centuries with about 190,000 words usages, four of which were labelled manually. The bi-LSTM, distilled RuBERTtiny2 and RuBERT models were used to train a POS tagger. All of them were trained on modern Russian corpora and further fine-tuned to label Old Russian texts using a technique called language transfer. To fine-tune transformer-based language models it was necessary to tokenize the texts using byte pair encoding and map tokens from the original Russian-language tokenizer to the new one based on indices. Then the model was fine-tuned for the token classification task. To fine-tune the model, a tagged subcorpus of three hagiographical texts was used, which included 35,603 tokens and 2,885 sentences. The experiment took into account only the tags of the parts of speech, the classification included seventeen tags, thirteen of which corresponded to parts of speech, and the remaining four marked punctuation marks. To evaluate the quality of the model, the standard metrics F1 and Accuracy were used. According to automatic evaluation metrics, the RuBERT model showed the best result. Most of the errors were related to incorrect generalization of linear position patterns or to the similarity of word forms in both the extreme left and extreme right positions.",

keywords = "агиография, корпус древнерусских текстов, нейросетевая разметка, языковой перенос нейросетевого обучения, частеречная разметка, corpus of Old Russian texts, hagiography, language-based transfer learning, neural network tagging, part-of speech tagging",

author = "Гудков, {Вадим Вадимович} and Митренина, {Ольга Владимировна} and Соколов, {Евгений Геннадьевич} and Коваль, {Ангелина Александровна}",

year = "2023",

month = jul,

doi = "10.21638/spbu09.2023.205",

language = "русский",

volume = "20",

pages = "268--282",

journal = " ВЕСТНИК САНКТ-ПЕТЕРБУРГСКОГО УНИВЕРСИТЕТА. ЯЗЫК И ЛИТЕРАТУРА ",

issn = "2541-9358",

publisher = "Издательство Санкт-Петербургского университета",

number = "2",

}

RIS

TY - JOUR

T1 - Языковой перенос нейросетевого обучения для частеречной разметки Санкт Петербургского корпуса агиографических текстов

AU - Гудков, Вадим Вадимович

AU - Митренина, Ольга Владимировна

AU - Соколов, Евгений Геннадьевич

AU - Коваль, Ангелина Александровна

PY - 2023/7

Y1 - 2023/7

N2 - The article describes an experiment about training a part-of-speech tagger using artificial neural networks on the St. Petersburg Corpus of Hagiographic Texts (SKAT), which is being developed at the Department of Mathematical Linguistics of St. Petersburg State University. The corpus includes the texts of 23 manuscripts dating from the 15th–18th centuries with about 190,000 words usages, four of which were labelled manually. The bi-LSTM, distilled RuBERTtiny2 and RuBERT models were used to train a POS tagger. All of them were trained on modern Russian corpora and further fine-tuned to label Old Russian texts using a technique called language transfer. To fine-tune transformer-based language models it was necessary to tokenize the texts using byte pair encoding and map tokens from the original Russian-language tokenizer to the new one based on indices. Then the model was fine-tuned for the token classification task. To fine-tune the model, a tagged subcorpus of three hagiographical texts was used, which included 35,603 tokens and 2,885 sentences. The experiment took into account only the tags of the parts of speech, the classification included seventeen tags, thirteen of which corresponded to parts of speech, and the remaining four marked punctuation marks. To evaluate the quality of the model, the standard metrics F1 and Accuracy were used. According to automatic evaluation metrics, the RuBERT model showed the best result. Most of the errors were related to incorrect generalization of linear position patterns or to the similarity of word forms in both the extreme left and extreme right positions.

AB - The article describes an experiment about training a part-of-speech tagger using artificial neural networks on the St. Petersburg Corpus of Hagiographic Texts (SKAT), which is being developed at the Department of Mathematical Linguistics of St. Petersburg State University. The corpus includes the texts of 23 manuscripts dating from the 15th–18th centuries with about 190,000 words usages, four of which were labelled manually. The bi-LSTM, distilled RuBERTtiny2 and RuBERT models were used to train a POS tagger. All of them were trained on modern Russian corpora and further fine-tuned to label Old Russian texts using a technique called language transfer. To fine-tune transformer-based language models it was necessary to tokenize the texts using byte pair encoding and map tokens from the original Russian-language tokenizer to the new one based on indices. Then the model was fine-tuned for the token classification task. To fine-tune the model, a tagged subcorpus of three hagiographical texts was used, which included 35,603 tokens and 2,885 sentences. The experiment took into account only the tags of the parts of speech, the classification included seventeen tags, thirteen of which corresponded to parts of speech, and the remaining four marked punctuation marks. To evaluate the quality of the model, the standard metrics F1 and Accuracy were used. According to automatic evaluation metrics, the RuBERT model showed the best result. Most of the errors were related to incorrect generalization of linear position patterns or to the similarity of word forms in both the extreme left and extreme right positions.

KW - агиография

KW - корпус древнерусских текстов

KW - нейросетевая разметка

KW - языковой перенос нейросетевого обучения

KW - частеречная разметка

KW - corpus of Old Russian texts

KW - hagiography

KW - language-based transfer learning

KW - neural network tagging

KW - part-of speech tagging

UR - https://www.mendeley.com/catalogue/64598fcf-665a-3381-b26b-fe8b0e2c02cc/

U2 - 10.21638/spbu09.2023.205

DO - 10.21638/spbu09.2023.205

M3 - статья

VL - 20

SP - 268

EP - 282

JO - ВЕСТНИК САНКТ-ПЕТЕРБУРГСКОГО УНИВЕРСИТЕТА. ЯЗЫК И ЛИТЕРАТУРА

JF - ВЕСТНИК САНКТ-ПЕТЕРБУРГСКОГО УНИВЕРСИТЕТА. ЯЗЫК И ЛИТЕРАТУРА

SN - 2541-9358

IS - 2

ER -

ID: 105622413