Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review
Using Kaldi for Phonetic Transcription: Evidence from the Corpus of Spoken Russian. / Riekhakaynen, Elena; Skorobagatko, Lada.
Proceedings of the Third International Conference on Advances in Computing Research (ACR’25). Springer Nature, 2025. p. 168-178 (Lecture Notes in Networks and Systems; Vol. 1346).Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review
}
TY - GEN
T1 - Using Kaldi for Phonetic Transcription: Evidence from the Corpus of Spoken Russian
AU - Riekhakaynen, Elena
AU - Skorobagatko, Lada
N1 - Conference code: 3
PY - 2025/4/16
Y1 - 2025/4/16
N2 - For creating a linguistically annotated speech corpus, it is useful to have a tool for an automatic phonetic transcription. We used the Kaldi tool to transcribe the recordings of radio interviews and talk shows from the Corpus of Spoken Russian. The training set included 2466 interpausal intervals (speech fragments between two pauses), and the test set – 617 ones. 15 models for monophone training and 15 models for triphone training were tested using a low-dimensional dictionary that contained only allophones. The error rates ranged from 44% to 39%. Learning through triphones coped better with the task than the one through monophones. Increasing the length of N-grams had a positive effect on the result of the model, the percentage of errors decreased to 36%. The frequency of allophone occurrence does not seem to affect the accuracy of their recognition. Vowels are recognized worse than consonants, which is consistent with what is known about how trained experts in phonetics transcribe spontaneous speech.
AB - For creating a linguistically annotated speech corpus, it is useful to have a tool for an automatic phonetic transcription. We used the Kaldi tool to transcribe the recordings of radio interviews and talk shows from the Corpus of Spoken Russian. The training set included 2466 interpausal intervals (speech fragments between two pauses), and the test set – 617 ones. 15 models for monophone training and 15 models for triphone training were tested using a low-dimensional dictionary that contained only allophones. The error rates ranged from 44% to 39%. Learning through triphones coped better with the task than the one through monophones. Increasing the length of N-grams had a positive effect on the result of the model, the percentage of errors decreased to 36%. The frequency of allophone occurrence does not seem to affect the accuracy of their recognition. Vowels are recognized worse than consonants, which is consistent with what is known about how trained experts in phonetics transcribe spontaneous speech.
KW - Acoustic Transcription
KW - Automatic Speech Recognition
KW - Natural Language Processing
KW - Phonetic Transcription
KW - Russian Speech
UR - https://www.mendeley.com/catalogue/60277827-7b74-3cc3-8758-0432160c02be/
U2 - 10.1007/978-3-031-87647-9_15
DO - 10.1007/978-3-031-87647-9_15
M3 - Conference contribution
SN - 9783031876462
T3 - Lecture Notes in Networks and Systems
SP - 168
EP - 178
BT - Proceedings of the Third International Conference on Advances in Computing Research (ACR’25)
PB - Springer Nature
Y2 - 7 July 2025 through 9 July 2025
ER -
ID: 138031353