The paper examines the practical issues in developing a speech-to-text system using deep neuralnetworks. The development of a Russian-language speech recognition system based on DeepSpeecharchitecture is described. The Mozilla company’s open source implementation of DeepSpeech for theEnglish language was used as a starting point. The system was trained in a containerized environmentusing the Docker technology. It allowed to describe the entire process of component assembly fromthe source code, including a number of optimization techniques for CPU and GPU. Docker also allowsto easily reproduce computation optimization tests on alternative infrastructures. We examined the useof TensorFlow XLA technology that optimizes linear algebra computations in the course of neuralnetwork training. The number of nodes in the internal layers of neural network was optimized basedon the word error rate (WER) obtained on a test data set, having regard to GPU memory limitations.We studied the use of probabilistic language models with various maximum lengths of wordsequences and selected the model that shows the best WER. Our study resulted in a Russian-languageacoustic model having been trained based on a data set comprising audio and subtitles from YouTubevideo clips. The language model was built based on the texts of subtitles and publicly availableRussian-language corpus of Wikipedia’s popular articles. The resulting system was tested on a data setconsisting of audio recordings of Russian literature available on voxforge.com—the best WERdemonstrated by the system was 18%.
|Название основной публикации||Distributed Computing and Grid-technologies in Science and Education 2018.|
|Состояние||Опубликовано - 30 дек 2018|
|Название||CEUR Workshop Proceedings|
|Издатель||RWTH Aahen University|
|ISSN (печатное издание)||1613-0073|