Результаты исследований: Публикации в книгах, отчётах, сборниках, трудах конференций › статья в сборнике материалов конференции › научная › Рецензирование
Detection of Toxic Language in Short Text Messages. / Makhnytkina, Olesia; Matveev, Anton; Bogoradnikova, Darya; Lizunova, Inna; Maltseva, Anna; Shilkina, Natalia.
Speech and Computer : 22nd International Conference, SPECOM 2020, Proceedings. ред. / Alexey Karpov; Rodmonga Potapova. Cham : Springer Nature, 2020. стр. 315-325 (Lecture Notes in Computer Science; Том 12335 ).Результаты исследований: Публикации в книгах, отчётах, сборниках, трудах конференций › статья в сборнике материалов конференции › научная › Рецензирование
}
TY - GEN
T1 - Detection of Toxic Language in Short Text Messages
AU - Makhnytkina, Olesia
AU - Matveev, Anton
AU - Bogoradnikova, Darya
AU - Lizunova, Inna
AU - Maltseva, Anna
AU - Shilkina, Natalia
N1 - Makhnytkina O., Matveev A., Bogoradnikova D., Lizunova I., Maltseva A., Shilkina N. (2020) Detection of Toxic Language in Short Text Messages. In: Karpov A., Potapova R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science, vol 12335. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_31
PY - 2020/10
Y1 - 2020/10
N2 - The ever-increasing online communication landscape provides circumstances for people with significant differences in their views to cross paths unlike it was ever possible before. This leads to the raise of toxicity in online comments and discussions and makes the development of means to detect instances of such phenomenon critically important. The toxic language detection problem is fairly researched and some solutions produce highly accurate predictions when significantly large datasets are available for training. However, such datasets are not always available for various languages. In this paper, we review different ways to approach the problem targeting transferring knowledge from one language to another: machine translation, multi-lingual models, and domain adaptation. We also focus on the analysis of methods for word embedding such as Word2Vec, FastText, GloVe, BERT, and methods for classification of toxic comment: Naïve Bayes, Random Forest, Logistic regression, Support Vector Machine, Majority vote, and Recurrent Neural Networks. We demonstrate that for small datasets in the Russian language, traditional machine-learning techniques produce highly competitive results on par with deep learning methods, and also that machine translation of the dataset to the English language produces more accurate results than multi-lingual models.
AB - The ever-increasing online communication landscape provides circumstances for people with significant differences in their views to cross paths unlike it was ever possible before. This leads to the raise of toxicity in online comments and discussions and makes the development of means to detect instances of such phenomenon critically important. The toxic language detection problem is fairly researched and some solutions produce highly accurate predictions when significantly large datasets are available for training. However, such datasets are not always available for various languages. In this paper, we review different ways to approach the problem targeting transferring knowledge from one language to another: machine translation, multi-lingual models, and domain adaptation. We also focus on the analysis of methods for word embedding such as Word2Vec, FastText, GloVe, BERT, and methods for classification of toxic comment: Naïve Bayes, Random Forest, Logistic regression, Support Vector Machine, Majority vote, and Recurrent Neural Networks. We demonstrate that for small datasets in the Russian language, traditional machine-learning techniques produce highly competitive results on par with deep learning methods, and also that machine translation of the dataset to the English language produces more accurate results than multi-lingual models.
KW - Classification methods
KW - Toxic language
KW - Machine learning
KW - Toxic language
KW - Machine learning
KW - Natural language processing
KW - Classification methods
KW - Multi-lingual models
KW - Domain adaptation
KW - Word embedding
KW - Machine translation
UR - http://www.scopus.com/inward/record.url?scp=85092889926&partnerID=8YFLogxK
UR - https://www.mendeley.com/catalogue/d27a9d86-6b08-38dc-b9fb-77fbc58630de/
U2 - 10.1007/978-3-030-60276-5_31
DO - 10.1007/978-3-030-60276-5_31
M3 - Conference contribution
AN - SCOPUS:85092889926
SN - 9783030602758
T3 - Lecture Notes in Computer Science
SP - 315
EP - 325
BT - Speech and Computer
A2 - Karpov, Alexey
A2 - Potapova, Rodmonga
PB - Springer Nature
CY - Cham
T2 - 22nd International Conference on Speech and Computer
Y2 - 7 October 2020 through 9 October 2020
ER -
ID: 70278848