DOI

The ever-increasing online communication landscape provides circumstances for people with significant differences in their views to cross paths unlike it was ever possible before. This leads to the raise of toxicity in online comments and discussions and makes the development of means to detect instances of such phenomenon critically important. The toxic language detection problem is fairly researched and some solutions produce highly accurate predictions when significantly large datasets are available for training. However, such datasets are not always available for various languages. In this paper, we review different ways to approach the problem targeting transferring knowledge from one language to another: machine translation, multi-lingual models, and domain adaptation. We also focus on the analysis of methods for word embedding such as Word2Vec, FastText, GloVe, BERT, and methods for classification of toxic comment: Naïve Bayes, Random Forest, Logistic regression, Support Vector Machine, Majority vote, and Recurrent Neural Networks. We demonstrate that for small datasets in the Russian language, traditional machine-learning techniques produce highly competitive results on par with deep learning methods, and also that machine translation of the dataset to the English language produces more accurate results than multi-lingual models.
Переведенное названиеОпределение токсичности в коротких текстовых сообщениях
Язык оригиналаанглийский
Название основной публикацииSpeech and Computer
Подзаголовок основной публикации22nd International Conference, SPECOM 2020, Proceedings
РедакторыAlexey Karpov, Rodmonga Potapova
Место публикацииCham
ИздательSpringer Nature
Страницы315-325
ISBN (печатное издание)9783030602758
DOI
СостояниеОпубликовано - окт 2020
Событие22nd International Conference on Speech and Computer - St. Petersburg, Russia => Online, St. Petersburg, Российская Федерация
Продолжительность: 7 окт 20209 окт 2020
http://specom.nw.ru/2020/program/SPECOM-ICR2020-Conference-Program-06102020.pdf

Серия публикаций

НазваниеLecture Notes in Computer Science
Том12335
ISSN (печатное издание)0302-9743
ISSN (электронное издание)1611-3349

конференция

конференция22nd International Conference on Speech and Computer
Сокращенное названиеSPECOM and ICR 2020
Страна/TерриторияРоссийская Федерация
ГородSt. Petersburg
Период7/10/209/10/20
Сайт в сети Internet

    Области исследований

  • Classification methods, Toxic language, Machine learning

    Предметные области Scopus

  • Теоретические компьютерные науки
  • Компьютерные науки (все)

ID: 70278848