The ever-increasing online communication landscape provides circumstances for people with significant differences in their views to cross paths unlike it was ever possible before. This leads to the raise of toxicity in online comments and discussions and makes the development of means to detect instances of such phenomenon critically important. The toxic language detection problem is fairly researched and some solutions produce highly accurate predictions when significantly large datasets are available for training. However, such datasets are not always available for various languages. In this paper, we review different ways to approach the problem targeting transferring knowledge from one language to another: machine translation, multi-lingual models, and domain adaptation. We also focus on the analysis of methods for word embedding such as Word2Vec, FastText, GloVe, BERT, and methods for classification of toxic comment: Naïve Bayes, Random Forest, Logistic regression, Support Vector Machine, Majority vote, and Recurrent Neural Networks. We demonstrate that for small datasets in the Russian language, traditional machine-learning techniques produce highly competitive results on par with deep learning methods, and also that machine translation of the dataset to the English language produces more accurate results than multi-lingual models.
Translated title of the contributionОпределение токсичности в коротких текстовых сообщениях
Original languageEnglish
Title of host publicationSpeech and Computer
Subtitle of host publication22nd International Conference, SPECOM 2020, Proceedings
EditorsAlexey Karpov, Rodmonga Potapova
Place of PublicationCham
PublisherSpringer Nature
Pages315-325
ISBN (Print)9783030602758
DOIs
StatePublished - Oct 2020
Event22nd International Conference on Speech and Computer, SPECOM 2020 - St. Petersburg, Russia => Online, St. Petersburg, Russian Federation
Duration: 7 Oct 20209 Oct 2020
http://specom.nw.ru/2020/program/SPECOM-ICR2020-Conference-Program-06102020.pdf

Publication series

NameLecture Notes in Computer Science
Volume12335
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference22nd International Conference on Speech and Computer, SPECOM 2020
Abbreviated titleSPECOM 2020
Country/TerritoryRussian Federation
CitySt. Petersburg
Period7/10/209/10/20
Internet address

    Research areas

  • Toxic language, Machine learning, Natural language processing, Classification methods, Multi-lingual models, Domain adaptation, Word embedding, Machine translation

    Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

ID: 70278848