Abstract

The ever-increasing online communication landscape provides circumstances for people with significant differences in their views to cross paths unlike it was ever possible before. This leads to the raise of toxicity in online comments and discussions and makes the development of means to detect instances of such phenomenon critically important. The toxic language detection problem is fairly researched and some solutions produce highly accurate predictions when significantly large datasets are available for training. However, such datasets are not always available for various languages. In this paper, we review different ways to approach the problem targeting transferring knowledge from one language to another: machine translation, multi-lingual models, and domain adaptation. We also focus on the analysis of methods for word embedding such as Word2Vec, FastText, GloVe, BERT, and methods for classification of toxic comment: Naïve Bayes, Random Forest, Logistic regression, Support Vector Machine, Majority vote, and Recurrent Neural Networks. We demonstrate that for small datasets in the Russian language, traditional machine-learning techniques produce highly competitive results on par with deep learning methods, and also that machine translation of the dataset to the English language produces more accurate results than multi-lingual models.
Translated title of the contributionОпределение токсичности в коротких текстовых сообщениях
Original languageEnglish
Title of host publicationSpeech and Computer
Subtitle of host publication22nd International Conference, SPECOM 2020, Proceedings
EditorsAlexey Karpov, Rodmonga Potapova
Place of PublicationCham
PublisherSpringer Nature
Pages315-325
ISBN (Print)9783030602758
DOIs
Publication statusPublished - Oct 2020
Event22nd International Conference on Speech and Computer - St. Petersburg, Russia => Online, St. Petersburg
Duration: 7 Oct 20209 Oct 2020
http://specom.nw.ru/2020/program/SPECOM-ICR2020-Conference-Program-06102020.pdf

Publication series

NameLecture Notes in Computer Science
Volume12335
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference22nd International Conference on Speech and Computer
Abbreviated titleSPECOM 2020
CountryRussian Federation
CitySt. Petersburg
Period7/10/209/10/20
Internet address

Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Detection of Toxic Language in Short Text Messages'. Together they form a unique fingerprint.

Cite this