The task of text analysis with the objective to determine text’s author is a challenge the solutions of which have engaged researchers since the last century. With the development of social networks and platforms for publishing of web-posts or articles on the Internet, the task of identifying authorship becomes even more acute. Specialists in the areas of journalism and law are particularly interested in finding a more accurate approach in order to resolve disputes related to the texts of dubious authorship. In this article authors carry out an applicability comparison of eight modern Machine Learning algorithms like Support Vector Machine, Naive Bayes, Logistic Regression, K-nearest Neighbors, Decision Tree, Random Forest, Multilayer Perceptron, Gradient Boosting Classifier for classification of Russian web-post collection. The best results were achieved with Logistic Regression, Multilayer Perceptron and Support Vector Machine with linear kernel using combination of Part-of-Speech and Word N-grams as features.

Original languageEnglish
Title of host publicationDatabases and Information Systems - 13th International Baltic Conference, DB and IS 2018, Proceedings
EditorsOlegas Vasilecas, Gintautas Dzemyda, Audrone Lupeikiene
PublisherSpringer Nature
Pages314-327
Number of pages14
ISBN (Print)9783319975702
DOIs
StatePublished - 1 Jan 2018
Event13th International Baltic Conference on Databases and Information Systems, DB and IS 2018 - Trakai, Lithuania
Duration: 1 Jul 20184 Jul 2018

Publication series

NameCommunications in Computer and Information Science
Volume838
ISSN (Print)1865-0929

Conference

Conference13th International Baltic Conference on Databases and Information Systems, DB and IS 2018
Country/TerritoryLithuania
CityTrakai
Period1/07/184/07/18

    Research areas

  • Author attribution, Frequency author profile, Text classification

    Scopus subject areas

  • Computer Science(all)
  • Mathematics(all)

ID: 38400560