Automated authorship attribution is actual to identify the author of an anonymous texts, or texts whose authorship is in doubt. It can be used in various applications including author verification, plagiarism detection, computer forensics and others. In this article, the authors analyze an approach based on frequency combination of letters is investigated for solving such a task as classification of documents by authorship. This technique could be used to identify the author of a computer program from a predefined set of possible authors. The effectiveness of this approach is significantly determined by the choice of metric. The research examines and compares four different distance measures between a text of unknown authorship and an authors' profile: L1 measure, Kullback-Leibler divergence, base metric of Common N-gram method (CNG) and a certain variation of dissimilarity measure of CNG method. Comparison outlines cases when some metric outperforms others with a specific parameter combination. Experiments are conducted on different Russian and English corpora.

Original languageEnglish
Pages (from-to)24-39
Number of pages16
JournalInternational Journal of Embedded and Real-Time Communication Systems
Volume8
Issue number2
DOIs
StatePublished - 1 Jul 2017

    Scopus subject areas

  • Computer Science(all)

    Research areas

  • Authorship Attribution Problem, Common N-grams, Document Classification, N-grams

ID: 38401716