Standard

N-gram based approach for text authorship classification : Metric selection. / Mikhailova, Elena; Diurdeva, Polina; Shalymov, Dmitry.

в: International Journal of Embedded and Real-Time Communication Systems, Том 8, № 2, 01.07.2017, стр. 24-39.

Результаты исследований: Научные публикации в периодических изданияхстатьяРецензирование

Harvard

Mikhailova, E, Diurdeva, P & Shalymov, D 2017, 'N-gram based approach for text authorship classification: Metric selection', International Journal of Embedded and Real-Time Communication Systems, Том. 8, № 2, стр. 24-39. https://doi.org/10.4018/IJERTCS.2017070102

APA

Mikhailova, E., Diurdeva, P., & Shalymov, D. (2017). N-gram based approach for text authorship classification: Metric selection. International Journal of Embedded and Real-Time Communication Systems, 8(2), 24-39. https://doi.org/10.4018/IJERTCS.2017070102

Vancouver

Mikhailova E, Diurdeva P, Shalymov D. N-gram based approach for text authorship classification: Metric selection. International Journal of Embedded and Real-Time Communication Systems. 2017 Июль 1;8(2):24-39. https://doi.org/10.4018/IJERTCS.2017070102

Author

Mikhailova, Elena ; Diurdeva, Polina ; Shalymov, Dmitry. / N-gram based approach for text authorship classification : Metric selection. в: International Journal of Embedded and Real-Time Communication Systems. 2017 ; Том 8, № 2. стр. 24-39.

BibTeX

@article{103c8691f1564668802f03deaf7c0ce0,
title = "N-gram based approach for text authorship classification: Metric selection",
abstract = "Automated authorship attribution is actual to identify the author of an anonymous texts, or texts whose authorship is in doubt. It can be used in various applications including author verification, plagiarism detection, computer forensics and others. In this article, the authors analyze an approach based on frequency combination of letters is investigated for solving such a task as classification of documents by authorship. This technique could be used to identify the author of a computer program from a predefined set of possible authors. The effectiveness of this approach is significantly determined by the choice of metric. The research examines and compares four different distance measures between a text of unknown authorship and an authors' profile: L1 measure, Kullback-Leibler divergence, base metric of Common N-gram method (CNG) and a certain variation of dissimilarity measure of CNG method. Comparison outlines cases when some metric outperforms others with a specific parameter combination. Experiments are conducted on different Russian and English corpora.",
keywords = "Authorship Attribution Problem, Common N-grams, Document Classification, N-grams",
author = "Elena Mikhailova and Polina Diurdeva and Dmitry Shalymov",
year = "2017",
month = jul,
day = "1",
doi = "10.4018/IJERTCS.2017070102",
language = "English",
volume = "8",
pages = "24--39",
journal = "International Journal of Embedded and Real-Time Communication Systems",
issn = "1947-3176",
publisher = "IGI Global",
number = "2",

}

RIS

TY - JOUR

T1 - N-gram based approach for text authorship classification

T2 - Metric selection

AU - Mikhailova, Elena

AU - Diurdeva, Polina

AU - Shalymov, Dmitry

PY - 2017/7/1

Y1 - 2017/7/1

N2 - Automated authorship attribution is actual to identify the author of an anonymous texts, or texts whose authorship is in doubt. It can be used in various applications including author verification, plagiarism detection, computer forensics and others. In this article, the authors analyze an approach based on frequency combination of letters is investigated for solving such a task as classification of documents by authorship. This technique could be used to identify the author of a computer program from a predefined set of possible authors. The effectiveness of this approach is significantly determined by the choice of metric. The research examines and compares four different distance measures between a text of unknown authorship and an authors' profile: L1 measure, Kullback-Leibler divergence, base metric of Common N-gram method (CNG) and a certain variation of dissimilarity measure of CNG method. Comparison outlines cases when some metric outperforms others with a specific parameter combination. Experiments are conducted on different Russian and English corpora.

AB - Automated authorship attribution is actual to identify the author of an anonymous texts, or texts whose authorship is in doubt. It can be used in various applications including author verification, plagiarism detection, computer forensics and others. In this article, the authors analyze an approach based on frequency combination of letters is investigated for solving such a task as classification of documents by authorship. This technique could be used to identify the author of a computer program from a predefined set of possible authors. The effectiveness of this approach is significantly determined by the choice of metric. The research examines and compares four different distance measures between a text of unknown authorship and an authors' profile: L1 measure, Kullback-Leibler divergence, base metric of Common N-gram method (CNG) and a certain variation of dissimilarity measure of CNG method. Comparison outlines cases when some metric outperforms others with a specific parameter combination. Experiments are conducted on different Russian and English corpora.

KW - Authorship Attribution Problem

KW - Common N-grams

KW - Document Classification

KW - N-grams

UR - http://www.scopus.com/inward/record.url?scp=85028073743&partnerID=8YFLogxK

U2 - 10.4018/IJERTCS.2017070102

DO - 10.4018/IJERTCS.2017070102

M3 - Article

AN - SCOPUS:85028073743

VL - 8

SP - 24

EP - 39

JO - International Journal of Embedded and Real-Time Communication Systems

JF - International Journal of Embedded and Real-Time Communication Systems

SN - 1947-3176

IS - 2

ER -

ID: 38401716