Research output: Contribution to journal › Article › peer-review
N-gram based approach for text authorship classification : Metric selection. / Mikhailova, Elena; Diurdeva, Polina; Shalymov, Dmitry.
In: International Journal of Embedded and Real-Time Communication Systems, Vol. 8, No. 2, 01.07.2017, p. 24-39.Research output: Contribution to journal › Article › peer-review
}
TY - JOUR
T1 - N-gram based approach for text authorship classification
T2 - Metric selection
AU - Mikhailova, Elena
AU - Diurdeva, Polina
AU - Shalymov, Dmitry
PY - 2017/7/1
Y1 - 2017/7/1
N2 - Automated authorship attribution is actual to identify the author of an anonymous texts, or texts whose authorship is in doubt. It can be used in various applications including author verification, plagiarism detection, computer forensics and others. In this article, the authors analyze an approach based on frequency combination of letters is investigated for solving such a task as classification of documents by authorship. This technique could be used to identify the author of a computer program from a predefined set of possible authors. The effectiveness of this approach is significantly determined by the choice of metric. The research examines and compares four different distance measures between a text of unknown authorship and an authors' profile: L1 measure, Kullback-Leibler divergence, base metric of Common N-gram method (CNG) and a certain variation of dissimilarity measure of CNG method. Comparison outlines cases when some metric outperforms others with a specific parameter combination. Experiments are conducted on different Russian and English corpora.
AB - Automated authorship attribution is actual to identify the author of an anonymous texts, or texts whose authorship is in doubt. It can be used in various applications including author verification, plagiarism detection, computer forensics and others. In this article, the authors analyze an approach based on frequency combination of letters is investigated for solving such a task as classification of documents by authorship. This technique could be used to identify the author of a computer program from a predefined set of possible authors. The effectiveness of this approach is significantly determined by the choice of metric. The research examines and compares four different distance measures between a text of unknown authorship and an authors' profile: L1 measure, Kullback-Leibler divergence, base metric of Common N-gram method (CNG) and a certain variation of dissimilarity measure of CNG method. Comparison outlines cases when some metric outperforms others with a specific parameter combination. Experiments are conducted on different Russian and English corpora.
KW - Authorship Attribution Problem
KW - Common N-grams
KW - Document Classification
KW - N-grams
UR - http://www.scopus.com/inward/record.url?scp=85028073743&partnerID=8YFLogxK
U2 - 10.4018/IJERTCS.2017070102
DO - 10.4018/IJERTCS.2017070102
M3 - Article
AN - SCOPUS:85028073743
VL - 8
SP - 24
EP - 39
JO - International Journal of Embedded and Real-Time Communication Systems
JF - International Journal of Embedded and Real-Time Communication Systems
SN - 1947-3176
IS - 2
ER -
ID: 38401716