In this paper, we address the problem of literary writing style determination using a comparison of the randomness of two given texts. We attempt to comprehend if these texts are generated from distinct probability sources that can reveal a difference between the literary writing styles of the corresponding authors. We propose a new approach based on the incorporation of the known Friedman-Rafsky two-sample test into a multistage procedure with the aim of stabilizing the process. A sampling pro cedure constructed by applying the N-grams methodology is applied to simulate samples drawn from the pooled text with the aim of evaluating the null hypothesis distribution that appears after the writing styles coincide. Next, samples from different files are selected, and the p-values of the test statistics are calculated. An empirical distribution of these values is compared numerous times with the uniform one on the interval [0, 1], and the writing styles are recognized as different if the rejection fraction in this comparison's sequence is significantly greater than 0.5. The offered approach is language independent in the community of alphabetic languages and does not involve the use of linguistics. In comparison with most existing methods our approach does not deal with any authorship attribute determination. A text itself, more precisely speaking, the distribution of sequential text templates and their mutual occurrences essentially identifies the style. Experiments demonstrate the strong capability of the proposed method. (C) 2016 Elsevier Ltd. All rights reserved.
Original language | English |
---|---|
Pages (from-to) | 145-153 |
Number of pages | 9 |
Journal | Expert Systems with Applications |
Volume | 61 |
DOIs | |
State | Published - 1 Nov 2016 |
Externally published | Yes |
ID: 7568102