Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review
Validity of Lingvo-Statistical Parameters for the Corpus of Fiction. / Гребенников, Александр Олегович; Корышев, Михаил Витальевич; Иванова, Екатерина Павловна; Скребцова, Татьяна Георгиевна.
Literature, Language and Computing: Russian Contribution from the LiLaC-2023. Singapore : Springer Nature, 2025. p. 15-21.Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review
}
TY - GEN
T1 - Validity of Lingvo-Statistical Parameters for the Corpus of Fiction
AU - Гребенников, Александр Олегович
AU - Корышев, Михаил Витальевич
AU - Иванова, Екатерина Павловна
AU - Скребцова, Татьяна Георгиевна
PY - 2025/3
Y1 - 2025/3
N2 - Finding variables and statistical metrics to describe rank distributions of lexemes is a relevant linguistic task. We analyze the validity of lingvo-statistical parameters (the rank mean and entropy) for describing frequency dictionary of fiction. The comparative use of the Weibull and Haustein functions as approximating ones for the values of the parameters in question is also investigated. The research draws on a representative sample from the Corpus of the Russian Short Stories (1900–1930) (total volume is more than 1,000,000 tokens). The rank mean is shown to be only a relative valid parameter for describing a large-scale corpus of fiction, while the relative validity of entropy is greatly affected by the nature of the texts analyzed. TheWeibull function is proved to be the preferable one for the approximation of the parameters’ growth.
AB - Finding variables and statistical metrics to describe rank distributions of lexemes is a relevant linguistic task. We analyze the validity of lingvo-statistical parameters (the rank mean and entropy) for describing frequency dictionary of fiction. The comparative use of the Weibull and Haustein functions as approximating ones for the values of the parameters in question is also investigated. The research draws on a representative sample from the Corpus of the Russian Short Stories (1900–1930) (total volume is more than 1,000,000 tokens). The rank mean is shown to be only a relative valid parameter for describing a large-scale corpus of fiction, while the relative validity of entropy is greatly affected by the nature of the texts analyzed. TheWeibull function is proved to be the preferable one for the approximation of the parameters’ growth.
KW - Authors’ lexicography
KW - Corpus
KW - Frequency dictionary
KW - Statistical modeling
KW - Stylometry
UR - https://www.mendeley.com/catalogue/4a314456-046d-32c5-a287-9aee46584cd6/
U2 - 10.1007/978-981-96-0990-1_2
DO - 10.1007/978-981-96-0990-1_2
M3 - Conference contribution
SN - 978-981-96-0989-5
SP - 15
EP - 21
BT - Literature, Language and Computing
PB - Springer Nature
CY - Singapore
Y2 - 9 November 2023 through 11 November 2023
ER -
ID: 133399730