Validity of Lingvo-Statistical Parameters for the Corpus of Fiction

Standard

Validity of Lingvo-Statistical Parameters for the Corpus of Fiction. / Гребенников, Александр Олегович ; Корышев, Михаил Витальевич ; Иванова, Екатерина Павловна ; Скребцова, Татьяна Георгиевна.

Literature, Language and Computing: Russian Contribution from the LiLaC-2023. Singapore : Springer Nature, 2025. p. 15-21.

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review

Harvard

Гребенников, АО , Корышев, МВ , Иванова, ЕП & Скребцова, ТГ 2025, Validity of Lingvo-Statistical Parameters for the Corpus of Fiction. in Literature, Language and Computing: Russian Contribution from the LiLaC-2023. Springer Nature, Singapore, pp. 15-21, Международная конференция «Литература, язык и
компьютерные технологии» (LiLaC: Literature, Language and Computing: Russian Contribution), Санкт-Петербург, Russian Federation, 9/11/23. https://doi.org/10.1007/978-981-96-0990-1_2

BibTeX

@inproceedings{3c19f484b031460fafbe21aaeede8468,

title = "Validity of Lingvo-Statistical Parameters for the Corpus of Fiction",

abstract = "Finding variables and statistical metrics to describe rank distributions of lexemes is a relevant linguistic task. We analyze the validity of lingvo-statistical parameters (the rank mean and entropy) for describing frequency dictionary of fiction. The comparative use of the Weibull and Haustein functions as approximating ones for the values of the parameters in question is also investigated. The research draws on a representative sample from the Corpus of the Russian Short Stories (1900–1930) (total volume is more than 1,000,000 tokens). The rank mean is shown to be only a relative valid parameter for describing a large-scale corpus of fiction, while the relative validity of entropy is greatly affected by the nature of the texts analyzed. TheWeibull function is proved to be the preferable one for the approximation of the parameters{\textquoteright} growth.",

keywords = "Authors{\textquoteright} lexicography, Corpus, Frequency dictionary, Statistical modeling, Stylometry",

author = "Гребенников, {Александр Олегович} and Корышев, {Михаил Витальевич} and Иванова, {Екатерина Павловна} and Скребцова, {Татьяна Георгиевна}",

year = "2025",

month = mar,

doi = "10.1007/978-981-96-0990-1_2",

language = "English",

isbn = "978-981-96-0989-5",

pages = "15--21",

booktitle = "Literature, Language and Computing",

publisher = "Springer Nature",

address = "Germany",

note = "null ; Conference date: 09-11-2023 Through 11-11-2023",

url = "https://conference-spbu.ru/conference/49/",

}

RIS

TY - GEN

T1 - Validity of Lingvo-Statistical Parameters for the Corpus of Fiction

AU - Гребенников, Александр Олегович

AU - Корышев, Михаил Витальевич

AU - Иванова, Екатерина Павловна

AU - Скребцова, Татьяна Георгиевна

PY - 2025/3

Y1 - 2025/3

N2 - Finding variables and statistical metrics to describe rank distributions of lexemes is a relevant linguistic task. We analyze the validity of lingvo-statistical parameters (the rank mean and entropy) for describing frequency dictionary of fiction. The comparative use of the Weibull and Haustein functions as approximating ones for the values of the parameters in question is also investigated. The research draws on a representative sample from the Corpus of the Russian Short Stories (1900–1930) (total volume is more than 1,000,000 tokens). The rank mean is shown to be only a relative valid parameter for describing a large-scale corpus of fiction, while the relative validity of entropy is greatly affected by the nature of the texts analyzed. TheWeibull function is proved to be the preferable one for the approximation of the parameters’ growth.

AB - Finding variables and statistical metrics to describe rank distributions of lexemes is a relevant linguistic task. We analyze the validity of lingvo-statistical parameters (the rank mean and entropy) for describing frequency dictionary of fiction. The comparative use of the Weibull and Haustein functions as approximating ones for the values of the parameters in question is also investigated. The research draws on a representative sample from the Corpus of the Russian Short Stories (1900–1930) (total volume is more than 1,000,000 tokens). The rank mean is shown to be only a relative valid parameter for describing a large-scale corpus of fiction, while the relative validity of entropy is greatly affected by the nature of the texts analyzed. TheWeibull function is proved to be the preferable one for the approximation of the parameters’ growth.

KW - Authors’ lexicography

KW - Corpus

KW - Frequency dictionary

KW - Statistical modeling

KW - Stylometry

UR - https://www.mendeley.com/catalogue/4a314456-046d-32c5-a287-9aee46584cd6/

U2 - 10.1007/978-981-96-0990-1_2

DO - 10.1007/978-981-96-0990-1_2

M3 - Conference contribution

SN - 978-981-96-0989-5

SP - 15

EP - 21

BT - Literature, Language and Computing

PB - Springer Nature

CY - Singapore

Y2 - 9 November 2023 through 11 November 2023

ER -

ID: 133399730