Finding variables and statistical metrics to describe rank distributions of lexemes is a relevant linguistic task. We analyze the validity of lingvo-statistical parameters (the rank mean and entropy) for describing frequency dictionary of fiction. The comparative use of the Weibull and Haustein functions as approximating ones for the values of the parameters in question is also investigated. The research draws on a representative sample from the Corpus of the Russian Short Stories (1900–1930) (total volume is more than 1,000,000 tokens). The rank mean is shown to be only a relative valid parameter for describing a large-scale corpus of fiction, while the relative validity of entropy is greatly affected by the nature of the texts analyzed. TheWeibull function is proved to be the preferable one for the approximation of the parameters’ growth.
Original languageEnglish
Title of host publicationLiterature, Language and Computing
Subtitle of host publicationRussian Contribution from the LiLaC-2023
Place of PublicationSingapore
PublisherSpringer Nature
Pages15-21
Number of pages7
ISBN (Electronic)978-981-96-0990-1
ISBN (Print)978-981-96-0989-5
DOIs
StatePublished - Mar 2025
EventМеждународная конференция «Литература, язык и
компьютерные технологии» (LiLaC: Literature, Language and Computing: Russian Contribution)
- СПбГУ, Санкт-Петербург, Russian Federation
Duration: 9 Nov 202311 Nov 2023
https://conference-spbu.ru/conference/49/

Conference

ConferenceМеждународная конференция «Литература, язык и
компьютерные технологии» (LiLaC: Literature, Language and Computing: Russian Contribution)
Abbreviated titleLiLaC 2023
Country/TerritoryRussian Federation
CityСанкт-Петербург
Period9/11/2311/11/23
Internet address

    Research areas

  • Authors’ lexicography, Corpus, Frequency dictionary, Statistical modeling, Stylometry

ID: 133399730