Modeling Lemma Frequency Bands for Lexical Complexity Assessment of Russian Texts

Standard

Modeling Lemma Frequency Bands for Lexical Complexity Assessment of Russian Texts. / Blinova , O. V ; Tarasov , N. A.; Modina , V. V.; Blekanov , I. S.

Computational Linguistics and Intellectual Technologies : Proceedings of the International Conference “Dialogue 2020”, Moscow, June 17–20, 2020. ed. / В.П. Селей. Vol. 19(26) М., 2020. p. 76-92 (Компьютерная лингвистика и интеллектуальные технологии).

Research output: Chapter in Book/Report/Conference proceeding › Chapter › Research › peer-review

Harvard

Blinova , OV , Tarasov , NA , Modina , VV & Blekanov , IS 2020, Modeling Lemma Frequency Bands for Lexical Complexity Assessment of Russian Texts. in ВП Селей (ed.), Computational Linguistics and Intellectual Technologies : Proceedings of the International Conference “Dialogue 2020”, Moscow, June 17–20, 2020. vol. 19(26), Компьютерная лингвистика и интеллектуальные технологии, М., pp. 76-92, DIALOGUE-2020: компьютерная лингвистика в формате цифровых дискуссий, Москва, Russian Federation, 17/06/20.

APA

Blinova , O. V., Tarasov , N. A., Modina , V. V., & Blekanov , I. S. (2020). Modeling Lemma Frequency Bands for Lexical Complexity Assessment of Russian Texts. In В. П. Селей (Ed.), Computational Linguistics and Intellectual Technologies : Proceedings of the International Conference “Dialogue 2020”, Moscow, June 17–20, 2020 (Vol. 19(26), pp. 76-92). (Компьютерная лингвистика и интеллектуальные технологии)..

Vancouver

Blinova OV , Tarasov NA , Modina VV , Blekanov IS. Modeling Lemma Frequency Bands for Lexical Complexity Assessment of Russian Texts. In Селей ВП, editor, Computational Linguistics and Intellectual Technologies : Proceedings of the International Conference “Dialogue 2020”, Moscow, June 17–20, 2020. Vol. 19(26). М. 2020. p. 76-92. (Компьютерная лингвистика и интеллектуальные технологии).

Author

Blinova , O. V ; Tarasov , N. A. ; Modina , V. V. ; Blekanov , I. S. / Modeling Lemma Frequency Bands for Lexical Complexity Assessment of Russian Texts. Computational Linguistics and Intellectual Technologies : Proceedings of the International Conference “Dialogue 2020”, Moscow, June 17–20, 2020. editor / В.П. Селей. Vol. 19(26) М., 2020. pp. 76-92 (Компьютерная лингвистика и интеллектуальные технологии).

BibTeX

@inbook{dce4e12b4f0c4c6891c9bcc2f802ed1b,

title = "Modeling Lemma Frequency Bands for Lexical Complexity Assessment of Russian Texts",

abstract = "The paper is devoted to the problem of modeling general-language frequency using data of large Russian corpora. Our goal is to develop a methodology for forming a consolidated frequency list which in the future can be used for assessing lexical complexity of Russian texts. We compared 4 frequency lists developed from 4 corpora (Russian National Corpus, ruTenTen11, Araneum Russicum III Maximum, Taiga). Firstly, we applied rank correlation analysis. Secondly, we used the measures “coverage” and “enrichment”. Thirdly, we applied the measure “sum of minimal frequencies”. We found that there are significant differences between the compared frequency lists both in ranking and in relative frequencies. The application of the “coverage” measure showed that frequency lists are by no means substitutable. Therefore, none of the corpora in question can be excluded when compiling a consolidated frequency list. For a more detailed comparison of frequency lists for different frequency bands, the ranked frequency list, based on RNC data, was divided into 4 equal parts. Then 4 random samples (containing 20 lemmas from each quartile) were formed.Due to the wide range of values, accepted by ipm measure, relative frequency values are difficult to interpret. In addition, there are no reliable thresholds separating high-frequency, mid-frequency, and low-frequency lemmas. Meanwhile, to assess the lexical complexity of texts, it is useful to have a convenient way of distributing lemmas with certain frequencies over the bands of the frequency list. Therefore, we decided to assign lemmas “Zipf-values”, which made the frequency data interpretable because the range of measure values is small.The result of our work will be a publicly accessible reference resource called “Frequentator”, which will allow to obtain interpretable information about the frequency of Russian words.",

keywords = "linguistic corpora, lemma frequency lists, general-language frequency, frequency bands, low-frequency words, lexical complexity, Russian, linguistic corpora, lemma frequency lists, general-language frequency, frequency bands, low-frequency words, lexical complexity",

author = "Blinova, {O. V} and Tarasov, {N. A.} and Modina, {V. V.} and Blekanov, {I. S.}",

note = "Blinova O. V., Tarasov N. A., Modina V. V., Blekanov I. S. Modeling Lemma Frequency Bands for Lexical Complexity Assessment of Russian Texts // Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2020” (Moscow, June 17–20, 2020). (Komp'juternaja Lingvistika i Intellektual'nye Tehnologii 2020). 19 (26). P. 76-92; null ; Conference date: 17-06-2020 Through 20-06-2020",

year = "2020",

language = "English",

volume = "19(26)",

series = "Компьютерная лингвистика и интеллектуальные технологии",

publisher = "Российский государственный гуманитарный университет",

pages = "76--92",

editor = "В.П. Селей",

booktitle = "Computational Linguistics and Intellectual Technologies",

}

RIS

TY - CHAP

T1 - Modeling Lemma Frequency Bands for Lexical Complexity Assessment of Russian Texts

AU - Blinova , O. V

AU - Tarasov , N. A.

AU - Modina , V. V.

AU - Blekanov , I. S.

N1 - Blinova O. V., Tarasov N. A., Modina V. V., Blekanov I. S. Modeling Lemma Frequency Bands for Lexical Complexity Assessment of Russian Texts // Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2020” (Moscow, June 17–20, 2020). (Komp'juternaja Lingvistika i Intellektual'nye Tehnologii 2020). 19 (26). P. 76-92

PY - 2020

Y1 - 2020

N2 - The paper is devoted to the problem of modeling general-language frequency using data of large Russian corpora. Our goal is to develop a methodology for forming a consolidated frequency list which in the future can be used for assessing lexical complexity of Russian texts. We compared 4 frequency lists developed from 4 corpora (Russian National Corpus, ruTenTen11, Araneum Russicum III Maximum, Taiga). Firstly, we applied rank correlation analysis. Secondly, we used the measures “coverage” and “enrichment”. Thirdly, we applied the measure “sum of minimal frequencies”. We found that there are significant differences between the compared frequency lists both in ranking and in relative frequencies. The application of the “coverage” measure showed that frequency lists are by no means substitutable. Therefore, none of the corpora in question can be excluded when compiling a consolidated frequency list. For a more detailed comparison of frequency lists for different frequency bands, the ranked frequency list, based on RNC data, was divided into 4 equal parts. Then 4 random samples (containing 20 lemmas from each quartile) were formed.Due to the wide range of values, accepted by ipm measure, relative frequency values are difficult to interpret. In addition, there are no reliable thresholds separating high-frequency, mid-frequency, and low-frequency lemmas. Meanwhile, to assess the lexical complexity of texts, it is useful to have a convenient way of distributing lemmas with certain frequencies over the bands of the frequency list. Therefore, we decided to assign lemmas “Zipf-values”, which made the frequency data interpretable because the range of measure values is small.The result of our work will be a publicly accessible reference resource called “Frequentator”, which will allow to obtain interpretable information about the frequency of Russian words.

AB - The paper is devoted to the problem of modeling general-language frequency using data of large Russian corpora. Our goal is to develop a methodology for forming a consolidated frequency list which in the future can be used for assessing lexical complexity of Russian texts. We compared 4 frequency lists developed from 4 corpora (Russian National Corpus, ruTenTen11, Araneum Russicum III Maximum, Taiga). Firstly, we applied rank correlation analysis. Secondly, we used the measures “coverage” and “enrichment”. Thirdly, we applied the measure “sum of minimal frequencies”. We found that there are significant differences between the compared frequency lists both in ranking and in relative frequencies. The application of the “coverage” measure showed that frequency lists are by no means substitutable. Therefore, none of the corpora in question can be excluded when compiling a consolidated frequency list. For a more detailed comparison of frequency lists for different frequency bands, the ranked frequency list, based on RNC data, was divided into 4 equal parts. Then 4 random samples (containing 20 lemmas from each quartile) were formed.Due to the wide range of values, accepted by ipm measure, relative frequency values are difficult to interpret. In addition, there are no reliable thresholds separating high-frequency, mid-frequency, and low-frequency lemmas. Meanwhile, to assess the lexical complexity of texts, it is useful to have a convenient way of distributing lemmas with certain frequencies over the bands of the frequency list. Therefore, we decided to assign lemmas “Zipf-values”, which made the frequency data interpretable because the range of measure values is small.The result of our work will be a publicly accessible reference resource called “Frequentator”, which will allow to obtain interpretable information about the frequency of Russian words.

KW - linguistic corpora

KW - lemma frequency lists

KW - general-language frequency

KW - frequency bands

KW - low-frequency words

KW - lexical complexity

KW - Russian

KW - linguistic corpora

KW - lemma frequency lists

KW - general-language frequency

KW - frequency bands

KW - low-frequency words

KW - lexical complexity

UR - http://www.dialog-21.ru/media/5074/blinovaovplusetal-137.pdf

UR - http://www.dialog-21.ru/en/digest/2020/articles/

M3 - Chapter

VL - 19(26)

T3 - Компьютерная лингвистика и интеллектуальные технологии

SP - 76

EP - 92

BT - Computational Linguistics and Intellectual Technologies

A2 - Селей, В.П.

CY - М.

Y2 - 17 June 2020 through 20 June 2020

ER -

ID: 61380005