Documents

The paper describes the experimental study of automatic keyphrase extraction
techniques using expert assessments. The purpose of the study is to confirm the hypotheses on the location of keyphrases within a document and on the differentiation of keyphrases as regards applied algorithms and text styles. Experiments on automatic selection of keyphrases are carried out using nine algorithms of various types, including statistical (Log-Likelihood, TF-IDF, Chisquare, hybrid linguostatistical (RAKE, YAKE, PullEnti, Topia), structural (graph-based) (TextRank) and machine learning (KeyBERT). In course of the study a mixed corpus was prepared of about 1 million tokens in size, including 50 social media texts (news reports with headlines), 50 scientific texts (articles on computational linguistics with headings, annotations and manually specified sets of key expressions), 50 literary texts (chapters from prose works, provided with the author's description of the content). Evaluation procedure implies comparison of keyphrases selected by experts from the first segment of texts and key expressions automatically extracted from the second segment. A quantitative assessment of the matches between expert and automatic
markup made it possible to confirm the hypothesis on a different concentration of keyphrases in text segments involved in comparison. The study of lexico-grammatical and semantic features of keyphrases allowed to reveal features that are determined by text style. The results of the study may improve semantic compression procedures performed using the methods of automatic keyphrase extraction.
Translated title of the contributionEXPERIMENTS ON AUTOMATIC KEY EXPRESSION EXTRACTION IN STYLISTICALLY HETEROGENEOUS CORPUS OF RUSSIAN TEXTS
Original languageRussian
Number of pages31
JournalОбщество. Коммуникация. Образование
Volume13
Issue number4
StateAccepted/In press - 2022

    Research areas

  • semantic compression, automatic keyphrase extraction, expert annotation, text corpus, functional styles

ID: 100333101