СИСТЕМЫ ОБРАБОТКИ ЕСТЕСТВЕННОГО ЯЗЫКА ДЛЯ ИЗВЛЕЧЕНИЯ ДАННЫХ И КАРТОГРАФИРОВАНИЯ НА ОСНОВЕ НЕСТРУКТУРИРОВАННЫХ БЛОКОВ ТЕКСТА

Standard

СИСТЕМЫ ОБРАБОТКИ ЕСТЕСТВЕННОГО ЯЗЫКА ДЛЯ ИЗВЛЕЧЕНИЯ ДАННЫХ И КАРТОГРАФИРОВАНИЯ НА ОСНОВЕ НЕСТРУКТУРИРОВАННЫХ БЛОКОВ ТЕКСТА. / Kolesnikov, Alexey A.; Kikin, Pavel M.; Niko, Giovanni; Komissarova, Elena V.

In: InterCarto, InterGIS, Vol. 26, 2020, p. 375-384.

Research output: Contribution to journal › Conference article › peer-review

BibTeX

@article{abb904e834c64c0b873675ec00a7ca0f,

title = "СИСТЕМЫ ОБРАБОТКИ ЕСТЕСТВЕННОГО ЯЗЫКА ДЛЯ ИЗВЛЕЧЕНИЯ ДАННЫХ И КАРТОГРАФИРОВАНИЯ НА ОСНОВЕ НЕСТРУКТУРИРОВАННЫХ БЛОКОВ ТЕКСТА",

abstract = "Modern natural language processing technologies allow you to work with texts without being a specialist in linguistics. The use of popular data processing platforms for the development and use of linguistic models provides an opportunity to implement them in popular geographic information systems. This feature allows you to significantly expand the functionality and improve the accuracy of standard geocoding functions. The article provides a comparison of the most popular methods and software implemented on their basis, using the example of solving the problem of extracting geographical names from plain text. This option is an extended version of the geocoding operation, since the result also includes the coordinates of the point features of interest, but there is no need to separately extract the addresses or geographical names of the objects in advance from the text. In computer linguistics, this problem is solved by the methods of extracting named entities (Eng. named entity recognition). Among the most modern approaches to the final implementation, the authors of the article have chosen algorithms based on rules, models of maximum entropy and convolutional neural networks. The selected algorithms and methods were evaluated not only from the point of view of the accuracy of searching for geographical objects in the text, but also from the point of view of simplicity of refinement of the basic rules or mathematical models using their own text bodies. Reports on technological violations, accidents and incidents at the facilities of the heat and power complex of the Ministry of Energy of the Russian Federation were selected as the initial data for testing the abovementioned methods and software solutions. Also, a study is presented on a method for improving the quality of recognition of named entities based on additional training of a neural network model using a specialized text corpus.",

keywords = "DeepPavlov, Geographical name, Named entity recognition, Natural language processing, SpaCy",

author = "Kolesnikov, {Alexey A.} and Kikin, {Pavel M.} and Giovanni Niko and Komissarova, {Elena V.}",

note = "Publisher Copyright: {\textcopyright} 2020 Lomonosov Moscow State University. All rights reserved. Copyright: Copyright 2020 Elsevier B.V., All rights reserved.; 2020 International Conference on GI Support of Sustainable Development of Territories ; Conference date: 28-09-2020 Through 29-09-2020",

year = "2020",

doi = "10.35595/2414-9179-2020-1-26-375-384",

language = "русский",

volume = "26",

pages = "375--384",

journal = "ИНТЕРКАРТО/ИНТЕРГИС",

issn = "2414-9179",

publisher = "Тикунов Владимир Сергеевич",

}

RIS

TY - JOUR

T1 - СИСТЕМЫ ОБРАБОТКИ ЕСТЕСТВЕННОГО ЯЗЫКА ДЛЯ ИЗВЛЕЧЕНИЯ ДАННЫХ И КАРТОГРАФИРОВАНИЯ НА ОСНОВЕ НЕСТРУКТУРИРОВАННЫХ БЛОКОВ ТЕКСТА

AU - Kolesnikov, Alexey A.

AU - Kikin, Pavel M.

AU - Niko, Giovanni

AU - Komissarova, Elena V.

PY - 2020

Y1 - 2020

N2 - Modern natural language processing technologies allow you to work with texts without being a specialist in linguistics. The use of popular data processing platforms for the development and use of linguistic models provides an opportunity to implement them in popular geographic information systems. This feature allows you to significantly expand the functionality and improve the accuracy of standard geocoding functions. The article provides a comparison of the most popular methods and software implemented on their basis, using the example of solving the problem of extracting geographical names from plain text. This option is an extended version of the geocoding operation, since the result also includes the coordinates of the point features of interest, but there is no need to separately extract the addresses or geographical names of the objects in advance from the text. In computer linguistics, this problem is solved by the methods of extracting named entities (Eng. named entity recognition). Among the most modern approaches to the final implementation, the authors of the article have chosen algorithms based on rules, models of maximum entropy and convolutional neural networks. The selected algorithms and methods were evaluated not only from the point of view of the accuracy of searching for geographical objects in the text, but also from the point of view of simplicity of refinement of the basic rules or mathematical models using their own text bodies. Reports on technological violations, accidents and incidents at the facilities of the heat and power complex of the Ministry of Energy of the Russian Federation were selected as the initial data for testing the abovementioned methods and software solutions. Also, a study is presented on a method for improving the quality of recognition of named entities based on additional training of a neural network model using a specialized text corpus.

AB - Modern natural language processing technologies allow you to work with texts without being a specialist in linguistics. The use of popular data processing platforms for the development and use of linguistic models provides an opportunity to implement them in popular geographic information systems. This feature allows you to significantly expand the functionality and improve the accuracy of standard geocoding functions. The article provides a comparison of the most popular methods and software implemented on their basis, using the example of solving the problem of extracting geographical names from plain text. This option is an extended version of the geocoding operation, since the result also includes the coordinates of the point features of interest, but there is no need to separately extract the addresses or geographical names of the objects in advance from the text. In computer linguistics, this problem is solved by the methods of extracting named entities (Eng. named entity recognition). Among the most modern approaches to the final implementation, the authors of the article have chosen algorithms based on rules, models of maximum entropy and convolutional neural networks. The selected algorithms and methods were evaluated not only from the point of view of the accuracy of searching for geographical objects in the text, but also from the point of view of simplicity of refinement of the basic rules or mathematical models using their own text bodies. Reports on technological violations, accidents and incidents at the facilities of the heat and power complex of the Ministry of Energy of the Russian Federation were selected as the initial data for testing the abovementioned methods and software solutions. Also, a study is presented on a method for improving the quality of recognition of named entities based on additional training of a neural network model using a specialized text corpus.

KW - DeepPavlov

KW - Geographical name

KW - Named entity recognition

KW - Natural language processing

KW - SpaCy

UR - http://www.scopus.com/inward/record.url?scp=85093864697&partnerID=8YFLogxK

U2 - 10.35595/2414-9179-2020-1-26-375-384

DO - 10.35595/2414-9179-2020-1-26-375-384

M3 - статья в журнале по материалам конференции

AN - SCOPUS:85093864697

VL - 26

SP - 375

EP - 384

JO - ИНТЕРКАРТО/ИНТЕРГИС

JF - ИНТЕРКАРТО/ИНТЕРГИС

SN - 2414-9179

T2 - 2020 International Conference on GI Support of Sustainable Development of Territories

Y2 - 28 September 2020 through 29 September 2020

ER -

ID: 76310131