Research output: Contribution to journal › Conference article › peer-review
СИСТЕМЫ ОБРАБОТКИ ЕСТЕСТВЕННОГО ЯЗЫКА ДЛЯ ИЗВЛЕЧЕНИЯ ДАННЫХ И КАРТОГРАФИРОВАНИЯ НА ОСНОВЕ НЕСТРУКТУРИРОВАННЫХ БЛОКОВ ТЕКСТА. / Kolesnikov, Alexey A.; Kikin, Pavel M.; Niko, Giovanni; Komissarova, Elena V.
In: InterCarto, InterGIS, Vol. 26, 2020, p. 375-384.Research output: Contribution to journal › Conference article › peer-review
}
TY - JOUR
T1 - СИСТЕМЫ ОБРАБОТКИ ЕСТЕСТВЕННОГО ЯЗЫКА ДЛЯ ИЗВЛЕЧЕНИЯ ДАННЫХ И КАРТОГРАФИРОВАНИЯ НА ОСНОВЕ НЕСТРУКТУРИРОВАННЫХ БЛОКОВ ТЕКСТА
AU - Kolesnikov, Alexey A.
AU - Kikin, Pavel M.
AU - Niko, Giovanni
AU - Komissarova, Elena V.
N1 - Publisher Copyright: © 2020 Lomonosov Moscow State University. All rights reserved. Copyright: Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2020
Y1 - 2020
N2 - Modern natural language processing technologies allow you to work with texts without being a specialist in linguistics. The use of popular data processing platforms for the development and use of linguistic models provides an opportunity to implement them in popular geographic information systems. This feature allows you to significantly expand the functionality and improve the accuracy of standard geocoding functions. The article provides a comparison of the most popular methods and software implemented on their basis, using the example of solving the problem of extracting geographical names from plain text. This option is an extended version of the geocoding operation, since the result also includes the coordinates of the point features of interest, but there is no need to separately extract the addresses or geographical names of the objects in advance from the text. In computer linguistics, this problem is solved by the methods of extracting named entities (Eng. named entity recognition). Among the most modern approaches to the final implementation, the authors of the article have chosen algorithms based on rules, models of maximum entropy and convolutional neural networks. The selected algorithms and methods were evaluated not only from the point of view of the accuracy of searching for geographical objects in the text, but also from the point of view of simplicity of refinement of the basic rules or mathematical models using their own text bodies. Reports on technological violations, accidents and incidents at the facilities of the heat and power complex of the Ministry of Energy of the Russian Federation were selected as the initial data for testing the abovementioned methods and software solutions. Also, a study is presented on a method for improving the quality of recognition of named entities based on additional training of a neural network model using a specialized text corpus.
AB - Modern natural language processing technologies allow you to work with texts without being a specialist in linguistics. The use of popular data processing platforms for the development and use of linguistic models provides an opportunity to implement them in popular geographic information systems. This feature allows you to significantly expand the functionality and improve the accuracy of standard geocoding functions. The article provides a comparison of the most popular methods and software implemented on their basis, using the example of solving the problem of extracting geographical names from plain text. This option is an extended version of the geocoding operation, since the result also includes the coordinates of the point features of interest, but there is no need to separately extract the addresses or geographical names of the objects in advance from the text. In computer linguistics, this problem is solved by the methods of extracting named entities (Eng. named entity recognition). Among the most modern approaches to the final implementation, the authors of the article have chosen algorithms based on rules, models of maximum entropy and convolutional neural networks. The selected algorithms and methods were evaluated not only from the point of view of the accuracy of searching for geographical objects in the text, but also from the point of view of simplicity of refinement of the basic rules or mathematical models using their own text bodies. Reports on technological violations, accidents and incidents at the facilities of the heat and power complex of the Ministry of Energy of the Russian Federation were selected as the initial data for testing the abovementioned methods and software solutions. Also, a study is presented on a method for improving the quality of recognition of named entities based on additional training of a neural network model using a specialized text corpus.
KW - DeepPavlov
KW - Geographical name
KW - Named entity recognition
KW - Natural language processing
KW - SpaCy
UR - http://www.scopus.com/inward/record.url?scp=85093864697&partnerID=8YFLogxK
U2 - 10.35595/2414-9179-2020-1-26-375-384
DO - 10.35595/2414-9179-2020-1-26-375-384
M3 - статья в журнале по материалам конференции
AN - SCOPUS:85093864697
VL - 26
SP - 375
EP - 384
JO - ИНТЕРКАРТО/ИНТЕРГИС
JF - ИНТЕРКАРТО/ИНТЕРГИС
SN - 2414-9179
T2 - 2020 International Conference on GI Support of Sustainable Development of Territories
Y2 - 28 September 2020 through 29 September 2020
ER -
ID: 76310131