Almost all the pages of websites of large organizations have a variety of markups, headers, footers and menu items. In terms of information retrieval, this part of the page is not semantically significant for it and can be considered as noise. Furthermore, noise can negatively affect information retrieval results. Therefore, eliminating noisy information is an important step in pre-processing for subsequent analysis (clustering, classification, etc.). This paper discusses several methods of noise removal from Web pages belonging to a large collection. The first method is based on the use of Boilerpipe library to detect and remove surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The second method is based on a headless browser. The third method involves the use of HTML5 semantic markup (only applicable for browsers that support HTML5). An experiment to assess the quality and performance speed of the methods described is presented. Comparative analysis is carried out.
Язык оригиналаанглийский
Название основной публикацииProceedings of the 2nd International Conference on Applications in Information Technology (ICAIT-2016)
ИздательUniversity of Aizu Press
СостояниеОпубликовано - 2016
СобытиеInternational Conference on Applications in Information Technology - University of Aizu, Aizu-Wakamatsu, Япония
Продолжительность: 6 окт 20168 окт 2016
Номер конференции: 2
http://kspt.icc.spbstu.ru/conf/icait-2016/

конференция

конференцияInternational Conference on Applications in Information Technology
Сокращенное названиеICAIT-2016
Страна/TерриторияЯпония
ГородAizu-Wakamatsu
Период6/10/168/10/16
Сайт в сети Internet

ID: 7604790