Almost all the pages of websites of large organizations have a variety of markups, headers, footers and menu items. In terms of information retrieval, this part of the page is not semantically significant for it and can be considered as noise. Furthermore, noise can negatively affect information retrieval results. Therefore, eliminating noisy information is an important step in pre-processing for subsequent analysis (clustering, classification, etc.). This paper discusses several methods of noise removal from Web pages belonging to a large collection. The first method is based on the use of Boilerpipe library to detect and remove surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The second method is based on a headless browser. The third method involves the use of HTML5 semantic markup (only applicable for browsers that support HTML5). An experiment to assess the quality and performance speed of the methods described is presented. Comparative analysis is carried out.
Original languageEnglish
Title of host publicationProceedings of the 2nd International Conference on Applications in Information Technology (ICAIT-2016)
PublisherUniversity of Aizu Press
StatePublished - 2016
EventInternational Conference on Applications in Information Technology - University of Aizu, Aizu-Wakamatsu, Japan
Duration: 6 Oct 20168 Oct 2016
Conference number: 2
http://kspt.icc.spbstu.ru/conf/icait-2016/

Conference

ConferenceInternational Conference on Applications in Information Technology
Abbreviated titleICAIT-2016
Country/TerritoryJapan
CityAizu-Wakamatsu
Period6/10/168/10/16
Internet address

ID: 7604790