Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review
Noise Removal Methods from Web Pages. / Korelin, Vasilii; Blekanov, Ivan.
Proceedings of the 2nd International Conference on Applications in Information Technology (ICAIT-2016). University of Aizu Press, 2016.Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review
}
TY - GEN
T1 - Noise Removal Methods from Web Pages
AU - Korelin, Vasilii
AU - Blekanov, Ivan
N1 - Conference code: 2
PY - 2016
Y1 - 2016
N2 - Almost all the pages of websites of large organizations have a variety of markups, headers, footers and menu items. In terms of information retrieval, this part of the page is not semantically significant for it and can be considered as noise. Furthermore, noise can negatively affect information retrieval results. Therefore, eliminating noisy information is an important step in pre-processing for subsequent analysis (clustering, classification, etc.). This paper discusses several methods of noise removal from Web pages belonging to a large collection. The first method is based on the use of Boilerpipe library to detect and remove surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The second method is based on a headless browser. The third method involves the use of HTML5 semantic markup (only applicable for browsers that support HTML5). An experiment to assess the quality and performance speed of the methods described is presented. Comparative analysis is carried out.
AB - Almost all the pages of websites of large organizations have a variety of markups, headers, footers and menu items. In terms of information retrieval, this part of the page is not semantically significant for it and can be considered as noise. Furthermore, noise can negatively affect information retrieval results. Therefore, eliminating noisy information is an important step in pre-processing for subsequent analysis (clustering, classification, etc.). This paper discusses several methods of noise removal from Web pages belonging to a large collection. The first method is based on the use of Boilerpipe library to detect and remove surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The second method is based on a headless browser. The third method involves the use of HTML5 semantic markup (only applicable for browsers that support HTML5). An experiment to assess the quality and performance speed of the methods described is presented. Comparative analysis is carried out.
M3 - Conference contribution
BT - Proceedings of the 2nd International Conference on Applications in Information Technology (ICAIT-2016)
PB - University of Aizu Press
T2 - International Conference on Applications in Information Technology
Y2 - 6 October 2016 through 8 October 2016
ER -
ID: 7604790