Standard

Noise Removal Methods from Web Pages. / Korelin, Vasilii; Blekanov, Ivan.

Proceedings of the 2nd International Conference on Applications in Information Technology (ICAIT-2016). University of Aizu Press, 2016.

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Harvard

Korelin, V & Blekanov, I 2016, Noise Removal Methods from Web Pages. in Proceedings of the 2nd International Conference on Applications in Information Technology (ICAIT-2016). University of Aizu Press, International Conference on Applications in Information Technology, Aizu-Wakamatsu, Japan, 6/10/16. <http://kspt.icc.spbstu.ru/conf/icait-2016/>

APA

Korelin, V., & Blekanov, I. (2016). Noise Removal Methods from Web Pages. In Proceedings of the 2nd International Conference on Applications in Information Technology (ICAIT-2016) University of Aizu Press. http://kspt.icc.spbstu.ru/conf/icait-2016/

Vancouver

Korelin V, Blekanov I. Noise Removal Methods from Web Pages. In Proceedings of the 2nd International Conference on Applications in Information Technology (ICAIT-2016). University of Aizu Press. 2016

Author

Korelin, Vasilii ; Blekanov, Ivan. / Noise Removal Methods from Web Pages. Proceedings of the 2nd International Conference on Applications in Information Technology (ICAIT-2016). University of Aizu Press, 2016.

BibTeX

@inproceedings{8a69ce57b7ac40b598fb5a9a89d9e9e2,
title = "Noise Removal Methods from Web Pages",
abstract = "Almost all the pages of websites of large organizations have a variety of markups, headers, footers and menu items. In terms of information retrieval, this part of the page is not semantically significant for it and can be considered as noise. Furthermore, noise can negatively affect information retrieval results. Therefore, eliminating noisy information is an important step in pre-processing for subsequent analysis (clustering, classification, etc.). This paper discusses several methods of noise removal from Web pages belonging to a large collection. The first method is based on the use of Boilerpipe library to detect and remove surplus {"}clutter{"} (boilerplate, templates) around the main textual content of a web page. The second method is based on a headless browser. The third method involves the use of HTML5 semantic markup (only applicable for browsers that support HTML5). An experiment to assess the quality and performance speed of the methods described is presented. Comparative analysis is carried out.",
author = "Vasilii Korelin and Ivan Blekanov",
year = "2016",
language = "English",
booktitle = "Proceedings of the 2nd International Conference on Applications in Information Technology (ICAIT-2016)",
publisher = "University of Aizu Press",
address = "Japan",
note = "International Conference on Applications in Information Technology, ICAIT-2016 ; Conference date: 06-10-2016 Through 08-10-2016",
url = "http://kspt.icc.spbstu.ru/conf/icait-2016/",

}

RIS

TY - GEN

T1 - Noise Removal Methods from Web Pages

AU - Korelin, Vasilii

AU - Blekanov, Ivan

N1 - Conference code: 2

PY - 2016

Y1 - 2016

N2 - Almost all the pages of websites of large organizations have a variety of markups, headers, footers and menu items. In terms of information retrieval, this part of the page is not semantically significant for it and can be considered as noise. Furthermore, noise can negatively affect information retrieval results. Therefore, eliminating noisy information is an important step in pre-processing for subsequent analysis (clustering, classification, etc.). This paper discusses several methods of noise removal from Web pages belonging to a large collection. The first method is based on the use of Boilerpipe library to detect and remove surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The second method is based on a headless browser. The third method involves the use of HTML5 semantic markup (only applicable for browsers that support HTML5). An experiment to assess the quality and performance speed of the methods described is presented. Comparative analysis is carried out.

AB - Almost all the pages of websites of large organizations have a variety of markups, headers, footers and menu items. In terms of information retrieval, this part of the page is not semantically significant for it and can be considered as noise. Furthermore, noise can negatively affect information retrieval results. Therefore, eliminating noisy information is an important step in pre-processing for subsequent analysis (clustering, classification, etc.). This paper discusses several methods of noise removal from Web pages belonging to a large collection. The first method is based on the use of Boilerpipe library to detect and remove surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The second method is based on a headless browser. The third method involves the use of HTML5 semantic markup (only applicable for browsers that support HTML5). An experiment to assess the quality and performance speed of the methods described is presented. Comparative analysis is carried out.

M3 - Conference contribution

BT - Proceedings of the 2nd International Conference on Applications in Information Technology (ICAIT-2016)

PB - University of Aizu Press

T2 - International Conference on Applications in Information Technology

Y2 - 6 October 2016 through 8 October 2016

ER -

ID: 7604790