An Approach to Improving the Classification of the New York Times Annotated Corpus

Standard

An Approach to Improving the Classification of the New York Times Annotated Corpus. / Mozzherina, E.

Communications in Computer and Information Science. Vol. 394: Knowledge Engineering and the Semantic Web 4th International Conference, KESW 2013, St. Petersburg, Russia, October 7-9, 2013. Proceedings. Springer Nature, 2013. p. 83-91.

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research

Harvard

Mozzherina, E 2013, An Approach to Improving the Classification of the New York Times Annotated Corpus. in Communications in Computer and Information Science. Vol. 394: Knowledge Engineering and the Semantic Web 4th International Conference, KESW 2013, St. Petersburg, Russia, October 7-9, 2013. Proceedings. Springer Nature, pp. 83-91. https://doi.org/10.1007/978-3-642-41360-5_7

APA

Mozzherina, E. (2013). An Approach to Improving the Classification of the New York Times Annotated Corpus. In Communications in Computer and Information Science. Vol. 394: Knowledge Engineering and the Semantic Web 4th International Conference, KESW 2013, St. Petersburg, Russia, October 7-9, 2013. Proceedings (pp. 83-91). Springer Nature. https://doi.org/10.1007/978-3-642-41360-5_7

Vancouver

Mozzherina E. An Approach to Improving the Classification of the New York Times Annotated Corpus. In Communications in Computer and Information Science. Vol. 394: Knowledge Engineering and the Semantic Web 4th International Conference, KESW 2013, St. Petersburg, Russia, October 7-9, 2013. Proceedings. Springer Nature. 2013. p. 83-91 https://doi.org/10.1007/978-3-642-41360-5_7

Author

Mozzherina, E. / An Approach to Improving the Classification of the New York Times Annotated Corpus. Communications in Computer and Information Science. Vol. 394: Knowledge Engineering and the Semantic Web 4th International Conference, KESW 2013, St. Petersburg, Russia, October 7-9, 2013. Proceedings. Springer Nature, 2013. pp. 83-91

BibTeX

@inproceedings{52592a992aef465d8e8d7986d89f1ca5,

title = "An Approach to Improving the Classification of the New York Times Annotated Corpus",

abstract = "The New York Times Annotated Corpus contains over 1.5 million of manually tagged articles, and could become a useful source for evaluation of algorithms for documents clustering. Since documents were labeled over twenty years, it is argued that classication may contains errors due to possible dissent between experts, and the necessity to add tags over time. This paper presents an approach to improve classication quality by using assigned tags as a starting point. It is assumed that tags can be described by a set of features. These features are selected based on the value of mutual information between the tag and stems from documents with it. An algorithm for reassigning tags in case the document does not contain features of its labels is presented. Experiments were performed on about ninety thousand articles published by the New York Times in 2005. Results of applying the algorithm to the collection are discussed.",

keywords = "Document classification, classification improvement, classification evaluation, mutual information",

author = "E. Mozzherina",

year = "2013",

doi = "10.1007/978-3-642-41360-5_7",

language = "English",

isbn = "9783642413599",

pages = "83--91",

booktitle = "Communications in Computer and Information Science. Vol. 394: Knowledge Engineering and the Semantic Web 4th International Conference, KESW 2013, St. Petersburg, Russia, October 7-9, 2013. Proceedings",

publisher = "Springer Nature",

address = "Germany",

}

RIS

TY - GEN

T1 - An Approach to Improving the Classification of the New York Times Annotated Corpus

AU - Mozzherina, E.

PY - 2013

Y1 - 2013

N2 - The New York Times Annotated Corpus contains over 1.5 million of manually tagged articles, and could become a useful source for evaluation of algorithms for documents clustering. Since documents were labeled over twenty years, it is argued that classication may contains errors due to possible dissent between experts, and the necessity to add tags over time. This paper presents an approach to improve classication quality by using assigned tags as a starting point. It is assumed that tags can be described by a set of features. These features are selected based on the value of mutual information between the tag and stems from documents with it. An algorithm for reassigning tags in case the document does not contain features of its labels is presented. Experiments were performed on about ninety thousand articles published by the New York Times in 2005. Results of applying the algorithm to the collection are discussed.

AB - The New York Times Annotated Corpus contains over 1.5 million of manually tagged articles, and could become a useful source for evaluation of algorithms for documents clustering. Since documents were labeled over twenty years, it is argued that classication may contains errors due to possible dissent between experts, and the necessity to add tags over time. This paper presents an approach to improve classication quality by using assigned tags as a starting point. It is assumed that tags can be described by a set of features. These features are selected based on the value of mutual information between the tag and stems from documents with it. An algorithm for reassigning tags in case the document does not contain features of its labels is presented. Experiments were performed on about ninety thousand articles published by the New York Times in 2005. Results of applying the algorithm to the collection are discussed.

KW - Document classification

KW - classification improvement

KW - classification evaluation

KW - mutual information

U2 - 10.1007/978-3-642-41360-5_7

DO - 10.1007/978-3-642-41360-5_7

M3 - Conference contribution

SN - 9783642413599

SP - 83

EP - 91

BT - Communications in Computer and Information Science. Vol. 394: Knowledge Engineering and the Semantic Web 4th International Conference, KESW 2013, St. Petersburg, Russia, October 7-9, 2013. Proceedings

PB - Springer Nature

ER -

ID: 7383576