An opinion word lexicon and a training dataset for Russian sentiment analysis of social media

Standard

An opinion word lexicon and a training dataset for Russian sentiment analysis of social media. / Koltsova, O. Yu; Alexeeva, S. V.; Kolcov, S. N.

In: Komp'juternaja Lingvistika i Intellektual'nye Tehnologii, Vol. 2016, 01.01.2016, p. 277-287.

Research output: Contribution to journal › Conference article › peer-review

Harvard

Koltsova, OY, Alexeeva, SV & Kolcov, SN 2016, 'An opinion word lexicon and a training dataset for Russian sentiment analysis of social media', Komp'juternaja Lingvistika i Intellektual'nye Tehnologii, vol. 2016, pp. 277-287.

APA

Koltsova, O. Y., Alexeeva, S. V., & Kolcov, S. N. (2016). An opinion word lexicon and a training dataset for Russian sentiment analysis of social media. Komp'juternaja Lingvistika i Intellektual'nye Tehnologii, 2016, 277-287.

Vancouver

Koltsova OY, Alexeeva SV, Kolcov SN. An opinion word lexicon and a training dataset for Russian sentiment analysis of social media. Komp'juternaja Lingvistika i Intellektual'nye Tehnologii. 2016 Jan 1;2016:277-287.

Author

Koltsova, O. Yu ; Alexeeva, S. V. ; Kolcov, S. N. / An opinion word lexicon and a training dataset for Russian sentiment analysis of social media. In: Komp'juternaja Lingvistika i Intellektual'nye Tehnologii. 2016 ; Vol. 2016. pp. 277-287.

BibTeX

@article{0238e50d47d4478abda2436d7e97a273,

title = "An opinion word lexicon and a training dataset for Russian sentiment analysis of social media",

abstract = "Automatic assessment of sentiment in large text corpora is an important goal in social sciences. This paper describes a methodology and the results of the development of a system for Russian language sentiment analysis that includes: a publicly available sentiment lexicon, a publicly available test collection with sentiment markup and a crowdsourcing website for such markup. The lexicon is aimed at detecting sentiment in user-generated content (blogs, social media) related to social and political issues. Its prototype was formed based on other dictionaries and on the topic modeling performed on a large collection of blog posts. Topic modeling revealed relevant (social and political) topics and as a result-relevant words for the lexicon prototype and relevant texts for the training collection. Each word was assessed by at least three volunteers in the context of three different texts where the word occurred while the texts received their sentiment scores from the same volunteers as well. Both texts and words were scored from -2 (negative) to +2 (positive). Of 7,546 candidate words, 2,753 got non-neutral sentiment scores. The quality of the lexicon was assessed with SentiStrength software by comparing human text scores with the scores obtained automatically based on the created lexicon. 93% of texts were classified correctly at the error level of ±1 class, which closely matches the result of SentiStrength initial application to the English language tweets. Negative classes were much larger and better predicted. The lexicon and the text collection are publicly available at http://linis-crowd.org.",

keywords = "Crowdsourcing sentiment markup, Livejournal, Russian blogosphere, Sentiment lexicon, Test collection, Topic modeling, Web interface",

author = "Koltsova, {O. Yu} and Alexeeva, {S. V.} and Kolcov, {S. N.}",

year = "2016",

month = jan,

day = "1",

language = "English",

volume = "2016",

pages = "277--287",

journal = "Компьютерная лингвистика и интеллектуальные технологии",

issn = "2221-7932",

publisher = "Российский государственный гуманитарный университет",

}

RIS

TY - JOUR

T1 - An opinion word lexicon and a training dataset for Russian sentiment analysis of social media

AU - Koltsova, O. Yu

AU - Alexeeva, S. V.

AU - Kolcov, S. N.

PY - 2016/1/1

Y1 - 2016/1/1

N2 - Automatic assessment of sentiment in large text corpora is an important goal in social sciences. This paper describes a methodology and the results of the development of a system for Russian language sentiment analysis that includes: a publicly available sentiment lexicon, a publicly available test collection with sentiment markup and a crowdsourcing website for such markup. The lexicon is aimed at detecting sentiment in user-generated content (blogs, social media) related to social and political issues. Its prototype was formed based on other dictionaries and on the topic modeling performed on a large collection of blog posts. Topic modeling revealed relevant (social and political) topics and as a result-relevant words for the lexicon prototype and relevant texts for the training collection. Each word was assessed by at least three volunteers in the context of three different texts where the word occurred while the texts received their sentiment scores from the same volunteers as well. Both texts and words were scored from -2 (negative) to +2 (positive). Of 7,546 candidate words, 2,753 got non-neutral sentiment scores. The quality of the lexicon was assessed with SentiStrength software by comparing human text scores with the scores obtained automatically based on the created lexicon. 93% of texts were classified correctly at the error level of ±1 class, which closely matches the result of SentiStrength initial application to the English language tweets. Negative classes were much larger and better predicted. The lexicon and the text collection are publicly available at http://linis-crowd.org.

AB - Automatic assessment of sentiment in large text corpora is an important goal in social sciences. This paper describes a methodology and the results of the development of a system for Russian language sentiment analysis that includes: a publicly available sentiment lexicon, a publicly available test collection with sentiment markup and a crowdsourcing website for such markup. The lexicon is aimed at detecting sentiment in user-generated content (blogs, social media) related to social and political issues. Its prototype was formed based on other dictionaries and on the topic modeling performed on a large collection of blog posts. Topic modeling revealed relevant (social and political) topics and as a result-relevant words for the lexicon prototype and relevant texts for the training collection. Each word was assessed by at least three volunteers in the context of three different texts where the word occurred while the texts received their sentiment scores from the same volunteers as well. Both texts and words were scored from -2 (negative) to +2 (positive). Of 7,546 candidate words, 2,753 got non-neutral sentiment scores. The quality of the lexicon was assessed with SentiStrength software by comparing human text scores with the scores obtained automatically based on the created lexicon. 93% of texts were classified correctly at the error level of ±1 class, which closely matches the result of SentiStrength initial application to the English language tweets. Negative classes were much larger and better predicted. The lexicon and the text collection are publicly available at http://linis-crowd.org.

KW - Crowdsourcing sentiment markup

KW - Livejournal

KW - Russian blogosphere

KW - Sentiment lexicon

KW - Test collection

KW - Topic modeling

KW - Web interface

UR - http://www.scopus.com/inward/record.url?scp=85020375816&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85020375816

VL - 2016

SP - 277

EP - 287

JO - Компьютерная лингвистика и интеллектуальные технологии

JF - Компьютерная лингвистика и интеллектуальные технологии

SN - 2221-7932

ER -

ID: 104815086