Automatic recognition of messages from virtual communities of drug addicts

Standard

Automatic recognition of messages from virtual communities of drug addicts. / Фирсанова, Виктория Игоревна.

In: Journal of applied linguistics and lexicography, Vol. 2, No. 1, 25.12.2020, p. 16.

Research output: Contribution to journal › Article › peer-review

BibTeX

@article{5977b1f485e342bd88627180fd083e3e,

title = "Automatic recognition of messages from virtual communities of drug addicts",

abstract = "The paper describes building a binary classifier with Convolutional Neural Network (CNN) using two different types of word vector representations, Bag-of-Words and Word Embeddings. The purpose of the classifier is to recognise messages published in virtual communities of drug-addicted people. This system may find application in healthcare as a tool for automatic identification of addicts{\textquoteright} communities. It may also provide insights on the features of addicts{\textquoteright} online discourse. The classifier is based on the dataset from Russian-speaking online VK (VKontakte) communities. The dataset comprises texts of publications and comments posted in two types of open communities. The first type includes communities which actively discuss problems of addiction to psychotropic and psychoactive substance. The second type of communities focuses on the discussion of private issues — the users share their life stories and ask for help or advice. In the latter case publications are not related to drug addiction issues. The experiments centered around the development, evaluation and comparative analyses of two models — based on Bag-of-Words and Word Embeddings, respectively. The neural network training was implemented with the Tesla T4 graphics processing unit on the Google Colab platform. The model with the best performance showed 0.99 F1-Score and 0.95 Accuracy; however, after the programme testing, a few weaknesses were found. The programme still requires retraining on a supplemented dataset which includes publications collected from both addicts{\textquoteright} and non-addicts{\textquoteright} communities describing various mental conditions including depression, anxiety and nervous disorders. This opens up an opportunity to create software that can automatically distinguish publications made by people struggling with depression caused by the use of psychoactive substances from publications made by people suffering from depressive disorders of a different kind.",

keywords = "text classification, Word Embeddings, Bag-of-Words, Convolutional Neural Networks, supervised learning, text categorisation, neural networks, one-hot encoding, classification algorithm",

author = "Фирсанова, {Виктория Игоревна}",

note = "SOURCES Google Colab. (2020) [Online]. Available at: https://colab.research.google.com/ (accessed 07.12.2020). (In English) Keras. (2020) [Online]. Available at: https://keras.io/ (accessed 07.12.2020). (In English) NumPy Documentation. (2020) NumPy. [Online]. Available at: https://numpy.org/doc/ (accessed 07.12.2020). (In English) Python 3.6.7 documentation. (2020) Python. [Online]. Available at: https://docs.python.org/release/3.6.7/ (accessed 07.12.2020). (In English) VK API. (2020) VK Developers. [Online]. Available at: https://vk.com/dev/manuals (accessed 07.12.2020). (In Russian) REFERENCES Collobert, R., Weston, J. (2008) A unified architecture for natural language processing. In: ICML {\textquoteright}08: Proceedings of the 25th International Conference on Machine Learning. New York: Association for Computing Machinery Publ., pp. 160–167. https://doi.org/10.1145/1390156.1390177 (In English) Goodfellow, I., Bengio, Y., Courville, A. (2016) Deep learning. Cambridge: The MIT Press, 800 p. (In English) Easton, V. J., McColl, J. H. (1997) Hypothesis testing. Statistics Glossary. [Online]. Available at: http://www.stats.gla.ac.uk/steps/glossary/hypothesis_testing.html (accessed 07.12.2020). (In English) Manning, C. D., Raghavan, P., Sch{\"u}tze, H. (2008) Introduction to information retrieval. New York: Cambridge University Press, 496 p. (In English) Jakobson, R. (1973) Main trends in the science of language. London: Routledge Publ., 76 p. (In English) Kim, P. (2017) MATLAB deep learning: With machine learning, neural networks and artificial intelligence. Berkeley: Apress Publ., 151 p. https://doi.org/10.1007/978-1-4842-2845-6 (In English) Kim, Y. (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on empirical methods in natural language processing (EMNLP). Doha: Association for Computational Linguistics Publ., pp. 1746–1751. https://www.doi.org/10.3115/v1/D14-1181 (In English) Lauret, A. (2019) The design of web APIs. Shelter Island: Manning Publications, 392 p. (In English) Luria, A. R. (1976) Basic problems of neurolinguistics. The Hague: Mouton Publ., 398 p. (In English) Ma, A., Stagliano, A., Wills, G. (2017) Supervised classification algorithms. O{\textquoteright}Reilly. [Online]. Available at: https://learning.oreilly.com/videos/supervised-classification-algorithms/9781492023937 (accessed 07.12.2020). (In English) Mou, L., Meng, Z., Yan, R. et al. (2016) How transferable are neural networks in NLP applications? In: Proceedings of the 2016 Conference on empirical methods in natural language processing. Austin: Association for Computational Linguistics Publ., pp. 479–489. https://www.doi.org/10.18653/v1/D16-1046 (In English) Shajdukova, L. K. (2013) Sovremennye podkhody k reabilitatsii narkozavisimykh [Modern approaches to the rehabilitation of the drug addicts]. Kazanskij meditsinskij zhurnal — Kazan Medical Journal, 94 (3): 402–405. (In Russian) Spivak, D. L. (1983) Lingvisticheskaja tipologija iskusstvenno vyzyvaemykh sostojanij izmenennogo soznanija. Soobshchenie 1 [The linguistic typology of artificially caused altered states of consciousness. I]. Fiziologija cheloveka — Human Physiology, 1: 141–146. (In Russian) Tao, W. I., Chang, D. (2019) News text classification based on an improved convolutional neural network. Tehni{\v c}ki vjesnik — Technical Gazette, 26 (5): 1400–1409. https://doi.org/10.17559/TV-20190623122323 (In English) Turney, P. D., Pantel, P. (2010) From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37: 141–188. https://doi.org/10.1613/jair.2934 (In English) Webster, J. J., Kit, C. (1992) Tokenization as the initial phase in NLP. In: COLING{\textquoteright} 92: Proceedings of the 14th conference on Computational linguistics. Vol. 4. Stroudsburg: Association for Computational Linguistic Publ., pp. 1106–1110. https://doi.org/10.3115/992424.992434 (In English) Yin, Z., Shen, Y. (2018) On the dimensionality of word embedding. In: 32nd Conference on Neural information processing systems (NeurIPS 2018). [Online]. Available at: https://arxiv.org/abs/1812.04224 (accessed 07.12.2020). (In English) Zhang, Y., Jin, R., Zhou, Z.-H. (2010) Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1 (1–4): 43–52. https://doi.org/10.1007/s13042-010-0001-0 (In English)",

year = "2020",

month = dec,

day = "25",

doi = "10.33910/2687-0215-2020-2-1-16-27",

language = "English",

volume = "2",

pages = "16",

journal = "Journal of applied linguistics and lexicography",

issn = "2687-0215",

publisher = "Издательство РГПУ им. А.И. Герцена",

number = "1",

}

RIS

TY - JOUR

T1 - Automatic recognition of messages from virtual communities of drug addicts

AU - Фирсанова, Виктория Игоревна

N1 - SOURCES Google Colab. (2020) [Online]. Available at: https://colab.research.google.com/ (accessed 07.12.2020). (In English) Keras. (2020) [Online]. Available at: https://keras.io/ (accessed 07.12.2020). (In English) NumPy Documentation. (2020) NumPy. [Online]. Available at: https://numpy.org/doc/ (accessed 07.12.2020). (In English) Python 3.6.7 documentation. (2020) Python. [Online]. Available at: https://docs.python.org/release/3.6.7/ (accessed 07.12.2020). (In English) VK API. (2020) VK Developers. [Online]. Available at: https://vk.com/dev/manuals (accessed 07.12.2020). (In Russian) REFERENCES Collobert, R., Weston, J. (2008) A unified architecture for natural language processing. In: ICML ’08: Proceedings of the 25th International Conference on Machine Learning. New York: Association for Computing Machinery Publ., pp. 160–167. https://doi.org/10.1145/1390156.1390177 (In English) Goodfellow, I., Bengio, Y., Courville, A. (2016) Deep learning. Cambridge: The MIT Press, 800 p. (In English) Easton, V. J., McColl, J. H. (1997) Hypothesis testing. Statistics Glossary. [Online]. Available at: http://www.stats.gla.ac.uk/steps/glossary/hypothesis_testing.html (accessed 07.12.2020). (In English) Manning, C. D., Raghavan, P., Schütze, H. (2008) Introduction to information retrieval. New York: Cambridge University Press, 496 p. (In English) Jakobson, R. (1973) Main trends in the science of language. London: Routledge Publ., 76 p. (In English) Kim, P. (2017) MATLAB deep learning: With machine learning, neural networks and artificial intelligence. Berkeley: Apress Publ., 151 p. https://doi.org/10.1007/978-1-4842-2845-6 (In English) Kim, Y. (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on empirical methods in natural language processing (EMNLP). Doha: Association for Computational Linguistics Publ., pp. 1746–1751. https://www.doi.org/10.3115/v1/D14-1181 (In English) Lauret, A. (2019) The design of web APIs. Shelter Island: Manning Publications, 392 p. (In English) Luria, A. R. (1976) Basic problems of neurolinguistics. The Hague: Mouton Publ., 398 p. (In English) Ma, A., Stagliano, A., Wills, G. (2017) Supervised classification algorithms. O’Reilly. [Online]. Available at: https://learning.oreilly.com/videos/supervised-classification-algorithms/9781492023937 (accessed 07.12.2020). (In English) Mou, L., Meng, Z., Yan, R. et al. (2016) How transferable are neural networks in NLP applications? In: Proceedings of the 2016 Conference on empirical methods in natural language processing. Austin: Association for Computational Linguistics Publ., pp. 479–489. https://www.doi.org/10.18653/v1/D16-1046 (In English) Shajdukova, L. K. (2013) Sovremennye podkhody k reabilitatsii narkozavisimykh [Modern approaches to the rehabilitation of the drug addicts]. Kazanskij meditsinskij zhurnal — Kazan Medical Journal, 94 (3): 402–405. (In Russian) Spivak, D. L. (1983) Lingvisticheskaja tipologija iskusstvenno vyzyvaemykh sostojanij izmenennogo soznanija. Soobshchenie 1 [The linguistic typology of artificially caused altered states of consciousness. I]. Fiziologija cheloveka — Human Physiology, 1: 141–146. (In Russian) Tao, W. I., Chang, D. (2019) News text classification based on an improved convolutional neural network. Tehnički vjesnik — Technical Gazette, 26 (5): 1400–1409. https://doi.org/10.17559/TV-20190623122323 (In English) Turney, P. D., Pantel, P. (2010) From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37: 141–188. https://doi.org/10.1613/jair.2934 (In English) Webster, J. J., Kit, C. (1992) Tokenization as the initial phase in NLP. In: COLING’ 92: Proceedings of the 14th conference on Computational linguistics. Vol. 4. Stroudsburg: Association for Computational Linguistic Publ., pp. 1106–1110. https://doi.org/10.3115/992424.992434 (In English) Yin, Z., Shen, Y. (2018) On the dimensionality of word embedding. In: 32nd Conference on Neural information processing systems (NeurIPS 2018). [Online]. Available at: https://arxiv.org/abs/1812.04224 (accessed 07.12.2020). (In English) Zhang, Y., Jin, R., Zhou, Z.-H. (2010) Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1 (1–4): 43–52. https://doi.org/10.1007/s13042-010-0001-0 (In English)

PY - 2020/12/25

Y1 - 2020/12/25

N2 - The paper describes building a binary classifier with Convolutional Neural Network (CNN) using two different types of word vector representations, Bag-of-Words and Word Embeddings. The purpose of the classifier is to recognise messages published in virtual communities of drug-addicted people. This system may find application in healthcare as a tool for automatic identification of addicts’ communities. It may also provide insights on the features of addicts’ online discourse. The classifier is based on the dataset from Russian-speaking online VK (VKontakte) communities. The dataset comprises texts of publications and comments posted in two types of open communities. The first type includes communities which actively discuss problems of addiction to psychotropic and psychoactive substance. The second type of communities focuses on the discussion of private issues — the users share their life stories and ask for help or advice. In the latter case publications are not related to drug addiction issues. The experiments centered around the development, evaluation and comparative analyses of two models — based on Bag-of-Words and Word Embeddings, respectively. The neural network training was implemented with the Tesla T4 graphics processing unit on the Google Colab platform. The model with the best performance showed 0.99 F1-Score and 0.95 Accuracy; however, after the programme testing, a few weaknesses were found. The programme still requires retraining on a supplemented dataset which includes publications collected from both addicts’ and non-addicts’ communities describing various mental conditions including depression, anxiety and nervous disorders. This opens up an opportunity to create software that can automatically distinguish publications made by people struggling with depression caused by the use of psychoactive substances from publications made by people suffering from depressive disorders of a different kind.

AB - The paper describes building a binary classifier with Convolutional Neural Network (CNN) using two different types of word vector representations, Bag-of-Words and Word Embeddings. The purpose of the classifier is to recognise messages published in virtual communities of drug-addicted people. This system may find application in healthcare as a tool for automatic identification of addicts’ communities. It may also provide insights on the features of addicts’ online discourse. The classifier is based on the dataset from Russian-speaking online VK (VKontakte) communities. The dataset comprises texts of publications and comments posted in two types of open communities. The first type includes communities which actively discuss problems of addiction to psychotropic and psychoactive substance. The second type of communities focuses on the discussion of private issues — the users share their life stories and ask for help or advice. In the latter case publications are not related to drug addiction issues. The experiments centered around the development, evaluation and comparative analyses of two models — based on Bag-of-Words and Word Embeddings, respectively. The neural network training was implemented with the Tesla T4 graphics processing unit on the Google Colab platform. The model with the best performance showed 0.99 F1-Score and 0.95 Accuracy; however, after the programme testing, a few weaknesses were found. The programme still requires retraining on a supplemented dataset which includes publications collected from both addicts’ and non-addicts’ communities describing various mental conditions including depression, anxiety and nervous disorders. This opens up an opportunity to create software that can automatically distinguish publications made by people struggling with depression caused by the use of psychoactive substances from publications made by people suffering from depressive disorders of a different kind.

KW - text classification

KW - Word Embeddings

KW - Bag-of-Words

KW - Convolutional Neural Networks

KW - supervised learning

KW - text categorisation

KW - neural networks

KW - one-hot encoding

KW - classification algorithm

U2 - 10.33910/2687-0215-2020-2-1-16-27

DO - 10.33910/2687-0215-2020-2-1-16-27

M3 - Article

VL - 2

SP - 16

JO - Journal of applied linguistics and lexicography

JF - Journal of applied linguistics and lexicography

SN - 2687-0215

IS - 1

ER -

ID: 84633543