Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review
Data Encoding for Social Media: Comparing Twitter, Reddit, and Telegram. / Блеканов, Иван Станиславович; Тарасов, Никита Андреевич; Непиющих, Дмитрий Викторович; Бодрунова, Светлана Сергеевна.
Networks in the Global World VI. NetGloW 2022. Springer Nature, 2023. p. 114–122 (Lecture Notes in Networks and Systems; Vol. 663 LNNS).Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review
}
TY - GEN
T1 - Data Encoding for Social Media: Comparing Twitter, Reddit, and Telegram
AU - Блеканов, Иван Станиславович
AU - Тарасов, Никита Андреевич
AU - Непиющих, Дмитрий Викторович
AU - Бодрунова, Светлана Сергеевна
PY - 2023
Y1 - 2023
N2 - Social networking platforms have become a major source of data for most textual machine learning models. Applications of encodings in earlier language models, as well as advancements in model reusage, have opened new possibilities for case studies with limited or unsupervised data. In this paper, the authors test whether semantic similarity of large-scale data from three platforms allow for applying the same transfer-learning language models on data from various social media. For this, the authors perform a comparative case study to outline linguistic differences and measure similarity for deep neural encodings for the case data. In particular, semantic similarity is evaluated using traditional text similarity metrics, structure metrics of the corpora, and RUBERT encodings that provide general semantic characteristics of the text data in three datasets. We show that the platforms are similar in semantic terms well enough for the transfer learning models to be applied, by both linguistic metrics and semantic encodings. We also demonstrate, however, that, despite the difference in the average text length, Twitter is more similar to Reddit than to Telegram by linguistic metrics, which hints to the idea of ‘platformization’ of social media speech. We conclude by stating the speech factors that may lead to platform dissimilarity.
AB - Social networking platforms have become a major source of data for most textual machine learning models. Applications of encodings in earlier language models, as well as advancements in model reusage, have opened new possibilities for case studies with limited or unsupervised data. In this paper, the authors test whether semantic similarity of large-scale data from three platforms allow for applying the same transfer-learning language models on data from various social media. For this, the authors perform a comparative case study to outline linguistic differences and measure similarity for deep neural encodings for the case data. In particular, semantic similarity is evaluated using traditional text similarity metrics, structure metrics of the corpora, and RUBERT encodings that provide general semantic characteristics of the text data in three datasets. We show that the platforms are similar in semantic terms well enough for the transfer learning models to be applied, by both linguistic metrics and semantic encodings. We also demonstrate, however, that, despite the difference in the average text length, Twitter is more similar to Reddit than to Telegram by linguistic metrics, which hints to the idea of ‘platformization’ of social media speech. We conclude by stating the speech factors that may lead to platform dissimilarity.
KW - Linguistic metrics
KW - RUBERT
KW - Reddit
KW - Semantic neural encodings
KW - Social network analysis
KW - Telegram
KW - Text similarity assessment
KW - Twitter
UR - https://link.springer.com/chapter/10.1007/978-3-031-29408-2_8
UR - https://www.mendeley.com/catalogue/bd283c29-4eec-39a0-b826-95ac142c586b/
U2 - 10.1007/978-3-031-29408-2_8
DO - 10.1007/978-3-031-29408-2_8
M3 - Conference contribution
SN - 978-3-031-29407-5
T3 - Lecture Notes in Networks and Systems
SP - 114
EP - 122
BT - Networks in the Global World VI. NetGloW 2022
PB - Springer Nature
Y2 - 22 June 2022 through 24 June 2022
ER -
ID: 110777150