This paper presents a method based on topic modelling for identifying texts with propagandistic content. The method is an attempt to incorporate transfer learning idea of obtaining effective vector representation from a large unlabeled or (semi-) automatically labelled dataset, while also attempting to minimize the amount of necessary manual expert labelling by introducing high level labelling (either manual or automatic) on some explicit document property. The proposed method includes four key stages: formation of corpus partitioning, computing a topic model of a united corpus, calculation of corpora imbalance estimates of each topic; extrapolating the results of the imbalance estimation on all documents. The method was cross-validated on a labelled subsample of 1000 news, and achieves high predictive power – ROC AUC 0.73.
Язык оригиналаанглийский
Страницы (с-по)205-212
ЖурналProcedia Computer Science
Том178
Дата раннего онлайн-доступа7 дек 2020
СостояниеОпубликовано - 2020
Событие9th International Young Scientists Conference in Computational Science - Heraklion, Греция
Продолжительность: 22 июн 202027 июн 2020

ID: 72704648