This paper presents a method based on topic modelling for identifying texts with propagandistic content. The method is an attempt to incorporate transfer learning idea of obtaining effective vector representation from a large unlabeled or (semi-) automatically labelled dataset, while also attempting to minimize the amount of necessary manual expert labelling by introducing high level labelling (either manual or automatic) on some explicit document property. The proposed method includes four key stages: formation of corpus partitioning, computing a topic model of a united corpus, calculation of corpora imbalance estimates of each topic; extrapolating the results of the imbalance estimation on all documents. The method was cross-validated on a labelled subsample of 1000 news, and achieves high predictive power – ROC AUC 0.73.
Original languageEnglish
Pages (from-to)205-212
JournalProcedia Computer Science
Volume178
Early online date7 Dec 2020
StatePublished - 2020
Event9th International Young Scientists Conference in Computational Science - Heraklion, Greece
Duration: 22 Jun 202027 Jun 2020

    Research areas

  • propaganda, natural language processing, topic modelling, text classification, mass media analysis

ID: 72704648