An Active Learning Framework for Trustworthy Classification over COVID-19 Multilingual Fake News Dataset
Abstract
In the past few years of the COVID-19 pandemic, rumors have spread in various languages on social media along with the virus's global spread; believing in these rumors can significantly damage people worldwide. Rumors can spread in different languages in different forms. However, most existing COVID-19-related fake news datasets are in English only. Finding reliable fake news detection data in different languages to research in the fight against this international 'infodemic' is meaningful. We address the diffusion of information issue by creating a multilingual COVID-19 fake news dataset, including 6,500 data points in five world languages: English, Spanish, Chinese, Korean, and Russian. We leverage the state of art model in NLP, Multilingual-BERT, and finetune it with the produced dataset to solve multilingual fake news detection problems. We obtain the best performance of 0.971 F1-score with Multilingual-BERT on the overall test set. We also applied active learning techniques to find the best way to leverage the limited data to solve multilingual fake news detection problems.
Subject Area
Computer science|Mass communications|Artificial intelligence
Recommended Citation
Pu, Tianyao, "An Active Learning Framework for Trustworthy Classification over COVID-19 Multilingual Fake News Dataset" (2023). ETD Collection for Fordham University. AAI29998543.
https://research.library.fordham.edu/dissertations/AAI29998543