Overcoming Language Disparity in Online Content Classification with Multimodal Learning

Resources for the ICWSM 2022 paper, "Overcoming Language Disparity in Online Content Classification with Multimodal Learning"
Authors: Gaurav Verma, Rohit Mujumdar, Zijie J. Wang, Munmun De Choudhury, and Srijan Kumar
Paper link: https://arxiv.org/abs/2205.09744
Webpage: https://multimodality-language-disparity.github.io/

Overview

Figure description: An example of a social media post that is correctly classified in English but misclassified in Spanish. Including the corresponding image leads to correct classification in Spanish as well as other non-English languages. F1 scores on all examples are also shown (average F1 score for all non-English languages.)

Code

We make the code for fine-tuning BERT-based monolingual and multilingual classifiers available. We have code available for the following languages: English, Spanish, Portuguese, French, Chinese, and Hindi. Please refer to the files inside language-models/ for more details. We also release the code to fine-tune a VGG-16 image classifier and the code for training a fusion-based multimodal classifiers. Please refer to the files inside image-models/ for more details.

Datasets

In this work, we consider three social computing tasks that have existing multimodal datasets available. Please download the datasets from respective webpages:

Crisis humanitarianism (CrisisMMD): https://crisisnlp.qcri.org/crisismmd
Fake news detection: https://github.com/shiivangii/SpotFakePlus
Emotion classification: https://github.com/emoclassifier/emoclassifier.github.io (note: if you cannot access the dataset at its original source (proposed in this paper), please contact us for the Reddit URLs we used for our work.)

Human-translated evaluation set

As part of our evaluation, we create human-translated subset of the CrisisMMD dataset. The human-translated subset contains about ~200 multimodal examples in English, each translated to Spanish, Portuguese, French, Chinese, and Hindi (a total of ~1200 translations). The translations for five non-English languages are available in human-translated-eval-set/. The Twitter IDs for the original examples from the CrisisMMD dataset are available in the file names human-translated-eval-set/tweet_ids.txt – the lines in rest of the translation files correspond to these IDs.

Bibtex

@inproceedings{verma2022overcoming,
    title={Overcoming Language Disparity in Online Content Classification with Multimodal Learning},
    author={Verma, Gaurav and Mujumdar, Rohit and Wang, Zijie J and De Choudhury, Munmun and Kumar, Srijan},
    booktitle={Proceedings of the International AAAI Conference on Web and Social Media},
    year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
human-translated-eval-set		human-translated-eval-set
image-models		image-models
language-models		language-models
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overcoming Language Disparity in Online Content Classification with Multimodal Learning

Overview

Code

Datasets

Human-translated evaluation set

Bibtex

About

Releases

Packages

Contributors 3

Languages

License

claws-lab/multimodality-language-disparity

Folders and files

Latest commit

History

Repository files navigation

Overcoming Language Disparity in Online Content Classification with Multimodal Learning

Overview

Code

Datasets

Human-translated evaluation set

Bibtex

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages