This repository contains code for the paper "RP-DNN: A Tweet level propagation context based deep neural networks for early rumor detection in Social Media" By J. Gao, S. Han, X. Song, etc, available via https://arxiv.org/abs/2002.12683v2
This paper has been accepted for an Oral presentation at the 12th International Conference on Language Resources and Evaluation:
The LOO-CV and CV rumor source dataset (train set, validation set and test set) are available at our in "data/cv_dataset".
For all the social context corpus (12 events in total) used in this paper, please download them from https://zenodo.org/record/3249977 and https://figshare.com/articles/PHEME_dataset_for_Rumour_Detection_and_Veracity_Classification/6392078
Our models are developed with Allennlp framework.
The trained models reported in our paper is available at figshare project site (shef.data.11558520.v1).
Due to limited quote of available space, we release full model only.
If you are interested in other models examined in our experiment, please contact us.
The code and models are developed and tested in following environment with Conda:
- Python 3.6
- CUDA 9.1.85
- cudnn 7.0 (binary-cuda-9.1.85)
- gcc 4.9.4
- conda 4.3.17
Following resources were used to train our models:
- 2 x large-memory nodes, with 2x Intel E5-2630-v3 CPUs and 256GB RAM
- 2 x NVIDIA Kepler K40M GPUs (Each K40M GPU has 12GB)
- NVIDIA K80 nodes (Each GPU unit have 24 GByes of memory)
- pandas>=0.23.4
- allennlp==0.8.2
- tqdm>=4.31.1
- gensim
- h5pynltk
- overrides
- regex==2018.01.10
All setting steps can optionally be done in a virtual environment using tools such as conda
Prerequisite: To use our source code either for training or for loading pre-trained RPDNN model, you need to setup two important resource.
a) ELMo model;
It is recommended to set symlink for elmo model in resource/elmo_model/
. Please see a template script
symlink_elmo_model.sh
in the root directory. Alternatively, you can also copy latest model file into this directory. For the fine-tuned ELMo, please see details in https://github.com/soojihan/Multitask4Veracity.
Please cite our following paper if you are using this model in your research.
Han S., Gao, J., Ciravegna, F. (2019). "Data Augmentation for Rumor Detection Using Context-Sensitive Neural Language Model With Large-Scale Credibility Corpus", Seventh International Conference on Learning Representations (ICLR) LLD,New Orleans, Louisiana, US
b) social context corpus;
set symlink for social context directory (organised in
PHEME corpus structure)
in data/social_context/all-rnr-annotated-threads-retweets
.
A template script symlink_social_context_directory.sh
is provided.
Allennlp is used as a library in this project and its Jsonet based configuration is not developed and supported.
An alternative trainer utility script rumour_dnn_trainer.py
is developed to support training of our Allennlp based model.
Key inputs are:
- train_set_path ("-t"): train dataset path
- heldout_set_path("--heldout"): validation dataset path
- evaluationset("-e"): evaluation (test) dataset
- model_file_prefix("-p"): set model file prefix name for model weight output file
- feature_setting("-f"): feature options that used to train or evaluate RPDNN model and have 5 options available
- max_cxt_size: maximum social context size (default 200)
- n_gpu ("-g"): gpu device(s) to use (-1: no gpu, 0: 1 gpu). only support int value for device no.
- epochs: set num_epochs for training
For more settings, please find in the trainer script.
The model will be output into output/
directory by default with timestamp in a new subdirectory.
Example usage:
$ python /RPDNN/src/rumour_dnn_trainer.py -t /data/loocv_set_20191002/sydneysiege/all_rnr_train_set_combined.csv --heldout /data/loocv_set_20191002/sydneysiege/all_rnr_heldout_set_combined.csv -e /data/loocv_set_20191002/sydneysiege/all_rnr_test_set_combined.csv -p "sydneysiege_full" -g 0 -f -1 --max_cxt_size 200 --epochs 10
To test your model or evaluate our trained model, you need to use our evaluator script rumour_dnn_evaluator.py
Key inputs are:
- testset("-t"): test set csv file path;
- model("-m"): pre-trained model directory to be evaluated;
- feature_setting("-f"): feature options that used to train or evaluate RPDNN model and have 5 options available. In test mode, this setting must be the same as the setting in training
- n_gpu("-g"): gpu device(s) to use (-1: no gpu, 0: 1 gpu). only support int value for device no.
For more settings, please find in the evaluator script.
Example usage:
python /RPDNN/src/rumour_dnn_evaluator.py -t /data/loocv_set_20191002/ferguson/all_rnr_test_set_combined.csv -m /model/RPDNN_model_output_201910/full/ferguson_full201910121555 -g 0 -f -1 --max_cxt_size 200