This is a work-in-progress repository for the CLiPS HAte speech DEtection System (HADES).
Currently, the repository contains the supplementary materials from the paper: "A Dictionary-based Approach to Racism Detection in Dutch Social Media", presented at the TA-COS workshop at LREC 2016.
The dictionaries in this repository are available under a CC BY-SA 4.0 License. If you use the dictionaries in your work, please cite:
@inproceedings{tulkens2016a,
title={A Dictionary-based Approach to Racism Detection in {Dutch} Social Media},
author={Tulkens, St\'{e}phan and Hilte, Lisa and Lodewyckx, Elise and Verhoeven, Ben and Daelemans, Walter},
booktitle={Proceedings of the LREC 2016 Workshop on Text Analytics for Cybersecurity and Online Safety (TA-COS)},
year={2016},
organization={European Language Resources Association (ELRA)}
}
Note that we expanded the TA-COS submission into a journal paper, which was published in the CLIN Journal.
If you use the dictionary expansion techniques from this paper, please also consider citing it:
@article{tulkens2016automated,
title={The automated detection of racist discourse in dutch social media},
author={Tulkens, St{\'e}phan and Hilte, Lisa and Lodewyckx, Elise and Verhoeven, Ben and Daelemans, Walter},
journal={Computational Linguistics in the Netherlands Journal},
volume={6},
number={1},
pages={3--20},
year={2016}
}
The dictionaries are in .csv format. The first word of each line is the category name, while the other words are the words in that category. Included is a python (2.7 & 3.x) script which reads in the dictionaries and outputs relative frequencies. It can be used for similar dictionaries, such as the LIWC dictionaries.
from dictfeaturizer import DictFeaturizer
# Load from csv
d = DictFeaturizer.load("expanded.csv")
text = "this is an example text".split()
score = d.transform(text)
# Direct initialization
direct = {"good": ["good", "splendid"], "bad": ["bad", "useless"]}
d = DictFeaturizer(direct, relative=False)
text = "This stuff is splendid".split()
score_2 = d.transform(text)