Tool for modelling change and persistence in newspaper content. For an exposition of the underlying method see Persistent News: The Information Dynamics of Nordic Newspapers and for design see News-fluxus design specification.
Publications:
- K. L. Nielbo, R. B. Baglini, P. B. Vahlstrup, K. C. Enevoldsen, A. Bechmann, and A. Roepstorff, “News Information Decoupling: An Information Signature of Catastrophes in Legacy News Media,” arXiv:2101.02956 [cs].
For running in virtual environment (recommended) and assuming python3.7+ is installed.
$ sudo pip3 install virtualenv
$ virtualenv -p /usr/bin/python3.7 venv
$ source venv/bin/activate
Clone repository and install requirements
$ git clone https://github.com/centre-for-humanities-computing/newsFluxus.git
$ pip3 install -r requirements.txt
Currently the requirements file installs torch
and torchvision
without support for GPU acceleration. If you want to use your accelerator(-s) comment out torch
and torchvision
in the requirements file, uninstall with pip (if relevant), and run pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio===0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
for your desired CUDA version (in this case 11.0+).
Clone and install Mallet (plus dependencies)
$ sudo apt-get install default-jdk
$ sudo apt-get install ant
$ git clone git@github.com:mimno/Mallet.git
$ cd Mallet/
$ ant
Change path the local mallet installation in src/tekisuto/models/latentsemantics.py
>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.models.wrappers import LdaMallet
>>> path_to_mallet_binary = "/path/to/mallet/binary"
>>> model = LdaMallet(path_to_mallet_binary, corpus=common_corpus, num_topics=20, id2word=common_dictionary)
$ python downloader.py --langauge <language-code>
# ex. for Danish langauge resources
$ python downloader.py --language da
And you will be prompted for location to store data, just use default. To find language codes see Stanza
>>> import stanza
>>> nlp = stanza.Pipeline(lang="da")
>>> doc = nlp("Rap! rap! sagde hun, og så rappede de sig alt hvad de kunne, og så til alle sider under de grønne blade, og moderen lod dem se så meget de ville, for det grønne er godt for øjnene.")
>>> doc.sentences[0].print_dependencies()
$ bash main.sh
And individually
$ python src/bow_mdl.py --dataset <path-to-dataset> --language <language-code> --bytestore <frequency-of-backup> --sourcename <name-of-dataset> --estimate "<start stop step>" --verbose <frequency-of-log>
$ python src/signal_extraction.py --model <path-to-serialized-model>
# ex. for Danish sample
$ python bow_mdl.py --dataset ../dat/sample.ndjson --language da --bytestore 100 --estimate "20 50 10" --sourcename sample --verbose 100
$ python python src/signal_extraction.py --model mdl/da_sample_model.pcl
Requires matplotlib
$ python src/news_uncertainty.py --dataset mdl/da_sample_signal.json --window 7 --figure "fig"
resulting visualizations in fig/
- Fork it!
- Create your feature branch:
git checkout -b my-new-feature
- Commit your changes:
git commit -am 'Add some feature'
- Push to the branch:
git push origin my-new-feature
- Submit a pull request 😈
Edition | Date | Comment |
---|---|---|
v1.0 | June 04 2020 | Launch |
v1.1 | January 14 2020 | New NLP pipeline |
Kristoffer L. Nielbo
This project is licensed under the MIT License - see the LICENSE.md file for details
Stopwords ISO for their multilingual collection of stopwords.