Using twitter data to predict primary election results.
Follow the steps to setup MongoDB on an Ubuntu machine.
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 7F0CEB10
echo "deb http://repo.mongodb.org/apt/ubuntu trusty/mongodb-org/3.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.0.list
sudo apt-get update
sudo apt-get install -y mongodb-org
On a Mac:
brew update
brew install mongodb
sudo service mongod start
sudo service mongod stop
pip install tweepy
pip install pymongo
In the ~/.bashrc file (ubuntu) or ~/.bash_profile file (mac), enter the following:
export TWITTER_CONSUMER_KEY='your key'
export TWITTER_CONSUMER_KEY_SECRET_0='your key'
export TWITTER_ACCESS_TOKEN_KEY='your key'
export TWITTER_ACCESS_TOKEN_KEY_SECRET_0='your key'
export PYTHONPATH="Location of your folder (in this case the location of the folder 'Elections' ":$PYTHONPATH
then either load a new prompt or run
source ~/.bashrc
or
source ~/.bash_profile
A tmux session named collect_twitter_data has been created. It will be running in the background forever until terminated explicitly. Currently extract-tweets.py is running on the AWS instance and it is collecting data. To view what's happening you should 'attach' to that session. Basically, ssh to our instance from the terminal and run the command:
tmux attach -t collect_twitter_data
Then you can use it as an usual ubuntu terminal.
For more info on tmux - https://robots.thoughtbot.com/a-tmux-crash-course
Streamparse lets you run Python code against real-time streams of data via Apache Storm. With streamparse you can create Storm bolts and spouts in Python without having to write a single line of Java. It also provides handy CLI utilities for managing Storm clusters and projects.
With no need for any coding in Java, streamparse lets you get started right away with a basic wordcount topology.
Important: The version of lein
installed should be 2.x
. If not, upgrade it by following this.
For sentiment analysis we are using a Naive Bayes classifier from ntlk. The training data currently used is a set of 800,000 tagged tweets.
All examples are from within the sentiment
directory.
Trim and keep all 800,000 tweets
$ python3 trim.py
Trim to 10,000 tweets (faster for testing, use any number up to 800,000)
$ python3 trim.py training_set.csv 10000
Train the classifier (need the trimmed.csv file from above)
$ python3 trainer.py
Train the classifier, with 10 iterations over each set (no accuracy improvement, but gives you a better basis for comparison if there are more trials)
$ python3 trainer.py 10
The most accurate classifier from training will be saved at the end and that classifier is loaded here. Edit your config.json to use a different classifier.
To test using the interactive cli
$ python3 cli.py
? this is awesome!
pos
? this sucks
neg
You can also use the Analyzer
class provided for use within another program. See the cli.py
file for an example.
Run crontab -e and insert the following snippet inside the file.
*/2 * * * * /absolute_file_path/generate_wordcloud.py
This runs every 2 minutes. PS: Change the absolute paths inside generate_wordcloud.py
A useful link: http://www.thesitewizard.com/general/set-cron-job.shtml