Twittter sentiment analysis

Topic - to take twitter tweets and classify the tweet as positive(reflecting positive sentiment) and negative(reflecting negative sentiment)

1. DataSet

I have used a kaggle data set Click here
Training and Testing are done on the provided data set
data set has about 50k positive tweets and 40k negative tweets

Plot of frequency of words against the words

This graph follows zipf's law. Learn more about Zipf's law Here

2. Preprocessing

To train a classifer first of all we will have to modify the input tweet in a format which can be given to the classifier,this step is called preprossing.
It involves several steps

2.1 Hashtags

a word or phrase preceded by a hash sign (#), used on social media websites and applications, especially Twitter, to identify messages on a specific topic

2.2 URLS

used to share link to other sites in tweets. we have premanently removed links from our input text as they does not provide any information about the sentiment of the text

2.3 Emoticons

Are very much used nowadays in social networking sites.they are used to represent an human expression.Currently we have removed this emojis
how much useful emojis are for the purpose of sentiment analysis remains part of the future work

2.4 Punctuations

To remove punctuations from the input text
input - Arjun said "Aditya is a god boy"
output - Arjun said Aditya is a good boy

2.5 Repeating Character

To remove repeating characters from the text
input - yayyyyy ! i got the job
output - yayy ! i got the job

2.6 Stemming

A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stems", "stemmer", "stemming", "stemmed" as based on "stem".
A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word, "fish". On the other hand, "argue", "argued", "argues", "arguing", and "argus" reduce to the stem "argu" (illustrating the case where the stem is not itself a word or root) but "argument" and "arguments" reduce to the stem "argument".
porter stemmer is used here

3.Features

we have used
unigrams
bigrams
unigrams + bigrams
unigrams + bigrams +trigrams
as features

4.Expriments

we have used three model with above mentioned features.Note that all the results shown here are of test results which is obtained by submitting the output on the test file to kaggle.

4.1 Naive bayes Classifier

4.2 Maximum Entropy Classifier

4.3 XGboost

5. Results

For all of the classifiers shown above we can see that only using unigrams gives the least accuracy where as maximum accuracy is achieved by using Maximum entropy classifier using uni_bi+tri grams as features

Real Data

we used sentiment140 data set which contains nearly 16 lakhs tweets with positve , negative and neutral comments
dataset is also provided in the data folder
we then used pull_tweets.py file to pull data from the twitter corresponding to a paticular hashtag and then predict the results. Now we have used ME classifier with uni+bi+tri grams features and we have not tried any other models due to lack of processing power
we pulled tweets from two hashtags

ramdaan
SaveDemocracy
Results are shown below

Note

I am open to pull requests for further modifications to this project

Future Work

to use another set of features and classifiers to improve accuracy
to use emoji as an feature for sentiment analysis and check how it affects the accuracy of the classifier

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Code in python		Code in python
charts		charts
data		data
media		media
output		output
sentiments		sentiments
LICENSE		LICENSE
README.md		README.md
display_charts.ipynb		display_charts.ipynb
preprocessing_Test_data.ipynb		preprocessing_Test_data.ipynb
preprocessing_Training_data.ipynb		preprocessing_Training_data.ipynb
program_MEClassifier.ipynb		program_MEClassifier.ipynb
program_NBClassifier.ipynb		program_NBClassifier.ipynb
program_XGB_classifier.ipynb		program_XGB_classifier.ipynb
pull_tweets.py		pull_tweets.py
test_real_data.ipynb		test_real_data.ipynb
tweets.csv		tweets.csv
tweets_SaveDemocracy.csv		tweets_SaveDemocracy.csv
tweets_netneutrality.csv		tweets_netneutrality.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twittter sentiment analysis

1. DataSet

2. Preprocessing

2.1 Hashtags

2.2 URLS

2.3 Emoticons

2.4 Punctuations

2.5 Repeating Character

2.6 Stemming

3.Features

4.Expriments

4.1 Naive bayes Classifier

4.2 Maximum Entropy Classifier

4.3 XGboost

5. Results

Real Data

Note

Future Work

About

Releases

Packages

Languages

License

adibyte95/Twittter-sentiment-analysis

Folders and files

Latest commit

History

Repository files navigation

Twittter sentiment analysis

1. DataSet

2. Preprocessing

2.1 Hashtags

2.2 URLS

2.3 Emoticons

2.4 Punctuations

2.5 Repeating Character

2.6 Stemming

3.Features

4.Expriments

4.1 Naive bayes Classifier

4.2 Maximum Entropy Classifier

4.3 XGboost

5. Results

Real Data

Note

Future Work

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages