Topic - to take twitter tweets and classify the tweet as positive(reflecting positive sentiment) and negative(reflecting negative sentiment)
I have used a kaggle data set Click here
Training and Testing are done on the provided data set
data set has about 50k positive tweets and 40k negative tweets
Plot of frequency of words against the words
This graph follows zipf's law. Learn more about Zipf's law Here
To train a classifer first of all we will have to modify the input tweet in a format which can be given to the classifier,this step is called preprossing.
It involves several steps
a word or phrase preceded by a hash sign (#), used on social media websites and applications, especially Twitter, to identify messages on a specific topic
used to share link to other sites in tweets.
we have premanently removed links from our input text as they does not provide any information about the sentiment of the text
Are very much used nowadays in social networking sites.they are used to represent an human expression.Currently we have removed this emojis
how much useful emojis are for the purpose of sentiment analysis remains part of the future work
To remove punctuations from the input text
input - Arjun said "Aditya is a god boy"
output - Arjun said Aditya is a good boy
To remove repeating characters from the text
input - yayyyyy ! i got the job
output - yayy ! i got the job
A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stems", "stemmer", "stemming", "stemmed" as based on "stem".
A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word, "fish". On the other hand, "argue", "argued", "argues", "arguing", and "argus" reduce to the stem "argu" (illustrating the case where the stem is not itself a word or root) but "argument" and "arguments" reduce to the stem "argument".
porter stemmer is used here
we have used
unigrams
bigrams
unigrams + bigrams
unigrams + bigrams +trigrams
as features
we have used three model with above mentioned features.Note that all the results shown here are of test results which is obtained by submitting the output on the test file to kaggle.
For all of the classifiers shown above we can see that only using unigrams gives the least accuracy where as maximum accuracy is achieved by using Maximum entropy classifier using uni_bi+tri grams as features
we used sentiment140 data set which contains nearly 16 lakhs tweets with positve , negative and neutral comments
dataset is also provided in the data folder
we then used pull_tweets.py file to pull data from the twitter corresponding to a paticular hashtag and then predict the results. Now we have used ME classifier with uni+bi+tri grams features and we have not tried any other models due to lack of processing power
we pulled tweets from two hashtags
- ramdaan
- SaveDemocracy
Results are shown below
I am open to pull requests for further modifications to this project
- to use another set of features and classifiers to improve accuracy
- to use emoji as an feature for sentiment analysis and check how it affects the accuracy of the classifier