1. INTRODUCTION

In Sentiment Analysis, there are two approaches.

Machine Learning Algorithm Based Sentiment Analysis
Dictionary(Rules) Based Sentiment Analysis

2. IMPLEMENTATION

In this project, I implemented the second approach which is Dictionary(Rules) Based Sentiment Analysis. This

project is constructed on the combination of three sub projects.

Constructing Turkish Dictionary(Rules)
Creating Your Own Sklearn Classifier
Implementing Turkish Dictionary(Rules) Based Sentiment Analyzing

2.1. CONSTRUCTING TURKISH DICTIONARY(RULES)

At first, I examined English Dictionary Based Sentiment Analyzing tools. Then, I tried to create Dictionary of

Turkish Rules. Because of inadequate Turkish resources, I examined the English resources that have words with their

polarities. Finally, I found the SentiWordNet dataset. In this dataset, there are lots of words with their polarities. After

that, I started to process the dataset. At first, I smoothed the data and then translated the words into Turkish equivalents

using Google's Translation tool. Finally, I had sample rules (words with their polarities). For more

rules, please contact with me After setting rules, I stepped into the second phase of my project.

2.2. CREATING YOUR OWN SKLEARN CLASSIFIER

In this part, I searched for Sklearn Classifiers in order to understand the concept of them and create my own

Sklearn Classifier.

2.2.1. BUILDING AN OBJECT

I created an Sklearn Text Classifier object by inheriting from Sklearn BaseEstimator and ClassifierMixin

classes. After that, I created some functions to implement my own Sklearn Classifier.

2.2.2. ABIDING SKLEARN RULES

All arguments of init() must have default value. It helps us to initialize the Sklearn Classifier just by typing

DictionaryBasedSentimentAnalyzer()

2.2.3. FIT AND PREDICT METHODS

Every Sklearn Classsifier requires fit() and predict() method for classification

2.2.4. EXPLANATION OF SKLEARN CLASSIFIER'S METHODS

init()

Initially read the rules from the dictionary

fit()

Method for fitting the data

_generate_ngrams()

Method that generates unigrams and bigrams of the sentences

_data_prepare()

Method that preprocesses the unlabeled data

predictor()

Method that calculates the polarity of the data

predict()

Method for analyzing the sentiment of the data

score()

Method for calculating the accuracy score of predicted data

predict_proba()

Method for calculating the prediction probabilities of predicted data

Eventually, our new Sklearn Classifier is built. Then, I stepped into the last phase of my project.

2.3. IMPLEMENTING TURKISH DICTIONARY(RULES) BASED SENTIMENT ANALYZING

2.3.1. DICTIONARY(RULES) BASED SENTIMENT ANALYZING STEPS

2.3.1.1. DATA FETCHING

The first rule is to get adequate dataset to train your model efficiently. Here, I have sample movie

critics from Beyazperde. You can find it from this link.

2.3.1.2. DATA PREPROCESSING

This step is the crucial step for any kind of Machine Learning model training. Real life data is

not always clean. So, you must process your dataset as possible as. In Machine Learning, there

is a ratio that is, data preprocessing/cleaning is 80% and modelling is 20% of overall work. So, I

also splitted data preprocessing into sub steps.

2.3.1.2.1. LOAD DATASET

Dataset is in the form of csv file. For more dataset, please contact with me

2.3.1.2.2. ELIMINATE NAN VALUES

Nan values is not useful for training model

2.3.1.2.3. ELIMINATE PUNCTUATIONS

Punctuations are unnecessary for training model

2.3.1.2.4. NORMALIZATION

This corrects the miswritten words and throws meaningless words away

2.3.1.2.5. STEMMING/LEMMATIZATION

This removes the suffixes and gives us the root of each word

2.3.1.3. DATA CLASSIFICATION

In this step, I use my own Sklearn Classifier which is DictionaryBasedSentimentAnalyzer for Turkish

Dictionary Based Sentiment Analyzing. This classifier checks ngrams(unigram and bigram) of the sentences and

captures whether it matches our Sentiment Rules or not. If it matches, then it returns the polarity value

of that token. I also, splitted data classification into some sub steps.

2.3.1.3.1. FIT AND PREDICT

The model learns rules by fitting and analyzes the sentiment of the data by predicting

2.3.1.3.2. OBSERVING ACCURACY, F1, PRECISION AND RECALL SCORES

This scores are useful for comparing model’s success

2.3.1.3.3. OBSERVING CONFUSION MATRIX AND PREDICTION PROBABILITIES

This gives us an intuition of how confidently the model makes the predictions

2.3.1.3.4. TEN-FOLD CROSS VALIDATION

This shows us whether our model performs correctly or not

2.3.1.4. MODEL PIPELINING AND PICKLING

In this step, I create a pipeline for the model. Pipelining prevents us from repeating all steps again

and again. With the help of pipelining, when I give any raw unlabeled data, at first, the model preprocess

it and then, makes prediction. So, it makes our model reusable.

Pickling a model means transforming it into binary form. It makes our model portable. When you want to

use the model in different projects, by just loading this pickled file, you can use the model and get

predictions wherever you want.

4. IMPROVEMENTS

The model score can be improved by increasing the number of "Turkish Dictionary(Rules)".

5. CONCLUSION

As I mentioned before, this project constructed on three sub projects which are different branches of Sentiment

Analyzing. In the first step, I learned how to prepare Dictionary(Rules) for Sentiment Analysis. And, this

Dictionary(Rules) will be a good basis for Turkish Dictionary(Rules) Based Sentiment Analyzing field. I'm glad

to prepare this Dictionary(Rules). In the second step, I learned how to create a Text Classifier

for Sentiment Analysis. This gained me a good understanding of Sklearn Classifiers and helped me to create my

own Classifier. Creation of this kind of Classifier is a good chance for me to dive deep into the Machine Learning

Algorithms and their working principles. In the last step, I learned to apply Dictionary(Rules) Based Sentiment

Analyzing/Text Classification. It also provided me very useful knowledge about Natural Language Processing.

Since, fetching and preprocessing the dataset is the crucial part of any Machine Learning model training. In

conclusion, this project gained me lots of NLP concepts that is very crucial for a Data Scientist. I hope this

study will be useful for everyone.

6. RESOURCES/THANKS

I completed this project in cooperation with Verius Technology Company.The training dataset (Beyazperde)

and data preprocessing tools (normalization, stemming) are provided me by them. I also used python libraries

such as: Sklearn, Pandas, Numpy, Nltk.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
classsifier		classsifier
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Turkish_Dictionary(Rules)_Based_Sentiment_Analysis.ipynb		Turkish_Dictionary(Rules)_Based_Sentiment_Analysis.ipynb
sample_beyazperde_dataset.csv		sample_beyazperde_dataset.csv
stopwords.txt		stopwords.txt

License

slmttndrk/Turkish_Dictionary-Rules-_Based_Sentiment_Analysis

Folders and files

Latest commit

History

Repository files navigation