Homework 2: Text Classification with Naive Bayes and Logistic Regression

Description

This homework will expose you to scikit-learn: a Python API that is used for common NLP and Machine Learning tasks. Specifically, you will learn how to use scikit-learn to carry out feature engineering and supervised learning for sentiment classification of movie reviews.

Download and unzip the training and test corpora available on the class webpage. Datasets are simple plaintext files grouped into two folders: pos and neg. All files in the pos folder have a positive sentiment associated with them; and all files in the neg folder have a negative sentiment associated with them.

Use the CountVectorizer and TfidfVectorizer classes provided by scikit-learn to obtain bag-of-words and tf-idf representations of the raw text respectively.
With the feature representation as input; train the Naive Bayes and Logistic Regression classifier(s) to carry out text classification.
Test the performance of your classifier(s) on the test set by reporting accuracy, precision, recall and F-score values for the test set. Additionally, carry out these experiments:
Observe the effect of using bag-of-words and tf-idf representations on the model’s performance.
Look into how stop words can be removed. Observe the effect of removing stop words on model performance.
Observe the effect of L1 and L2 regularization v/s no regularization with Logistic Regression on model performance.

Instructions

Unzip the aclImdb_v1.tar.gz file.

Install dependencies:

Linux or macOS

pip3 install -r requirements.txt

Windows

pip install -r requirements.txt

To run, type in the command line interpreter:

Linux or macOS

python3 hw2.py <path-to-train-set> <path-to-test-set> <representation> <classifier> <stop-words> <regularization>

Windows

python hw2.py <path-to-train-set> <path-to-test-set> <representation> <classifier> <stop-words> <regularization>

Valid arguments:

representation ∈ {bow, tfidf}
classifier ∈ {nbayes, regression}
stop-words ∈ {0, 1}
regularization ∈ {no, l1, l2}

NOTE: Python version >=3.6.1 is recommended.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
Homework_2.pdf		Homework_2.pdf
README.md		README.md
aclImdb_v1.tar.gz		aclImdb_v1.tar.gz
hw2.py		hw2.py
requirements.txt		requirements.txt
test_command.txt		test_command.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Homework 2: Text Classification with Naive Bayes and Logistic Regression

Description

Instructions

Valid arguments:

About

Releases

Packages

Languages

nich227/NLP_HW2

Folders and files

Latest commit

History

Repository files navigation

Homework 2: Text Classification with Naive Bayes and Logistic Regression

Description

Instructions

Valid arguments:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages