TextClassification

Simple practice for text classification using Python

File	Content
Rocchio.py	Text classification using Rocchio's algorithm. Each document is represented in a vector space. In the training phase, centroids for each class of documents are found. In the testing phase, test document's distance to each centroid is calculated and document is assigned to the closest centroid's class.
NaiveBayes.py	Text classification using Naive Bayes' algorithm. Each document in represented in a vector space. In the training phase, class priors and class conditional probabilities for every term of dictionary are learnt. In the testing phase, document is assigned to the class having the maximum posterior probability given the test document.
sklearn-text classification.ipynb	This is an IPython notebook showing a complete, though simple, text classification pipeline using scikits-learn machine learning library. Pipeline starts with text cleaning and tokenization, then each document is projected into a vector space. Tfidf weighting is used to nornalize vectors. Then some classifiers are tested; using their default parameters. Finally using 10-fold cross validation over a brute force parameter grid search, best paramters for some of the classifiers are found and classification is performed using the newly-found parameters.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
20_newsgroup		20_newsgroup
.gitignore		.gitignore
LICENSE		LICENSE
NaiveBayes.py		NaiveBayes.py
README.md		README.md
Rocchio.py		Rocchio.py
sklearn-text classification.ipynb		sklearn-text classification.ipynb
util.py		util.py

Provide feedback