ngram model

This project builds a trigram language model, which is then applied to a text classification task to evaluate ETS TOEFL essays. The probability distributions are not pre-computed, instead, the model stores the raw counts of ngram occurrences and computes probabilities on demand. The model is evaluated using perplexity on an entire corpus. The project is composed of the following parts:

extract ngrams from a sentence
counting ngram in a corpus
raw ngram probabilities
smoothed probabilities
computing sentence probability
perplexity

Usage

python ngram_model.py data/brown_train.txt data/brown_test.txt

Output

Training perplexity:  18.02694455227229
Testing perplexity:  300.17653468276933
Essay Scoring Accuracy:  0.848605577689243

Data

The ETS data set is proprietary and licensed to Columbia University for research and educational use only.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.idea		.idea
data		data
README.md		README.md
ngram_model.py		ngram_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ngram model

Usage

Output

Data

About

Releases

Packages

Languages

yyteng-hci/nlpngram

Folders and files

Latest commit

History

Repository files navigation

ngram model

Usage

Output

Data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages