NLTK Slovenian POS tagger

This is a project that uses IJS JOS-1M corpus to train a part-of-speech tagger for Slovenian language.

Quick usage

POS tagger is available on PyPi with prebuilt dictionary. Installation:

pip install slopos

Usage:

import slopos

slopos.tag("Jaz sem iz okolice Ljubljane")

> [('Jaz', 'ZOP-EI'),
 ('sem', 'GP-SPE-N'),
 ('iz', 'DR'),
 ('okolice', 'SOZER'),
 ('Ljubljane', 'SLZER.')]

Tag reference is contained in tag_reference-sl.txt (slovenian) and tag_reference-en.txt files respectively.

Prepared files

The corpus was processed in several ways to prepare it for NLTK consumption. Partial files are part of this repository.

Original corpus

Original corpus is stored in multple split XML files, which are here stored in xml directory.
Partial text files

XML files have been processed and converted into a NLTK readable word/tag format using convert_xml_to_txt.py script. The processed files are stored in txt directory.
NLTK tagged corpus

Files from txt directory have been combined into a single file and stored in data/tagged_corpus directory for nltk-trainer consumption.

Training the POS tagger

POS tagger is trained using nltk-trainer project, which is included as a submodule in this project.

Install dependencies

virtualenv .
pip install -r requirements
pip install numpy
python nltk-trainer/setup.py install

Convert input files

python convert_xml_to_txt.py

Train

In top project directory run the trainer:

python nltk-trainer/train_tagger.py data/tagged_corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --filename slopos/sl-tagger.pickle

It'll take a short while and you should see output in form of

loading data/tagged_corpus
15758 tagged sents, training on 15758
training AffixTagger with affix -3 and backoff <DefaultTagger: tag=-None->
training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <AffixTagger: size=11492>
training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=109127>
training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=130795>
evaluating TrigramTagger
accuracy: 0.930942
creating directory out
dumping TrigramTagger to out/sl-tagger.pickle

The trained tagger will be deposited in out directory with name of sl-tagger.pickle.

Using the POS tagger

POS tagger is stored in form of Python pickle file after creation and you will need NLTK installed.

Usage:

import pickle
sl_tagger = pickle.load(open('out/sl-tagger.pickle', 'rb'))

sl_tagger.tag(["Jaz", "sem", "iz", "okolice", "Ljubljane"])

> [('Jaz', 'ZOP-EI'),
 ('sem', 'GP-SPE-N'),
 ('iz', 'DR'),
 ('okolice', 'SOZER'),
 ('Ljubljane', 'SLZER.')]

Note that punctionation should be stripped from words for proper detection. Tag reference is contained in tag_reference-sl.txt (slovenian) and tag_reference-en.txt files respectively.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
data/tagged_corpus		data/tagged_corpus
nltk-trainer @ 1d48784		nltk-trainer @ 1d48784
slopos		slopos
txt		txt
xml		xml
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
convert_xml_to_txt.py		convert_xml_to_txt.py
requirements.txt		requirements.txt
setup.py		setup.py
tag_reference-en.txt		tag_reference-en.txt
tag_reference-sl.txt		tag_reference-sl.txt
tests.py		tests.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLTK Slovenian POS tagger

Quick usage

Prepared files

Training the POS tagger

Install dependencies

Convert input files

Train

Using the POS tagger

About

Releases

Packages

Languages

License

izacus/slo_pos

Folders and files

Latest commit

History

Repository files navigation

NLTK Slovenian POS tagger

Quick usage

Prepared files

Training the POS tagger

Install dependencies

Convert input files

Train

Using the POS tagger

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages