This is a project that uses IJS JOS-1M corpus to train a part-of-speech tagger for Slovenian language.
POS tagger is available on PyPi with prebuilt dictionary. Installation:
pip install slopos
Usage:
import slopos
slopos.tag("Jaz sem iz okolice Ljubljane")
> [('Jaz', 'ZOP-EI'),
('sem', 'GP-SPE-N'),
('iz', 'DR'),
('okolice', 'SOZER'),
('Ljubljane', 'SLZER.')]
Tag reference is contained in tag_reference-sl.txt
(slovenian) and tag_reference-en.txt
files respectively.
The corpus was processed in several ways to prepare it for NLTK consumption. Partial files are part of this repository.
-
Original corpus
Original corpus is stored in multple split XML files, which are here stored in
xml
directory. -
Partial text files
XML files have been processed and converted into a NLTK readable word/tag format using
convert_xml_to_txt.py
script. The processed files are stored intxt
directory. -
NLTK tagged corpus
Files from
txt
directory have been combined into a single file and stored indata/tagged_corpus
directory for nltk-trainer consumption.
POS tagger is trained using nltk-trainer project, which is included as a submodule in this project.
virtualenv .
pip install -r requirements
pip install numpy
python nltk-trainer/setup.py install
python convert_xml_to_txt.py
In top project directory run the trainer:
python nltk-trainer/train_tagger.py data/tagged_corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --filename slopos/sl-tagger.pickle
It'll take a short while and you should see output in form of
loading data/tagged_corpus
15758 tagged sents, training on 15758
training AffixTagger with affix -3 and backoff <DefaultTagger: tag=-None->
training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <AffixTagger: size=11492>
training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=109127>
training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=130795>
evaluating TrigramTagger
accuracy: 0.930942
creating directory out
dumping TrigramTagger to out/sl-tagger.pickle
The trained tagger will be deposited in out
directory with name of sl-tagger.pickle
.
POS tagger is stored in form of Python pickle file after creation and you will need NLTK installed.
Usage:
import pickle
sl_tagger = pickle.load(open('out/sl-tagger.pickle', 'rb'))
sl_tagger.tag(["Jaz", "sem", "iz", "okolice", "Ljubljane"])
> [('Jaz', 'ZOP-EI'),
('sem', 'GP-SPE-N'),
('iz', 'DR'),
('okolice', 'SOZER'),
('Ljubljane', 'SLZER.')]
Note that punctionation should be stripped from words for proper detection. Tag reference is contained in tag_reference-sl.txt
(slovenian) and tag_reference-en.txt
files respectively.