Skip to content

A novel stemmer for the Ukrainian language trained with AI

Notifications You must be signed in to change notification settings

amakukha/stemmers_ukrainian

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 

Repository files navigation

Stemmers for Ukrainian

This repository introduces a new stemmer for the Ukrainian language (tree_stem) created via machine learning. It outperforms all other stemmers available to date as well as some lemmatizers by the error rate relative to truncation (ERRT) (Paice 1994). It also has the lowest percentage of understemming errors compared to the available stemming algorithms.

The proposed algorithm does not use dictionary lookups while maintaining a reasonably small size (48 KB of Python bytecode). It works faster than lemmatization approach by a factor of x24, and outperforms other stemming algorithms in speed as well.

In addition to the new algorithm, this repository also contains Python ports of some of the previously published stemmers.

Comparison of stemmers for the Ukrainian language

Stemmer Languages UI OI ERRT
Dictionary-based (reference) 0.0172 3.59e-06 0.0244
tree_stem Python 0.0907 2.71e-06 0.125
pymorphy2 (Paper) Python 0.324 2.01e-07 0.391
stemka C++ 0.329 2.34e-06 0.447
tapkomet Snowball, C, Java 0.447 2.73e-06 0.603
vgrichina Groovy, Python 0.497 1.05e-06 0.651
drupal JS, Python 0.511 7.54e-07 0.666
tochytskyi (Paper) PHP, Python 0.623 5.72e-07 0.795
No stemming 1.00 1.69e-08

where:

  • UI – understemming index
  • OI – overstemming index
  • ERRT – error rate relative to truncation

Notes:

  • pymorphy2 is a dictionary-assisted lemmatizer and morphological analyzer which was included into this comparison for reference. The most probable normal form is used as a stem.
  • training and testing was performed on a dictionary of word forms.

References