This repository introduces a new stemmer for the Ukrainian language (tree_stem) created via machine learning. It outperforms all other stemmers available to date as well as some lemmatizers by the error rate relative to truncation (ERRT) (Paice 1994). It also has the lowest percentage of understemming errors compared to the available stemming algorithms.
The proposed algorithm does not use dictionary lookups while maintaining a reasonably small size (48 KB of Python bytecode). It works faster than lemmatization approach by a factor of x24, and outperforms other stemming algorithms in speed as well.
In addition to the new algorithm, this repository also contains Python ports of some of the previously published stemmers.
Stemmer | Languages | UI | OI | ERRT |
---|---|---|---|---|
Dictionary-based (reference) | – | 0.0172 | 3.59e-06 | 0.0244 |
tree_stem | Python | 0.0907 | 2.71e-06 | 0.125 |
pymorphy2 (Paper) | Python | 0.324 | 2.01e-07 | 0.391 |
stemka | C++ | 0.329 | 2.34e-06 | 0.447 |
tapkomet | Snowball, C, Java | 0.447 | 2.73e-06 | 0.603 |
vgrichina | Groovy, Python | 0.497 | 1.05e-06 | 0.651 |
drupal | JS, Python | 0.511 | 7.54e-07 | 0.666 |
tochytskyi (Paper) | PHP, Python | 0.623 | 5.72e-07 | 0.795 |
No stemming | – | 1.00 | 1.69e-08 | – |
where:
- UI – understemming index
- OI – overstemming index
- ERRT – error rate relative to truncation
Notes:
- pymorphy2 is a dictionary-assisted lemmatizer and morphological analyzer which was included into this comparison for reference. The most probable normal form is used as a stem.
- training and testing was performed on a dictionary of word forms.
- Paice, C. (1994). An Evaluation Method for Stemming Algorithms. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 42-50.