Automatic neologism detector for Latvian language. Author: Pavels Ivanovs
This is Pavels Ivanovs' project for bachelor thesis Automatic Neologism Detection [1].
The goal of this project is to create a NLP tool which extracts from the submitted text words which are most likely to be included into the vocabulary of Latvian language, specifically, Tēzaurs.lv: the biggest publicly available thesaurus of Latvian language.
Two main approaches are used to achieve the goal of the project:
- Exclusion lists. Words from the input text are being filtered out if their lemmas are located in the vocabulary. Lemmatization functionality provided by LVTagger and NLP-PIPE.
- Classification by machine-learning model. Classification using neural network. Input features, like word length, Levenshtein distance to the closest vocabulary entry, are being extracted from the word which are being fed to the neural network which outputs a possibility of a word being included into the vocabulary.
After training the model its efficiency is as follows (x-axis: batch number; y-axis: metric):
- Accuracy (Pareizība): 77.86%
- Precision (Precizitāte): 40.56%
- Recall (Pārklājums): 61.73%
- F-score (F-mērs): 46.80%
Based on the metrics received from testing the model it is seen that there are still ways to improve the efficiency of the model. Two main options: optimization of the dataset (oversampling and overall increase of records) and model optimization, including neural network strucure changes and additional experimenting with epoch number and learning rate.
- Python v3.10
- Docker compose
[1] P. Ivanovs, "Jaunvārdu automātiska atpazīšana," Bakalaura darbs, Datorikas fakultāte, Latvijas Universitāte, Rīga, Latvija, 2023