GitHub - bgospodinov/bulgarian_dictionary: Bulgarian dictionary adapted for versification

Install dependencies listed in pkglist.txt (names will vary depending from distro to distro). Then, to build the artifact (without Slovnik), run:

./build --no-slovnik

This repository contains code that produces a dictionary of 190 000 Bulgarian lemmata and 2 million wordforms along with:

stress patterns for most wordforms
full Slovnik-style morphosyntactic analysis of all wordforms
pronunciation of all wordforms
complete breakdown by syllables
derivational relationships between some of the lemmata

The dictionary is meant to be used for translating or writing metric and rhymed poetry in Bulgarian. It is based on previous Bulgarian linguistic resources such as Slovnik, RBE, Rechko and Murdarov's dictionary.

The build script requires a Linux distribution (it is developed on a standard Arch Linux setup), although the resulting artifact is a platform-independent SQLite file. To see how to set up on Ubuntu, please refer to .github/workflows/build.yml. Recommended RAM > 2.5GB. The script uses as many cores as possible, where parallelization is feasible.

For best results mount /tmp to main memory as tmpfs or ramfs. Build time is ~3 minutes on a quad-core i7 @ 2.5 GHz.

All wordforms in the database are annotated using the BulTreeBank morphosyntactic tagset. To learn more about it, please refer to the technical report inside resources/. The Slovnik dictionary, however, is encrypted as it is not intended for public use. Add option --no-slovnik to skip augmenting Slovnik into the final artifact. If you want to gain access to Slovnik, contact a BTB member at http://bultreebank.org. Afterwards, I will provide you with the password to decrypt Slovnik.

If you find any mistakes in the scripts or the dictionary itself, please raise an issue or contact the maintainer: b g o s p o d i n ov at pr ot on m ail dot c o m.!

Name		Name	Last commit message	Last commit date
Latest commit History 193 Commits
.githooks		.githooks
corpora/chitanka		corpora/chitanka
inc		inc
pipeline		pipeline
plugins/ngrams		plugins/ngrams
resources		resources
src		src
tests		tests
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build		build
pkglist.txt		pkglist.txt
setup-devel		setup-devel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

License

bgospodinov/bulgarian_dictionary

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages