GitHub - agmmnn/turkish-nlp-resources: 🔡 List of Tools, Libraries, Models, Datasets and other resources for Turkish NLP.

artwork: Mihrap, Osman Hamdi Bey

Turkish NLP Resources

Turkish NLP (Türkçe Doğal Dil İşleme) related Tools, Libraries, Models, Datasets and other resources.

Tools/Libraries

ITU Turkish NLP (Web Based & API) : Tools of Istanbul Technical University, Natural Language Processing Group.
spaCy Turkish models : spaCy Turkish models
VNLP (Python) : State of the art, lightweight NLP tools for Turkish language.
TDD - Tools (Web based) : Online tools provided by Turkish Data Depository (TDD) project.
Zemberek-NLP (Java) : Zemberek-NLP provides Natural Language Processing tools for Turkish.
Zemberek-Python (Python) : Python implementation of Zemberek.
Zemberek-Server (Docker) : REST Docker Server on Zemberek Turkish NLP Java Library.
Mukayese (Python) : is a benchmarking platform for various Turkish NLP tools and tasks, ranging from Spell-checking to NLU tasks.
SadedeGel (Python) : is initially designed to be a library for unsupervised extraction-based news summarization using several old and new NLP techniques.
Turkish Stemmer (Python) : Stemmer algorithm for Turkish language.
sinKAF (Python) : An ML library for profanity detection in Turkish sentences.
TrTokenizer (Python) : Sentence and word tokenizers for the Turkish language.
Tools for Turkish NLP provided by Starlang (Multi/Python) : Morphological Analysis, Spell Checker, Dependency Parser, Deasciifier, NER.
snnclsr/NER (Python) : Named Entity Recognition system for the Turkish Language.

↥ Back To Top

Models

BERTurk : Turkish BERT/DistilBERT, ELECTRA and ConvBERT models.
ELMO For ManyLangs : Pre-trained ELMo Representations for Many Languages.
Fasttext - Word Vector : Pre-trained word vectors for 157 languages, trained on Common Crawl and Wikipedia using fastText.
Loodos/Turkish Language Models : In this repository, we publish Transformer based Turkish language models and related tools.
Hugging Face - Models/Turkish

Word Embeddings

Floret Embeddings : Turkish Floret Embeddings, large and medium sized.
VNLP Word Embeddings : Word2Vec Turkish word embeddings.
TurkishGloVe : Turkish GloVe word embeddings.

↥ Back To Top

Datasets

TDD - Türkçe Dil Deposu (Turkish Language Repository) : The Turkish Natural Language Processing Project, one of the main projects of the Turkey Open Source Platform, aims to prepare the datasets needed for the processing of Turkish texts.
ITU NLP Group - Datasets : Datasets of Istanbul Technical University, Natural Language Processing Group.
Boğaziçi University TABI - NLI-TR : The Natural Language Inference in Turkish is a set of two large scale datasets that were obtained by translating the foundational NLI corpora (SNLI and MultiNLI) using Amazon Translate.
Turkish NLP Suite Datasets : Turkish NLP Suite Project offers diverse linguistic resources for Turkish NLP. The repo currently contains several NER datasets, medical NLP datasets and sentiment analysis datasets including movie reviews, product reviews and more.

Multilingual Datasets:

Amazon MASSIVE : MASSIVE is a parallel dataset of 1M utterances across 51 languages with annotations for the NLU tasks of intent prediction and slot annotation.
OPUS: en-tr : OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus.
CC-100 : Monolingual Datasets from Web Crawl Data. This corpus comprises of monolingual data for 100+ languages.
OSCAR : is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.

Treebank:

Universal Dependencies : is an international cooperative project to create treebanks of the world's languages. The project seeks to develop cross-linguistically consistent treebank annotation of morphology and syntax for multiple languages.
UD Turkish Kenet Turkish-Kenet UD Treebank consists of 18,700 manually annotated sentences and 178,700 tokens. Its corpus consists of dictionary examples from TDK.
UD Turkish BOUN : BOUN Treebank is created by the TABILAB and supported by TÜBİTAK. This corpus contains 9761 sentences, 121,214 tokens.

Other Data:

Other Sources:

↥ Back To Top

Other Resources

Books:

Turkish Natural Language Processing (Theory and Applications of Natural Language Processing)

Videos:

Articles:

Sample Notebooks/Snippets:

Blog Posts:

Other Lists:

ITU NLP Group - Tools and Resources : List of various tools and resources for Turkish and Turkics languages.
Açık Veri Kaynakları : List of Turkey's open data sources. Official Institutions, Municipalities, Universities, International Institutions.
Awesome Turkish NLP : Yet another Turkish NLP list.
Türkçe Yapay Zeka Kaynakları : List of AI resources in Turkish.

↥ Back To Top

Contrubuting

Your contributions are welcome. If you want to contribute to this list send a pull request or just open a new issue.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

artwork: Mihrap, Osman Hamdi Bey

Turkish NLP Resources

Contents:

Tools/Libraries

Models

Word Embeddings

Datasets

Multilingual Datasets:

Treebank:

Other Data:

Other Sources:

Other Resources

Books:

Videos:

Articles:

Sample Notebooks/Snippets:

Blog Posts:

Other Lists:

Contrubuting

About

Contributors 4

agmmnn/turkish-nlp-resources

Folders and files

Latest commit

History

Repository files navigation

artwork: Mihrap, Osman Hamdi Bey

Turkish NLP Resources

Contents:

Tools/Libraries

Models

Word Embeddings

Datasets

Multilingual Datasets:

Treebank:

Other Data:

Other Sources:

Other Resources

Books:

Videos:

Articles:

Sample Notebooks/Snippets:

Blog Posts:

Other Lists:

Contrubuting

About

Topics

Resources

Stars

Watchers

Forks

Contributors 4