Note: dictionary data in this repo is a read-only mirror (translated to open formats for data interchange) of the official Unitex repository, where active development is ongoing.
The Brazilian Portuguese (pt-BR
language), Unitex primary sources for the vocabulary and its morphological definitions, in a open data (FrictionlessData) interchange format.
Controlled primary sources:
-
pt-BR
Alphabet: Alphabet.csv and Alphabet_sort.csv -
pt-BR
DELAS: DELA for Simple words, "Dicionário de Palavras Simples para o Português Brasileiro". ~67500 canonic words and its inflection rules. DELAS.csv. -
pt-BR
DELACF: DELA for Compound Forms, "Dicionário de Palavras Compostas Flexionadas para o Português Brasileiro". ~4000 compound words and its morphological classification. DELACF.csv. -
pt-BR
Inflections: all*.fst2
(finite state transducer v2) files, the compiled format for inflection graphs (see chapter 14.3 of the Unitex Manual). Each file contains only the basic representations of transitions of the graph — not changes by Graph-layout editing, changes only when topology or classification is modified. Under construction (JSON format), see dumps folder.
-
Main sources:
-
Unitex-GramLab-3.1-usermanual-en - UNITEX 3.1 USER MANUAL. See:
- Chapter 3.1, "The DELA dictionaries" (DELAF, DELAS, DELACF)
- Chapter 3.5, "Automatic inflection";
- Chapter 13.22, "Grf2Fst2".
-
Novo dicionario de formas flexionadas do UNITEX-PB, Avaliação da flexão verbal (2015).
-
Date ranges https://en.wikipedia.org/wiki/Reforms_of_Portuguese_orthography#Timeline_of_spelling_reforms
See spreadsheets do download here as data/*.csv.
Any other file must be validated by software (see SQL back-end).