cross-lingual-tools

Contains the files needed for working with cross-lingual data, all developed in-house. The following are the folders present in the directory. Click on the link to see the contents of the folder.

Root Directory

Files not included in any project, but can be used all by themselves as stand-only files.
Parallel Data

Tools that can be used to check the accuracy of alignments, quality of parallel data.
Tagset Converter

Convert PDT, Penn Treebank, Perseus and PDT based PDT-based Tamil tagsets into UD tagset.

Root Directory

langCodes.tsv

TSV File containing the language codes for 134 languages, arranged in alphabetical order of their name, with their codes in 4 major standards. The columns are named as Language and Standard Code out of which the second is a CSV Value arranged as ISO 639-1 Code, ISO 639-2 Code, ISO 639-3 Code, WALS Code.

The following notations hold in CSV values:

Notation	Implication
`XXX`	List big enough to not fit here
`abc [A, B, C]`	`abc` as inclusive code, along with the ones in braces
`[A, B, C]`	all the codes mentioned are used, each for different dialects/variations of the language
`-`	the language is not coded as per this standard

Information on WALS can be found here.

wals.py

Python3 File to
- Find the most similar languages to given language.
- Find the centroid language of a given genus, i.e. a language most similar to other languages of the genus.
- Find languages that are most dissimilar to any other language in the given genus.
List of Arguments (all compulsory):
- -i or --input: Input file containing the WALS data in a tsv-format
List of Positional Arguments, and the sub-arguments (Mutually-exclusive):
- similarity: Display the WALS code and similarity scores for most similar languages to given input language's WALS code.
  
  Sub-Arguments Function
  
  -c or --code Input WALS code for the source language
  
  -n or --number Number of languages to be displayed in the output
- centroid: Display the WALS code and similarity scores for the centroid language of an input genus, i.e. a language most similar to other languages of the genus.
  
  Sub-Arguments Function
  
  -g or --genus Input genus to find the centroid for
- dissimilarity: Display the WALS Code and similarity scores of the languages that are most dissimilar to any other language in the given genus.
  
  Sub-Arguments Function
  
  -g or --genus Input genus to find the centroid for
  
  -n or --number Number of languages to be displayed in the output
  
  The input file for the task can be downloaded from here.
Usage:
- python3 wals.py -i input_file similar -c <wals_code> -n <output_count>
- python3 wals.py -i input_file centroid -g <genus_name>
- python3 wals.py -i input_file dissimilar -g <genus_name> -n <output_count>

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
hi_monolingual_clean		hi_monolingual_clean
parallel_data		parallel_data
tagset_converter		tagset_converter
.gitignore		.gitignore
README.md		README.md
langCodes.tsv		langCodes.tsv
wals.py		wals.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cross-lingual-tools

Root Directory

About

Releases

Packages

Languages

Sub-Arguments	Function
`-c` or `--code`	Input WALS code for the source language
`-n` or `--number`	Number of languages to be displayed in the output

Sub-Arguments	Function
`-g` or `--genus`	Input genus to find the centroid for
`-n` or `--number`	Number of languages to be displayed in the output

Akshayanti/cross-lingual-tools

Folders and files

Latest commit

History

Repository files navigation

cross-lingual-tools

Root Directory

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages