STAPLE 2020 README

Welcome to the Duolingo 2020 Shared Task! The shared task website is here: sharedtask.duolingo.com.

This repository has code for:

Scoring a predictions file
Training an example baseline model with fairseq

Python 3.6+ is required. It is strongly recommended that you run this in a virtual environment.

Scoring

Requirements

There are no special requirements for running the scoring function.

Code

You can score a predicted file as follows (using the AWS baseline as example, and running in the repo top level directory):

$ python staple_2020_scorer.py --goldfile staple-2020-train/en_vi/train.en_vi.2020-01-13.gold.txt  --predfile staple-2020-train/en_vi/train.en_vi.aws_baseline.pred.txt

Training models

If all you want to do is evaluation, then ignore this section.

Most participants will probably write their own code for this task, but we also provide code for training a vanilla sequence-to-sequence models using fairseq. This does not produce the best results for this task, but it is an obvious baseline and may give you a jumpstart. This code is an adaptation of translation tutorials from fairseq.

Requirements

Certain scripts require perl to run. If you are on mac or Linux, you probably already have it. See here for more details.

Next, get these repositories:

$ git clone https://github.com/moses-smt/mosesdecoder
$ git clone https://github.com/rsennrich/subword-nmt

Go to the variables.sh file and set the paths for MOSES and SUBWORDNMT accordingly.

Install python requirements:

$ pip install fairseq sacremoses subword_nmt sacrebleu tqdm

Code

The following files are provided.

variables.sh : common BASH variables
preprocess.sh : to preprocess the data for training with fairseq
train.sh : to train the model using preprocessed data
run_pretrained.sh : script to run pretrained fairseq models
my_cands_extract.py : used to convert outputs from fairseq into shared task format files (used in run_pretrained.sh).
get_traintest_data.py : converts shared task format files into fairseq-readable format (used in preprocess.sh).

The most relevant files are preprocess.sh, train.sh, and run_pretrained.sh.

Good luck!

If you have questions, feel free to check or post to the mailing list

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STAPLE 2020 README

Scoring

Requirements

Code

Training models

Requirements

Code

About

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.md		README.md
get_traintest_data.py		get_traintest_data.py
my_cands_extract.py		my_cands_extract.py
preprocess.sh		preprocess.sh
run_pretrained.sh		run_pretrained.sh
staple_2020_scorer.py		staple_2020_scorer.py
train.sh		train.sh
utils.py		utils.py
variables.sh		variables.sh

duolingo/duolingo-sharedtask-2020

Folders and files

Latest commit

History

Repository files navigation

STAPLE 2020 README

Scoring

Requirements

Code

Training models

Requirements

Code

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages