Welcome to the Duolingo 2020 Shared Task! The shared task website is here: sharedtask.duolingo.com.
This repository has code for:
- Scoring a predictions file
- Training an example baseline model with fairseq
Python 3.6+ is required. It is strongly recommended that you run this in a virtual environment.
There are no special requirements for running the scoring function.
You can score a predicted file as follows (using the AWS baseline as example, and running in the repo top level directory):
$ python staple_2020_scorer.py --goldfile staple-2020-train/en_vi/train.en_vi.2020-01-13.gold.txt --predfile staple-2020-train/en_vi/train.en_vi.aws_baseline.pred.txt
If all you want to do is evaluation, then ignore this section.
Most participants will probably write their own code for this task, but we also provide code for training a vanilla sequence-to-sequence models using fairseq. This does not produce the best results for this task, but it is an obvious baseline and may give you a jumpstart. This code is an adaptation of translation tutorials from fairseq.
Certain scripts require perl to run. If you are on mac or Linux, you probably already have it. See here for more details.
Next, get these repositories:
$ git clone https://github.com/moses-smt/mosesdecoder
$ git clone https://github.com/rsennrich/subword-nmt
Go to the variables.sh
file and set the paths for MOSES
and SUBWORDNMT
accordingly.
Install python requirements:
$ pip install fairseq sacremoses subword_nmt sacrebleu tqdm
The following files are provided.
variables.sh
: common BASH variablespreprocess.sh
: to preprocess the data for training with fairseqtrain.sh
: to train the model using preprocessed datarun_pretrained.sh
: script to run pretrained fairseq modelsmy_cands_extract.py
: used to convert outputs from fairseq into shared task format files (used inrun_pretrained.sh
).get_traintest_data.py
: converts shared task format files into fairseq-readable format (used inpreprocess.sh
).
The most relevant files are preprocess.sh
, train.sh
, and run_pretrained.sh
.
Good luck!
If you have questions, feel free to check or post to the mailing list