Let's quickly walk through setting up a development environment.
If you don't mind using a VM and things being a bit slower, we've provided one for you, the user and password are both "nate". The VM should already have everything you need installed, and will boot with a terminal open in the repository. You just need to activate the python virtualenv with
~/nate $ source .venv/bin/activate
and then you should be able to skip to "Reproducing the evaluation".
Make sure you clone with --recursive
as we have a few submodules.
~ $ git clone --recursive https://github.com/ucsd-progsys/nate.git
~ $ cd nate
This project uses Haskell for feature extraction and Python (2.7) for learning/executing the models.
There are also some required libraries and tools that won't be installed automatically: ncurses, graphviz, BLAS, Tk, and Java/Ant (if you want to run the SHErrLoc comparison). If you're on Ubuntu, the following command should suffice.
$ sudo apt-get install ncurses-dev graphviz libopenblas-dev python-tk openjdk-8-jdk ant
We recommend building the Haskell components using the stack tool.
~/nate $ stack setup && stack build
For python we recommend using virtualenv.
~/nate $ virtualenv .venv
~/nate $ source .venv/bin/activate
~/nate $ pip install -r requirements.txt
Let's run a quick test to make sure everything was built/installed correctly. We'll train a logistic regression on just the local syntactic features.
First, we'll extract the features from the paired programs. This should
take a few minutes. You'll also see a bunch of warnings about unbound
variables (in particular for variants of printf
which we don't yet
support), this is expected. OCaml interleaves name resolution and type
checking, so it will sometimes abort with a type error before discovering
an unbound variable.
~/nate $ stack exec -- generate-features \
--source features/data/ucsd/data/derived/sp14/pairs.json \
--features op \
--out data/sp14
~/nate $ stack exec -- generate-features \
--source features/data/ucsd/data/derived/fa15/pairs.json \
--features op \
--out data/fa15
The result will be a set of .ml
files in the data/{sp14,fa15}
directories, containing the individual programs, and a set of .csv
files in the data/{sp14,fa15}/op
directories, containing the extracted
features for each program.
Next, let's train a logistic regression on the sp14 programs and test it on the fa15 programs. The specific learning parameters don't particularly matter here since we're just trying to make sure everything was built properly.
~/nate $ python learning/learn.py \
--data=data/sp14/op --test_data=data/fa15/op \
--model=linear --learn_rate=0.01 --reg_rate=0.01 \
--batch_size=200 --n_epochs=10 --seed=0
training complete in 5.48 seconds
final accuracy: 0.492 / 0.663 / 0.782
avg/std recall: 0.532 / 0.388
avg / std / med samples: 10.50 / 6.99 / 9.00
avg / std / med changes: 2.74 / 2.34 / 2.00
avg prediction time: 0.179411
testing complete in 430.08 seconds
And for good measure let's try a decision tree too.
~/nate $ python learning/trees.py decision-tree data/sp14/op data/fa15/op
(40938, 49)
(40938, 45)
precision for top 3
top 1
0.534460887949
top 2
0.707822410148
top 3
0.780549682875
recall for top 3
0.415379864114
Running the entire evaluation can take quite a while (i.e. hours) since there are so many different combinations of feature sets and models to train and evaluate. Where possible, we provide commands to run both the full evaluation and a smaller subset, which may be particularly useful if you're using our VM.
If you're running the FASTER commands and want to examine a different feature set, here's a simple table to help select a feature set.
Segment | Description |
---|---|
op | baseline of local syntax (always comes first) |
+context | add contextual syntax |
+type | add typing features (local and contextual) |
+size | add expression size |
For example, to use the feature set with local and contextual syntax
features and expression size, we would use op+context+size
.
We examine many different combinations of feature sets in our evaluation, so let's go a head and generate them all.
This will take a while depending on how many cores you have, so you might want to go grab a cup of coffee.
~/nate $ make -j20 csvs
# FASTER: just the full feature set
~/nate $ make -j2 sp14-op+context+type+size-csvs fa15-op+context+type+size-csvs
The first experiment compares the accuracy of our learned models against OCaml and the state-of-the-art SHErrLoc and Mycroft tools.
Let's first train all of NATE's various models. This will also take a while, so if you're in a hurry or on a low-powered machine try the FASTER command.
~/nate $ make -j5 linear tree hidden
# FASTER: just the MLP-500 with the op+context+type+size features
~/nate $ make op+context+type+size-hidden-500
Running make -j
will garble the results printed to stdout, but we also
store each model's predictions for offline analysis. The predictions are
stored in data/{sp14,fa15}/<features>/<model>/<program>.ml.out
, with a
single predicted blame span per line. For example, to see the
predictions of the MLP-500 on the op+context+type features from the sp14
dataset, we would look at the files in
data/sp14/op+context+type+size/hidden-500
.
We can compute the top-k accuracy summaries with the results
target.
~/nate $ make -j5 results
# FASTER: just the MLP-500
~/nate $ make op+context+type+size-hidden-500-results
This will produce a results.csv
file in the same directory as the
<program>.ml.out
files. For example, here are the results for the
MLP-500 on the op+context+type features.
~/nate $ cat data/sp14/op+context+type+size/hidden-500/results.csv
tool,year,features,model,top-1,top-2,top-3,recall,total
op+context+type+size/hidden-500,sp14,op+context+type+size,hidden-500,0.736,0.864,0.915,0.720,2712
~/nate $ cat data/fa15/op+context+type+size/hidden-500/results.csv
tool,year,features,model,top-1,top-2,top-3,recall,total
op+context+type+size/hidden-500,fa15,op+context+type+size,hidden-500,0.701,0.841,0.907,0.696,2365
You may also want to compare our models against OCaml, SHErrLoc, and Mycroft. We have patched all of these tools slightly to produce source locations in a standard format, so you will have to build the included versions.
Unfortunately, Mycroft is not available publicly so we are not comfortable distributing it ourselves. Please contact the authors for a copy if you wish to rerun the Mycroft benchmarks.
~/nate $ cd eval/ocaml
~/nate/eval/ocaml $ ./configure -prefix $(pwd)/../build
~/nate/eval/ocaml $ make world world.opt
# NOTE: at the moment, ocaml's `make install` seems to
# get stuck in an infinite recursion, just hit Ctrl-C
# after a few seconds and it should be fine.. Sorry!
~/nate/eval/ocaml $ make install
# first we have to build sherrloc's version of easyocaml
~/nate $ cd eval/sherrloc/easyocaml++
~/nate/eval/sherrloc/easyocaml++ $ ./configure -prefix $(pwd)/../../build/eocaml
~/nate/eval/sherrloc/easyocaml++$ ./build/smallworld.sh
~/nate/eval/sherrloc/easyocaml++$ ./build/install.sh
# now we can build sherrloc itself
~/eval/sherrloc/easyocaml++ $ cd ..
~/eval/sherrloc $ ant
Once you have built the tools, we provide another make
target to
gather the predictions. Note that SHErrLoc in particular is quite slow
on some programs, so you may want to use our cached predictions instead
of reproducing them yourself.
~/nate $ make -j6 ocaml sherrloc # mycroft
# FASTER: just ocaml
~/nate $ make -j2 ocaml
# EVEN FASTER: use our cached predictions
NOTE: As mentioned in the paper, there are a number of programs that SHErrLoc was unable to check due to unsupported language features or exceeding its 1GB heap limit. These will present in the logs as errors of the form:
This feature is not supported by EasyOcaml
eval/bin/ecamlc: line 14: error.con: No such file or directory
OR
OutOfMemoryError
The predictions are stored as above, in
data/{sp14,fa15}/<tool>/<program>.ml.out
files. As before, we provide
a make
target to compute the accuracy summaries.
~/nate $ make -j6 ocaml-results sherrloc-results # mycroft-results
# FASTER: just ocaml
~/nate $ make ocaml-sp14-results ocaml-fa15-results
# EVEN FASTER: use the cached results
NOTE: As mentioned in the paper, there are a number of programs for which SHErrLoc and OCaml either did not produce any blame locations (perhaps due to a timeout), or our scripts could not match one of their blame locations with program locations as we see them (basically, off-by-one errors in comparing source spans). In these cases, you will see warnings of the form:
WARN: no blamed spans
OR
WARN: blamed spans not subset of all spans
As before, you should now see results.csv
files.
~/nate $ cat data/sp14/ocaml/results.csv
tool,year,features,model,top-1,top-2,top-3,recall,total
ocaml,sp14,.,ocaml,0.448,0.448,0.448,0.227,2652
~/nate $ cat data/fa15/ocaml/results.csv
tool,year,features,model,top-1,top-2,top-3,recall,total
ocaml,fa15,.,ocaml,0.435,0.435,0.435,0.219,2328
The actual bar graphs in the paper are produced by LaTeX, to rebuild them after running the benchmarks:
~/nate $ cd paper/oopsla17-submission && latexmk -pdf main
Next, let's see how to reproduce the feature utility results.
This will take quite a while as there are many combinations of feature sets and models to test, and each combination does a 10-fold cross-validation.
~/nate $ make feature-cross # this will take a few hours, and requires all feature sets
# FASTER: just run the MLP-500 on the op+context+type+size features
~/nate $ python learning/learn.py \
--data data/fa15/op+context+type+size:data/sp14/op+context+type+size \
--model=hidden --hidden_layers=500 \
--learn_rate=0.001 --reg_rate=0.001 \
--batch_size=200 --n_epochs=20 --n_folds=10 \
--seed 0
There's no interleaving of output here, so you can scroll through the log
if you like, or you can look at the summary csvs we produce in
models/<model>-<features>.cross.csv
. For example, to see the results
for the MLP-500 on the op+context+type feature set:
~/nate $ cat models/hidden-500-op+context+type+size.cross.csv
model,features,top-1,top-2,top-3,recall
hidden-500,op+context+type+size,0.773,0.883,0.929,0.737
As before, the bar graphs in the paper are produced directly by LaTeX.
In this final experiment we try to explain specific predictions from our decision-tree classifier. We use the op+context+type feature set since we don't expect the expression size feature to produce a very intuitive explanation.
If you've been following the FASTER path, you may need to first extract the op+context+type feature set and train a decision tree with:
~/nate $ python learning/trees.py decision-tree data/fa15/op+context+type data/sp14/op+context+type
Now, we'll use the learning/decisionpath.py
script to print out
the sequence of decisions made by the tree. When we train the
decision-tree, we also store a copy of the trained model in
models/data-<quarter>-<features>.pkl
, so let's use the tree trained on
the fa15 data with the op+context+type features to explain the first
bogus prediction.
~/nate $ python learning/decisionpath.py models/decision-tree-data-fa15-op+context+type.pkl data/sp14/op+context+type/0967.csv
# This script prints out decision paths for each expression, but we're
# only interested in the recursive call to `clone` here
...
For span
(3,42)-(3,47)
with confidence
0.369841269841
our prediction is
0.0
should be
1.0
Rules used to predict sample 2:
F-Is-Type-Fun : (= 1.0) > 0.5
F-Is-App : (= 0.0) <= 0.5
F-Is-Fun-C1 : (= 0.0) <= 0.5
F-Is-App-C1 : (= 0.0) <= 0.5
F-Is-Fun-P : (= 0.0) <= 0.5
F-Is-Type-Fun-P : (= 0.0) <= 0.5
F-Is-App-P : (= 1.0) > 0.5
F-Is-Type-Tuple-P : (= 0.0) <= 0.5
F-Is-Type-float-P : (= 0.0) <= 0.5
F-Is-Type-Tuple-C1 : (= 0.0) <= 0.5
F-Is-Type-int-P : (= 0.0) <= 0.5
F-Is-Type-expr-P : (= 0.0) <= 0.5
F-Is-Type-unit-P : (= 0.0) <= 0.5
F-Is-Type-char-P : (= 0.0) <= 0.5
F-Is-Type-bool-P : (= 0.0) <= 0.5
F-Is-Type-string-P : (= 0.0) <= 0.5
F-Is-Type-Fun-C1 : (= 0.0) <= 0.5
F-Is-Type-list-P : (= 1.0) > 0.5
leaf node 2821 reached, no decision here
...
There are a few important pieces of information here. First of all, we can see that we predicted 0 (i.e. no-blame), when the correct answer was 1, and furthermore our "confidence" was 0.369. Confidence is a bit of a misnomer here, it's really the probability that we should predict blame, i.e. < 0.5 means we will predict no-blame, and > 0.5 means we will predict blame.
More importantly, we see the sequence of decisions made by the tree. Each line is as follows
<feature> : (= <feature-value>) [> <=] 0.5
which says that <feature>
, with a value <feature-value>
is greater
or less than the threshold (in our case with binary features, always
0.5). For example, the first line says that the F-Is-Type-Fun
feature
(i.e. the type of this expression mentions the ->
type constructor) is
1 (i.e. true), which is greater than the threshold. Continuing on, we
can see that the only enabled features are F-Is-Type-Fun
,
F-Is-App-P
(the parent expression is an application), and
F-Is-Type-list-P
(the type of the parent expression mentions list
).
There should also now be a visualization of the entire tree in
models/decision-tree-data-fa15-op+context+type.pkl.pdf
. Each node in
the tree lists a decision as above, and is shaded either blue
(indicating the classifier is leaning towards blame) or orange
(indicating it's leaning towards no-blame). Deeper shades indicate
higher confidence. When the condition of a node is true, the tree will
descend into the left child. Note that all of the conditions have the
form <feature> <= 0.5
, so taking the left path means the feature
was disabled, the right path means the feature was enabled.
There are a few ways in which you may want to build on top of our work. You might have your own dataset that you wish to evaluate NATE's models on. Perhaps you have some clever ideas for additional features or classifiers to use with our data. Or maybe you would like to do your own analyses on the raw student interactions with OCaml.
If you have your own dataset of ill-typed programs paired with fixes, and you would like to see how NATE's models perform on it, it should be fairly straightforward to integrate your data into our processing pipeline.
First, we expect the program pairs to be in a single pairs.json
file,
with a single JSON object on each line. Each object should have a bad
field with the ill-typed program, and a fix
field with the fixed
program.
Once you have the pairs.json
file, run
~/nate $ stack exec -- generate-features \
--source path/to/pairs.json \
--features <features> \
--out path/to/csvs
with a suitable choice of features
, based on our experiments
op+context+type
will probably perform best.
Then you can train and evaluate a model with your data. For example, suppose we want to train on the combined sp14/fa15 data and evaluate on your data, to see how well the models generalize to other datasets. We'll use the MLP-500 classifier and op+context+type features since they give the best results.
~/nate $ python learning/learn.py \
--data=data/sp14/op+context+type:data/fa15/op+context+type \
--test_data=path/to/csvs \
--model=hidden --hidden_layers=500 \
--learn_rate=0.001 --reg_rate=0.001 \
--batch_size=200 --n_epochs=20 \
--seed 0
The learn_rate
, reg_rate
, batch_size
, and n_epochs
are learning
parameters, feel free to experiment with them, but these values seem to
work well in practice. seed
is an optional random seed for
reproducibility.
Perhaps you have a clever idea for a new feature to improve NATE's accuracy.
At a high level we model features as real-valued functions over typed expressions, but things are slightly more complex in the implementation.
The feature extraction functions are defined in
features/src/NanoML/Classify.hs
, using the type alias
type Feature = ([String], (TExpr -> TExpr -> [Double]))
That is, a Feature
is a pair of a list of labels and a function that
takes two typed expressions and returns a list of Double
s. The reason
we define it this way, multiple features together, is that each
expression is also given features of its parent and children, so it's
very convenient to define a feature function for the current expression,
and then use mkContextLabels
and mkContextFeatures
to lift it to the
surrounding context.
As an example, look at how we define the features for the syntactic
class of an expression, preds_tis_ctx
, in terms of
tis_op_ctx
, tis_op
, etc.
Once you have added your new feature function, you must hook it up to
the generate-features
binary. For this, open up
features/bin/Generate.hs
and look at the case-analysis in the main
function. Here you just need to add a new case for your desired feature
set, using the mkBadFeatures
function. For example, to create a new
feature set with local syntax and your new feature, you might add:
"op+myfeature"
-> mkBadFeatures src cls (preds_tis ++ preds_myfeature) jsons
Then, recompile and extract the features:
~/nate $ stack build && stack exec -- generate-features \
--source path/to/pairs.json \
--features op+myfeature \
--out path/to/csvs
Adding a new model could mean a few things. At the simple end of the
spectrum, you might want to test a different architectures for the MLP.
This requires no code changes at all, just a different setting for the
--hidden_layers
flag. It accepts a -
separated list of numbers,
specifying the number of hidden layers and how many units each should
have. For example, if we wanted to train an MLP with an initial hidden
layer of 250 units and a second layer of 500 units, we would call
~/nate $ python learning/learn.py \
--model=hidden --hidden_layers=250-500 # ...
You may also want to experiment with different settings for the learning
parameters --learn_rate
, --reg_rate
(for L2 regularization),
--batch_size
, and --n_epochs
.
If you want to make more invasive changes, e.g. different activation
functions or connections between layers, see the build_model
function
in learning/hidden.py
.
It's also quite straightforward to use any of scikit-learn's built-in classifiers, just extend the switch at https://github.com/ucsd-progsys/ml2/blob/master/learning/trees.py#L103-L119 with your choice of classifier.
Of course, you can always write your own classifier from scratch if it doesn't fit nicely into the tensorflow or scikit-learn models. As long as your classifier can read our CSV files and output a sequence of ranked predictions in the one-per-line format described above (see "Comparing Blame Accuracy"), it will fit nicely into our evaluation pipeline. The CSV files themselves are quite simple and look roughly as follows.
SourceSpan,L-NoChange,L-DidChange,F-InSlice,F-Feature1,F-Feature2,etc.
"(1,1)-(2,2)",0,1,1,0,0,...
"(1,5)-(1,8)",1,0,1,1,0,...
...
The SourceSpan
column contains the source span for each expression,
and is what your classifier should output as predictions. The
L-{No,Did}Change
columns are the boolean output labels, indicating
whether the expression changed in the fixed program. Having two columns
here is redundant, you can just use one of them. The F-InSlice
column
indicates whether the expression is part of a type error
slice. Regardless of the feature set, this column will be present, and
based on our experiments, but you can safely ignore it unless you are
working with the op+slice dataset, as we usually discard expressions
where F-InSlice
is false during feature extraction. Finally, there are
an arbitrary number of F-Feature
columns for the input features to the
model.
If you just want to work with the interaction traces we collected from the students, you can find the full (anonymized) dataset at https://github.com/ucsd-progsys/yunounderstand-data.