Extended-code2vec

Introduction

This is an extension to the popular code2vec model, which learns a distributed representations for code by taking its ASTs paths input.
Our model takes paths from multiple graph-based representations, calculates a weighted average (using attention mechanism) of each type of path separately, and aggregates them to create a final code vector. The goal is to use semantic representations like Control Flow Graph (CFG) and Program Dependency Graph (PDG) along with usual ASTs.

Setting up the model

Please refer the code2vec's README to know about installing the required dependencies, model configuration parameters, releasing/exporting the model, exporting the code vectors.
The settings and parameters specific to our model are mentioned below:
- Creating a dataset
  
  Use our Path Extractor tool to extract syntactic and semantic paths from C programs and create a dataset.
- Training and Evaluating the model
  - Set the variables in train.sh to point it to the right dataset. By default, it points to our "sumatrapdf_ast_cfg_ddg" dataset, created using the Path Extractor tool.
  - Uncomment one of the python commands in train.sh based on whether you are training the model or evaluating it on the test dataset.
  - Set the --reps and --max_contexts arguments in the python commands. Please note that --reps should be set to the exact representations included in the dataset, and --max_contexts for each type of path must also match the dataset.
    - --reps takes a space-separated list of representations. Allowed values are 'ast', 'cfg', 'cdg', 'ddg'.
    - --max_contexts takes a JSON object where each key is a representation, and its corresponding value is the maximum number of paths for that representation.
  - You can edit the configuration hyper-parameters in the file config.py, as explained here.
  - Run the train.sh script:
```
source train.sh
```
Note that we have only implemented the Keras version of the model. Pure TensorFlow implementation is currently not available.

Datasets

Dataset Format
- Each training, test, and validation splits should be a single text file, where each row is an example.
- Each example is a space-separated list of fields, where:
  - The first field is the method name (label), which is separated into subwords by pipe ("|").
  - Each of the following fields is a set of AST path-contexts followed by CFG path-contexts and then PDG path-contexts. The only way to differentiate a path AST or CFG or PDG path is by counting its position from the start (This is why it is important to set the correct value for --max-contexts argument.) If an example has fewer path-contexts than --max-contexts, it should be padded with additional spaces to compensate for fewer paths.
  - Each context has three components separated by commas (","). Each of these components cannot include spaces nor commas. The first and third components of a path context are the context words, and the second component is the actual path.
The datasets we have created for our project can be found here.

Testing the model

You can test if all the dependencies are properly installed by training the model on the testdata we have provided. Before training just set the following parameteres in the config.py file.
```
  self.TRAIN_BATCH_SIZE = 2
  self.TOP_K_WORDS_CONSIDERED_DURING_PREDICTION = 2
```

Team

In case of any queries or if you would like to give any suggestions, please feel free to contact:

Karthik Chandra (cs17b026@iittp.ac.in)
Dheeraj Vagavolu (cs17b028@iittp.ac.in)
Sridhar Chimalakonda (ch@iittp.ac.in)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data/testdata/ast_cfg_ddg		data/testdata/ast_cfg_ddg
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
common.py		common.py
config.py		config.py
keras_attention_layer.py		keras_attention_layer.py
keras_checkpoint_saver_callback.py		keras_checkpoint_saver_callback.py
keras_model.py		keras_model.py
keras_topk_word_predictions_layer.py		keras_topk_word_predictions_layer.py
keras_word_prediction_layer.py		keras_word_prediction_layer.py
keras_words_subtoken_metrics.py		keras_words_subtoken_metrics.py
model_base.py		model_base.py
path_context_reader.py		path_context_reader.py
requirements.txt		requirements.txt
run.py		run.py
train.sh		train.sh
vocabularies.py		vocabularies.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extended-code2vec

Introduction

Setting up the model

Datasets

Testing the model

Team

About

Releases

Packages

Languages

License

dheerajVagavolu/MockTail-Extended_Code2Vec_for_source_code_graphs

Folders and files

Latest commit

History

Repository files navigation

Extended-code2vec

Introduction

Setting up the model

Datasets

Testing the model

Team

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages