software-vulnerability-detection-imbalance

This project is the Pytorch implementation for the paper An Empirical Study of the Imbalance Issue in Software Vulnerability Detection.

Project Overview

Dataset
Source code for CodeBERT
Source code for GraphCodeBERT

Environment

 Python== 3.7
 pytorch==1.7.1
 torchvision==0.8.2
 tree-sitter==0.20.1
 transformers==4.24.0
 tqdm
 numpy

Dataset

All datasets provide function-level source code. Three open-source repositories:

CodeXGlue provides the devign dataset.

Devign provides the ffmpeg and qemu datasets.

Lin2018 provides the Asterisk, FFmpeg, LibPNG, LibTIFF, Pidgin, and VLC datasets.

Each dataset includes the training, validation, and test sets (*_trian.jsonl, *_valid.jsonl, *_test.jsonl).

Run

For GraphCodeBERT, we need to build the tree-sitter to parse code snippets and extract variable names. Build tree-sitter using the following command:

cd graphcoderbert/python_parser/parser_folder
bash build.sh

CodeBERT and GraphCodeBERT use the same commands for training/test. We use CodeBERT as an example.

Fine-tuning

python run.py \
    --do_train \
    --training standard\
    --data_root devign\
    --project_name qemu\
    --epochs 50 \
    --evaluate_during_training \
    --seed 123456

Validation

python run.py \
    --do_eval \
    --training standard\
    --data_root devign\
    --project_name qemu\

Test

python run.py \
    --do_test \
    --training standard\
    --data_root devign\
    --project_name qemu\

Parameter setting:

--training: the solution used to address the imbalance issue.
- Choices:
  - standard: use the default setting of CodeBERT and GraphCodeBERT.
  - weight: use the mean false error loss
  - cbl: use the class-balanced loss
    - augmentation: use the adversarial attack-based augmentation (re-sampled data are created in the dataset folder. You can also generate it by using the code in dataset/function-level/identifyP/augment.py)
  - down: use the random down-sampling
  - focal: use the focal loss
  - over: use the random over-sampling (re-sampled data are created in the dataset folder. You can also generate it by using the code in dataset/function-level/identifyP/augment_du.py)
  - threshold: use the threshold-moving
data_root: the source of data
- Choices: codexglue, devign, lin2018
project_name: the name of dataset
- Choices: please check the names in dataset/function-level/ for each source.

Publication


@inproceedings{guo2023empirical,
  title={An Empirical Study of the Imbalance Issue in Software Vulnerability Detection},
  author={Yuejun Guo and Qiang Hu and Qiang Tang and Yves Le Traon},
  booktitle={Computer Security -- ESORICS 2023},
  publisher={Springer Nature Switzerland},
  address={Cham},
  pages={371--390},
}

If you use this project, please consider citing us.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
codebert		codebert
dataset/function-level		dataset/function-level
graphcodebert		graphcodebert
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

software-vulnerability-detection-imbalance

Project Overview

Environment

Dataset

Run

Fine-tuning

Validation

Test

Publication

About

Releases

Packages

Contributors 2

Languages

License

testing-cs/vulnerability-detection

Folders and files

Latest commit

History

Repository files navigation

software-vulnerability-detection-imbalance

Project Overview

Environment

Dataset

Run

Fine-tuning

Validation

Test

Publication

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages