README

This is the replication package of paper "Pre-training by Predicting Program Dependencies for Vulnerability Analysis Tasks".

Our datasets and online appendix can be found here.

Requirements

Softwares

CUDA: 11.1
torch==1.10.2+cu111
allennlp==2.8.0
allennlp_models==2.8.0
transformers==4.12.5
numpy==1.22.4
scipy==1.8.1
torchtext==0.11.2
torchvision==0.11.3+cu111

Hardwares (opt.)

GPU: RTX 3090 24GB

Organization of the Replication Package

common: Common modules for both pre-training and downstream tasks.
data: Datasets and saved models.
- datasets: Datasets for experiments.
- models: Saved models for intrinsic evaluation and extrinsic evaluation (fine-tuned).
dist_importing: For importing modules when dist_training is enabled for allennlp.
downstreams: Modules, scripts and configs for downstreams tasks (extrinsic evaluation).
pretrain: Modules, scripts and configs for pre-training task (intrinsic evaluation).
utils: Utility functions.

How to Run

Intrinsicn Evaluation

To obtain the results of Table 1 & Table 2 in our paper.

Go to the pretrain folder (This is important for relative path retrieving).
For partial code intrinsic evaluation results in Table 1, run: python eval_partial_func_pdg.py
For full function only intrinsic evaluation results in Table 2, run: python eval_full_func_pdg.py

Note

we convert the test set into one file named packed_hybrid_vol_221228.pkl, and the ground truth of control dependency prediction (CDG) and data dependency prediction (DDG) has been constructed based on the outputs of Joern and provided in this file.
A pre-trained model with CDP and DDP headers has been provided in models/intrinsic, but this is only for intrinsic evaluation.

Extrinsic Evaluation

We use three vulnerability analysis tasks for extrinsic evaluation: vulnerability detection, vulnerability classification and vulnerability assessment.

Preparation

To make training and testing as a unified pipeline, you should open downstream/global_vars.json to make some configurations. In detail, the key of the object in downstream/global_vars.json should be the name of your machine (run Python command import platform; print(platform.node()) to check), and the python_bin should be the path your Python binary located.

Vulnerability Detection

Go to downstream folder (This is important for relative path retrieving).
For three datasets, run:
- ReVeal: python train_eval_from_config.py -config configs/vul_detect/pdbert_reveal.jsonnet -task_name vul_detect/reveal -average binary
- Devign: python train_eval_from_config.py -config configs/vul_detect/pdbert_devign.jsonnet -task_name vul_detect/devign -average binary
- BigVul: python train_eval_from_config.py -config configs/vul_detect/pdbert_bigvul.jsonnet -task_name vul_detect/bigvul -average binary

CWE Classification

Go to downstream folder (This is important for relative path retrieving).
Run python train_eval_from_config.py -config configs/cwe_class/pdbert.jsonnet -task_name cwe_class -average macro -extra_averages weighted

Vulnerability Assessment

Go to downstream folder (This is important for relative path retrieving).
Run python train_eval_multi_task_from_config.py -config configs/vul_assess/pdbert.jsonnet -task_name vul_assess -extra_eval_configs "{\"task_names\":\"CPL,AVL,CFD,ITG\"}" -eval_script eval_multi_task_classification -average macro -extra_averages weighted

Note:

If you want to change the configuration of the running task, check downstream/configs accordingly.
GPU running is enabled by default. If you run these experiments on GPU with small memory and encounter "CUDA out of memory" error, try to decrease the data_loader/batch_size in the config. But to keep consistent with our configuration, you should correspondingly increase the trainer/num_gradient_accumulation_steps, since the real batch size is batch_size * num_gradient_accumulation_steps.
Due to unavaliable network or other connection problem, process will fail sometimes and report errors like "The TLS connection was non-properly terminated" or "Make sure that 'microsoft/codebert-base' is a correct model identifier listed on 'https://huggingface.co/models'". This is because our model was trained based on CodeBERT and transformers need to fetch meta info of CodeBERT from remote. As a solution, you can download the archived CodeBERT model and put it in the right path for local retrieving. Take these steps:
- Download the CodeBERT model by:
```
  git lfs install
  git clone https://huggingface.co/microsoft/codebert-base
```
- For intrinsic evaluation, move downloaded CodeBERT directory to "pretrain/microsoft/codebert-base"
- For extrinsic evaluation, move downloaded CodeBERT directory to "downstreams/microsoft/codebert-base"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

Requirements

Softwares

Hardwares (opt.)

Organization of the Replication Package

How to Run

Intrinsicn Evaluation

Note

Extrinsic Evaluation

Preparation

Vulnerability Detection

CWE Classification

Vulnerability Assessment

Note:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
common		common
dist_importing		dist_importing
downstream		downstream
pretrain		pretrain
utils		utils
README.md		README.md

ZJU-CTAG/PDBERT

Folders and files

Latest commit

History

Repository files navigation

README

Requirements

Softwares

Hardwares (opt.)

Organization of the Replication Package

How to Run

Intrinsicn Evaluation

Note

Extrinsic Evaluation

Preparation

Vulnerability Detection

CWE Classification

Vulnerability Assessment

Note:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages