Skip to content

We propose a novel adversarial example generation technique (i.e., CODA) for testing deep code models. Its key idea is to use code differences between the target input and reference inputs to guide the generation of adversarial examples.

Notifications You must be signed in to change notification settings

tianzhaotju/CODA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CODA

News: we are updating the latest code!

To improve test effectiveness on deep code models, we propose a novel perspective by exploiting the code differences between reference inputs and the target input to guide the generation of adversarial examples. From this perspective, we design CODA, which reduces the ingredient space as the one constituted by structure and identifier differences and designs equivalent structure transformations and identifier renaming transformations to preserve original semantics. We conducted an extensive study on 15 subjects. The results demonstrate that CODA reveals more faults with less time than the state-of-the-art techniques (i.e., CARROT and ALERT), and confirm the capability of enhancing the model robustness.

See Zhao Tian, Junjie Chen, et al. "Code Difference Guided Adversarial Example Generation for Deep Code Models." The 38th IEEE/ACM International Conference on Automated Software Engineering (ASE'23).

Overview

drawing


Code Structure Transformation

The descriptions and examples of code structure transformation rules in CODA.

  1. R1-loop: equivalent transformation among for structure and while structure

(1) while loop → for loop

(2) for loop → while loop

  1. R2-branch: equivalent transformation between if-else(-if) structure and if-if structure

(1) if-else-(if)if-if

(2) if-ifif-else-(if)

  1. R3-calculation: equivalent numerical calculation transformation, e.g., ++, --, +=, -=, *=, /=, %=, <<=, >>=, &=, |= , ˆ =

(1) i++i = i + 1

(2) i--i = i - 1

(3) i += ji = i + j

(4) i -= ji = i - j

(5) i *= ji = i * j

(6) i /= ji = i / j

(7) i %= ji = i % j

(8) i <<= ji = i << j

(9) i >>= ji = i >> j

(10) i &= ji = i & j

(11) i |= ji = i | j

(12) i ^= ji = i ^ j

  1. R4-constant: equivalent transformation between a constant and a variable assigned by the same constant

(1) Any literal expression (string, number, character, or boolean) can be replaced with a constant variable with the same value: if("Hello, World!");String i = "Hello, World!"; println(i);

Folder Structure

.
│  README.md
│  utils.py
│  
├─test
│  ├─AuthorshipAttribution
│  │  │  README.md
│  │  ├─code
│  │  └─dataset
│  │          
│  ├─CloneDetection
│  │  │  README.md
│  │  ├─code  
│  │  └─dataset
│  │          
│  ├─DefectPrediction
│  │  │  README.md
│  │  ├─code
│  │  └─dataset
│  │          
│  ├─FunctionalityClassification
│  │  │  README.md
│  │  ├─code
│  │  └─dataset
│  │          
│  └─VulnerabilityPrediction
│      │  README.md
│      ├─code
│      └─dataset       
├─figs 
└─python_parser
    │  pattern.py
    │  run_parser.py
    └─parser_folder
        ├─tree-sitter-c          
        ├─tree-sitter-cpp       
        ├─tree-sitter-java         
        └─tree-sitter-python

Under each subject's folder in test/ (AuthorshipAttribution/, CloneDetection/, DefectPrediction/, FunctionalityClassification/, and VulnerabilityPrediction/), there are two folders (code/ and dataset/) and one file (README.md). The original dataset and data processing programs (get_reference.py) are stored in the dataset/ directory. The code/ directory contains the test codes (test.py and attacker.py). The README.md file contains commands for data processing and testing. In the python_parser/ directory, there is tree-sitter, a parse tree generation tool. And we use this tool to implement parsing tools in multiple programming languages (C/C++, Java, and Python).


Environment Configuration

Docker

Our experiments were conducted under Ubuntu 20.04. We have made a ready-to-use docker image for this experiment. And you can also find all the datasets in the docker.

docker pull tianzhao1020/coda:v1.6

Then, assuming you have NVIDIA GPUs, you can create a container using this docker image. An example:

docker run --name=coda --gpus all -it --mount type=bind,src=/home/coda,dst=/workspace tianzhao1020/coda:v1.6

Subjects

(1) Statistics of datasets and of target models.

drawing


Download all the fine-tuned models from this Google Drive Link.

Experiments

Demo

Let's take the CodeBERT and Authorship Attribution task as an example. The dataset folder contains the training and evaluation data for this task. Run python test.py in each directory to test the deep code models. E.g., run the following commands to test the CodeBERT model on Authorship Attribution.

cd /root/CODA/test/AuthorshipAttribution/code/;
CUDA_VISIBLE_DEVICES=0 python test.py --eval_data_file=../dataset/data_folder/processed_gcjpy/valid.txt --model_name=codebert;

Running Experiments

We refer to the README.md files under each folder to prepare the dataset and test models on different tasks.

Acknowledgement

We are very grateful that the authors of Tree-sitter, CodeBERT, GraphCodeBERT, CodeT5, ALERT, and CARROT make their code publicly available so that we can build this repository on top of their code.

This work was supported by the National Natural Science Foundation of China Grant Nos. 62322208, 62002256, 62192731, 62192730, and CCF Young Elite Scientists Sponsorship Program (by CAST).


About

We propose a novel adversarial example generation technique (i.e., CODA) for testing deep code models. Its key idea is to use code differences between the target input and reference inputs to guide the generation of adversarial examples.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published