This folder contains the implementation of the GGSNN and GMN models from DeepMind. The tool is constituted by two components. The first one takes as input the ACFG disasm data and it produces as output a number of intermediate results. Those are then taken as input by the second part, which implements the machine learning component.
The first part of the tool is implemented in a Python3 script called gnn_preprocessing.py
. We also provide a Docker container with the required dependencies.
The input is a folder with the JSON files extracted via the ACFG disasm IDA plugin:
- For the Datasets we have released, the JSON files are already available in the
features
directory. Please, note that features are downloaded from GDrive as explained in the README. - To extract the features for a new set of binaries, run the ACFG disasm IDA plugin following the instructions in the README.
- A JSON file
opcodes_dict.json
that maps the selected opcodes to their (frequency) ranking in the training dataset.
The script will produce the following output:
- A JSON file
opcodes_dict.json
that maps the selected opcodes to their (frequency) ranking in the training dataset. (Only if the script is launched in--training
mode.) - A JSON file
graph_func_dict_opc_{}.json
with the selected intermediate features.
The following are the concrete steps to run the analysis within the provided Docker container:
- Build the docker image:
docker build --no-cache Preprocessing/ -t gnn-preprocessing
- Run the main script within the docker container:
docker run --rm \
-v <path_to_the_acfg_disasm_dir>:/input \
-v <path_to_the_training_data>:/training_data \
-v <path_to_the_output_dir>:/output \
-it gnn-preprocessing /code/gnn_preprocessing.py -i /input [--training] -o /output
You can see all options of the gnn_preprocessing.py
command with:
docker run --rm -it gnn-preprocessing /code/gnn_preprocessing.py --help
- Example: run
gnn_preprocessing.py
in training mode on the Dataset-1_training:
docker run --rm \
-v $(pwd)/../../DBs/Dataset-1/features/training/acfg_disasm_Dataset-1_training:/input \
-v $(pwd)/Preprocessing/Dataset-1_training:/output \
-it gnn-preprocessing /code/gnn_preprocessing.py -i /input --training -o /output
- Example: run
gnn_preprocessing.py
on the Dataset-1_validation:
docker run --rm \
-v $(pwd)/../../DBs/Dataset-1/features/validation/acfg_disasm_Dataset-1_validation:/input \
-v $(pwd)/Preprocessing/Dataset-1_training:/training_data \
-v $(pwd)/Preprocessing/Dataset-1_validation:/output \
-it gnn-preprocessing /code/gnn_preprocessing.py -i /input -d /training_data/opcodes_dict.json -o /output
- Example: run
gnn_preprocessing.py
on the Dataset-1_testing:
docker run --rm \
-v $(pwd)/../../DBs/Dataset-1/features/testing/acfg_disasm_Dataset-1_testing:/input \
-v $(pwd)/Preprocessing/Dataset-1_training:/training_data \
-v $(pwd)/Preprocessing/Dataset-1_testing:/output \
-it gnn-preprocessing /code/gnn_preprocessing.py -i /input -d /training_data/opcodes_dict.json -o /output
- Example: run
gnn_preprocessing.py
on the Dataset-2:
docker run --rm \
-v $(pwd)/../../DBs/Dataset-2/features/acfg_disasm_Dataset-2:/input \
-v $(pwd)/Preprocessing/Dataset-1_training:/training_data \
-v $(pwd)/Preprocessing/Dataset-2:/output \
-it gnn-preprocessing /code/gnn_preprocessing.py -i /input -d /training_data/opcodes_dict.json -o /output
- Example: run
gnn_preprocessing.py
on the Dataset-Vulnerability
docker run --rm \
-v $(pwd)/../../DBs/Dataset-Vulnerability/features/acfg_disasm_Dataset-Vulnerability:/input \
-v $(pwd)/Preprocessing/Dataset-1_training:/training_data \
-v $(pwd)/Preprocessing/Dataset-Vulnerability:/output \
-it gnn-preprocessing /code/gnn_preprocessing.py -i /input -d /training_data/opcodes_dict.json -o /output
Run unittest:
docker run --rm \
-v $(pwd)/Preprocessing/testdata/:/input \
-v $(pwd)/Preprocessing/testdata/gnn_intermediate:/output \
-it gnn-preprocessing /bin/bash -c "( cd /code && python3 -m unittest test_gnn_preprocessing.py )"
The second part implements the machine learning component. We also provide a Docker container with TensorFlow 1.14 and the other required dependencies.
The neural network model takes in input:
- The CSV files with the functions to train, or the pair of functions to validate and test the model. These files are already available for the Datasets we have released. The path of these files is hardcoded in the
config.py
file, based on the dataset type. - The
graph_func_dict_opc_{}.json
file from Part 1. - The model checkpoint (only if the model is used in inference mode, i.e., during validation and testing).
The model will produce the following output:
- A set of CSV files with the similarity (column
sim
) for the functions selected for validation and testing - A
config.json
file with the configuration used to run the test. This includes the parameters and the path of the CSV and JSON files in input. This file is useful for debugging and tracking different experiments. - A
gnn.log
file with the logs from the neural network. To improve logging, use the--debug
(-d
) option. - The model checkpoint (only if the model is trained).
The following are the concrete steps to run the the neural network using our Docker container:
- Build the docker image:
docker build --no-cache NeuralNetwork/ -t gnn-neuralnetwork
- Run the neural network within the Docker container:
docker run --rm \
-v $(pwd)/../../DBs:/input \
-v $(pwd)/Preprocessing:/preprocessing \
-v $(pwd)/NeuralNetwork/:/output \
-it gnn-neuralnetwork /code/gnn.py (--train | --validate | --test) [--num_epochs 10] \
--model_type {embedding, matching} --training_mode {pair,triplet} \
--features_type {opc,nofeatures} --dataset {one,two,vuln} \
-c /code/model_checkpoint \
-o /output/Dataset-x
- You can see all options of the
gnn.py
command with:
docker run --rm -it gnn-neuralnetwork /code/gnn.py --help
- Example: run the training on the Dataset-1 for the
GGSNN
model (embedding
) withopc
features inpair
mode
docker run --rm \
-v $(pwd)/../../DBs:/input \
-v $(pwd)/NeuralNetwork:/output \
-v $(pwd)/Preprocessing:/preprocessing \
-it gnn-neuralnetwork /code/gnn.py --train --num_epochs 10 \
--model_type embedding --training_mode pair \
--features_type opc --dataset one \
-c /output/model_checkpoint_$(date +'%Y-%m-%d') \
-o /output/Dataset-1_training_GGSNN_opc_pair
The new trained model will be saved in $(pwd)/NeuralNetwork/model_checkpoint_$(date +'%Y-%m-%d')
. Use the --restore
option to continue the training from an existing checkpoint.
- Example: run the training on the Dataset-1 for the
GGSNN
model (embedding
) withnofeatures
inpair
mode
docker run --rm \
-v $(pwd)/../../DBs:/input \
-v $(pwd)/NeuralNetwork:/output \
-v $(pwd)/Preprocessing:/preprocessing \
-it gnn-neuralnetwork /code/gnn.py --train --num_epochs 10 \
--model_type embedding --training_mode pair \
--features_type nofeatures --dataset one \
-c /output/model_checkpoint_$(date +'%Y-%m-%d') \
-o /output/Dataset-1_training_GGSNN_nofeatures_pair
- Example: run the training on the Dataset-1 for the
GMN
model (matching
) withopc
features inpair
mode
docker run --rm \
-v $(pwd)/../../DBs:/input \
-v $(pwd)/NeuralNetwork:/output \
-v $(pwd)/Preprocessing:/preprocessing \
-it gnn-neuralnetwork /code/gnn.py --train --num_epochs 16 \
--model_type matching --training_mode pair \
--features_type opc --dataset one \
-c /output/model_checkpoint_$(date +'%Y-%m-%d') \
-o /output/Dataset-1_training_GMN_opc_pair
- Example: run the validation on Dataset-1 using the model_checkpoint that we trained on Dataset-1:
docker run --rm \
-v $(pwd)/../../DBs:/input \
-v $(pwd)/NeuralNetwork/:/output \
-v $(pwd)/Preprocessing:/preprocessing \
-it gnn-neuralnetwork /code/gnn.py --validate \
--model_type embedding --training_mode pair \
--features_type opc --dataset one \
-c /code/model_checkpoint_GGSNN_pair \
-o /output/Dataset-1_validation
- Example: run the testing on Dataset-1 using the model_checkpoint that we trained on Dataset-1:
docker run --rm \
-v $(pwd)/../../DBs:/input \
-v $(pwd)/NeuralNetwork/:/output \
-v $(pwd)/Preprocessing:/preprocessing \
-it gnn-neuralnetwork /code/gnn.py --test \
--model_type embedding --training_mode pair \
--features_type opc --dataset one \
-c /code/model_checkpoint_GGSNN_pair \
-o /output/Dataset-1_testing
The following are the main steps that are needed to run the models on a new dataset of functions.
- Create a CSV file with the selected functions for training. Example here.
idb_path
andfva
are the "primary keys" used to uniquely identify a function. The only requirement is to have the same function (i.e., the same function name) to be compiled under different settings (e.g., compilers, architectures, optimizations). The more the variants for each function, the better the model can generalize. - Extract the features using the ACFG disasm IDA plugin following the instructions in the README. The
idb_path
for the selected functions must be a valid path to an IDB file to run the IDA plugin correctly. - Run the GGSNN/GMN preprocessing tool following the instructions in Part 1.
- Run the GGSNN/GMN neural network in training mode (
--train
) following the instructions in Part 2.
- Create a CSV file with the pairs of functions selected for validation and testing. Example here. (
idb_path_1
,fva_1
) and (idb_path_2
,fva_2
) are the "primary keys". - Extract the features using the ACFG disasm IDA plugin following the instructions in the README.
idb_path_1
andidb_path_2
for the selected functions must be valid paths to the IDBs file to run the IDA plugin correctly. - Run the GGSNN/GMN preprocessing tool following the instructions in Part 1.
- Run the GGSNN/GMN neural network in testing mode (
--test
) following the instructions in Part 2.
- The GMN neural network requires two functions in input to compute their similarity. This limits the scalability of the approach because the model does not translate the function into an embedding representation.
- The model checkpoint we provide was trained using the functions of Dataset-1, which have been compiled for Linux using three architectures (x86-64, ARM 32/64 and MIPS 32/64), five optimizations, and two compilers (GCC and CLANG). Do not use the model to infer the similarity for functions compiled in different settings (e.g., for Windows), but retrain it following the instructions above.
- The implementation allows to select different types of loss functions, features and training modes (pair or triplet). More information in the gnn.py and
config.py
files.
The NeuralNetwork implementation includes part of the code from https://github.com/deepmind/deepmind-research/blob/master/graph_matching_networks/graph_matching_networks.ipynb which is licensed under Apache License 2.0.