Skip to content

Latest commit

 

History

History
194 lines (153 loc) · 10.4 KB

README.md

File metadata and controls

194 lines (153 loc) · 10.4 KB

LLM4Decompile 源码解析

📊 Results | 🤗 Models | 🚀 Quick Start | 📚 HumanEval-Decompile | 📎 Citation | 📝 Paper

Reverse Engineering: Decompiling Binary Code with Large Language Models

Updates

  • [2023-05-13]: Release V1.5 series. V1.5 are trained with a larger dataset (15B tokens) and a maximum token length of 4,096, with remarkable performance (up to 100% improvement) compared to the previous model.
  • [2023-03-16]: Add llm4decompile-6.7b-uo model which is trained without prior knowledge of the optimization levels (O0~O3), the average re-executability is around 0.219, performs the best in our models.

About

  • LLM4Decompile is the pioneering open-source large language model dedicated to decompilation. Its current version supports decompiling Linux x86_64 binaries, ranging from GCC's O0 to O3 optimization levels, into human-readable C source code. Our team is committed to expanding this tool's capabilities, with ongoing efforts to incorporate a broader range of architectures and configurations.
  • HumanEval-Decompile is the first decompilation benchmark that focuses on assessing the re-executability aspects of decompiled code. It is the C language adaptation of the HumanEval dataset and provides a suite of C solutions and assertions in evaluating the practical utility of decompiled code.

Evaluation Results

Metrics

  • Re-executability evaluates whether the decompiled code can execute properly and pass all the predefined test cases. Re-executability serves as critical indicator in validating the effectiveness of a decompilation process. Re-executability provides critical measure of semantic correctness. By re-compiling the decompiled output and running the test cases, we assess if the decompilation preserved the program logic and behavior.

Benchmarks

  • HumanEval-Decompile A collection of 164 functions that exclusively rely on standard C libraries.
  • ExeBench A collection of 2,621 functions drawn from real projects, each utilizing user-defined functions, structures, and macros.

image

Figure 1 presents the steps involved in our decompilation evaluation. First, the source code (denoted as src) is compiled by the GCC compiler with specific parameters, such as optimization levels, to produce the executable binary. This binary is then disassembled into assembly language (asm) using the objdump tool. The assembly instructions are subsequently decompiled to reconstruct the source code in a format that's readable to humans (noted as src'). To assess the quality of the decompiled code (src'), it is tested for its functionality through test assertions (re-executability).

Results

results

Models

Our LLM4Decompile includes models with sizes between 1.3 billion and 33 billion parameters, and we have made these models available on Hugging Face.

Model Checkpoint Size Re-executability Note
llm4decompile-1.3b 🤗 HF Link 1.3B 10.6% -
llm4decompile-6.7b 🤗 HF Link 6.7B 21.4% -
llm4decompile-33b 🤗 HF Link 33B 21.5% -
llm4decompile-6.7b-nsp 🤗 HF Link 6.7B 20.9% Note 1
llm4decompile-6.7b-uo 🤗 HF Link 6.7B 21.9% Note 2
llm4decompile-1.3b-v1.5 🤗 HF Link 1.3B 29.7% Note 3
llm4decompile-6.7b-v1.5 🤗 HF Link 6.7B 47.7% Note 3

Note 1: The NSP model is trained with assembly code, the average re-executability is around 0.17.

Note 2: The unified optimization (UO) model is trained without prior knowledge of the optimization levels (O0~O3), the average re-executability is around 0.21. The pre-processing of the UO model is slightly different (no prior knowledge of the On), please check the model page.

Note 3: V1.5 series are trained with a larger dataset (15B tokens) and a maximum token size of 4,096, with remarkable performance (up to 100% improvement) compared to the previous model.

Quick Start

Here is an example of how to use our model (Revised for V1.5. For previous models, please check the corresponding model page at HF). Note: Replace func0 with the function name you want to decompile.

Preprocessing: Compile the C code into binary, and disassemble the binary into assembly instructions.

import subprocess
import os

OPT = ["O0", "O1", "O2", "O3"]
fileName = 'samples/sample' #'path/to/file'
for opt_state in OPT:
    output_file = fileName +'_' + opt_state
    input_file = fileName+'.c'
    compile_command = f'gcc -o {output_file}.o {input_file} -{opt_state} -lm'#compile the code with GCC on Linux
    subprocess.run(compile_command, shell=True, check=True)
    compile_command = f'objdump -d {output_file}.o > {output_file}.s'#disassemble the binary file into assembly instructions
    subprocess.run(compile_command, shell=True, check=True)
    
    input_asm = ''
    with open(output_file+'.s') as f:#asm file
        asm= f.read()
        if '<'+'func0'+'>:' not in asm: #IMPORTANT replace func0 with the function name
            raise ValueError("compile fails")
        asm = '<'+'func0'+'>:' + asm.split('<'+'func0'+'>:')[-1].split('\n\n')[0] #IMPORTANT replace func0 with the function name
        asm_clean = ""
        asm_sp = asm.split("\n")
        for tmp in asm_sp:
            if len(tmp.split("\t"))<3 and '00' in tmp:
                continue
            idx = min(
                len(tmp.split("\t")) - 1, 2
            )
            tmp_asm = "\t".join(tmp.split("\t")[idx:])  # remove the binary code
            tmp_asm = tmp_asm.split("#")[0].strip()  # remove the comments
            asm_clean += tmp_asm + "\n"
    input_asm = asm_clean.strip()
    before = f"# This is the assembly code:\n"#prompt
    after = "\n# What is the source code?\n"#prompt
    input_asm_prompt = before+input_asm.strip()+after
    with open(fileName +'_' + opt_state +'.asm','w',encoding='utf-8') as f:
        f.write(input_asm_prompt)

Assembly instructions should be in the format:

<FUNCTION_NAME>:\nOPERATIONS\nOPERATIONS\n

Typical assembly instructions may look like this:

<func0>:
endbr64
lea    (%rdi,%rsi,1),%eax
retq

Decompilation: Use LLM4Decompile to translate the assembly instructions into C:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_path = 'LLM4Binary/llm4decompile-6.7b-v1.5' # V1.5 Model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.bfloat16).cuda()

with open(fileName +'_' + OPT[0] +'.asm','r') as f:#optimization level O0
    asm_func = f.read()
inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=2048)### max length to 4096, max new tokens should be below the range
c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])

with open(fileName +'.c','r') as f:#original file
    func = f.read()

print(f'original function:\n{func}')# Note we only decompile one function, where the original file may contain multiple functions
print(f'decompiled function:\n{c_func_decompile}')

HumanEval-Decompile

Data are stored in llm4decompile/decompile-eval/decompile-eval.json, using JSON list format. There are 164*4 (O0, O1, O2, O3) samples, each with five keys:

  • task_id: indicates the ID of the problem.
  • type: the optimization stage, is one of [O0, O1, O2, O3].
  • c_func: C solution for HumanEval problem.
  • c_test: C test assertions.
  • input_asm_prompt: assembly instructions with prompts, can be derived as in our preprocessing example.

To run the evaluation on a single GPU and single process:

cd LLM4Decompile
python ./evaluation/run_evaluation_llm4decompile_singleGPU.py

To run the evaluation using TGI (10x faster, support multiple GPUs and multi-process): First, please install the text-generation-inference following the official link

git clone https://github.com/albertan017/LLM4Decompile.git
cd LLM4Decompile
pip install -r requirements.txt

# Before running the evaluation script, please update the model_path to your local model path.
bash ./scripts/run_evaluation_llm4decompile.sh

On Going

  • Larger training dataset with the cleaning process. (done:2024.05.13)
  • Support for popular languages/platforms and settings.
  • Support for executable binaries. (done:2024.05.13)
  • Integration with decompilation tools (e.g., Ghidra, Rizin)

License

This code repository is licensed under the MIT and DeepSeek License.

Citation

@misc{tan2024llm4decompile,
      title={LLM4Decompile: Decompiling Binary Code with Large Language Models}, 
      author={Hanzhuo Tan and Qi Luo and Jing Li and Yuqun Zhang},
      year={2024},
      eprint={2403.05286},
      archivePrefix={arXiv},
      primaryClass={cs.PL}
}

Star History

Star History Chart