Edit model card

1. Introduction of LLM4Decompile

LLM4Decompile aims to decompile x86 assembly instructions into C. It is finetuned from Deepseek-Coder on 4B tokens of assembly-C pairs compiled from AnghaBench.

Note: The unified optimization (UO) model is trained without prior knowledge of the optimization levels (O0~O3), the average re-executability is arond 0.21.

2. Evaluation Results

Model Re-compilability Re-executability
Optimization-level O0 O1 O2 O3 Avg. O0 O1 O2 O3 Avg.
GPT4 0.92 0.94 0.88 0.84 0.895 0.1341 0.1890 0.1524 0.0854 0.1402
DeepSeek-Coder-33B 0.0659 0.0866 0.1500 0.1463 0.1122 0.0000 0.0000 0.0000 0.0000 0.0000
LLM4Decompile-1b 0.8780 0.8732 0.8683 0.8378 0.8643 0.1573 0.0768 0.1000 0.0878 0.1055
LLM4Decompile-6b 0.8817 0.8951 0.8671 0.8476 0.8729 0.3000 0.1732 0.1988 0.1841 0.2140
LLM4Decompile-33b 0.8134 0.8195 0.8183 0.8305 0.8204 0.3049 0.1902 0.1817 0.1817 0.2146

3. How to Use

Note: For the UO model, it is trained without prior knowledge of the optimization levels (O0~O3), therefore, the prompt is slightly different.

Here give an example of how to use our model. First compile the C code into binary, disassemble the binary into assembly instructions:

import subprocess
import os
import re

digit_pattern = r'\b0x[a-fA-F0-9]+\b'# binary codes in Hexadecimal
zeros_pattern = r'^0+\s'#0s
OPT = ["O0", "O1", "O2", "O3"]
fileName = 'path/to/file'
with open(fileName+'.c','r') as f:#original file
    c_func = f.read()
for opt_state in OPT:
    output_file = fileName +'_' + opt_state
    input_file = fileName+'.c'
    compile_command = f'gcc -c -o {output_file}.o {input_file} -{opt_state} -lm'#compile the code with GCC on Linux
    subprocess.run(compile_command, shell=True, check=True)
    compile_command = f'objdump -d {output_file}.o > {output_file}.s'#disassemble the binary file into assembly instructions
    subprocess.run(compile_command, shell=True, check=True)
    
    input_asm = ''
    with open(output_file+'.s') as f:#original file
        asm= f.read()
    asm = asm.split('Disassembly of section .text:')[-1].strip()
    for tmp in asm.split('\n'):
        tmp_asm = tmp.split('\t')[-1]#remove the binary code
        tmp_asm = tmp_asm.split('#')[0].strip()#remove the comments
        input_asm+=tmp_asm+'\n'
    input_asm = re.sub(zeros_pattern, '', input_asm)
    before = f"# This is the assembly code:\n"#prompt different for the UO model
    after = "\n# What is the source code?\n"#prompt
    input_asm_prompt = before+input_asm.strip()+after
    with open(fileName +'_' + opt_state +'.asm','w',encoding='utf-8') as f:
        f.write(input_asm_prompt)

Then use LLM4Decompile to translate the assembly instructions into C:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_path = 'arise-sustech/llm4decompile-6.7b-uo'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.bfloat16).cuda()

with open(fileName +'_' + opt_state +'.asm','r') as f:#original file
    asm_func = f.read()
inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512)
c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])

4. License

This code repository is licensed under the DeepSeek License.

5. Contact

If you have any questions, please raise an issue.

6. Citation

@misc{tan2024llm4decompile,
      title={LLM4Decompile: Decompiling Binary Code with Large Language Models}, 
      author={Hanzhuo Tan and Qi Luo and Jing Li and Yuqun Zhang},
      year={2024},
      eprint={2403.05286},
      archivePrefix={arXiv},
      primaryClass={cs.PL}
}
Downloads last month
22
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.