File size: 4,619 Bytes
7ab5803 1f757eb 7ab5803 e40fa06 7ab5803 e40fa06 cfca3e7 e40fa06 50b13e7 e40fa06 050c78d e40fa06 7ab5803 9d4e11c 7ab5803 e40fa06 7ab5803 c8fc5c3 7ab5803 9683109 7ab5803 c8fc5c3 7ab5803 e40fa06 7ab5803 a1f8a46 7ab5803 50b13e7 7ab5803 6eeb2fd df73d89 6eeb2fd df73d89 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
---
language: code # <-- my language
widget:
- text: "# This is the assembly code with O0 optimization:\n<func0>:\nendbr64\nlea (%rdi,%rsi,1),%eax\nretq\n# What is the source code?\n"
license: other
tags:
- decompile
- binary
---
### 1. Introduction of LLM4Decompile
LLM4Decompile aims to decompile x86 assembly instructions into C. It is finetuned from Deepseek-Coder on 4B tokens of assembly-C pairs compiled from AnghaBench.
- **Github Repository:** [LLM4Decompile](https://github.com/albertan017/LLM4Decompile)
- **Paper link:** For more details check out the [paper](https://arxiv.org/abs/2403.05286).
### 2. Evaluation Results
| Model | Re-compilability | | | | | Re-executability | | | | |
|--------------------|:----------------:|:---------:|:---------:|:---------:|:---------:|:----------------:|-----------|-----------|-----------|:---------:|
| Optimization-level | O0 | O1 | O2 | O3 | Avg. | O0 | O1 | O2 | O3 | Avg. |
| GPT4 | 0.92 | 0.94 | 0.88 | 0.84 | 0.895 | 0.1341 | 0.1890 | 0.1524 | 0.0854 | 0.1402 |
| DeepSeek-Coder-33B | 0.0659 | 0.0866 | 0.1500 | 0.1463 | 0.1122 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| LLM4Decompile-1b | 0.8780 | 0.8732 | 0.8683 | 0.8378 | 0.8643 | 0.1573 | 0.0768 | 0.1000 | 0.0878 | 0.1055 |
| LLM4Decompile-6b | 0.8817 | 0.8951 | 0.8671 | 0.8476 | 0.8729 | 0.3000 | 0.1732 | 0.1988 | 0.1841 | 0.2140 |
| LLM4Decompile-33b | 0.8134 | 0.8195 | 0.8183 | 0.8305 | 0.8204 | 0.3049 | 0.1902 | 0.1817 | 0.1817 | 0.2146 |
### 3. How to Use
Here give an example of how to use our model.
First compile the C code into binary, disassemble the binary into assembly instructions:
```python
import subprocess
import os
import re
digit_pattern = r'\b0x[a-fA-F0-9]+\b'# binary codes in Hexadecimal
zeros_pattern = r'^0+\s'#0s
OPT = ["O0", "O1", "O2", "O3"]
fileName = 'path/to/file'
with open(fileName+'.c','r') as f:#original file
c_func = f.read()
for opt_state in OPT:
output_file = fileName +'_' + opt_state
input_file = fileName+'.c'
compile_command = f'gcc -c -o {output_file}.o {input_file} -{opt_state} -lm'#compile the code with GCC on Linux
subprocess.run(compile_command, shell=True, check=True)
compile_command = f'objdump -d {output_file}.o > {output_file}.s'#disassemble the binary file into assembly instructions
subprocess.run(compile_command, shell=True, check=True)
input_asm = ''
with open(output_file+'.s') as f:#asm file
asm= f.read()
asm = asm.split('Disassembly of section .text:')[-1].strip()
for tmp in asm.split('\n'):
tmp_asm = tmp.split('\t')[-1]#remove the binary code
tmp_asm = tmp_asm.split('#')[0].strip()#remove the comments
input_asm+=tmp_asm+'\n'
input_asm = re.sub(zeros_pattern, '', input_asm)
before = f"# This is the assembly code with {opt_state} optimization:\n"#prompt
after = "\n# What is the source code?\n"#prompt
input_asm_prompt = before+input_asm.strip()+after
with open(fileName +'_' + opt_state +'.asm','w',encoding='utf-8') as f:
f.write(input_asm_prompt)
```
Then use LLM4Decompile to translate the assembly instructions into C:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_path = 'arise-sustech/llm4decompile-1.3b'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.bfloat16).cuda()
with open(fileName +'_' + opt_state +'.asm','r') as f:#original file
asm_func = f.read()
inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512)
c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])
```
### 4. License
This code repository is licensed under the DeepSeek License.
### 5. Contact
If you have any questions, please raise an issue.
### 6. Citation
```
@misc{tan2024llm4decompile,
title={LLM4Decompile: Decompiling Binary Code with Large Language Models},
author={Hanzhuo Tan and Qi Luo and Jing Li and Yuqun Zhang},
year={2024},
eprint={2403.05286},
archivePrefix={arXiv},
primaryClass={cs.PL}
}
``` |