sandspeare commited on
Commit
211d31e
1 Parent(s): d440779

optrans 1.0

Browse files
Files changed (1) hide show
  1. README.md +33 -66
README.md CHANGED
@@ -7,20 +7,30 @@ license: mit
7
  <h4 align="center">
8
  <p>
9
  <a href=#about>About</a> |
10
- <a href=#news>News</a> |
11
  <a href=#quickstart>QuickStart</a> |
12
- <a href=#details>Details</a> |
13
  <p>
14
  </h4>
15
 
16
  ## About
17
 
18
- OpTrans (Re-Optimization Transformer), is an innovative framework fuses binary code optimization techniques with the transformer model for BCSD. OpTrans employs an algorithm based on binary program analysis to determine which functions should be inlined, followed by binary rewriting techniques to effectuate re-optimization on binaries. Our goal is to provide an effective tool for researchers and practitioners in binary code similarity detection, with our models accessible on the Hugging Face Model Hub.
19
 
20
- ## News
21
 
22
- - [2024/3/27] OpTrans is available on Hugging Face Model Hub (https://huggingface.co/sandspeare/optrans).
23
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ## QuickStart
26
 
@@ -46,80 +56,37 @@ from transformers import AutoModel, AutoTokenizer
46
 
47
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
48
 
49
- asm_tokenizer = AutoTokenizer.from_pretrained("hustcw/clap-asm", trust_remote_code=True)
50
- text_tokenizer = AutoTokenizer.from_pretrained("hustcw/clap-text", trust_remote_code=True)
51
- asm_encoder = AutoModel.from_pretrained("hustcw/clap-asm", trust_remote_code=True).to(device)
52
- text_encoder = AutoModel.from_pretrained("hustcw/clap-text", trust_remote_code=True).to(device)
53
  ```
54
 
55
  ### Example Use Cases
56
- **Fine-Grained Sorting Algorithm Classification (Zero-Shot)**
57
 
58
- 1. Load your assembly (asm) code dataset. For demonstration, we use a JSON file containing assembly code snippets related to bubble sort:
59
 
60
  ```python
61
- with open("./CaseStudy/bubblesort.json") as fp:
62
- asm = json.load(fp)
63
  ```
64
 
65
- 2. Define your classification prompts:
66
- ```python
67
- prompts = [
68
- "This is a function related to bubble sort",
69
- "This is a function related to selection sort",
70
- ...
71
- ]
72
- ```
73
- 3. Encode the assembly code and prompts, then perform classification:
74
 
75
  ```python
76
- # Encode assembly code
77
- asm_input = asm_tokenizer([asm], padding=True, return_tensors="pt").to(device)
78
- asm_embedding = asm_encoder(**asm_input)
79
-
80
- # Encode prompts
81
- text_input = text_tokenizer(prompts, return_tensors='pt').to(device)
82
- text_embeddings = text_encoder(**text_input)
83
 
84
- # Classification
85
- logits = torch.einsum("nc,ck->nk", [asm_embedding.last_hidden_state, text_embeddings.last_hidden_state.T])
86
- preds = torch.softmax(logits / 0.07, dim=1).squeeze(0).tolist()
87
 
88
- # Output predictions
89
- for i, prompt in enumerate(prompts):
90
- print(f"Probability: {preds[i]*100:.3f}%, Text: {prompt}")
91
  ```
92
 
93
- ## Details
94
- In this document, we provide an overview of the contents of this repository and instructions for accessing the materials.
95
-
96
- 1. **CaseStudy.ipynb**: A Jupyter Notebook showcasing the zero-shot performance of our proposed model using a case study. Please open this file to get an in-depth view of how our model works and the results it produces.
97
-
98
- 2. **CaseStudy**: A folder containing IDB files and rebased assembly code for the case study used in the Jupyter Notebook. These files are used to generate the results in the Jupyter Notebook. We provide three different senarios for the case study,
99
- including a bubble sort program, SHA-3 crypto algorithms and a real-world malware sample (which can be found at [virustotal](https://www.virustotal.com/gui/file/cd677242197cdc89d7b8e2e3056030fe2bb9b384c95a7a027a7eee8182b8426f/)). We conduct three zero-shot (without any further training) case studies, the results are shown in the Jupyter Notebook.
100
-
101
- 3. **Prompts**: A folder containing prompts for explaining source code and zero-shot evaluation in crypto identification task and protocol categorization.
102
 
103
- 3. **HumanEvaluationExamples**: A folder containing screenshots of human evaluations procedure performed while evaluating our data engine. These examples serve as supplementary evidence to support the claims made in the paper.
104
-
105
- ### Instructions
106
-
107
- To access the materials, please follow these steps:
108
-
109
- 1. You can view these materials with your brower. Just open the CaseStudy.ipynb file to view the case study and the performance of our model. And you can browse the **HumanEvaluationExamples** folder to view the screenshots of human evaluations performed during the assessment of our shadow models.
110
-
111
- 2. Download or clone this repository to your local machine.
112
-
113
- 1. Ensure you have a recent version of Jupyter Notebook installed on your system. Or you can use VSCode to open the Jupyter Notebook file.
114
-
115
- 2. Open the CaseStudy.ipynb file with Jupyter Notebook to view the case study and the performance of our model.
116
-
117
- 3. We provide IDB files and rebased assembly code for the case study in the **CaseStudy** folder. You can use IDA Pro to open the IDB files and view the assembly code. Or you can view the rebased assembly code in any text editor.
118
-
119
- Thank you for your interest in our work, and we hope these materials help you better understand our research and findings.
120
-
121
- ### Processing Data
122
- We provide a example script to process the assembly code. The script is located at `scripts/process_asm.py`. You can use the script to process your own binaries.
123
- ```bash
124
- /path/to/idat64 -c -A -Sscripts/process_asm.py -obinary.idb /path/to/binary
125
  ```
 
7
  <h4 align="center">
8
  <p>
9
  <a href=#about>About</a> |
10
+ <a href=#intuition>Intuition</a> |
11
  <a href=#quickstart>QuickStart</a> |
 
12
  <p>
13
  </h4>
14
 
15
  ## About
16
 
17
+ OpTrans (Re-Optimization Transformer), is an innovative framework fuses binary code optimization techniques with the transformer model for BCSD. By OpTrans employs an algorithm based on binary program analysis to determine which functions should be inlined, followed by binary rewriting techniques to effectuate re-optimization on binaries. Our goal is to provide an effective tool for researchers and practitioners in binary code similarity detection, with our models accessible on the Hugging Face Model Hub.
18
 
 
19
 
20
+ ## Intuition
21
 
22
+ This document will present how function inlining optimization improve binary code similarity detection.
23
+
24
+ Function Faust_next in sc3-plugins-HOAEncLebedev501.so compiled with -O0 (sc3-plugins-HOAEncLebedev501.so-O0.i64)
25
+ ![O0](./Intuition/O0.jpg)
26
+
27
+ Function Faust_next in sc3-plugins-HOAEncLebedev501.so compiled with -O3 (sc3-plugins-HOAEncLebedev501.so-O3.i64)
28
+ ![O3](./Intuition/O3.jpg)
29
+
30
+ Function Faust_next in sc3-plugins-HOAEncLebedev501.so compiled with -O0 and processed by function inlining optimization (sc3-plugins-HOAEncLebedev501.so-O0-inline.i64)
31
+ ![O0-inline](./Intuition/O0-inline.jpg)
32
+
33
+ The idb files in ./Intuition are generated by IDA-8.3
34
 
35
  ## QuickStart
36
 
 
56
 
57
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
58
 
59
+ tokenizer = AutoTokenizer.from_pretrained("sandspeare/optrans", trust_remote_code=True)
60
+ encoder = AutoModel.from_pretrained("sandspeare/optrans", trust_remote_code=True).to(device)
61
+ tokenizer.pad_token = tokenizer.unk_token
 
62
  ```
63
 
64
  ### Example Use Cases
65
+ **Function inlining optimization for BCSD**
66
 
67
+ 1. Load your binary code dataset. For demonstration, we use a pickle file containing binary code snippets for similarity compare.
68
 
69
  ```python
70
+ with open("./CaseStudy/casestudy.json") as fp:
71
+ data = json.load(fp)
72
  ```
73
 
74
+ 2. Encode the binary code.
 
 
 
 
 
 
 
 
75
 
76
  ```python
77
+ asm_O0 = tokenizer([data["O0"]], padding=True, return_tensors="pt").to(device)
78
+ asm_embedding_O0 = encoder(**asm_O0)
 
 
 
 
 
79
 
80
+ asm_O0_inline = tokenizer([data["O0_inline"]], padding=True, return_tensors="pt").to(device)
81
+ asm_embedding_O0_inline = encoder(**asm_O0_inline)
 
82
 
83
+ asm_O3 = tokenizer([data["O3"]], padding=True, return_tensors="pt").to(device)
84
+ asm_embedding_O3 = encoder(**asm_O3)
 
85
  ```
86
 
87
+ 3. Perform similarity comparison:
 
 
 
 
 
 
 
 
88
 
89
+ ```python
90
+ sim_O0vsO3 = torch.mm(asm_embedding_O0, asm_embedding_O3.T).squeeze() / 0.07
91
+ sim_O0_inlinevsO3 = torch.mm(asm_embedding_O0_inline, asm_embedding_O3.T).squeeze() / 0.07
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
  ```