seanmor5 commited on
Commit
fa89a38
1 Parent(s): 524eb93

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -0
README.md CHANGED
@@ -1,3 +1,50 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - dyngnosis/function_names_v2
5
  ---
6
+
7
+ A simple Phi-2 model fine-tuned on a function identification task of disassembled binary functions. It will output function names as a JSON object. You can use the following code to identify a function name:
8
+
9
+ ```python
10
+ from transformers import AutoModelForCausalLM, AutoTokenizer
11
+ import torch
12
+
13
+ model = AutoModelForCausalLM.from_pretrained(
14
+ "seanmor5/phi-2-function-identification",
15
+ attn_implementation="flash_attention_2",
16
+ torch_dtype=torch.bfloat16,
17
+ )
18
+ model.to(torch.device("cuda"))
19
+ tokenizer = AutoTokenizer.from_pretrained("seanmor5/phi-2-function-identification")
20
+
21
+ def prompt(code):
22
+ return (
23
+ "Input: Given the following disassembled code, provide a descriptive"
24
+ + " function name for the code. Your function name should"
25
+ + " accurately describe the purpose of the code. It should"
26
+ + " be formatted in C style with lowercase and snakecase."
27
+ + f" Only output the name as valid JSON, e.g. {json.dumps({'name': 'function_name'})}"
28
+ + f"\nCode: {code}\nOutput:"
29
+ )
30
+
31
+ def identify_function(code):
32
+ eos_tokens = tokenizer.convert_tokens_to_ids(['"}', "<|endoftext|>"])
33
+ inputs = tokenizer(prompt(func), return_tensors="pt")
34
+ inputs.to(torch.device("cuda"))
35
+
36
+ outputs = model.generate(**inputs, max_new_tokens=64, eos_token_id=eos_tokens)
37
+ text = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[1] :])[0]
38
+ return text
39
+
40
+ func = """
41
+ void fcn.140030b80(ulong param_1, ulong param_2, ulong param_3) {
42
+ ulong uVar1; uVar1 = fcn.140030ae0(param_3);
43
+ fcn.14002efc0(param_1, param_2, uVar1); return;
44
+ }
45
+ """
46
+
47
+ print(identify_function(func))
48
+ ```
49
+
50
+ The model tends to repeat itself excessively, so you should set the EOS token to `"}` when generating.