ahmedheakl commited on
Commit
3a7d20d
·
verified ·
1 Parent(s): 7df2c33

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +111 -11
README.md CHANGED
@@ -9,26 +9,34 @@ tags:
9
  model-index:
10
  - name: asm2asm-deepseek-1.3b-500k-2ep-x86-O0-risc
11
  results: []
 
 
 
 
 
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
 
17
- # asm2asm-deepseek-1.3b-500k-2ep-x86-O0-risc
18
 
19
- This model is a fine-tuned version of [deepseek-ai/deepseek-coder-1.3b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-instruct) on an unknown dataset.
20
 
21
- ## Model description
22
 
23
- More information needed
24
 
25
- ## Intended uses & limitations
26
 
27
- More information needed
 
 
28
 
29
- ## Training and evaluation data
30
 
31
- More information needed
 
 
32
 
33
  ## Training procedure
34
 
@@ -45,7 +53,99 @@ The following hyperparameters were used during training:
45
  - lr_scheduler_type: linear
46
  - num_epochs: 2
47
 
48
- ### Training results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
 
51
 
 
9
  model-index:
10
  - name: asm2asm-deepseek-1.3b-500k-2ep-x86-O0-risc
11
  results: []
12
+ datasets:
13
+ - ahmedheakl/asm2asm_O0_500000_gnueabi_gcc
14
+ metrics:
15
+ - exact_match
16
+ - accuracy
17
  ---
18
 
19
+ # CISC-to-RISC
 
20
 
21
+ A fine-tuned version of [deepseek-ai/deepseek-coder-1.3b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-instruct) specialized in converting x86 assembly code to RISCv5-64 assembly.
22
 
23
+ ## Model Overview
24
 
25
+ **asm2asm-deepseek1.3b-xtokenizer-risc** is designed to assist developers in converting x86 assembly instructions to RISCv5-64 assembly. Leveraging the capabilities of the base model, this fine-tuned variant enhances accuracy and efficiency in assembly code transpilation tasks.
26
 
27
+ ## Intended Use
28
 
29
+ This model is intended for:
30
 
31
+ - **Assembly Code Conversion**: Assisting developers in translating x86 assembly instructions to RISCv5-64 architecture.
32
+ - **Educational Purposes**: Helping learners understand the differences and translation mechanisms between x86 and RISCv5-64 assembly.
33
+ - **Code Optimization**: Facilitating optimization processes by converting and refining assembly code across architectures.
34
 
35
+ ## Limitations
36
 
37
+ - **Dataset Specificity**: The model is fine-tuned on a specific dataset, which may limit its performance on assembly instructions outside the training distribution.
38
+ - **Complex Instructions**: May struggle with highly complex or unconventional assembly instructions not well-represented in the training data.
39
+ - **Error Propagation**: Inaccuracies in the generated RISCv5-64 code can lead to functional discrepancies or bugs if not reviewed.
40
 
41
  ## Training procedure
42
 
 
53
  - lr_scheduler_type: linear
54
  - num_epochs: 2
55
 
56
+ ## Usage
57
+
58
+ All models and datasets are available on [Hugging Face](https://huggingface.co/collections/ahmedheakl/cisc-to-risc-672727bd996db985473d146e). Below is an example of how to use the best model for converting x86 assembly to RISCv5-64.
59
+
60
+ ### Inference Code
61
+
62
+ ```python
63
+ import torch
64
+ from transformers import AutoModelForCausalLM, AutoTokenizer
65
+ from tqdm import tqdm
66
+
67
+ # Replace 'hf_token' with your Hugging Face token
68
+ hf_token = "your_hf_token_here"
69
+
70
+ model_name = "ahmedheakl/asm2asm-deepseek1.3b-risc"
71
+
72
+ instruction = """<|begin▁of▁sentence|>You are a helpful coding assistant assistant on converting from x86 to RISCv64 assembly.
73
+ ### Instruction:
74
+ Convert this x86 assembly into RISCv64
75
+ ```asm
76
+ {asm_x86}
77
+ "```"
78
+ ### Response:
79
+ ```asm
80
+ {asm_risc}
81
+ """
82
+
83
+ model = AutoModelForCausalLM.from_pretrained(
84
+ model_name,
85
+ token=hf_token,
86
+ device_map="auto",
87
+ torch_dtype=torch.bfloat16,
88
+ )
89
+
90
+ model.config.use_cache = True
91
+
92
+ tokenizer = AutoTokenizer.from_pretrained(
93
+ model_name,
94
+ trust_remote_code=True,
95
+ token=hf_token,
96
+ )
97
+
98
+ def inference(asm_x86: str) -> str:
99
+ prompt = instruction.format(asm_x86=asm_x86, asm_risc="")
100
+ inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
101
+ generated_ids = model.generate(
102
+ **inputs,
103
+ use_cache=True,
104
+ num_return_sequences=1,
105
+ max_new_tokens=8000,
106
+ do_sample=False,
107
+ num_beams=8,
108
+ # temperature=0.7,
109
+ eos_token_id=tokenizer.eos_token_id,
110
+ pad_token_id=tokenizer.pad_token_id,
111
+ )
112
+ outputs = tokenizer.batch_decode(generated_ids)[0]
113
+ torch.cuda.empty_cache()
114
+ torch.cuda.synchronize()
115
+ return outputs.split("```asm\n")[-1].split(f"```{tokenizer.eos_token}")[0]
116
+
117
+
118
+ x86 = "DWORD PTR -248[rbp] movsx rdx"
119
+ converted_risc = inference(x86)
120
+ print(converted_risc)
121
+ ```
122
+
123
+ ## Experiments and Results
124
+
125
+ | **Model** | **Average Edit Distance** (↓) | **Exact Match** (↑) | **Test Accuracy** (↑) |
126
+ |-----------------------------------------------|-------------------------------|---------------------|-----------------------|
127
+ | GPT4o | 1296 | 0% | 8.18% |
128
+ | DeepSeekCoder2-16B | 1633 | 0% | 7.36% |
129
+ | Yi-Coder-9B | 1653 | 0% | 6.33% |
130
+ | **Yi-Coder-1.5B** | 275 | 16.98% | 49.69% |
131
+ | **DeepSeekCoder-1.3B** | 107 | 45.91% | 77.23% |
132
+ | **DeepSeekCoder-1.3B-xTokenizer-int4** | 119 | 46.54% | 72.96% |
133
+ | **DeepSeekCoder-1.3B-xTokenizer-int8** | **96** | 49.69% | 75.47% |
134
+ | **DeepSeekCoder-1.3B-xTokenizer** | 165 | **50.32%** | **79.25%** |
135
+
136
+ *Table: Comparison of models' performance on the x86 to ARM transpilation task, measured by Edit Distance (lower is better), Exact Match (higher is better), and Test Accuracy (higher is better). The top section lists pre-existing models, while the bottom section lists models trained by us. The best results in each metric are highlighted in bold.*
137
+
138
+
139
+ | **Model** | **Average Edit Distance** (↓) | **Exact Match** (↑) | **Test Accuracy** (↑) |
140
+ |----------------------------------------|-------------------------------|---------------------|-----------------------|
141
+ | GPT4o | 1293 | 0% | 7.55% |
142
+ | DeepSeekCoder2-16B | 1483 | 0% | 6.29% |
143
+ |----------------------------------------|-------------------------------|---------------------|-----------------------|
144
+ | DeepSeekCoder-1.3B-xTokenizer-int4 | 112 | 14.47% | 68.55% |
145
+ | DeepSeekCoder-1.3B-xTokenizer-int8 | 31 | 69.81% | 88.05% |
146
+ | DeepSeekCoder-1.3B-xTokenizer | **27** | **69.81%** | **88.68%** |
147
+
148
+ **Table:** Comparison of models' performance on the _x86 to RISCv64_ transpilation task.
149
 
150
 
151