|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
tags: |
|
- code |
|
- knowledge extraction |
|
- tiny |
|
- small |
|
--- |
|
A model that can **extract the knowledge points** involved from a given **C language code**. |
|
|
|
The base model is [pythia-70m](https://huggingface.co/EleutherAI/pythia-70m). This model was fine-tuned with 10 epochs using [Q-Lora](https://github.com/artidoro/qlora) method on my own training set. |
|
|
|
A usage example is as follows, first import the model and prepare the code: |
|
|
|
```python |
|
from transformers import GPTNeoXForCausalLM, AutoTokenizer |
|
|
|
model_name_or_path = 'Mxode/Pythia-70m-C-Language-KnowledgeExtract' |
|
device = 'cuda' |
|
|
|
model = GPTNeoXForCausalLM.from_pretrained(model_name_or_path).to(device) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) |
|
|
|
instruction = '[Summarize the knowledge points in the code below]\n' # instruction template |
|
# any c-lang pieces you like, could be partial functions or statements |
|
input_content = '''```c |
|
int partition(int arr[], int low, int high) { |
|
int pivot = arr[high]; |
|
int i = (low - 1); |
|
for (int j = low; j <= high - 1; j++) { |
|
if (arr[j] < pivot) { |
|
i++; |
|
swap(&arr[i], &arr[j]); |
|
} |
|
} |
|
swap(&arr[i + 1], &arr[high]); |
|
return (i + 1); |
|
} |
|
|
|
void quickSort(int arr[], int low, int high) { |
|
if (low < high) { |
|
int pi = partition(arr, low, high); |
|
quickSort(arr, low, pi - 1); |
|
quickSort(arr, pi + 1, high); |
|
} |
|
} |
|
```''' |
|
text = instruction + input_content |
|
``` |
|
|
|
Then generate: |
|
|
|
```python |
|
inputs = tokenizer(text, return_tensors="pt").to(device) |
|
tokens = model.generate( |
|
**inputs, |
|
pad_token_id=tokenizer.eos_token_id, |
|
max_new_tokens=32, |
|
) |
|
response = tokenizer.decode(tokens[0]).split('```')[-1].split('<')[0] # deduplicate inputs |
|
``` |
|
|
|
|
|
|
|
However, in practical use, in order to achieve more diverse representations, it's recommended to do multiple inferences. Don't worry, it's really small so the inferences don't take much time, as follows: |
|
|
|
```python |
|
ans_dict = {} |
|
def increment_insert(key): |
|
ans_dict[key] = ans_dict.get(key, 0) + 1 |
|
|
|
for i in range(30): # maybe 20 times or less enough too |
|
inputs = tokenizer(text, return_tensors="pt").to(device) |
|
tokens = model.generate( |
|
**inputs, |
|
pad_token_id=tokenizer.eos_token_id, |
|
max_new_tokens=32, |
|
do_sample=True, |
|
temperature=2.0, # high temperature for diversity |
|
top_p=0.95, |
|
top_k=30, |
|
) |
|
response = tokenizer.decode(tokens[0]).split('```')[-1].split('<')[0] |
|
increment_insert(response) |
|
|
|
print(ans_dict) |
|
### output as below, could take high-freq answers |
|
### {'Backtracking': 1, 'Heap': 1, 'Quick sort': 25, 'Recurrence': 2, 'Queue': 1} |
|
``` |