Mxode's picture
Update README.md
e77286d
|
raw
history blame
2.72 kB
---
license: apache-2.0
language:
- en
tags:
- code
- knowledge extraction
- tiny
- small
---
A model that can **extract the knowledge points** involved from a given **C language code**.
The base model is [pythia-70m](https://huggingface.co/EleutherAI/pythia-70m). This model was fine-tuned with 10 epochs using [Q-Lora](https://github.com/artidoro/qlora) method on my own training set.
A usage example is as follows, first import the model and prepare the code:
```python
from transformers import GPTNeoXForCausalLM, AutoTokenizer
model_name_or_path = 'Mxode/Pythia-70m-C-Language-KnowledgeExtract'
device = 'cuda'
model = GPTNeoXForCausalLM.from_pretrained(model_name_or_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
instruction = '[Summarize the knowledge points in the code below]\n' # instruction template
# any c-lang pieces you like, could be partial functions or statements
input_content = '''```c
int partition(int arr[], int low, int high) {
int pivot = arr[high];
int i = (low - 1);
for (int j = low; j <= high - 1; j++) {
if (arr[j] < pivot) {
i++;
swap(&arr[i], &arr[j]);
}
}
swap(&arr[i + 1], &arr[high]);
return (i + 1);
}
void quickSort(int arr[], int low, int high) {
if (low < high) {
int pi = partition(arr, low, high);
quickSort(arr, low, pi - 1);
quickSort(arr, pi + 1, high);
}
}
```'''
text = instruction + input_content
```
Then generate:
```python
inputs = tokenizer(text, return_tensors="pt").to(device)
tokens = model.generate(
**inputs,
pad_token_id=tokenizer.eos_token_id,
max_new_tokens=32,
)
response = tokenizer.decode(tokens[0]).split('```')[-1].split('<')[0] # deduplicate inputs
```
However, in practical use, in order to achieve more diverse representations, it's recommended to do multiple inferences. Don't worry, it's really small so the inferences don't take much time, as follows:
```python
ans_dict = {}
def increment_insert(key):
ans_dict[key] = ans_dict.get(key, 0) + 1
for i in range(30): # maybe 20 times or less enough too
inputs = tokenizer(text, return_tensors="pt").to(device)
tokens = model.generate(
**inputs,
pad_token_id=tokenizer.eos_token_id,
max_new_tokens=32,
do_sample=True,
temperature=2.0, # high temperature for diversity
top_p=0.95,
top_k=30,
)
response = tokenizer.decode(tokens[0]).split('```')[-1].split('<')[0]
increment_insert(response)
print(ans_dict)
### output as below, could take high-freq answers
### {'Backtracking': 1, 'Heap': 1, 'Quick sort': 25, 'Recurrence': 2, 'Queue': 1}
```