Mxode
/

Pythia-70m-C-Language-KnowledgeExtract

Text Generation

knowledge extraction

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Pythia-70m-C-Language-KnowledgeExtract / README.md

Mxode's picture

Update README.md

e77286d about 1 year ago

|

2.72 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- code
	- knowledge extraction
	- tiny
	- small
	---
	A model that can extract the knowledge points involved from a given C language code.

	The base model is [pythia-70m](https://huggingface.co/EleutherAI/pythia-70m). This model was fine-tuned with 10 epochs using [Q-Lora](https://github.com/artidoro/qlora) method on my own training set.

	A usage example is as follows, first import the model and prepare the code:

	```python
	from transformers import GPTNeoXForCausalLM, AutoTokenizer

	model_name_or_path = 'Mxode/Pythia-70m-C-Language-KnowledgeExtract'
	device = 'cuda'

	model = GPTNeoXForCausalLM.from_pretrained(model_name_or_path).to(device)
	tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

	instruction = '[Summarize the knowledge points in the code below]\n' # instruction template
	# any c-lang pieces you like, could be partial functions or statements
	input_content = '''```c
	int partition(int arr[], int low, int high) {
	int pivot = arr[high];
	int i = (low - 1);
	for (int j = low; j <= high - 1; j++) {
	if (arr[j] < pivot) {
	i++;
	swap(&arr[i], &arr[j]);
	}
	}
	swap(&arr[i + 1], &arr[high]);
	return (i + 1);
	}

	void quickSort(int arr[], int low, int high) {
	if (low < high) {
	int pi = partition(arr, low, high);
	quickSort(arr, low, pi - 1);
	quickSort(arr, pi + 1, high);
	}
	}
	```'''
	text = instruction + input_content
	```

	Then generate:

	```python
	inputs = tokenizer(text, return_tensors="pt").to(device)
	tokens = model.generate(
	**inputs,
	pad_token_id=tokenizer.eos_token_id,
	max_new_tokens=32,
	)
	response = tokenizer.decode(tokens[0]).split('```')[-1].split('<')[0] # deduplicate inputs
	```



	However, in practical use, in order to achieve more diverse representations, it's recommended to do multiple inferences. Don't worry, it's really small so the inferences don't take much time, as follows:

	```python
	ans_dict = {}
	def increment_insert(key):
	ans_dict[key] = ans_dict.get(key, 0) + 1

	for i in range(30): # maybe 20 times or less enough too
	inputs = tokenizer(text, return_tensors="pt").to(device)
	tokens = model.generate(
	**inputs,
	pad_token_id=tokenizer.eos_token_id,
	max_new_tokens=32,
	do_sample=True,
	temperature=2.0, # high temperature for diversity
	top_p=0.95,
	top_k=30,
	)
	response = tokenizer.decode(tokens[0]).split('```')[-1].split('<')[0]
	increment_insert(response)

	print(ans_dict)
	### output as below, could take high-freq answers
	### {'Backtracking': 1, 'Heap': 1, 'Quick sort': 25, 'Recurrence': 2, 'Queue': 1}
	```