pythia-6.9b-HC3 / README.md

Adding Evaluation Results (#2)

7b54387 about 1 year ago

3.76 kB

	---
	license: apache-2.0
	tags:
	- generated_from_trainer
	- HC3
	- chatGPT
	- assistant
	datasets:
	- pszemraj/HC3-textgen-qa
	metrics:
	- accuracy
	inference: false
	base_model: EleutherAI/pythia-6.9b-deduped
	---

	# pythia-6.9b-deduped for general QA

	<a href="https://colab.research.google.com/gist/pszemraj/e19747c911697b20f3bedf6e21dee0a5/pythia-6-9b-hc3-notebook-v2.ipynb">
	<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
	</a>

	This model is a fine-tuned version of [EleutherAI/pythia-6.9b-deduped](https://huggingface.co/EleutherAI/pythia-6.9b-deduped) on the pszemraj/HC3-textgen-qa dataset.
	It achieves the following results on the evaluation set:
	- Loss: 1.2372
	- Accuracy: 0.6769
	- perplexity: 3.446

	## Model description

	Text generation model trained on the HC3 text data of human questions + chatGPT answers.

	![example](https://i.imgur.com/iMqPDXU.png)


	### Usage

	Install necessary packages for inference (_unless you have a big boi GPU_)
	```bash
	pip install -U -q transformers bitsandbytes accelerate
	```

	Basic inference example:

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("pszemraj/pythia-6.9b-HC3")

	model = AutoModelForCausalLM.from_pretrained(
	"pszemraj/pythia-6.9b-HC3", load_in_8bit=True, device_map="auto"
	) # shards are ~4GB each, there are eight total

	prompt = "I was wondering how much wood a woodchuck could chuck? <answer>"
	inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
	outputs = model.generate(
	**inputs, max_new_tokens=300
	) # default generation config (+ 300 tokens)
	result = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
	result = result.split("<end_answer>")[0].strip()

	import pprint as pp

	pp.pprint(result)
	```

	The defautl `GenerationConfig` uses contrastive search with `top_k=4` and `penalty_alpha=0.6`. For more information on inference and parameters to use, see [the transformers docs](https://huggingface.co/docs/transformers/generation_strategies#decoding-strategies).

	## Intended uses & limitations

	- Intended use: research/exploration into comparing RLHF tuning vs. "guided"/specific tuning on "quality" datasets/responses of _"what the human would want as answer anyway"_
	- This is not trained/fine-tuned with RLHF and therefore will not be as helpful/generalizable/safe as chatGPT (_outside of the fact that this model is ~30x smaller_)

	## Training and evaluation data

	```yaml
	model-index:
	- name: pythia-6.9b-hc3-qa-assistant
	results:
	- task:
	name: Causal Language Modeling
	type: text-generation
	dataset:
	name: pszemraj/HC3-textgen-qa
	metrics:
	- name: Accuracy
	type: accuracy
	value: 0.6768941789814655
	```


	## Training procedure

	Two epochs on the `pszemraj/HC3-textgen-qa` dataset.

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Accuracy \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:--------:\|
	\| 1.2598 \| 0.99 \| 79 \| 1.3291 \| 0.6496 \|
	\| 0.7446 \| 1.99 \| 158 \| 1.2372 \| 0.6769 \|


	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_pszemraj__pythia-6.9b-HC3)

	\| Metric \| Value \|
	\|-----------------------\|---------------------------\|
	\| Avg. \| 33.33 \|
	\| ARC (25-shot) \| 36.52 \|
	\| HellaSwag (10-shot) \| 61.76 \|
	\| MMLU (5-shot) \| 26.94 \|
	\| TruthfulQA (0-shot) \| 45.05 \|
	\| Winogrande (5-shot) \| 60.77 \|
	\| GSM8K (5-shot) \| 0.0 \|
	\| DROP (3-shot) \| 2.23 \|