File size: 3,763 Bytes
9e3fdfb 4b90790 b83d551 4b90790 ebadd8c 9e3fdfb 4b90790 c1a36ec 4b90790 c5c60ea b83d551 4b90790 3c05d2d 4b90790 d9abd93 c1a36ec e2b3560 c1a36ec 4329c4d e2b3560 b83d551 c1a36ec b83d551 c1a36ec 4b90790 c1a36ec f29252c 4b90790 b83d551 4b90790 7b54387 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
---
license: apache-2.0
tags:
- generated_from_trainer
- HC3
- chatGPT
- assistant
datasets:
- pszemraj/HC3-textgen-qa
metrics:
- accuracy
inference: false
base_model: EleutherAI/pythia-6.9b-deduped
---
# pythia-6.9b-deduped for general QA
<a href="https://colab.research.google.com/gist/pszemraj/e19747c911697b20f3bedf6e21dee0a5/pythia-6-9b-hc3-notebook-v2.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
This model is a fine-tuned version of [EleutherAI/pythia-6.9b-deduped](https://huggingface.co/EleutherAI/pythia-6.9b-deduped) on the pszemraj/HC3-textgen-qa dataset.
It achieves the following results on the evaluation set:
- Loss: 1.2372
- Accuracy: 0.6769
- perplexity: 3.446
## Model description
Text generation model trained on the HC3 text data of human questions + chatGPT answers.
![example](https://i.imgur.com/iMqPDXU.png)
### Usage
Install necessary packages for inference (_unless you have a big boi GPU_)
```bash
pip install -U -q transformers bitsandbytes accelerate
```
Basic inference example:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("pszemraj/pythia-6.9b-HC3")
model = AutoModelForCausalLM.from_pretrained(
"pszemraj/pythia-6.9b-HC3", load_in_8bit=True, device_map="auto"
) # shards are ~4GB each, there are eight total
prompt = "I was wondering how much wood a woodchuck could chuck? <answer>"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs, max_new_tokens=300
) # default generation config (+ 300 tokens)
result = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
result = result.split("<end_answer>")[0].strip()
import pprint as pp
pp.pprint(result)
```
The defautl `GenerationConfig` uses contrastive search with `top_k=4` and `penalty_alpha=0.6`. For more information on inference and parameters to use, see [the transformers docs](https://huggingface.co/docs/transformers/generation_strategies#decoding-strategies).
## Intended uses & limitations
- **Intended use:** research/exploration into comparing RLHF tuning vs. "guided"/specific tuning on "quality" datasets/responses of _"what the human would want as answer anyway"_
- This is **not** trained/fine-tuned with RLHF and therefore will not be as helpful/generalizable/safe as chatGPT (_outside of the fact that this model is ~30x smaller_)
## Training and evaluation data
```yaml
model-index:
- name: pythia-6.9b-hc3-qa-assistant
results:
- task:
name: Causal Language Modeling
type: text-generation
dataset:
name: pszemraj/HC3-textgen-qa
metrics:
- name: Accuracy
type: accuracy
value: 0.6768941789814655
```
## Training procedure
Two epochs on the `pszemraj/HC3-textgen-qa` dataset.
### Training results
| Training Loss | Epoch | Step | Validation Loss | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:--------:|
| 1.2598 | 0.99 | 79 | 1.3291 | 0.6496 |
| 0.7446 | 1.99 | 158 | 1.2372 | 0.6769 |
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_pszemraj__pythia-6.9b-HC3)
| Metric | Value |
|-----------------------|---------------------------|
| Avg. | 33.33 |
| ARC (25-shot) | 36.52 |
| HellaSwag (10-shot) | 61.76 |
| MMLU (5-shot) | 26.94 |
| TruthfulQA (0-shot) | 45.05 |
| Winogrande (5-shot) | 60.77 |
| GSM8K (5-shot) | 0.0 |
| DROP (3-shot) | 2.23 |
|