File size: 3,763 Bytes

9e3fdfb
 
4b90790
 
b83d551
 
 
4b90790
 
 
 
ebadd8c
 
9e3fdfb
4b90790
c1a36ec
4b90790
c5c60ea
b83d551
 
 
4b90790
 
 
 
3c05d2d
4b90790
 
 
 
 
d9abd93
 
 
c1a36ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e2b3560
c1a36ec
 
4329c4d
e2b3560
 
 
b83d551
 
c1a36ec
 
 
b83d551
c1a36ec
 
 
4b90790
 
 
c1a36ec
f29252c
4b90790
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b83d551
 
4b90790
 
 
 
 
 
 
7b54387

---
license: apache-2.0
tags:
- generated_from_trainer
- HC3
- chatGPT
- assistant
datasets:
- pszemraj/HC3-textgen-qa
metrics:
- accuracy
inference: false
base_model: EleutherAI/pythia-6.9b-deduped
---

# pythia-6.9b-deduped for general QA

<a href="https://colab.research.google.com/gist/pszemraj/e19747c911697b20f3bedf6e21dee0a5/pythia-6-9b-hc3-notebook-v2.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This model is a fine-tuned version of [EleutherAI/pythia-6.9b-deduped](https://huggingface.co/EleutherAI/pythia-6.9b-deduped) on the pszemraj/HC3-textgen-qa dataset.
It achieves the following results on the evaluation set:
- Loss: 1.2372
- Accuracy: 0.6769
- perplexity: 3.446

## Model description

Text generation model trained on the HC3 text data of human questions + chatGPT answers.

![example](https://i.imgur.com/iMqPDXU.png)


### Usage

Install necessary packages for inference (_unless you have a big boi GPU_)
```bash
pip install -U -q transformers bitsandbytes accelerate
```

Basic inference example:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("pszemraj/pythia-6.9b-HC3")

model = AutoModelForCausalLM.from_pretrained(
    "pszemraj/pythia-6.9b-HC3", load_in_8bit=True, device_map="auto"
)  # shards are ~4GB each, there are eight total

prompt = "I was wondering how much wood a woodchuck could chuck? <answer>"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs, max_new_tokens=300
)  # default generation config (+ 300 tokens)
result = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
result = result.split("<end_answer>")[0].strip()

import pprint as pp

pp.pprint(result)
```

The defautl `GenerationConfig` uses contrastive search with `top_k=4` and `penalty_alpha=0.6`. For more information on inference and parameters to use, see [the transformers docs](https://huggingface.co/docs/transformers/generation_strategies#decoding-strategies).

## Intended uses & limitations

- **Intended use:** research/exploration into comparing RLHF tuning vs. "guided"/specific tuning on "quality" datasets/responses of _"what the human would want as answer anyway"_
- This is **not** trained/fine-tuned with RLHF and therefore will not be as helpful/generalizable/safe as chatGPT (_outside of the fact that this model is ~30x smaller_)

## Training and evaluation data

```yaml
model-index:
- name: pythia-6.9b-hc3-qa-assistant
  results:
  - task:
      name: Causal Language Modeling
      type: text-generation
    dataset:
      name: pszemraj/HC3-textgen-qa
    metrics:
    - name: Accuracy
      type: accuracy
      value: 0.6768941789814655
```


## Training procedure

Two epochs on the `pszemraj/HC3-textgen-qa` dataset.

### Training results

| Training Loss | Epoch | Step | Validation Loss | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:--------:|
| 1.2598        | 0.99  | 79   | 1.3291          | 0.6496   |
| 0.7446        | 1.99  | 158  | 1.2372          | 0.6769   |


# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_pszemraj__pythia-6.9b-HC3)

| Metric                | Value                     |
|-----------------------|---------------------------|
| Avg.                  | 33.33   |
| ARC (25-shot)         | 36.52          |
| HellaSwag (10-shot)   | 61.76    |
| MMLU (5-shot)         | 26.94         |
| TruthfulQA (0-shot)   | 45.05   |
| Winogrande (5-shot)   | 60.77   |
| GSM8K (5-shot)        | 0.0        |
| DROP (3-shot)         | 2.23         |