pythia-6.9b-deduped for general QA

This model is a fine-tuned version of EleutherAI/pythia-6.9b-deduped on the pszemraj/HC3-textgen-qa dataset. It achieves the following results on the evaluation set:

Loss: 1.2372
Accuracy: 0.6769
perplexity: 3.446

Model description

Text generation model trained on the HC3 text data of human questions + chatGPT answers.

Usage

Install necessary packages for inference (unless you have a big boi GPU)

pip install -U -q transformers bitsandbytes accelerate

Basic inference example:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("pszemraj/pythia-6.9b-HC3")

model = AutoModelForCausalLM.from_pretrained(
    "pszemraj/pythia-6.9b-HC3", load_in_8bit=True, device_map="auto"
)  # shards are ~4GB each, there are eight total

prompt = "I was wondering how much wood a woodchuck could chuck? <answer>"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs, max_new_tokens=300
)  # default generation config (+ 300 tokens)
result = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
result = result.split("<end_answer>")[0].strip()

import pprint as pp

pp.pprint(result)

The defautl GenerationConfig uses contrastive search with top_k=4 and penalty_alpha=0.6. For more information on inference and parameters to use, see the transformers docs.

Intended uses & limitations

Intended use: research/exploration into comparing RLHF tuning vs. "guided"/specific tuning on "quality" datasets/responses of "what the human would want as answer anyway"
This is not trained/fine-tuned with RLHF and therefore will not be as helpful/generalizable/safe as chatGPT (outside of the fact that this model is ~30x smaller)

Training and evaluation data

model-index:
- name: pythia-6.9b-hc3-qa-assistant
  results:
  - task:
      name: Causal Language Modeling
      type: text-generation
    dataset:
      name: pszemraj/HC3-textgen-qa
    metrics:
    - name: Accuracy
      type: accuracy
      value: 0.6768941789814655

Training procedure

Two epochs on the pszemraj/HC3-textgen-qa dataset.

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy
1.2598	0.99	79	1.3291	0.6496
0.7446	1.99	158	1.2372	0.6769

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	33.33
ARC (25-shot)	36.52
HellaSwag (10-shot)	61.76
MMLU (5-shot)	26.94
TruthfulQA (0-shot)	45.05
Winogrande (5-shot)	60.77
GSM8K (5-shot)	0.0
DROP (3-shot)	2.23

pszemraj
/

pythia-6.9b-HC3