Edit model card

pythia-6.9b-deduped for general QA

Open In Colab

This model is a fine-tuned version of EleutherAI/pythia-6.9b-deduped on the pszemraj/HC3-textgen-qa dataset. It achieves the following results on the evaluation set:

  • Loss: 1.2372
  • Accuracy: 0.6769
  • perplexity: 3.446

Model description

Text generation model trained on the HC3 text data of human questions + chatGPT answers.



Install necessary packages for inference (unless you have a big boi GPU)

pip install -U -q transformers bitsandbytes accelerate

Basic inference example:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("pszemraj/pythia-6.9b-HC3")

model = AutoModelForCausalLM.from_pretrained(
    "pszemraj/pythia-6.9b-HC3", load_in_8bit=True, device_map="auto"
)  # shards are ~4GB each, there are eight total

prompt = "I was wondering how much wood a woodchuck could chuck? <answer>"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs, max_new_tokens=300
)  # default generation config (+ 300 tokens)
result = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
result = result.split("<end_answer>")[0].strip()

import pprint as pp


The defautl GenerationConfig uses contrastive search with top_k=4 and penalty_alpha=0.6. For more information on inference and parameters to use, see the transformers docs.

Intended uses & limitations

  • Intended use: research/exploration into comparing RLHF tuning vs. "guided"/specific tuning on "quality" datasets/responses of "what the human would want as answer anyway"
  • This is not trained/fine-tuned with RLHF and therefore will not be as helpful/generalizable/safe as chatGPT (outside of the fact that this model is ~30x smaller)

Training and evaluation data

- name: pythia-6.9b-hc3-qa-assistant
  - task:
      name: Causal Language Modeling
      type: text-generation
      name: pszemraj/HC3-textgen-qa
    - name: Accuracy
      type: accuracy
      value: 0.6768941789814655

Training procedure

Two epochs on the pszemraj/HC3-textgen-qa dataset.

Training results

Training Loss Epoch Step Validation Loss Accuracy
1.2598 0.99 79 1.3291 0.6496
0.7446 1.99 158 1.2372 0.6769

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 33.33
ARC (25-shot) 36.52
HellaSwag (10-shot) 61.76
MMLU (5-shot) 26.94
TruthfulQA (0-shot) 45.05
Winogrande (5-shot) 60.77
GSM8K (5-shot) 0.0
DROP (3-shot) 2.23
Downloads last month
Inference Examples
Inference API (serverless) has been turned off for this model.

Finetuned from

Dataset used to train pszemraj/pythia-6.9b-HC3

Spaces using pszemraj/pythia-6.9b-HC3 9