Edit model card
Deita banner

Model Card for Deita Quality Scorer

GitHub | Paper

Deita is an open-sourced project designed to facilitate Automatic Data Selection for instruction tuning in Large Language Models (LLMs).

Deita Quality Scorer is a tool for automatically annotating the Instruction Quality of SFT data.

Model description

  • Model type: Model fine tuned to automatically annotate the Instruction-Response Pair Quality
  • Language(s) (NLP): Primarily English
  • Finetuned from model: Llama-1-13b-hf

Model Sources

Performance

Model Align Data Size MT-Bench AlpacaEval(%) OpenLLM (Avg.)
Proprietary Models
GPT-4-Turbo ? -- 9.32 97.70 --
GPT-4 SFT + PPO -- 8.99 95.03 --
Claude-2 SFT + PPO -- 8.06 91.36 --
GPT-3.5-turbo SFT + PPO -- 7.94 89.37 --
Open-sourced Models based on LLaMA-1-13B
LIMA SFT 1K SFT 4.29 41.98 59.82
WizardLM-13B SFT 70K SFT 6.35 75.31 58.96
Vicuna-13B-v1.3 SFT 125K SFT 6.39 82.11 60.01
Random SFT 10K SFT 6.03 71.52 60.14
DEITA-LLaMA1-13B-v1.0-sft SFT 10K SFT 6.60 78.01 64.27
Open-sourced Models based on LLaMA-2-13B
Tulu-2-13B SFT 326K SFT 6.70 78.90 --
Tulu-2-13B+DPO SFT + DPO 326K SFT + 60K DPO 7.00 89.50 --
LLaMA2-13B-Chat SFT + PPO -- 6.65 81.09 --
WizardLM-13B-v1.2 SFT >70K SFT 7.09 89.17 --
Vicuna-13B-v1.5 SFT 125K SFT 6.57 78.80 61.63
Random SFT 10K SFT 5.78 65.19 61.32
DEITA-LLaMA2-13B-v1.0-sft SFT 10K SFT 6.79 81.09 62.71
Open-sourced Models based on Mistral-7B
Mistral-7B-Instruct-v0.1 -- -- 6.84 69.65 60.45
Zephyr-7B-sft SFT 200K SFT 5.32 75.12 60.93
$\text{Zephyr-7B-}\beta$ SFT + DPO 200K SFT + 60K DPO 7.34 90.60 66.36
OpenChat-3.5 C-RLFT >> 70K C-RLFT 7.81 88.51 --
Starling-7B C-RLFT + APA >>70K C-RLFT + 183K APA 8.09 91.99 --
Random SFT 10K SFT 5.89 56.90 61.72
DEITA-7B-v1.0-sft (6K) SFT 6K SFT 7.22 80.78 64.94
DEITA-7B-v1.0-sft (10K) SFT 10K SFT 7.32 81.67 64.00
DEITA-7B-v1.0 SFT + DPO 6K SFT + 10K DPO 7.55 90.06 69.86

Usage

Please use the following format to score the quality of Instruction-Response Pair

from transformers import AutoTokenizer, AutoModelForCausalLM
import numpy as np
from scipy.special import softmax
model_name = "hkust-nlp/deita-quality-scorer"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)


def infer_Quality(model, tokenizer, input_text, resp_text):
    quality_template = ("You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: ")
    user_input = quality_template.format(instruction=input_text, output=resp_text)
    input_ids = tokenizer.encode(user_input, return_tensors="pt")
    max_length = 512
    outputs = model.generate(input_ids, max_length=512, num_return_sequences=1, return_dict_in_generate=True, output_scores=True)
    logprobs_list = outputs.scores[0][0]
    score_logits = []
    id2score = {
        29896: "1",
        29906: "2",
        29941: "3",
        29946: "4",
        29945: "5",
        29953: "6"
    }
    score_template = np.array([1,2,3,4,5,6])
    for k in id2score:
        score_logits.append(logprobs_list[k])
    score_logits = np.array(score_logits)
    score_npy = softmax(score_logits, axis=0)
    score_npy = score_npy * score_template

    score_npy = np.sum(score_npy, axis=0)
    return score_npy

input_text = "word to describe UI with helpful tooltips" # Example Input
output_text = "User-friendly or intuitive UI" # Example Output
quality_score = infer_quality(model, tokenizer, input_text)

print(quality_score)

Citation

If you find the content of this project helpful, please cite our paper as follows:

@misc{liu2023what,
      title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning}, 
      author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
      year={2023},
      eprint={2312.15685},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month
263

Collection including hkust-nlp/deita-quality-scorer