metadata

license: apache-2.0

Model Card for Deita Complexity Scorer

Deita is an open-sourced project designed to facilitate Automatic Data Selection for instruction tuning in Large Language Models (LLMs).

Deita Complexity Scorer is a tool for automatically annotating the Instruction Complexity of SFT data.

Model description

Model type: Model fine tuned to automatically annotate the Instruction Complexity
Language(s) (NLP): Primarily English
Finetuned from model: Llama-1-13b-hf

Model Sources

Repository: https://github.com/hkust-nlp/deita
Model Family: Other models and the dataset are found in the Deita collection.

Usage

Please use the following format

from transformers import AutoTokenizer, AutoModelForCausalLM
import numpy as np
from scipy.special import softmax
model_name = "hkust-nlp/Deita-Complexity-Scorer"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)


def infer_complexity(model, tokenizer, input_text):
    complexity_template = ("You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction}  \n##Complexity: ")
    user_input = complexity_template.format(instruction=input_text)
    input_ids = tokenizer.encode(user_input, return_tensors="pt")
    max_length = 512
    outputs = model.generate(input_ids, max_length=512, num_return_sequences=1, return_dict_in_generate=True, output_scores=True)
    logprobs_list = outputs.scores[0][0]
    score_logits = []
    id2score = {
        29896: "1",
        29906: "2",
        29941: "3",
        29946: "4",
        29945: "5",
        29953: "6"
    }
    score_template = np.array([1,2,3,4,5,6])
    for k in id2score:
        score_logits.append(logprobs_list[k])
    score_logits = np.array(score_logits)
    score_npy = softmax(score_logits, axis=0)
    score_npy = score_npy * score_template

    score_npy = np.sum(score_npy, axis=0)
    return score_npy

# example input
input_text = "write a performance review for a junior data scientist"
complexity_score = infer_complexity(model, tokenizer, input_text)

print(complexity_score)