flageval_judgemodel Card

Model Details

flageval_judgemodel is a judgeLLM (also GenRM -- generative reward model) developed by FlagEval team (https://flageval.baai.ac.cn/#/home).

Developed by: FlagEval, BAAI
Model type: An auto-regressive language model based on the transformer architecture.
License: Non-commercial license
Finetuned from model: Vicuna.

Uses

The flageval_judgemodel is designed to evaluate the performance of large language models on CLCC dataset. This dataset (https://huggingface.co/datasets/eyuansu71/CLCC_v1) is a Chinese Linguistics & Cognition Challenge dataset. The flageval_judgemodel aims to provide an automated evaluation, potentially replacing human judgment in assessing the models' outputs.

Quickstart

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def promptify(prompt, pred, gold):
    sys = "We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above.\nPlease rate the helpfulness, relevance, accuracy, level of details of their responses. Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance.\nPlease first output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively. The two scores are separated by a space. In the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment."
    prompt_template = f"You are a helpful and precise assistant for checking the quality of the answer.\n[Question]\n{prompt}\n\n[The Start of Assistant 1's Answer]\n{gold}\n\n[The End of Assistant 1's Answer]\n\n[The Start of Assistant 2's Answer]\n{pred}\n\n[The End of Assistant 2's Answer]\n\n[System]\n{sys}\n\n### Response:10"

    return prompt_template

model = AutoModelForCausalLM.from_pretrained("FlagEval/flageval_judgemodel", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, attn_implementation="flash_attention_2").cuda()
tokenizer = AutoTokenizer.from_pretrained("FlagEval/flageval_judgemodel")

prompt, pred, gold = '1、约翰喜欢看电影，玛丽也喜欢。\n2、约翰也喜欢看足球比赛。\n请问以上两句话是否是一个意思？', "不一样", "不一样"

with torch.no_grad():
    data_sample = promptify(prompt, pred, gold)
    input_ids = tokenizer(data_sample, return_tensors="pt").input_ids
    output_ids = model.generate(
        torch.as_tensor(input_ids).cuda(),
        max_new_tokens=128,
    )
    text = tokenizer.decode(output_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
    prompt_length = len(data_sample)
    ans = text[prompt_length:].strip()
    pred_label = 1 if int(ans) == 10 else 0