FlagEval
/

flageval_judgemodel

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

flageval_judgemodel / README.md

eyuansu71's picture

Update README.md

63b364e verified 2 months ago

|

history blame contribute delete

No virus

3.08 kB

	---
	license: apache-2.0
	language:
	- zh
	- en
	metrics:
	- accuracy
	pipeline_tag: text-generation
	tags:
	- chat
	- evaluate
	---
	# flageval_judgemodel Card

	## Model Details

	flageval_judgemodel is a judgeLLM developed by FlagEval team (https://flageval.baai.ac.cn/#/home).

	- Developed by: [FlagEval](https://flageval.baai.ac.cn/#/home), [BAAI](https://www.baai.ac.cn/english.html)
	- Model type: An auto-regressive language model based on the transformer architecture.
	- License: Non-commercial license
	- Finetuned from model: [Vicuna](https://vicuna.lmsys.org).

	## Uses

	The flageval_judgemodel is designed to evaluate the performance of large language models on CLCC dataset. This dataset (https://huggingface.co/datasets/eyuansu71/CLCC_v1) is a Chinese Linguistics & Cognition Challenge dataset. The flageval_judgemodel aims to provide an automated evaluation, potentially replacing human judgment in assessing the models' outputs.


	## Quickstart


	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	def promptify(prompt, pred, gold):
	sys = "We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above.\nPlease rate the helpfulness, relevance, accuracy, level of details of their responses. Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance.\nPlease first output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively. The two scores are separated by a space. In the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment."
	prompt_template = f"You are a helpful and precise assistant for checking the quality of the answer.\n[Question]\n{prompt}\n\n[The Start of Assistant 1's Answer]\n{gold}\n\n[The End of Assistant 1's Answer]\n\n[The Start of Assistant 2's Answer]\n{pred}\n\n[The End of Assistant 2's Answer]\n\n[System]\n{sys}\n\n### Response:10"

	return prompt_template

	model = AutoModelForCausalLM.from_pretrained("FlagEval/flageval_judgemodel", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, attn_implementation="flash_attention_2").cuda()
	tokenizer = AutoTokenizer.from_pretrained("FlagEval/flageval_judgemodel")

	prompt, pred, gold = '1、约翰喜欢看电影，玛丽也喜欢。\n2、约翰也喜欢看足球比赛。\n请问以上两句话是否是一个意思？', "不一样", "不一样"

	with torch.no_grad():
	data_sample = promptify(prompt, pred, gold)
	input_ids = tokenizer(data_sample, return_tensors="pt").input_ids
	output_ids = model.generate(
	torch.as_tensor(input_ids).cuda(),
	max_new_tokens=128,
	)
	text = tokenizer.decode(output_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
	prompt_length = len(data_sample)
	ans = text[prompt_length:].strip()
	pred_label = 1 if int(ans) == 10 else 0
	```