File size: 3,914 Bytes

---
base_model:
- google/gemma-2-2b-it
- google/gemma-2-2b-it
library_name: transformers
tags:
- mergekit
- merge
license: apache-2.0
datasets:
- prometheus-eval/Preference-Collection
- prometheus-eval/Feedback-Collection
language:
- en
---
# prometheus-2-llama-3-8b

Finetuned gemma-2-2b-it of [prometheus-7b-v2.0](https://huggingface.co/prometheus-eval/prometheus-7b-v2.0) using [Gemma-2-2B-Instruct](google/gemma-2-2b-it) as the base model.

Training hyperparameters:
* 3 epoch
* Learning rate 1e-5
* Effective batch size 4
* Cosine annealing
* ~5% warmup


Supports both feedback (likert-scale) evaluation and preference evaluation. Uses Gemma-2-2b-it Instruct the same prompts as prometheus-7b-v2.0. See example information below.


# Feedback Evaluation
```
ABSOLUTE_PROMPT = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
{}

###Response to evaluate:
{}

###Reference Answer (Score 5):
{}

###Score Rubrics:
{}

###Feedback: """

device = 'cuda:0'
model = AutoModelForCausalLM.from_pretrained("zli12321/prometheus2-2B").to(device)
tokenizer = AutoTokenizer.from_pretrained("zli12321/prometheus2-2B")

'''
Define your own instruction, response, reference, and rubric below
'''
prompt = ABSOLUTE_PROMPT.format(instruction, response, reference, rubric)
    
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
input_length = input_ids.shape[1]
outputs = model.generate(input_ids, output_logits=True, return_dict_in_generate=True, max_new_tokens=4096)
print(tokenizer.decode(outputs.sequences[0], skip_special_tokens=True))

```

# Preference Evaluation Template
Follow the above to generate preference evaluation with the preference evaluation template.

```
###Task Description:
An instruction (might include an Input inside it), a response to evaluate, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of two responses strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, choose a better response between Response A and Response B. You should refer to the score rubric.
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (A or B)"
4. Please do not generate any other opening, closing, and explanations.

###Instruction:
{}

###Response A:
{}

###Response B:
{}

###Reference Answer:
{}

###Score Rubric:
{}

###Feedback: 
```



# Citations


```bibtex
@misc{kim2023prometheus,
    title={Prometheus: Inducing Fine-grained Evaluation Capability in Language Models},
    author={Seungone Kim and Jamin Shin and Yejin Cho and Joel Jang and Shayne Longpre and Hwaran Lee and Sangdoo Yun and Seongjin Shin and Sungdong Kim and James Thorne and Minjoon Seo},
    year={2023},
    eprint={2310.08491},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
```
```bibtex
@misc{kim2024prometheus,
    title={Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models},
    author={Seungone Kim and Juyoung Suk and Shayne Longpre and Bill Yuchen Lin and Jamin Shin and Sean Welleck and Graham Neubig and Moontae Lee and Kyungjae Lee and Minjoon Seo},
    year={2024},
    eprint={2405.01535},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
```