snorkelai
/

Snorkel-Mistral-PairRM-DPO

Text Generation

text-generation-inference

Model card Files Files and versions Community

viethoangtranduong commited on Jan 22, 2024

Commit

7b7b07b

·

verified ·

1 Parent(s): 5db967d

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -44,7 +44,7 @@ On [**Alpaca-Eval 2.0**](https://tatsu-lab.github.io/alpaca_eval/):
 - The base model: [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) scored **14.72**.
 After applying the above methodology:
 - This model scored **30.2** - ranked 3rd and the highest for an open-source base model at the time of publication.
-- Utilizing the model with PairRM, which involved generating 16 responses and submitting the highest-scoring response by PairRM, we scored **34.86** - ranked 2nd.
 The best model on the leaderboard is "gpt-4-turbo", which is also the judge of optimal responses.
 We recognize that the Alpaca-Eval 2.0 benchmark does not entirely capture the full range of capabilities and performances of LLMs.

 - The base model: [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) scored **14.72**.
 After applying the above methodology:
 - This model scored **30.2** - ranked 3rd and the highest for an open-source base model at the time of publication.
+- When post-processing the model outputs with PairRM-best-of-16, which involved generating 16 responses and select the highest-scoring response by PairRM, we scored **34.86** - ranked 2nd.
 The best model on the leaderboard is "gpt-4-turbo", which is also the judge of optimal responses.
 We recognize that the Alpaca-Eval 2.0 benchmark does not entirely capture the full range of capabilities and performances of LLMs.