viethoangtranduong
commited on
Commit
•
7b7b07b
1
Parent(s):
5db967d
Update README.md
Browse files
README.md
CHANGED
@@ -44,7 +44,7 @@ On [**Alpaca-Eval 2.0**](https://tatsu-lab.github.io/alpaca_eval/):
|
|
44 |
- The base model: [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) scored **14.72**.
|
45 |
After applying the above methodology:
|
46 |
- This model scored **30.2** - ranked 3rd and the highest for an open-source base model at the time of publication.
|
47 |
-
-
|
48 |
The best model on the leaderboard is "gpt-4-turbo", which is also the judge of optimal responses.
|
49 |
|
50 |
We recognize that the Alpaca-Eval 2.0 benchmark does not entirely capture the full range of capabilities and performances of LLMs.
|
|
|
44 |
- The base model: [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) scored **14.72**.
|
45 |
After applying the above methodology:
|
46 |
- This model scored **30.2** - ranked 3rd and the highest for an open-source base model at the time of publication.
|
47 |
+
- When post-processing the model outputs with PairRM-best-of-16, which involved generating 16 responses and select the highest-scoring response by PairRM, we scored **34.86** - ranked 2nd.
|
48 |
The best model on the leaderboard is "gpt-4-turbo", which is also the judge of optimal responses.
|
49 |
|
50 |
We recognize that the Alpaca-Eval 2.0 benchmark does not entirely capture the full range of capabilities and performances of LLMs.
|