AlpacaEval results by chatgpt! Community run

#38
by Yhyu13 - opened

Hi,

Since phi2 requires remote code which HF open llm leaderboard would not accept at this moment,

I ran phi2 and my sft https://huggingface.co/Yhyu13/phi-2-sft-alpaca_gpt4_en-ep1-lora/tree/main on alpaca gpt4 en to the AlpcaEval benchmark

https://tatsu-lab.github.io/alpaca_eval/

Here is result evaluated by chatpgpt https://github.com/tatsu-lab/alpaca_eval/pull/183

                       win_rate  standard_error  n_total  avg_length
gpt4                      73.79            1.54      805        1365
claude                    70.37            1.60      805        1082
chatgpt                   66.09            1.66      805         811
wizardlm-13b              65.16            1.67      805         985
vicuna-13b                64.10            1.69      805        1037
guanaco-65b               62.36            1.71      805        1249
oasst-rlhf-llama-33b      62.05            1.71      805        1079
alpaca-farm-ppo-human     60.25            1.72      805         803
falcon-40b-instruct       56.52            1.74      805         662
phi-2-alpaca-gpt4(new)    54.23            1.75      804        1138
text_davinci_003          50.00            0.00      805         307
alpaca-7b                 45.22            1.74      805         396
phi-2(new)                43.79            1.74      805         924
text_davinci_001          28.07            1.56      805         296

It could be a milestone for small models, we finally have one open model can run for everyone which surpass GPT3.5!

Cheers!

Wow, nice to see a community replication. I place a lot of weight on these, as they are independent. Looks like it's conformed to be best-for-size.

It will be interesting if someone uses Claude as a judge. Since GPT4 was presumably used to generate the training data, so there is a little bit of a bias when used as judge (of course it likes its own student, it says exactly what it would say!).

Microsoft org

Thanks a lot @Yhyu13 ! I will leave this as closed (just for the sake of controlling answers), but please feel free to re-open if you prefer that way.

gugarosa changed discussion status to closed

Sign up or log in to comment