Benchmark Evals?

#3
by senseable - opened

Just discovered this model and I agree it's writing and reasoning depth seems greatly improved.
Are you going to submit this to Huggingface leaderboard, I'm interested in seeing it's benchmarks.
Nice work!

@senseable

i just tried to compare another 7b model(not this one) , with its extended version (using same config) on openllm leaderboard results, here is what i get :

comparison

Metric diff Extended(10.7b) Origin(7b)
Avg. -3.76 69.75 73.51
AI2 Reasoning Challenge (25-Shot) -3.07 68.09 71.16
HellaSwag (10-Shot) -0.66 87.10 87.76
MMLU (5-Shot) -0.34 64.43 64.77
TruthfulQA (0-shot) -0.97 64.28 65.25
Winogrande (5-shot) -0.31 82.72 83.03
GSM8k (5-shot) -17.21 51.86 69.07

but the effect in the chat seems good and stable , thank for this great config

@seyf1elislam Interesting, thanks for sharing. Nothing a little fine-tuning couldn't fix with potentially a higher ceiling on evals like MMLU.

@senseable Exactly: we have the potential to build some amazing larger models with the great Mistral-7B as a base. Your fine-tune is the perfect starting point. I think the process should go fine-tune > self-merge > fine-tune > self-merge > fine-tune > etc

After each self-merge, reapplying the original fine-tune should help realign the layers and get rid of the errors introduced by the self-merge. It should also result in a new model which can be further self-merged. If you would like to give a try to reapplying your WestLake fine-tune to this 10.7B self-merge, I would like to try to see how far we can push it. I expect the next good self-merge could result in a 16-20B model. And maybe it is possible to push it all the way to 34B.

Here is the HF LLM leaderboard comparison:

comparison

Metric diff WestLake-10.7B-v2 WestLake-7B-v2
Avg. -5.14 70.28 75.42
AI2 Reasoning Challenge (25-Shot) -1.88 71.16 73.04
HellaSwag (10-Shot) -0.72 87.93 88.65
MMLU (5-Shot) -0.90 63.81 64.71
TruthfulQA (0-shot) -2.15 64.91 67.06
Winogrande (5-shot) -1.58 85.40 86.98
GSM8k (5-shot) -19.18 48.45 67.63

Sign up or log in to comment