Mar 23, 2024

Just discovered this model and I agree it's writing and reasoning depth seems greatly improved.
Are you going to submit this to Huggingface leaderboard, I'm interested in seeing it's benchmarks.
Nice work!

seyf1elislam

Mar 23, 2024

•

edited Mar 23, 2024

@senseable

i just tried to compare another 7b model(not this one) , with its extended version (using same config) on openllm leaderboard results, here is what i get :

comparison

Metric	diff	Extended(10.7b)	Origin(7b)
Avg.	-3.76	69.75	73.51
AI2 Reasoning Challenge (25-Shot)	-3.07	68.09	71.16
HellaSwag (10-Shot)	-0.66	87.10	87.76
MMLU (5-Shot)	-0.34	64.43	64.77
TruthfulQA (0-shot)	-0.97	64.28	65.25
Winogrande (5-shot)	-0.31	82.72	83.03
GSM8k (5-shot)	-17.21	51.86	69.07

but the effect in the chat seems good and stable , thank for this great config

senseable

Mar 23, 2024

@seyf1elislam Interesting, thanks for sharing. Nothing a little fine-tuning couldn't fix with potentially a higher ceiling on evals like MMLU.

froggeric

Owner Mar 27, 2024

@senseable Exactly: we have the potential to build some amazing larger models with the great Mistral-7B as a base. Your fine-tune is the perfect starting point. I think the process should go fine-tune > self-merge > fine-tune > self-merge > fine-tune > etc

After each self-merge, reapplying the original fine-tune should help realign the layers and get rid of the errors introduced by the self-merge. It should also result in a new model which can be further self-merged. If you would like to give a try to reapplying your WestLake fine-tune to this 10.7B self-merge, I would like to try to see how far we can push it. I expect the next good self-merge could result in a 16-20B model. And maybe it is possible to push it all the way to 34B.

Here is the HF LLM leaderboard comparison:

comparison

Metric	diff	WestLake-10.7B-v2	WestLake-7B-v2
Avg.	-5.14	70.28	75.42
AI2 Reasoning Challenge (25-Shot)	-1.88	71.16	73.04
HellaSwag (10-Shot)	-0.72	87.93	88.65
MMLU (5-Shot)	-0.90	63.81	64.71
TruthfulQA (0-shot)	-2.15	64.91	67.06
Winogrande (5-shot)	-1.58	85.40	86.98
GSM8k (5-shot)	-19.18	48.45	67.63

froggeric
/

WestLake-10.7B-v2

Benchmark Evals?

comparison

comparison