Testing Might be broken
Collection
testing only models,
•
10 items
•
Updated
•
2
Another trial of merging models with different sizes, still under testing, should be more stable, but I have no ideia if it's improving or degrading the base model.
In this I changed something, to have more Westlake. Recipe:
merge_method: task_anysize
base_model: princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT
models:
- model: senseable/WestLake-7B-v2
parameters:
weight: 1.0
dtype: bfloat16
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 36.18 |
AI2 Reasoning Challenge (25-Shot) | 36.18 |
HellaSwag (10-Shot) | 57.54 |
MMLU (5-Shot) | 24.20 |
TruthfulQA (0-shot) | 42.39 |
Winogrande (5-shot) | 56.75 |
GSM8k (5-shot) | 0.00 |