This is a distillation experiment with SmolLM2-1.7B as teacher and SmolLM2-360M as student model.
It slightly improves upon the performance of the basemodel on the following tasks (wip). I guess i can do much better than this - will try again.
Tasks | HuggingFaceTB/SmolLM2-360M Value | aloobun/d-SmolLM2-360M Value |
---|---|---|
- leaderboard_bbh_causal_judgement | 0.4545 | 0.4652 |
- leaderboard_bbh_geometric_shapes | 0.1680 | 0.2040 |
- leaderboard_bbh_movie_recommendation | 0.2120 | 0.2440 |
- leaderboard_bbh_penguins_in_a_table | 0.2055 | 0.2123 |
- leaderboard_bbh_reasoning_about_colored_objects | 0.1160 | 0.1320 |
- leaderboard_bbh_ruin_names | 0.2360 | 0.2480 |
- leaderboard_bbh_salient_translation_error_detection | 0.1480 | 0.2120 |
- leaderboard_bbh_snarks | 0.5169 | 0.5281 |
- leaderboard_bbh_temporal_sequences | 0.2720 | 0.2800 |
- leaderboard_musr_murder_mysteries | 0.5040 | 0.5160 |
Eval Results aloobun/d-SmolLM2-360M (WIP)
Todo:
ifeval (0-shot, generative)
Math-lvl-5 (4-shots, generative, minerva version)
GPQA
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
leaderboard_gpqa | N/A | |||||||
- leaderboard_gpqa_diamond | 1 | none | 0 | acc_norm | ↑ | 0.2071 | ± | 0.0289 |
- leaderboard_gpqa_extended | 1 | none | 0 | acc_norm | ↑ | 0.2308 | ± | 0.0180 |
- leaderboard_gpqa_main | 1 | none | 0 | acc_norm | ↑ | 0.2679 | ± | 0.0209 |
MUSR
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
leaderboard_musr | N/A | |||||||
- leaderboard_musr_murder_mysteries | 1 | none | 0 | acc_norm | ↑ | 0.5160 | ± | 0.0317 |
- leaderboard_musr_object_placements | 1 | none | 0 | acc_norm | ↑ | 0.2383 | ± | 0.0267 |
- leaderboard_musr_team_allocation | 1 | none | 0 | acc_norm | ↑ | 0.4400 | ± | 0.0315 |
BBH
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
leaderboard_bbh | N/A | |||||||
- leaderboard_bbh_boolean_expressions | 1 | none | 3 | acc_norm | ↑ | 0.5480 | ± | 0.0315 |
- leaderboard_bbh_causal_judgement | 1 | none | 3 | acc_norm | ↑ | 0.4652 | ± | 0.0366 |
- leaderboard_bbh_date_understanding | 1 | none | 3 | acc_norm | ↑ | 0.1560 | ± | 0.0230 |
- leaderboard_bbh_disambiguation_qa | 1 | none | 3 | acc_norm | ↑ | 0.3120 | ± | 0.0294 |
- leaderboard_bbh_formal_fallacies | 1 | none | 3 | acc_norm | ↑ | 0.5240 | ± | 0.0316 |
- leaderboard_bbh_geometric_shapes | 1 | none | 3 | acc_norm | ↑ | 0.2040 | ± | 0.0255 |
- leaderboard_bbh_hyperbaton | 1 | none | 3 | acc_norm | ↑ | 0.5000 | ± | 0.0317 |
- leaderboard_bbh_logical_deduction_five_objects | 1 | none | 3 | acc_norm | ↑ | 0.2240 | ± | 0.0264 |
- leaderboard_bbh_logical_deduction_seven_objects | 1 | none | 3 | acc_norm | ↑ | 0.1440 | ± | 0.0222 |
- leaderboard_bbh_logical_deduction_three_objects | 1 | none | 3 | acc_norm | ↑ | 0.3320 | ± | 0.0298 |
- leaderboard_bbh_movie_recommendation | 1 | none | 3 | acc_norm | ↑ | 0.2440 | ± | 0.0272 |
- leaderboard_bbh_navigate | 1 | none | 3 | acc_norm | ↑ | 0.5800 | ± | 0.0313 |
- leaderboard_bbh_object_counting | 1 | none | 3 | acc_norm | ↑ | 0.2080 | ± | 0.0257 |
- leaderboard_bbh_penguins_in_a_table | 1 | none | 3 | acc_norm | ↑ | 0.2123 | ± | 0.0340 |
- leaderboard_bbh_reasoning_about_colored_objects | 1 | none | 3 | acc_norm | ↑ | 0.1320 | ± | 0.0215 |
- leaderboard_bbh_ruin_names | 1 | none | 3 | acc_norm | ↑ | 0.2480 | ± | 0.0274 |
- leaderboard_bbh_salient_translation_error_detection | 1 | none | 3 | acc_norm | ↑ | 0.2120 | ± | 0.0259 |
- leaderboard_bbh_snarks | 1 | none | 3 | acc_norm | ↑ | 0.5281 | ± | 0.0375 |
- leaderboard_bbh_sports_understanding | 1 | none | 3 | acc_norm | ↑ | 0.4600 | ± | 0.0316 |
- leaderboard_bbh_temporal_sequences | 1 | none | 3 | acc_norm | ↑ | 0.2800 | ± | 0.0285 |
- leaderboard_bbh_tracking_shuffled_objects_five_objects | 1 | none | 3 | acc_norm | ↑ | 0.1720 | ± | 0.0239 |
- leaderboard_bbh_tracking_shuffled_objects_seven_objects | 1 | none | 3 | acc_norm | ↑ | 0.1440 | ± | 0.0222 |
- leaderboard_bbh_tracking_shuffled_objects_three_objects | 1 | none | 3 | acc_norm | ↑ | 0.3000 | ± | 0.0290 |
- leaderboard_bbh_web_of_lies | 1 | none | 3 | acc_norm | ↑ | 0.5480 | ± | 0.0315 |
MMLU_PRO
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
leaderboard_mmlu_pro | 0.1 | none | 5 | acc | ↑ | 0.1173 | ± | 0.0029 |
IFEVAL
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
leaderboard_ifeval | 3 | none | 0 | inst_level_loose_acc | ↑ | 0.2866 | ± | N/A |
none | 0 | inst_level_strict_acc | ↑ | 0.2770 | ± | N/A | ||
none | 0 | prompt_level_loose_acc | ↑ | 0.1497 | ± | 0.0154 | ||
none | 0 | prompt_level_strict_acc | ↑ | 0.1423 | ± | 0.0150 |
- Downloads last month
- 9
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.