still in training. Trained on about ~21 billion tokens so far.

Tasks Version Filter n-shot Metric Value Stderr
Open LLM Leaderboard N/A
- arc_challenge 1 none 25 acc 0.2005 ± 0.0117
none 25 acc_norm 0.2406 ± 0.0125
- gsm8k 3 flexible-extract 5 exact_match 0.0083 ± 0.0025
strict-match 5 exact_match 0.0000 ± 0.0000
- hellaswag 1 none 10 acc 0.2724 ± 0.0044
none 10 acc_norm 0.2838 ± 0.0045
- mmlu 2 none acc 0.2290 ± 0.0035
- humanities 2 none acc 0.2380 ± 0.0062
- formal_logic 1 none 5 acc 0.2460 ± 0.0385
- high_school_european_history 1 none 5 acc 0.1818 ± 0.0301
- high_school_us_history 1 none 5 acc 0.2647 ± 0.0310
- high_school_world_history 1 none 5 acc 0.2911 ± 0.0296
- international_law 1 none 5 acc 0.2149 ± 0.0375
- jurisprudence 1 none 5 acc 0.2685 ± 0.0428
- logical_fallacies 1 none 5 acc 0.2209 ± 0.0326
- moral_disputes 1 none 5 acc 0.2457 ± 0.0232
- moral_scenarios 1 none 5 acc 0.2369 ± 0.0142
- philosophy 1 none 5 acc 0.1865 ± 0.0221
- prehistory 1 none 5 acc 0.1975 ± 0.0222
- professional_law 1 none 5 acc 0.2432 ± 0.0110
- world_religions 1 none 5 acc 0.3099 ± 0.0355
- other 2 none acc 0.2375 ± 0.0076
- business_ethics 1 none 5 acc 0.3200 ± 0.0469
- clinical_knowledge 1 none 5 acc 0.2226 ± 0.0256
- college_medicine 1 none 5 acc 0.1965 ± 0.0303
- global_facts 1 none 5 acc 0.1800 ± 0.0386
- human_aging 1 none 5 acc 0.3004 ± 0.0308
- management 1 none 5 acc 0.1942 ± 0.0392
- marketing 1 none 5 acc 0.2735 ± 0.0292
- medical_genetics 1 none 5 acc 0.3000 ± 0.0461
- miscellaneous 1 none 5 acc 0.2478 ± 0.0154
- nutrition 1 none 5 acc 0.2222 ± 0.0238
- professional_accounting 1 none 5 acc 0.2021 ± 0.0240
- professional_medicine 1 none 5 acc 0.1912 ± 0.0239
- virology 1 none 5 acc 0.2590 ± 0.0341
- social sciences 2 none acc 0.2203 ± 0.0075
- econometrics 1 none 5 acc 0.2368 ± 0.0400
- high_school_geography 1 none 5 acc 0.2020 ± 0.0286
- high_school_government_and_politics 1 none 5 acc 0.1865 ± 0.0281
- high_school_macroeconomics 1 none 5 acc 0.2205 ± 0.0210
- high_school_microeconomics 1 none 5 acc 0.2143 ± 0.0267
- high_school_psychology 1 none 5 acc 0.1908 ± 0.0168
- human_sexuality 1 none 5 acc 0.2672 ± 0.0388
- professional_psychology 1 none 5 acc 0.2386 ± 0.0172
- public_relations 1 none 5 acc 0.1727 ± 0.0362
- security_studies 1 none 5 acc 0.2367 ± 0.0272
- sociology 1 none 5 acc 0.2488 ± 0.0306
- us_foreign_policy 1 none 5 acc 0.2600 ± 0.0441
- stem 2 none acc 0.2157 ± 0.0073
- abstract_algebra 1 none 5 acc 0.2200 ± 0.0416
- anatomy 1 none 5 acc 0.1778 ± 0.0330
- astronomy 1 none 5 acc 0.1908 ± 0.0320
- college_biology 1 none 5 acc 0.2778 ± 0.0375
- college_chemistry 1 none 5 acc 0.2200 ± 0.0416
- college_computer_science 1 none 5 acc 0.2100 ± 0.0409
- college_mathematics 1 none 5 acc 0.2100 ± 0.0409
- college_physics 1 none 5 acc 0.2157 ± 0.0409
- computer_security 1 none 5 acc 0.2700 ± 0.0446
- conceptual_physics 1 none 5 acc 0.2638 ± 0.0288
- electrical_engineering 1 none 5 acc 0.2483 ± 0.0360
- elementary_mathematics 1 none 5 acc 0.2037 ± 0.0207
- high_school_biology 1 none 5 acc 0.1774 ± 0.0217
- high_school_chemistry 1 none 5 acc 0.2020 ± 0.0282
- high_school_computer_science 1 none 5 acc 0.2500 ± 0.0435
- high_school_mathematics 1 none 5 acc 0.2148 ± 0.0250
- high_school_physics 1 none 5 acc 0.2053 ± 0.0330
- high_school_statistics 1 none 5 acc 0.1481 ± 0.0242
- machine_learning 1 none 5 acc 0.3125 ± 0.0440
- truthfulqa_gen 3 none 0 bleu_acc 0.2362 ± 0.0149
none 0 bleu_diff -1.0138 ± 0.2569
none 0 bleu_max 7.9522 ± 0.4088
none 0 rouge1_acc 0.2595 ± 0.0153
none 0 rouge1_diff -1.9129 ± 0.4349
none 0 rouge1_max 21.7885 ± 0.7307
none 0 rouge2_acc 0.1200 ± 0.0114
none 0 rouge2_diff -1.9771 ± 0.3475
none 0 rouge2_max 9.0199 ± 0.5842
none 0 rougeL_acc 0.2570 ± 0.0153
none 0 rougeL_diff -1.8812 ± 0.4185
none 0 rougeL_max 19.6284 ± 0.6850
- truthfulqa_mc1 2 none 0 acc 0.1983 ± 0.0140
- truthfulqa_mc2 2 none 0 acc 0.3861 ± 0.0147
- winogrande 1 none 5 acc 0.4972 ± 0.0141
Groups Version Filter n-shot Metric Value Stderr
- mmlu 2 none acc 0.2290 ± 0.0035
- humanities 2 none acc 0.2380 ± 0.0062
- other 2 none acc 0.2375 ± 0.0076
- social sciences 2 none acc 0.2203 ± 0.0075
- stem 2 none acc 0.2157 ± 0.0073
Tasks Version Filter n-shot Metric Value Stderr
agieval_nous 0 none acc_norm 0.2133 ± 0.0081
- agieval_aqua_rat 1 none 0 acc 0.2047 ± 0.0254
none 0 acc_norm 0.1969 ± 0.0250
- agieval_logiqa_en 1 none 0 acc 0.2043 ± 0.0158
none 0 acc_norm 0.2304 ± 0.0165
- agieval_lsat_ar 1 none 0 acc 0.1739 ± 0.0250
none 0 acc_norm 0.1957 ± 0.0262
- agieval_lsat_lr 1 none 0 acc 0.1549 ± 0.0160
none 0 acc_norm 0.1608 ± 0.0163
- agieval_lsat_rc 1 none 0 acc 0.1636 ± 0.0226
none 0 acc_norm 0.2119 ± 0.0250
- agieval_sat_en 1 none 0 acc 0.2670 ± 0.0309
none 0 acc_norm 0.2621 ± 0.0307
- agieval_sat_en_without_passage 1 none 0 acc 0.2670 ± 0.0309
none 0 acc_norm 0.2621 ± 0.0307
- agieval_sat_math 1 none 0 acc 0.2182 ± 0.0279
none 0 acc_norm 0.2318 ± 0.0285
arc_challenge 1 none 0 acc 0.1945 ± 0.0116
none 0 acc_norm 0.2372 ± 0.0124
truthfulqa_mc2 2 none 0 acc 0.3861 ± 0.0147
Groups Version Filter n-shot Metric Value Stderr
agieval_nous 0 none acc_norm 0.2133 ± 0.0081

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 4.13
IFEval (0-Shot) 16.39
BBH (3-Shot) 1.78
MATH Lvl 5 (4-Shot) 0.00
GPQA (0-shot) 0.00
MuSR (0-shot) 5.15
MMLU-PRO (5-shot) 1.47
Downloads last month
243
Safetensors
Model size
248M params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train M4-ai/TinyMistral-248M-v3

Space using M4-ai/TinyMistral-248M-v3 1

Evaluation results