noneUsername/Mistral-Small-24B-Instruct-2501-abliterated-W8A8-better

vllm (pretrained=/root/autodl-tmp/Mistral-Small-24B-Instruct-2501-abliterated,add_bos_token=true,max_model_len=2048,tensor_parallel_size=2,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.912	±	0.0180
		strict-match	5	exact_match	↑	0.908	±	0.0183

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.898	±	0.0135
		strict-match	5	exact_match	↑	0.894	±	0.0138

vllm (pretrained=/root/autodl-tmp/Mistral-Small-24B-Instruct-2501-abliterated,add_bos_token=true,max_model_len=700,tensor_parallel_size=2,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.8000	±	0.0130
- humanities	2	none	acc	↑	0.8410	±	0.0260
- other	2	none	acc	↑	0.8154	±	0.0264
- social sciences	2	none	acc	↑	0.8500	±	0.0251
- stem	2	none	acc	↑	0.7298	±	0.0248

vllm (pretrained=/root/autodl-tmp/85-512,add_bos_token=true,max_model_len=2048,tensor_parallel_size=2,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.896	±	0.0193
		strict-match	5	exact_match	↑	0.892	±	0.0197

vllm (pretrained=/root/autodl-tmp/85-512,add_bos_token=true,max_model_len=2048,tensor_parallel_size=2,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.900	±	0.0134
		strict-match	5	exact_match	↑	0.894	±	0.0138

vllm (pretrained=/root/autodl-tmp/85-512,add_bos_token=true,max_model_len=700,tensor_parallel_size=2,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7942	±	0.0130
- humanities	2	none	acc	↑	0.8256	±	0.0264
- other	2	none	acc	↑	0.8154	±	0.0269
- social sciences	2	none	acc	↑	0.8500	±	0.0252
- stem	2	none	acc	↑	0.7228	±	0.0245

vllm (pretrained=/root/autodl-tmp/86-2048,add_bos_token=true,max_model_len=2048,tensor_parallel_size=2,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.912	±	0.018
		strict-match	5	exact_match	↑	0.912	±	0.018

vllm (pretrained=/root/autodl-tmp/86-2048,add_bos_token=true,max_model_len=2048,tensor_parallel_size=2,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.914	±	0.0126
		strict-match	5	exact_match	↑	0.908	±	0.0129

vllm (pretrained=/root/autodl-tmp/86-2048,add_bos_token=true,max_model_len=700,tensor_parallel_size=2,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.8000	±	0.0129
- humanities	2	none	acc	↑	0.8308	±	0.0263
- other	2	none	acc	↑	0.8103	±	0.0258
- social sciences	2	none	acc	↑	0.8500	±	0.0253
- stem	2	none	acc	↑	0.7404	±	0.0248

vllm (pretrained=/root/autodl-tmp/output-876-512,add_bos_token=true,max_model_len=2048,tensor_parallel_size=2,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.908	±	0.0183
		strict-match	5	exact_match	↑	0.900	±	0.0190

vllm (pretrained=/root/autodl-tmp/output-876-512,add_bos_token=true,max_model_len=2048,tensor_parallel_size=2,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.908	±	0.0129
		strict-match	5	exact_match	↑	0.902	±	0.0133

vllm (pretrained=/root/autodl-tmp/output-876-512,add_bos_token=true,max_model_len=700,tensor_parallel_size=2,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.8035	±	0.0128
- humanities	2	none	acc	↑	0.8410	±	0.0255
- other	2	none	acc	↑	0.8205	±	0.0259
- social sciences	2	none	acc	↑	0.8556	±	0.0250
- stem	2	none	acc	↑	0.7333	±	0.0248

vllm (pretrained=/root/autodl-tmp/output-876-2048,add_bos_token=true,max_model_len=2048,tensor_parallel_size=2,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.904	±	0.0187
		strict-match	5	exact_match	↑	0.900	±	0.0190

vllm (pretrained=/root/autodl-tmp/output-876-2048,add_bos_token=true,max_model_len=2048,tensor_parallel_size=2,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.898	±	0.0135
		strict-match	5	exact_match	↑	0.892	±	0.0139

vllm (pretrained=/root/autodl-tmp/output-876-2048,add_bos_token=true,max_model_len=700,tensor_parallel_size=2,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7977	±	0.0130
- humanities	2	none	acc	↑	0.8256	±	0.0261
- other	2	none	acc	↑	0.8154	±	0.0266
- social sciences	2	none	acc	↑	0.8556	±	0.0248
- stem	2	none	acc	↑	0.7298	±	0.0248

vllm (pretrained=/root/autodl-tmp/output-89-512,add_bos_token=true,max_model_len=2048,tensor_parallel_size=2,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9	±	0.019
		strict-match	5	exact_match	↑	0.9	±	0.019

vllm (pretrained=/root/autodl-tmp/output-89-512,add_bos_token=true,max_model_len=2048,tensor_parallel_size=2,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.902	±	0.0133
		strict-match	5	exact_match	↑	0.898	±	0.0135

vllm (pretrained=/root/autodl-tmp/output-89-512,add_bos_token=true,max_model_len=700,tensor_parallel_size=2,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7988	±	0.0129
- humanities	2	none	acc	↑	0.8256	±	0.0261
- other	2	none	acc	↑	0.8154	±	0.0269
- social sciences	2	none	acc	↑	0.8500	±	0.0255
- stem	2	none	acc	↑	0.7368	±	0.0243

noneUsername
/

Mistral-Small-24B-Instruct-2501-abliterated-W8A8-better

Model tree for noneUsername/Mistral-Small-24B-Instruct-2501-abliterated-W8A8-better