Adding Evaluation Results (#5)

6118643 verified 3 months ago

No virus

20.1 kB

	---
	license: agpl-3.0
	tags:
	- merge
	- wizardlm
	- mique
	pipeline_tag: text-generation
	model-index:
	- name: WizardLM-Math-70B-v0.1
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: AI2 Reasoning Challenge (25-Shot)
	type: ai2_arc
	config: ARC-Challenge
	split: test
	args:
	num_few_shot: 25
	metrics:
	- type: acc_norm
	value: 67.06
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=MaziyarPanahi/WizardLM-Math-70B-v0.1
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: HellaSwag (10-Shot)
	type: hellaswag
	split: validation
	args:
	num_few_shot: 10
	metrics:
	- type: acc_norm
	value: 86.01
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=MaziyarPanahi/WizardLM-Math-70B-v0.1
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU (5-Shot)
	type: cais/mmlu
	config: all
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 69.14
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=MaziyarPanahi/WizardLM-Math-70B-v0.1
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: TruthfulQA (0-shot)
	type: truthful_qa
	config: multiple_choice
	split: validation
	args:
	num_few_shot: 0
	metrics:
	- type: mc2
	value: 57.07
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=MaziyarPanahi/WizardLM-Math-70B-v0.1
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: Winogrande (5-shot)
	type: winogrande
	config: winogrande_xl
	split: validation
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 81.77
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=MaziyarPanahi/WizardLM-Math-70B-v0.1
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GSM8k (5-shot)
	type: gsm8k
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 64.44
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=MaziyarPanahi/WizardLM-Math-70B-v0.1
	name: Open LLM Leaderboard
	---



	## Examples



	There are three killers in a room. Someone enters the room and kills one of them. Nobody leaves the room. How many killers are left in the room? Explain your reasoning step by step.

	```
	Step 1: Identify the initial situation
	There are three killers in a room, which we can label as A, B, and C.

	Step 2: Describe the event
	Someone (let's call this person D) enters the room and kills one of the killers. For the sake of simplicity, let's say that Killer A is killed.

	Step 3: Analyze the new situation
	After the event, there are now two killers left in the room: B and C. The person who entered the room, D, is also still present. However, since D did not leave the room after killing A, it means that D is also a killer. Therefore, there are actually three killers in the room: B, C, and D.

	Final Answer: After the event, there are three killers left in the room: B, C, and D.
	```

	Mark is faster than Mary , Mary is faster than Joe. Is Joe faster than Mark? Let's think step by step

	```
	Step 1: Understand the given information
	We know that Mark is faster than Mary, and Mary is faster than Joe. This means that Mark is the fastest among the three, followed by Mary, and then Joe.

	Step 2: Analyze the question
	The question asks if Joe is faster than Mark. Based on the given information, we can conclude that Joe is not faster than Mark. In fact, Mark is faster than Joe.

	Final Answer: No, Joe is not faster than Mark.
	```

	If I lay 2 wet shirts out in the sun flat on the ground to dry and it takes 5 hours until they are dry, how long would it take to dry 4 shirts of the same type that way simultanously?

	```
	The time taken for a shirt to dry depends on several factors such as the material of the shirt, the humidity, temperature, wind speed, etc. However, if we assume that all four shirts have similar properties and the weather conditions remain constant, then you can expect the drying time to be proportional to the number of shirts.

	In other words, if two shirts take 5 hours to dry, then four shirts will likely take twice as long, or 10 hours. This is because there is more fabric to dry, which means it will absorb more water from the air, thus taking longer to reach equilibrium with the surrounding environment.

	However, this is just an estimate, and the actual drying time may vary depending on the specific circumstances.
	```
	Wrong, it takes the same amount of time!


	explain step by step 25-4*2+3=?
	```
	Step 1: Perform the multiplication operation first.
	25 - 4 * 2 + 3 = 25 - 8 + 3

	Step 2: Now, perform the subtraction and addition operations from left to right.
	25 - 8 + 3 = 17 + 3

	Step 3: Finally, perform the remaining addition operation.
	17 + 3 = 20

	So, 25-4*2+3 equals 20.
	```

	## Eval


	### WizardLM/WizardLM-70B-V1.0 vs. MaziyarPanahi/WizardLM-Math-70B-v0.1

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/5fd5e18a90b6dc4633f6d292/7NWjisnm1gfZN2cMg4c5X.png)


	### Leaderboard

	```python
	{
	"all": {
	"acc": 0.6914116069568377,
	"acc_stderr": 0.03063431437342948,
	"acc_norm": 0.6938613221179539,
	"acc_norm_stderr": 0.031238741076549784,
	"mc1": 0.40269277845777235,
	"mc1_stderr": 0.01716883093518722,
	"mc2": 0.5707095526544473,
	"mc2_stderr": 0.01525040450448649
	},
	"harness\|arc:challenge\|25": {
	"acc": 0.6322525597269625,
	"acc_stderr": 0.014090995618168482,
	"acc_norm": 0.6706484641638225,
	"acc_norm_stderr": 0.013734057652635474
	},
	"harness\|hellaswag\|10": {
	"acc": 0.6746664011153157,
	"acc_stderr": 0.0046754187743142306,
	"acc_norm": 0.8600876319458275,
	"acc_norm_stderr": 0.0034618713240671846
	},
	"harness\|hendrycksTest-abstract_algebra\|5": {
	"acc": 0.34,
	"acc_stderr": 0.04760952285695236,
	"acc_norm": 0.34,
	"acc_norm_stderr": 0.04760952285695236
	},
	"harness\|hendrycksTest-anatomy\|5": {
	"acc": 0.6518518518518519,
	"acc_stderr": 0.041153246103369526,
	"acc_norm": 0.6518518518518519,
	"acc_norm_stderr": 0.041153246103369526
	},
	"harness\|hendrycksTest-astronomy\|5": {
	"acc": 0.7894736842105263,
	"acc_stderr": 0.03317672787533157,
	"acc_norm": 0.7894736842105263,
	"acc_norm_stderr": 0.03317672787533157
	},
	"harness\|hendrycksTest-business_ethics\|5": {
	"acc": 0.73,
	"acc_stderr": 0.04461960433384741,
	"acc_norm": 0.73,
	"acc_norm_stderr": 0.04461960433384741
	},
	"harness\|hendrycksTest-clinical_knowledge\|5": {
	"acc": 0.7283018867924528,
	"acc_stderr": 0.027377706624670713,
	"acc_norm": 0.7283018867924528,
	"acc_norm_stderr": 0.027377706624670713
	},
	"harness\|hendrycksTest-college_biology\|5": {
	"acc": 0.8194444444444444,
	"acc_stderr": 0.032166008088022675,
	"acc_norm": 0.8194444444444444,
	"acc_norm_stderr": 0.032166008088022675
	},
	"harness\|hendrycksTest-college_chemistry\|5": {
	"acc": 0.5,
	"acc_stderr": 0.050251890762960605,
	"acc_norm": 0.5,
	"acc_norm_stderr": 0.050251890762960605
	},
	"harness\|hendrycksTest-college_computer_science\|5": {
	"acc": 0.57,
	"acc_stderr": 0.049756985195624284,
	"acc_norm": 0.57,
	"acc_norm_stderr": 0.049756985195624284
	},
	"harness\|hendrycksTest-college_mathematics\|5": {
	"acc": 0.37,
	"acc_stderr": 0.04852365870939099,
	"acc_norm": 0.37,
	"acc_norm_stderr": 0.04852365870939099
	},
	"harness\|hendrycksTest-college_medicine\|5": {
	"acc": 0.6878612716763006,
	"acc_stderr": 0.035331333893236574,
	"acc_norm": 0.6878612716763006,
	"acc_norm_stderr": 0.035331333893236574
	},
	"harness\|hendrycksTest-college_physics\|5": {
	"acc": 0.35294117647058826,
	"acc_stderr": 0.047551296160629475,
	"acc_norm": 0.35294117647058826,
	"acc_norm_stderr": 0.047551296160629475
	},
	"harness\|hendrycksTest-computer_security\|5": {
	"acc": 0.7,
	"acc_stderr": 0.046056618647183814,
	"acc_norm": 0.7,
	"acc_norm_stderr": 0.046056618647183814
	},
	"harness\|hendrycksTest-conceptual_physics\|5": {
	"acc": 0.676595744680851,
	"acc_stderr": 0.030579442773610337,
	"acc_norm": 0.676595744680851,
	"acc_norm_stderr": 0.030579442773610337
	},
	"harness\|hendrycksTest-econometrics\|5": {
	"acc": 0.40350877192982454,
	"acc_stderr": 0.046151869625837026,
	"acc_norm": 0.40350877192982454,
	"acc_norm_stderr": 0.046151869625837026
	},
	"harness\|hendrycksTest-electrical_engineering\|5": {
	"acc": 0.5793103448275863,
	"acc_stderr": 0.04113914981189261,
	"acc_norm": 0.5793103448275863,
	"acc_norm_stderr": 0.04113914981189261
	},
	"harness\|hendrycksTest-elementary_mathematics\|5": {
	"acc": 0.4497354497354497,
	"acc_stderr": 0.02562085704293665,
	"acc_norm": 0.4497354497354497,
	"acc_norm_stderr": 0.02562085704293665
	},
	"harness\|hendrycksTest-formal_logic\|5": {
	"acc": 0.46825396825396826,
	"acc_stderr": 0.04463112720677172,
	"acc_norm": 0.46825396825396826,
	"acc_norm_stderr": 0.04463112720677172
	},
	"harness\|hendrycksTest-global_facts\|5": {
	"acc": 0.46,
	"acc_stderr": 0.05009082659620332,
	"acc_norm": 0.46,
	"acc_norm_stderr": 0.05009082659620332
	},
	"harness\|hendrycksTest-high_school_biology\|5": {
	"acc": 0.8129032258064516,
	"acc_stderr": 0.022185710092252252,
	"acc_norm": 0.8129032258064516,
	"acc_norm_stderr": 0.022185710092252252
	},
	"harness\|hendrycksTest-high_school_chemistry\|5": {
	"acc": 0.5369458128078818,
	"acc_stderr": 0.035083705204426656,
	"acc_norm": 0.5369458128078818,
	"acc_norm_stderr": 0.035083705204426656
	},
	"harness\|hendrycksTest-high_school_computer_science\|5": {
	"acc": 0.79,
	"acc_stderr": 0.040936018074033256,
	"acc_norm": 0.79,
	"acc_norm_stderr": 0.040936018074033256
	},
	"harness\|hendrycksTest-high_school_european_history\|5": {
	"acc": 0.8363636363636363,
	"acc_stderr": 0.02888787239548795,
	"acc_norm": 0.8363636363636363,
	"acc_norm_stderr": 0.02888787239548795
	},
	"harness\|hendrycksTest-high_school_geography\|5": {
	"acc": 0.8686868686868687,
	"acc_stderr": 0.024063156416822502,
	"acc_norm": 0.8686868686868687,
	"acc_norm_stderr": 0.024063156416822502
	},
	"harness\|hendrycksTest-high_school_government_and_politics\|5": {
	"acc": 0.927461139896373,
	"acc_stderr": 0.018718998520678178,
	"acc_norm": 0.927461139896373,
	"acc_norm_stderr": 0.018718998520678178
	},
	"harness\|hendrycksTest-high_school_macroeconomics\|5": {
	"acc": 0.7025641025641025,
	"acc_stderr": 0.023177408131465953,
	"acc_norm": 0.7025641025641025,
	"acc_norm_stderr": 0.023177408131465953
	},
	"harness\|hendrycksTest-high_school_mathematics\|5": {
	"acc": 0.34814814814814815,
	"acc_stderr": 0.02904560029061626,
	"acc_norm": 0.34814814814814815,
	"acc_norm_stderr": 0.02904560029061626
	},
	"harness\|hendrycksTest-high_school_microeconomics\|5": {
	"acc": 0.7941176470588235,
	"acc_stderr": 0.02626502460827588,
	"acc_norm": 0.7941176470588235,
	"acc_norm_stderr": 0.02626502460827588
	},
	"harness\|hendrycksTest-high_school_physics\|5": {
	"acc": 0.4503311258278146,
	"acc_stderr": 0.04062290018683776,
	"acc_norm": 0.4503311258278146,
	"acc_norm_stderr": 0.04062290018683776
	},
	"harness\|hendrycksTest-high_school_psychology\|5": {
	"acc": 0.8954128440366973,
	"acc_stderr": 0.013120530245265593,
	"acc_norm": 0.8954128440366973,
	"acc_norm_stderr": 0.013120530245265593
	},
	"harness\|hendrycksTest-high_school_statistics\|5": {
	"acc": 0.5787037037037037,
	"acc_stderr": 0.03367462138896078,
	"acc_norm": 0.5787037037037037,
	"acc_norm_stderr": 0.03367462138896078
	},
	"harness\|hendrycksTest-high_school_us_history\|5": {
	"acc": 0.9166666666666666,
	"acc_stderr": 0.019398452135813905,
	"acc_norm": 0.9166666666666666,
	"acc_norm_stderr": 0.019398452135813905
	},
	"harness\|hendrycksTest-high_school_world_history\|5": {
	"acc": 0.8860759493670886,
	"acc_stderr": 0.020681745135884565,
	"acc_norm": 0.8860759493670886,
	"acc_norm_stderr": 0.020681745135884565
	},
	"harness\|hendrycksTest-human_aging\|5": {
	"acc": 0.757847533632287,
	"acc_stderr": 0.028751392398694755,
	"acc_norm": 0.757847533632287,
	"acc_norm_stderr": 0.028751392398694755
	},
	"harness\|hendrycksTest-human_sexuality\|5": {
	"acc": 0.8702290076335878,
	"acc_stderr": 0.029473649496907065,
	"acc_norm": 0.8702290076335878,
	"acc_norm_stderr": 0.029473649496907065
	},
	"harness\|hendrycksTest-international_law\|5": {
	"acc": 0.8181818181818182,
	"acc_stderr": 0.03520893951097655,
	"acc_norm": 0.8181818181818182,
	"acc_norm_stderr": 0.03520893951097655
	},
	"harness\|hendrycksTest-jurisprudence\|5": {
	"acc": 0.8148148148148148,
	"acc_stderr": 0.03755265865037181,
	"acc_norm": 0.8148148148148148,
	"acc_norm_stderr": 0.03755265865037181
	},
	"harness\|hendrycksTest-logical_fallacies\|5": {
	"acc": 0.7791411042944786,
	"acc_stderr": 0.03259177392742179,
	"acc_norm": 0.7791411042944786,
	"acc_norm_stderr": 0.03259177392742179
	},
	"harness\|hendrycksTest-machine_learning\|5": {
	"acc": 0.48214285714285715,
	"acc_stderr": 0.047427623612430116,
	"acc_norm": 0.48214285714285715,
	"acc_norm_stderr": 0.047427623612430116
	},
	"harness\|hendrycksTest-management\|5": {
	"acc": 0.8446601941747572,
	"acc_stderr": 0.03586594738573974,
	"acc_norm": 0.8446601941747572,
	"acc_norm_stderr": 0.03586594738573974
	},
	"harness\|hendrycksTest-marketing\|5": {
	"acc": 0.905982905982906,
	"acc_stderr": 0.019119892798924974,
	"acc_norm": 0.905982905982906,
	"acc_norm_stderr": 0.019119892798924974
	},
	"harness\|hendrycksTest-medical_genetics\|5": {
	"acc": 0.67,
	"acc_stderr": 0.047258156262526066,
	"acc_norm": 0.67,
	"acc_norm_stderr": 0.047258156262526066
	},
	"harness\|hendrycksTest-miscellaneous\|5": {
	"acc": 0.8697318007662835,
	"acc_stderr": 0.012036729568216054,
	"acc_norm": 0.8697318007662835,
	"acc_norm_stderr": 0.012036729568216054
	},
	"harness\|hendrycksTest-moral_disputes\|5": {
	"acc": 0.7774566473988439,
	"acc_stderr": 0.02239421566194282,
	"acc_norm": 0.7774566473988439,
	"acc_norm_stderr": 0.02239421566194282
	},
	"harness\|hendrycksTest-moral_scenarios\|5": {
	"acc": 0.5553072625698324,
	"acc_stderr": 0.016619881988177012,
	"acc_norm": 0.5553072625698324,
	"acc_norm_stderr": 0.016619881988177012
	},
	"harness\|hendrycksTest-nutrition\|5": {
	"acc": 0.7516339869281046,
	"acc_stderr": 0.024739981355113592,
	"acc_norm": 0.7516339869281046,
	"acc_norm_stderr": 0.024739981355113592
	},
	"harness\|hendrycksTest-philosophy\|5": {
	"acc": 0.77491961414791,
	"acc_stderr": 0.023720088516179027,
	"acc_norm": 0.77491961414791,
	"acc_norm_stderr": 0.023720088516179027
	},
	"harness\|hendrycksTest-prehistory\|5": {
	"acc": 0.7962962962962963,
	"acc_stderr": 0.02240967454730417,
	"acc_norm": 0.7962962962962963,
	"acc_norm_stderr": 0.02240967454730417
	},
	"harness\|hendrycksTest-professional_accounting\|5": {
	"acc": 0.5390070921985816,
	"acc_stderr": 0.029736592526424445,
	"acc_norm": 0.5390070921985816,
	"acc_norm_stderr": 0.029736592526424445
	},
	"harness\|hendrycksTest-professional_law\|5": {
	"acc": 0.5586701434159062,
	"acc_stderr": 0.012682016335646683,
	"acc_norm": 0.5586701434159062,
	"acc_norm_stderr": 0.012682016335646683
	},
	"harness\|hendrycksTest-professional_medicine\|5": {
	"acc": 0.7242647058823529,
	"acc_stderr": 0.027146271936625162,
	"acc_norm": 0.7242647058823529,
	"acc_norm_stderr": 0.027146271936625162
	},
	"harness\|hendrycksTest-professional_psychology\|5": {
	"acc": 0.761437908496732,
	"acc_stderr": 0.017242385828779627,
	"acc_norm": 0.761437908496732,
	"acc_norm_stderr": 0.017242385828779627
	},
	"harness\|hendrycksTest-public_relations\|5": {
	"acc": 0.7454545454545455,
	"acc_stderr": 0.041723430387053825,
	"acc_norm": 0.7454545454545455,
	"acc_norm_stderr": 0.041723430387053825
	},
	"harness\|hendrycksTest-security_studies\|5": {
	"acc": 0.7877551020408163,
	"acc_stderr": 0.026176967197866767,
	"acc_norm": 0.7877551020408163,
	"acc_norm_stderr": 0.026176967197866767
	},
	"harness\|hendrycksTest-sociology\|5": {
	"acc": 0.8805970149253731,
	"acc_stderr": 0.02292879327721974,
	"acc_norm": 0.8805970149253731,
	"acc_norm_stderr": 0.02292879327721974
	},
	"harness\|hendrycksTest-us_foreign_policy\|5": {
	"acc": 0.9,
	"acc_stderr": 0.030151134457776334,
	"acc_norm": 0.9,
	"acc_norm_stderr": 0.030151134457776334
	},
	"harness\|hendrycksTest-virology\|5": {
	"acc": 0.5602409638554217,
	"acc_stderr": 0.03864139923699122,
	"acc_norm": 0.5602409638554217,
	"acc_norm_stderr": 0.03864139923699122
	},
	"harness\|hendrycksTest-world_religions\|5": {
	"acc": 0.8596491228070176,
	"acc_stderr": 0.0266405825391332,
	"acc_norm": 0.8596491228070176,
	"acc_norm_stderr": 0.0266405825391332
	},
	"harness\|truthfulqa:mc\|0": {
	"mc1": 0.40269277845777235,
	"mc1_stderr": 0.01716883093518722,
	"mc2": 0.5707095526544473,
	"mc2_stderr": 0.01525040450448649
	},
	"harness\|winogrande\|5": {
	"acc": 0.8176795580110497,
	"acc_stderr": 0.010851565594267207
	},
	"harness\|gsm8k\|5": {
	"acc": 0.6444275966641395,
	"acc_stderr": 0.013185402252713852
	}
	}
	```
	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_MaziyarPanahi__WizardLM-Math-70B-v0.1)

	\| Metric \|Value\|
	\|---------------------------------\|----:\|
	\|Avg. \|70.92\|
	\|AI2 Reasoning Challenge (25-Shot)\|67.06\|
	\|HellaSwag (10-Shot) \|86.01\|
	\|MMLU (5-Shot) \|69.14\|
	\|TruthfulQA (0-shot) \|57.07\|
	\|Winogrande (5-shot) \|81.77\|
	\|GSM8k (5-shot) \|64.44\|