llemma_34b / README.md

Adding Evaluation Results

55560d6 verified 9 months ago

7.39 kB

	---
	language:
	- en
	license: llama2
	tags:
	- math
	- reasoning
	datasets:
	- EleutherAI/proof-pile-2
	- open-web-math/open-web-math
	model-index:
	- name: llemma_34b
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: AI2 Reasoning Challenge (25-Shot)
	type: ai2_arc
	config: ARC-Challenge
	split: test
	args:
	num_few_shot: 25
	metrics:
	- type: acc_norm
	value: 55.29
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=EleutherAI/llemma_34b
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: HellaSwag (10-Shot)
	type: hellaswag
	split: validation
	args:
	num_few_shot: 10
	metrics:
	- type: acc_norm
	value: 75.08
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=EleutherAI/llemma_34b
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU (5-Shot)
	type: cais/mmlu
	config: all
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 58.93
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=EleutherAI/llemma_34b
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: TruthfulQA (0-shot)
	type: truthful_qa
	config: multiple_choice
	split: validation
	args:
	num_few_shot: 0
	metrics:
	- type: mc2
	value: 40.31
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=EleutherAI/llemma_34b
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: Winogrande (5-shot)
	type: winogrande
	config: winogrande_xl
	split: validation
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 75.53
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=EleutherAI/llemma_34b
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GSM8k (5-shot)
	type: gsm8k
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 50.87
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=EleutherAI/llemma_34b
	name: Open LLM Leaderboard
	---
	<img src="llemma.png" width="400">

	[ArXiv](http://arxiv.org/abs/2310.10631) \| [Models](https://huggingface.co/EleutherAI/llemma_34b) \| [Data](https://huggingface.co/datasets/EleutherAI/proof-pile-2) \| [Code](https://github.com/EleutherAI/math-lm) \| [Blog](https://blog.eleuther.ai/llemma/) \| [Sample Explorer](https://llemma-demo.github.io/)

	[Zhangir Azerbayev](https://zhangir-azerbayev.github.io/), [Hailey Schoelkopf](https://github.com/haileyschoelkopf), [Keiran Paster](https://keirp.com), [Marco Dos Santos](https://github.com/dsantosmarco), [Stephen McAleer](https://www.andrew.cmu.edu/user/smcaleer/), [Albert Q. Jiang](https://albertqjiang.github.io/), [Jia Deng](https://www.cs.princeton.edu/~jiadeng/), [Stella Biderman](https://www.stellabiderman.com/), [Sean Welleck](https://wellecks.com/)

	Llemma 34B is a language model for mathematics. It was initialized with [Code Llama 34B](https://github.com/facebookresearch/codellama) weights, and trained on the [Proof-Pile-2](https://huggingface.co/datasets/EleutherAI/proof-pile-2) for 50B tokens.

	This model also comes in a 7B parameter version: [Llemma 7B](https://huggingface.co/EleutherAI/llemma_7b).

	## Evaluations

	Llemma models are particularly strong at chain-of-thought mathematical reasoning and using computational tools for mathematics, such as Python and formal theorem provers.


	### Chain-of-thought Math
	On chain-of-thought mathematics tasks, Llemma models outperform Llama-2, Code Llama, and when controlled for model size, outperform Minerva.

	\| Model \| Size \| GSM8k \| [OCW](https://openreview.net/forum?id=IFXTZERXdM7) \| MMLU-STEM \| [SAT](https://huggingface.co/datasets/mcaleste/sat_multiple_choice_math_may_23) \| MATH \|
	\|------------\|------\|--------\|-------\|-----------\|-------\|-------\|
	\| Llama 2 \| 7B \| 11.8% \| 3.7% \| 29.9% \| 25% \| 3.2% \|
	\| Code Llama \| 7B \| 10.5% \| 4.4% \| 25.1% \| 9.4% \| 4.5% \|
	\| LLEMMA \| 7B \| 36.4% \| 7.7% \| 37.7% \| 53.1% \| 18.0% \|
	\| Minerva \| 8B \| 16.2% \| 7.7% \| 35.6% \| - \| 14.1% \|
	\|------------\|------\|--------\|-------\|-----------\|-------\|-------\|
	\| Code Llama \| 34B \| 29.6% \| 7.0% \| 40.5% \| 40.6% \| 12.2% \|
	\| LLEMMA \| 34B \| 51.5% \| 11.8% \| 49.0% \| 71.9% \| 25.0% \|
	\|------------\|------\|--------\|-------\|-----------\|-------\|-------\|
	\| Minerva \| 62B \| 52.4% \| 12.0% \| 53.9% \| - \| 27.6% \|
	\| Minerva \| 540B \| 58.8% \| 17.6% \| 63.9% \| - \| 33.6% \|


	Further performance can be extracted by using majority voting:

	\| Model \| Size \| GSM8k maj@100 \| OCW maj@100 \| MMLU-STEM maj@16 \| SAT maj@16 \| MATH maj@256 \|
	\|---------\|------\|-------------\|-----------\|-----------------\|-----------\|------------\|
	\| LLEMMA \| 7B \| 54.0% \| 14.3% \| 49.9% \| 78.1% \| 33.5 \|
	\| Minerva \| 8B \| 28.4% \| 12.5% \| 43.4% \| - \| 25.4% \|
	\|---------\|------\|-------------\|-----------\|-----------------\|-----------\|------------\|
	\| LLEMMA \| 34B \| 69.3% \| 18.4% \| 59.7% \| 81.3% \| 43.1% \|
	\|---------\|------\|-------------\|-----------\|-----------------\|-----------\|------------\|
	\| Minerva \| 62B \| 68.5% \| 23.5% \| 63.5% \| - \| 43.4% \|
	\| Minerva \| 540B \| 78.5% \| 30.8% \| 75.0% \| - \| 50.3% \|

	### Tool Use and Theorem Proving
	In addition to chain-of-thought reasoning, Llemma has strong capabilities in computational mathematics tasks. For tool use and formal theorem proving evaluations, see [our paper](http://arxiv.org/abs/2310.10631).

	### Citation
	```
	@misc{azerbayev2023llemma,
	title={Llemma: An Open Language Model For Mathematics},
	author={Zhangir Azerbayev and Hailey Schoelkopf and Keiran Paster and Marco Dos Santos and Stephen McAleer and Albert Q. Jiang and Jia Deng and Stella Biderman and Sean Welleck},
	year={2023},
	eprint={2310.10631},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```
	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_EleutherAI__llemma_34b)

	\| Metric \|Value\|
	\|---------------------------------\|----:\|
	\|Avg. \|59.34\|
	\|AI2 Reasoning Challenge (25-Shot)\|55.29\|
	\|HellaSwag (10-Shot) \|75.08\|
	\|MMLU (5-Shot) \|58.93\|
	\|TruthfulQA (0-shot) \|40.31\|
	\|Winogrande (5-shot) \|75.53\|
	\|GSM8k (5-shot) \|50.87\|