metadata

language:
  - en
license: llama2
tags:
  - math
  - reasoning
datasets:
  - EleutherAI/proof-pile-2
  - open-web-math/open-web-math
model-index:
  - name: llemma_34b
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 55.29
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=EleutherAI/llemma_34b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 75.08
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=EleutherAI/llemma_34b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 58.93
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=EleutherAI/llemma_34b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 40.31
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=EleutherAI/llemma_34b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 75.53
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=EleutherAI/llemma_34b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 50.87
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=EleutherAI/llemma_34b
          name: Open LLM Leaderboard

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, Sean Welleck

Llemma 34B is a language model for mathematics. It was initialized with Code Llama 34B weights, and trained on the Proof-Pile-2 for 50B tokens.

This model also comes in a 7B parameter version: Llemma 7B.

Evaluations

Llemma models are particularly strong at chain-of-thought mathematical reasoning and using computational tools for mathematics, such as Python and formal theorem provers.

Chain-of-thought Math

On chain-of-thought mathematics tasks, Llemma models outperform Llama-2, Code Llama, and when controlled for model size, outperform Minerva.

Model	Size	GSM8k	OCW	MMLU-STEM	SAT	MATH
Llama 2	7B	11.8%	3.7%	29.9%	25%	3.2%
Code Llama	7B	10.5%	4.4%	25.1%	9.4%	4.5%
LLEMMA	7B	36.4%	7.7%	37.7%	53.1%	18.0%
Minerva	8B	16.2%	7.7%	35.6%	-	14.1%
------------	------	--------	-------	-----------	-------	-------
Code Llama	34B	29.6%	7.0%	40.5%	40.6%	12.2%
LLEMMA	34B	51.5%	11.8%	49.0%	71.9%	25.0%
------------	------	--------	-------	-----------	-------	-------
Minerva	62B	52.4%	12.0%	53.9%	-	27.6%
Minerva	540B	58.8%	17.6%	63.9%	-	33.6%

Further performance can be extracted by using majority voting:

Model	Size	GSM8k maj@100	OCW maj@100	MMLU-STEM maj@16	SAT maj@16	MATH maj@256
LLEMMA	7B	54.0%	14.3%	49.9%	78.1%	33.5
Minerva	8B	28.4%	12.5%	43.4%	-	25.4%
---------	------	-------------	-----------	-----------------	-----------	------------
LLEMMA	34B	69.3%	18.4%	59.7%	81.3%	43.1%
---------	------	-------------	-----------	-----------------	-----------	------------
Minerva	62B	68.5%	23.5%	63.5%	-	43.4%
Minerva	540B	78.5%	30.8%	75.0%	-	50.3%

Tool Use and Theorem Proving

In addition to chain-of-thought reasoning, Llemma has strong capabilities in computational mathematics tasks. For tool use and formal theorem proving evaluations, see our paper.

Citation

@misc{azerbayev2023llemma,
      title={Llemma: An Open Language Model For Mathematics}, 
      author={Zhangir Azerbayev and Hailey Schoelkopf and Keiran Paster and Marco Dos Santos and Stephen McAleer and Albert Q. Jiang and Jia Deng and Stella Biderman and Sean Welleck},
      year={2023},
      eprint={2310.10631},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	59.34
AI2 Reasoning Challenge (25-Shot)	55.29
HellaSwag (10-Shot)	75.08
MMLU (5-Shot)	58.93
TruthfulQA (0-shot)	40.31
Winogrande (5-shot)	75.53
GSM8k (5-shot)	50.87