Papers
arxiv:2310.10631

Llemma: An Open Language Model For Mathematics

Published on Oct 16, 2023
· Featured in Daily Papers on Oct 17, 2023

Abstract

We present Llemma, a large language model for mathematics. We continue pretraining Code Llama on the Proof-Pile-2, a mixture of scientific papers, web data containing mathematics, and mathematical code, yielding Llemma. On the MATH benchmark Llemma outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, Llemma is capable of tool use and formal theorem proving without any further finetuning. We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.

Community

This comment has been hidden

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Introduces Llemma LLM for mathematical reasoning: continue pre-training Code LLaMA on Proof-pile-2 (scientific papers, math data, and math code); releases 7B and 34B modes (latter is better than Google’s Minerva for math problems). Domain-specific language model can give better performance with smaller size. Uses a custom code dataset AlgebricStack, OpenWebMath, arXiv subset of RedPajama, and generic data sources. Uses standard decoder-only LLaMA 2 models (initialized from Code Llama that was trained on code), autoregressive language modeling objective on Proof-Pile-2. Trained with bf16 (mixed precision) using GPT-NeoX, tensor parallelism, ZeRO sharded; also uses Flash Attention 2 for better throughput and lower memory usage; RoPE for long context fine-tuning. Performs better than open models on CoT mathematical problem solving (GSM8k, OCW, SAT, etc.), matches Minerva; better than Code LLaMA for tool use (GSM8k+Python). Best perplexity for 2:4:1 arxiv to web to code mixture. Appendix has dataset creation (composition and processes), evaluation details, and additional results. From Eleuther AI, CMU.

Links: arxiv, GitHub

Sign up or log in to comment

Models citing this paper 15

Browse 15 models citing this paper

Datasets citing this paper 3

Spaces citing this paper 43

Collections including this paper 15