arxiv:2310.06825

Mistral 7B

Published on Oct 10, 2023

Upvote

Authors:

Albert Q. Jiang ,

Alexandre Sablayrolles ,

Arthur Mensch ,

Chris Bamford ,

Devendra Singh Chaplot ,

Guillaume Lample ,

Lucile Saulnier ,

Marie-Anne Lachaux ,

Pierre Stock ,

Teven Le Scao ,

Thomas Wang ,

Timothée Lacroix ,

Abstract

We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license.

View arXiv page View PDF Add to collection

Community

clem

Oct 11, 2023

•

edited Oct 11, 2023

Very cool work!

davidmezzetti

Oct 11, 2023

Great work on this model! It's fantastic to see a fully permissive licensed model with such impressive performance.

After looking through the paper, I didn't see anything regarding the training data or number of training tokens. Are there plans to release details on that?

venkoshi

Oct 11, 2023

Can you share what data sources were used to Train Mistral. It will be important to know for users what risk they are taking on with respect to the data training sources.

dzhulgakov

Oct 11, 2023

Great work! Can you provide more details on how the system prompt was formatted? Is it the same template as Llama with "<>" or something different?

TheProjectsGuy

Oct 12, 2023

Introduces Mistral 7B LLM: Better than LLaMA-2-13B and LLaMA-1-34B for reasoning, math, and code generation; uses grouped query attention (GQA) for faster inference and sliding window attention (SWA) for handling larger (variable-length) sequences with low inference cost; proposes instruction fine-tuned model - Mistral-7B-Instruct; implement on cloud (AWS/GCP/Azure) using vLLM and SkyPilot. SWA confines the attention window (preventing quadratic time scaling of attention); rolling buffer (fixed-window) cache for reducing memory usage; pre-filling and chunking for splitting attention (causal self-attention, sliding window attention on previous predictions). Better than LLaMA models on reasoning (HellaSwag, Winogrande, PIQA, ARC), world knowledge (trivia Q&A), coding (comparable to Code LLaMA - HumanEval, MBPP), and aggregated benchmarks (AGI-Eval). Instruction fine-tuning from HuggingFace dataset; better than Alpaca, Vicuna, and LLaMA models (just behind WizardLM) - human preference on MT-Bench LLM eval (LLM-Boxing leaderboard). From Mistral AI.