Mistral 7B

Published on Oct 10, 2023


We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license.


Very cool work!

Great work on this model! It's fantastic to see a fully permissive licensed model with such impressive performance.

After looking through the paper, I didn't see anything regarding the training data or number of training tokens. Are there plans to release details on that?

Can you share what data sources were used to Train Mistral. It will be important to know for users what risk they are taking on with respect to the data training sources.

Great work! Can you provide more details on how the system prompt was formatted? Is it the same template as Llama with "<>" or something different?

Introduces Mistral 7B LLM: Better than LLaMA-2-13B and LLaMA-1-34B for reasoning, math, and code generation; uses grouped query attention (GQA) for faster inference and sliding window attention (SWA) for handling larger (variable-length) sequences with low inference cost; proposes instruction fine-tuned model - Mistral-7B-Instruct; implement on cloud (AWS/GCP/Azure) using vLLM and SkyPilot. SWA confines the attention window (preventing quadratic time scaling of attention); rolling buffer (fixed-window) cache for reducing memory usage; pre-filling and chunking for splitting attention (causal self-attention, sliding window attention on previous predictions). Better than LLaMA models on reasoning (HellaSwag, Winogrande, PIQA, ARC), world knowledge (trivia Q&A), coding (comparable to Code LLaMA - HumanEval, MBPP), and aggregated benchmarks (AGI-Eval). Instruction fine-tuning from HuggingFace dataset; better than Alpaca, Vicuna, and LLaMA models (just behind WizardLM) - human preference on MT-Bench LLM eval (LLM-Boxing leaderboard). From Mistral AI.

Links: Blog, arxiv, HuggingFace page (instruct model), LLM Boxing Leaderboard, GitHub (Skypilot, FlashAttention, xformers)

This comment has been hidden

Sign up or log in to comment

Models citing this paper 171

Browse 171 models citing this paper

Datasets citing this paper 1

Spaces citing this paper 1133

Collections including this paper 28