|
--- |
|
license: apache-2.0 |
|
base_model: mistralai/Mixtral-8x22B-Instruct-v0.1 |
|
inference: false |
|
model_link: https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1 |
|
model_name: mistralai/Mixtral-8x22B-Instruct-v0.1 |
|
pipeline_tag: text-generation |
|
quantized_by: FriendliAI |
|
tags: |
|
- pretrained |
|
--- |
|
|
|
<!-- header start --> |
|
<p align="center"> |
|
<img src="https://i.imgur.com/mNM6Cai.png" width="100%" alt="Friendli Logo"> |
|
</p> |
|
<!-- header end --> |
|
|
|
# Mixtral-8x22B-Instruct-v0.1 - FP8 |
|
|
|
- Model creator: [Mistral AI](https://huggingface.co/mistralai) |
|
- Original model: [Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1) |
|
|
|
## Description |
|
|
|
This repo contains the Mixtral-8x22B-Instruct-v0.1 model quantized to FP8 by FriendliAI, significantly enhancing its inference efficiency while maintaining high accuracy. |
|
Note that FP8 is only supported by NVIDIA Ada, Hopper, and Blackwell GPU architectures. |
|
Check out [FriendliAI documentation](https://docs.friendli.ai/) for more details. |
|
|
|
## Compatibility |
|
|
|
This model is compatible with **[Friendli Container](https://friendli.ai/products/container/)**. |
|
|
|
## Prerequisites |
|
|
|
- Before you begin, make sure you have signed up for [Friendli Suite](https://suite.friendli.ai/). **You can use Friendli Containers free of charge for four weeks.** |
|
- Prepare a Personal Access Token following [this guide](#preparing-personal-access-token). |
|
- Prepare a Friendli Container Secret following [this guide](#preparing-container-secret). |
|
|
|
### Preparing Personal Access Token |
|
|
|
PAT (Personal Access Token) is the user credential for for logging into our container registry. |
|
|
|
1. Sign in [Friendli Suite](https://suite.friendli.ai/). |
|
2. Go to **[User Settings > Tokens](https://suite.friendli.ai/user-settings/tokens)** and click **'Create new token'**. |
|
3. Save your created token value. |
|
|
|
### Pulling Friendli Container Image |
|
|
|
1. Log in to the Docker client using the personal access token created as outlined in [this guide](#preparing-personal-access-token). |
|
|
|
```sh |
|
export FRIENDLI_PAT="YOUR PAT" |
|
docker login registry.friendli.ai -u $YOUR_EMAIL -p $FRIENDLI_PAT |
|
``` |
|
|
|
2. Pull image |
|
|
|
```sh |
|
docker pull registry.friendli.ai/trial |
|
``` |
|
|
|
## Running Friendli Container |
|
|
|
Once you've prepared the image of Friendli Container, you can launch it to create a serving endpoint. |
|
|
|
```sh |
|
docker run \ |
|
--gpus '"device=0,1,2,3"' \ |
|
-p 8000:8000 \ |
|
-v ~/.cache/huggingface:/root/.cache/huggingface \ |
|
-e FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET" \ |
|
registry.friendli.ai/trial \ |
|
--web-server-port 8000 \ |
|
--hf-model-name FriendliAI/Mixtral-8x22B-Instruct-v0.1-fp8 \ |
|
--num-devices 4 # Use tensor parallelism degree 4 |
|
``` |
|
|
|
### Optimizing Inference Performance with Policy Search |
|
|
|
To serve MoE models efficiently, it is required to run a policy search to explore the optimal execution policy: |
|
|
|
```sh |
|
export POLICY_DIR=$PWD/policy |
|
|
|
mkdir -p $POLICY_DIR |
|
|
|
docker run \ |
|
--gpus '"device=0,1,2,3"' \ |
|
-p 8000:8000 \ |
|
-v ~/.cache/huggingface:/root/.cache/huggingface \ |
|
-v $POLICY_DIR:/policy \ |
|
-e FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET" \ |
|
registry.friendli.ai/trial \ |
|
--web-server-port 8000 \ |
|
--hf-model-name FriendliAI/Mixtral-8x22B-Instruct-v0.1-fp8 \ |
|
--num-devices 4 # Use tensor parallelism degree 4 \ |
|
--algo-policy-dir /policy \ |
|
--search-policy true |
|
``` |
|
|
|
When the optimal policy is successfully searched, the policy is compiled into a policy file and saved at `$POLICY_DIR`. |
|
Now you can create an inference endpoint with this optimal policy as follows: |
|
|
|
```sh |
|
docker run \ |
|
--gpus '"device=0,1,2,3"' \ |
|
-p 8000:8000 \ |
|
-v ~/.cache/huggingface:/root/.cache/huggingface \ |
|
-v $POLICY_DIR:/policy \ |
|
-e FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET" \ |
|
registry.friendli.ai/trial \ |
|
--web-server-port 8000 \ |
|
--hf-model-name FriendliAI/Mixtral-8x22B-Instruct-v0.1-fp8 \ |
|
--num-devices 4 # Use tensor parallelism degree 4 \ |
|
--algo-policy-dir /policy |
|
``` |
|
|
|
--- |
|
|
|
# Original model card: MistralAI's Mixtral-8x22B-Instruct v0.1 |
|
|
|
# Model Card for Mixtral-8x22B-Instruct-v0.1 |
|
The Mixtral-8x22B-Instruct-v0.1 Large Language Model (LLM) is an instruct fine-tuned version of the [Mixtral-8x22B-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1). |
|
|
|
## Run the model |
|
```python |
|
from transformers import AutoModelForCausalLM |
|
from mistral_common.protocol.instruct.messages import ( |
|
AssistantMessage, |
|
UserMessage, |
|
) |
|
from mistral_common.protocol.instruct.tool_calls import ( |
|
Tool, |
|
Function, |
|
) |
|
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer |
|
from mistral_common.tokens.instruct.normalize import ChatCompletionRequest |
|
|
|
device = "cuda" # the device to load the model onto |
|
|
|
tokenizer_v3 = MistralTokenizer.v3() |
|
|
|
mistral_query = ChatCompletionRequest( |
|
tools=[ |
|
Tool( |
|
function=Function( |
|
name="get_current_weather", |
|
description="Get the current weather", |
|
parameters={ |
|
"type": "object", |
|
"properties": { |
|
"location": { |
|
"type": "string", |
|
"description": "The city and state, e.g. San Francisco, CA", |
|
}, |
|
"format": { |
|
"type": "string", |
|
"enum": ["celsius", "fahrenheit"], |
|
"description": "The temperature unit to use. Infer this from the users location.", |
|
}, |
|
}, |
|
"required": ["location", "format"], |
|
}, |
|
) |
|
) |
|
], |
|
messages=[ |
|
UserMessage(content="What's the weather like today in Paris"), |
|
], |
|
model="test", |
|
) |
|
|
|
encodeds = tokenizer_v3.encode_chat_completion(mistral_query).tokens |
|
model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x22B-Instruct-v0.1") |
|
model_inputs = encodeds.to(device) |
|
model.to(device) |
|
|
|
generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True) |
|
sp_tokenizer = tokenizer_v3.instruct_tokenizer.tokenizer |
|
decoded = sp_tokenizer.decode(generated_ids[0]) |
|
print(decoded) |
|
``` |
|
Alternatively, you can run this example with the Hugging Face tokenizer. |
|
To use this example, you'll need transformers version 4.39.0 or higher. |
|
```console |
|
pip install transformers==4.39.0 |
|
``` |
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model_id = "mistralai/Mixtral-8x22B-Instruct-v0.1" |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
conversation=[ |
|
{"role": "user", "content": "What's the weather like in Paris?"}, |
|
{ |
|
"role": "tool_calls", |
|
"content": [ |
|
{ |
|
"name": "get_current_weather", |
|
"arguments": {"location": "Paris, France", "format": "celsius"}, |
|
|
|
} |
|
] |
|
}, |
|
{ |
|
"role": "tool_results", |
|
"content": {"content": 22} |
|
}, |
|
{"role": "assistant", "content": "The current temperature in Paris, France is 22 degrees Celsius."}, |
|
{"role": "user", "content": "What about San Francisco?"} |
|
] |
|
|
|
|
|
tools = [{"type": "function", "function": {"name":"get_current_weather", "description": "Get▁the▁current▁weather", "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"}, "format": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The temperature unit to use. Infer this from the users location."}},"required":["location","format"]}}}] |
|
|
|
# render the tool use prompt as a string: |
|
tool_use_prompt = tokenizer.apply_chat_template( |
|
conversation, |
|
chat_template="tool_use", |
|
tools=tools, |
|
tokenize=False, |
|
add_generation_prompt=True, |
|
|
|
) |
|
model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x22B-Instruct-v0.1") |
|
|
|
inputs = tokenizer(tool_use_prompt, return_tensors="pt") |
|
|
|
outputs = model.generate(**inputs, max_new_tokens=20) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
|
|
# Instruct tokenizer |
|
The HuggingFace tokenizer included in this release should match our own. To compare: |
|
`pip install mistral-common` |
|
|
|
```py |
|
from mistral_common.protocol.instruct.messages import ( |
|
AssistantMessage, |
|
UserMessage, |
|
) |
|
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer |
|
from mistral_common.tokens.instruct.normalize import ChatCompletionRequest |
|
|
|
from transformers import AutoTokenizer |
|
|
|
tokenizer_v3 = MistralTokenizer.v3() |
|
|
|
mistral_query = ChatCompletionRequest( |
|
messages=[ |
|
UserMessage(content="How many experts ?"), |
|
AssistantMessage(content="8"), |
|
UserMessage(content="How big ?"), |
|
AssistantMessage(content="22B"), |
|
UserMessage(content="Noice 🎉 !"), |
|
], |
|
model="test", |
|
) |
|
hf_messages = mistral_query.model_dump()['messages'] |
|
|
|
tokenized_mistral = tokenizer_v3.encode_chat_completion(mistral_query).tokens |
|
|
|
tokenizer_hf = AutoTokenizer.from_pretrained('mistralai/Mixtral-8x22B-Instruct-v0.1') |
|
tokenized_hf = tokenizer_hf.apply_chat_template(hf_messages, tokenize=True) |
|
|
|
assert tokenized_hf == tokenized_mistral |
|
``` |
|
|
|
# Function calling and special tokens |
|
This tokenizer includes more special tokens, related to function calling : |
|
- [TOOL_CALLS] |
|
- [AVAILABLE_TOOLS] |
|
- [/AVAILABLE_TOOLS] |
|
- [TOOL_RESULTS] |
|
- [/TOOL_RESULTS] |
|
|
|
If you want to use this model with function calling, please be sure to apply it similarly to what is done in our [SentencePieceTokenizerV3](https://github.com/mistralai/mistral-common/blob/main/src/mistral_common/tokens/tokenizers/sentencepiece.py#L299). |
|
|
|
# The Mistral AI Team |
|
Albert Jiang, Alexandre Sablayrolles, Alexis Tacnet, Antoine Roux, |
|
Arthur Mensch, Audrey Herblin-Stoop, Baptiste Bout, Baudouin de Monicault, |
|
Blanche Savary, Bam4d, Caroline Feldman, Devendra Singh Chaplot, |
|
Diego de las Casas, Eleonore Arcelin, Emma Bou Hanna, Etienne Metzger, |
|
Gianna Lengyel, Guillaume Bour, Guillaume Lample, Harizo Rajaona, |
|
Jean-Malo Delignon, Jia Li, Justus Murke, Louis Martin, Louis Ternon, |
|
Lucile Saulnier, Lélio Renard Lavaud, Margaret Jennings, Marie Pellat, |
|
Marie Torelli, Marie-Anne Lachaux, Nicolas Schuhl, Patrick von Platen, |
|
Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, |
|
Thibaut Lavril, Timothée Lacroix, Théophile Gervet, Thomas Wang, |
|
Valera Nemychnikova, William El Sayed, William Marshall |