Edit model card

Neuronx for mistralai/Mistral-7B-Instruct-v0.2 - Updated Mistral 7B Model on AWS Inferentia2 Using AWS Neuron SDK version 2.18~

This model has been exported to the neuron format using specific input_shapes and compiler parameters detailed in the paragraphs below.

Please refer to the πŸ€— optimum-neuron documentation for an explanation of these parameters.

Note: To compile the mistralai/Mistral-7B-Instruct-v0.2 on Inf2, you need to update the model config sliding_window (either file or model variable) from null to default 4096.

Usage with πŸ€— TGI

Refer to container image on neuronx-tgi Amazon ECR Public Gallery.

export HF_TOKEN="hf_xxx"

docker run -d -p 8080:80 \
       --name mistral-7b-neuronx-tgi \
       -v $(pwd)/data:/data \
       --device=/dev/neuron0 \
       --device=/dev/neuron1 \
       --device=/dev/neuron2 \
       --device=/dev/neuron3 \
       --device=/dev/neuron4 \
       --device=/dev/neuron5 \
       --device=/dev/neuron6 \
       --device=/dev/neuron7 \
       --device=/dev/neuron8 \
       --device=/dev/neuron9 \
       --device=/dev/neuron10 \
       --device=/dev/neuron11 \
       -e HF_TOKEN=${HF_TOKEN} \
       public.ecr.aws/shtian/neuronx-tgi:latest \
       --model-id davidshtian/Mistral-7B-Instruct-v0.2-neuron-4x2048-24-cores-2.18 \
       --max-batch-size 4 \
       --max-input-length 16 \
       --max-total-tokens 32

There seems no support for sending list of prompts to server, refer to this GitHub issue.

from huggingface_hub import InferenceClient
import concurrent

client = InferenceClient(model="http://127.0.0.1:8080")
batch_text = ["1+1=", "2+2=", "3+3=", "4+4="]

bs = 4

def format_text_list(text_list):
    return ['[INST] ' + text + ' [/INST]' for text in text_list]

def gen_text(text):
    return client.text_generation(text, max_new_tokens=16)

with concurrent.futures.ThreadPoolExecutor(max_workers=bs) as executor:
    out = list(executor.map(gen_text, format_text_list(batch_text)))

print(out)

Usage with πŸ€— optimum-neuron pipeline

from optimum.neuron import pipeline

p = pipeline('text-generation', 'davidshtian/Mistral-7B-Instruct-v0.2-neuron-4x2048-24-cores-2.18')
p("My favorite place on earth is", max_new_tokens=64, do_sample=True, top_k=50)

[{'generated_text': "My favorite place on earth is probably Paris, France, and if I were to go there
now I would take my partner on a romantic getaway where we could lay on the grass in the park,
eat delicious French cheeses and wine, and watch the sunset on the Seine river.'"}]

Usage with πŸ€— optimum-neuron NeuronModelForCausalLM

import torch
from transformers import AutoTokenizer
from optimum.neuron import NeuronModelForCausalLM

model = NeuronModelForCausalLM.from_pretrained("davidshtian/Mistral-7B-Instruct-v0.2-neuron-4x2048-24-cores-2.18")

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer.pad_token_id = tokenizer.eos_token_id

def model_sample(input_prompt):
    input_prompt = "[INST] " + input_prompt + " [/INST]"

    tokens = tokenizer(input_prompt, return_tensors="pt")

    with torch.inference_mode():
        sample_output = model.generate(
            **tokens,
            do_sample=True,
            min_length=16,
            max_length=32,
            temperature=0.5,
            pad_token_id=tokenizer.eos_token_id
        )
        outputs = [tokenizer.decode(tok, skip_special_tokens=True) for tok in sample_output]

    res = outputs[0].split('[/INST]')[1].strip("</s>").strip()
    return(res + "\n")

print(model_sample("how are you today?"))

This repository contains tags specific to versions of neuronx. When using with πŸ€— optimum-neuron, use the repo revision specific to the version of neuronx you are using, to load the right serialized checkpoints.

Arguments passed during export

input_shapes

{
  "batch_size": 4,
  "sequence_length": 2048,
}

compiler_args

{
  "auto_cast_type": "bf16",
  "num_cores": 24,
}
Downloads last month
3
Inference API
Input a message to start chatting with davidshtian/Mistral-7B-Instruct-v0.2-neuron-4x2048-24-cores-2.18.
Inference API (serverless) has been turned off for this model.