aws-neuron/Mistral-7B-Instruct-v0.1-neuron-1x2048-2-cores

Hi!

I've tried to run the model in sagemaker by using HuggingFace TGI container 763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-tgi-inference:1.13.1-optimum0.0.17-neuronx-py310-ubuntu22.04-v1.0 with the following config:

batch_size = 1
sequence_length = 2048

llm_env = {
    "HF_MODEL_ID": "aws-neuron/Mistral-7B-Instruct-v0.1-neuron-1x2048-2-cores",  # model_id from hf.co/models
    'MAX_CONCURRENT_REQUESTS': json.dumps(batch_size), # Max length of input text
    'MAX_INPUT_LENGTH': json.dumps(1024),
    'MAX_TOTAL_TOKENS': json.dumps(sequence_length),
    'MAX_BATCH_PREFILL_TOKENS': json.dumps(int(sequence_length*batch_size / 2)),
    'MAX_BATCH_TOTAL_TOKENS': json.dumps(sequence_length*batch_size)
}

and the following instance:

instance_type = "ml.inf2.xlarge"
endpoint_name = "inferentia-mistral"

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             volume_size = 100
            )

I got the following error in cloudwatch logs:

#033[2m2024-02-12T16:57:53.590687Z#033[0m #033[31mERROR#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard process was signaled to shutdown with signal 9 #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
#033[2m2024-02-12T16:57:53.687244Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard 0 failed to start
#033[2m2024-02-12T16:57:53.687270Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shutting down shards
Error: ShardCannotStart

I've noticed in cloudwatch metrics the memory is really high around 91% , so I suppose is running out of memory since keeps restarting all the time.

Then I've tried with the instance type ml.inf2.8xlarge and works, but the issue is the cost which is the double :).
Is there any chance to get mistral working in a ml.inf2.xlarge?

Thanks

aws-neuron
/

Mistral-7B-Instruct-v0.1-neuron-1x2048-2-cores

Inf2.xlarge support