models for inf2.

#33
by AC2132 - opened

Is it possible for me to run 7B models on an inf2 device? I got the cached version of zephyr 7b beta working, but that had a sequence length of only 256, for the other models that would be useful for me, the aws repo either does not have the pytorch model.bin files or it gives me an error of some neff files missing. Has anyone been able to run a 7B model on inf2, if so, please help!

AWS Inferentia and Trainium org

Several 7b models are available in the cache, and a snippet to deploy each of them on SageMaker is available in the model card (Deploy > Amazon SageMaker > AWS Inferentia & Trainium).

Here is for instance the sniipet to deploy zephyr-7b-beta:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

# Hub Model configuration. https://huggingface.co/models
hub = {
    "HF_MODEL_ID": "HuggingFaceH4/zephyr-7b-beta",
    "HF_NUM_CORES": "8",
    "HF_BATCH_SIZE": "1",
    "HF_SEQUENCE_LENGTH": "4096",
    "HF_AUTO_CAST_TYPE": "bf16",  
    "MAX_BATCH_SIZE": "1",
    "MAX_INPUT_LENGTH": "3686",
    "MAX_TOTAL_TOKENS": "4096",
}


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface-neuronx", version="0.0.20"),
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.24xlarge",
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

# send request
predictor.predict(
    {
        "inputs": "What is is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

Alternatively, you can also export them locally in your ec2 environment following the instructions here and using the same configuration parameters as the cached version:

AWS Inferentia and Trainium org

Works for me, thank you. Don't forget to upgrade the Sagemaker SDK with 'pip install sagemaker --upgrade'. For the record, I used 2.214.0.

where do you see the available models for aws inf2? im looking for llama2 7b chat for "ml.inf2.xlarge"

AWS Inferentia and Trainium org

You can see the list of cached models here:

https://huggingface.co/aws-neuron/optimum-neuron-cache/tree/main/inference-cache-config

Alternatively, you can use the optimum-cli neuron cache lookup command to look for a specific model and see the cached configurations.

Since you want to deploy on an ml.inf2.xlarge, you need to select a configuration with 2 cores.

The following configuration is available:

batch_size: 1
sequence_length: 4096
num_cores: 2
auto_cast_type: fp16

You can adapt the snippet from the model card (Deploy/Amazon Sagemaker/AWS Inferentia & Trainium).

Hi @dacorvo - can i use that model cache with djl serving too?
see: https://docs.djl.ai/docs/demos/aws/sagemaker/large-model-inference/sample-llm/tnx_rollingbatch_deploy_llama_7b_int8.html

if not what steps i need to do in that case?

Sign up or log in to comment