Text Generation
Transformers
PyTorch
English
llama
Inference Endpoints
text-generation-inference

ENDPOINT CONFIGURATION ON AWS SAGEMAKER

#21
by NABARKA - opened

What should be the MAX_INPUT_LENGTH,MAX_TOTAL_TOKENS, MAX_BATCH_TOTAL_TOKENS any idea?

sagemaker config

instance_type = "ml.g5.2xlarge"
number_of_gpu = 1
health_check_timeout = 300

Define Model and Endpoint configuration parameter

config = {
'HF_MODEL_ID': "togethercomputer/Llama-2-7B-32K-Instruct", # model_id from hf.co/models
'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(MAX_INPUT_LENGTH), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(MAX_TOTAL_TOKENS), # Max length of the generation (including input text)
'MAX_BATCH_TOTAL_TOKENS': json.dumps(MAX_BATCH_TOTAL_TOKEN), # Limits the number of tokens that can be processed in parallel during the generation
'HUGGING_FACE_HUB_TOKEN': json.dumps("HF_TOKEN")
}

check if token is set

assert config['HUGGING_FACE_HUB_TOKEN'] != "HF_TOKEN", "Please set your Hugging Face Hub token"

create HuggingFaceModel with the image uri

llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env=config
)

Together org

Hi @NABARKA ,

This model has been trained to handle context length up to 32k, so I would recommend setting MAX_INPUT_LENGTH to at most 32K. The MAX_TOTAL_TOKENS parameter also depends on your application, i.e., how long you want the model answers to be (e.g., if you are interested in summarization or QA, you can set it to something below 512). The MAX_BATCH_TOTAL_TOKEN is also affected by your hardware (with more memory you can handle larger batches). I don't know whether Sagemaker itself has limitations on these parameters though.

Let us know how it goes!:)

Sign up or log in to comment