What should be the MAX_INPUT_LENGTH,MAX_TOTAL_TOKENS, MAX_BATCH_TOTAL_TOKENS any idea?

sagemaker config

instance_type = "ml.g5.2xlarge"
number_of_gpu = 1
health_check_timeout = 300

Define Model and Endpoint configuration parameter

config = {
'HF_MODEL_ID': "togethercomputer/Llama-2-7B-32K-Instruct", # model_id from hf.co/models
'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(MAX_INPUT_LENGTH), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(MAX_TOTAL_TOKENS), # Max length of the generation (including input text)
'MAX_BATCH_TOTAL_TOKENS': json.dumps(MAX_BATCH_TOTAL_TOKEN), # Limits the number of tokens that can be processed in parallel during the generation
'HUGGING_FACE_HUB_TOKEN': json.dumps("HF_TOKEN")
}

check if token is set

assert config['HUGGING_FACE_HUB_TOKEN'] != "HF_TOKEN", "Please set your Hugging Face Hub token"

create HuggingFaceModel with the image uri

llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env=config
)

togethercomputer
/

LLaMA-2-7B-32K

ENDPOINT CONFIGURATION ON AWS SAGEMAKER

sagemaker config

Define Model and Endpoint configuration parameter

check if token is set

create HuggingFaceModel with the image uri