I deploy google/flan-t5-xxl inference in AWS Sagemaker exactly as the Deploy Instructions for Sagemaker in the model page.

models

hub = {
'HF_MODEL_ID':'google/flan-t5-xxl',
'SM_NUM_GPUS': json.dumps(4)
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="1.0.3"),
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.12xlarge",
container_startup_health_check_timeout=300,
)

In parallel, I deploy the AWS Jumpstart flan-t5-xxl and compare the answers.
All the answers from huggingface 'google/flan-t5-xxl' were truncated compared to the AWS jumpstart flan-t5-xxl, even when setting the hyperparameter max_length to 300.
Same behavior for 'google/flan-t5-large/XL'.

Please advise what I miss here?

Thanks,
Hai

lysandre

Google org Oct 3, 2023

Maybe of interest to @philschmid

philschmid

Google org Oct 4, 2023

You can customize the lenght by adding parameters in the request, see: https://huggingface.co/blog/sagemaker-huggingface-llm#4-run-inference-and-chat-with-our-model

haizamir

Oct 4, 2023

Thank you very much @philschmid and @lysandre
We will try to set the max_new_tokens and test it.

Thanks!!!!

google
/

flan-t5-xxl

While try to using google/flan-t5-xxl inference deploy in AWS sagemaker. Answers is truncated.

I deploy google/flan-t5-xxl inference in AWS Sagemaker exactly as the Deploy Instructions for Sagemaker in the model page.

Hub Model configuration. https://huggingface.co/models