While try to using google/flan-t5-xxl inference deploy in AWS sagemaker. Answers is truncated.

#62
by haizamir - opened

Hi Huggingface community,

I deploy google/flan-t5-xxl inference in AWS Sagemaker exactly as the Deploy Instructions for Sagemaker in the model page.

Hub Model configuration. https://huggingface.co/models

hub = {
'HF_MODEL_ID':'google/flan-t5-xxl',
'SM_NUM_GPUS': json.dumps(4)
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="1.0.3"),
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.12xlarge",
container_startup_health_check_timeout=300,
)

In parallel, I deploy the AWS Jumpstart flan-t5-xxl and compare the answers.
All the answers from huggingface 'google/flan-t5-xxl' were truncated compared to the AWS jumpstart flan-t5-xxl, even when setting the hyperparameter max_length to 300.
Same behavior for 'google/flan-t5-large/XL'.

Please advise what I miss here?

Thanks,
Hai

Google org

Maybe of interest to @philschmid

You can customize the lenght by adding parameters in the request, see: https://huggingface.co/blog/sagemaker-huggingface-llm#4-run-inference-and-chat-with-our-model

Thank you very much @philschmid and @lysandre
We will try to set the max_new_tokens and test it.

Thanks!!!!

Sign up or log in to comment