While try to using google/flan-t5-xxl inference deploy in AWS sagemaker. Answers is truncated.
#62
by
haizamir
- opened
Hi Huggingface community,
I deploy google/flan-t5-xxl inference in AWS Sagemaker exactly as the Deploy Instructions for Sagemaker in the model page.
Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'google/flan-t5-xxl',
'SM_NUM_GPUS': json.dumps(4)
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="1.0.3"),
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.12xlarge",
container_startup_health_check_timeout=300,
)
In parallel, I deploy the AWS Jumpstart flan-t5-xxl and compare the answers.
All the answers from huggingface 'google/flan-t5-xxl' were truncated compared to the AWS jumpstart flan-t5-xxl, even when setting the hyperparameter max_length to 300.
Same behavior for 'google/flan-t5-large/XL'.
Please advise what I miss here?
Thanks,
Hai
Maybe of interest to @philschmid
You can customize the lenght by adding parameters in the request, see: https://huggingface.co/blog/sagemaker-huggingface-llm#4-run-inference-and-chat-with-our-model
Thank you very much
@philschmid
and
@lysandre
We will try to set the max_new_tokens and test it.
Thanks!!!!