ValueError: sharded is not supported for AutoModel ERROR

#68
by peyers - opened

Using the latest model of falcon-40b-instruct, there is a problem, when run on Sagemaker.
The endpoint can not be started when using these instructions https://github.com/marshmellow77/falcon-document-chatbot/blob/main/deploy-falcon-40b-instruct.ipynb

Yesterday everything worked fine.

The error message ist, the following:
ValueError: sharded is not supported for AutoModel

Current Workaround is to use the latest Revision which works:

llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env={
'HF_MODEL_ID': hf_model_id,
'HF_MODEL_REVISION': "1e7fdcc9f45d13704f3826e99937917e007cd975",
# 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
'SM_NUM_GPUS': json.dumps(number_of_gpu),
'MAX_INPUT_LENGTH': json.dumps(1900), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
}
)

Could be because of this change: https://huggingface.co/tiiuae/falcon-40b/commit/f1ba7d328c06aa6fbb4a8afd3c756f46d7e6b232 and line here
https://github.com/huggingface/text-generation-inference/blob/b7327205a6f2f2c6349e75b8ea484e1e2823075a/server/text_generation_server/models/__init__.py#L233

This is exactly what I'm running into when trying to make this work. I thought this was an issue with the HF inference server, thanks for pointing this out!

The problematic change has been reverted with https://huggingface.co/tiiuae/falcon-40b-instruct/commit/ca78eac0ed45bf64445ff0687fabba1598daebf3 , everything works like before now with the currently released files on main.

Hello,

I am still running into the same issue, with 7b-instruct version when explicitly pointing to the commit that reverts the change

config = {
  'HF_MODEL_ID': "tiiuae/falcon-7b-instruct", # model_id from hf.co/models
  'HF_MODEL_REVISION': "eb410fb6ffa9028e97adb801f0d6ec46d02f8b07",
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048),  # Max length of the generation (including input text)
  # 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}

Running on ml.g5.48xlarge with number_of_gpu = 8 above.

Any ideas what could be wrong in my setup?

Still having the issue...

Still having the issue...

It turns out sharding is not supported for 7B variants. You need to make sure to either choose an instance that has a single GPU, or explicitly choose number_of_gpu = 1 in your config.

Hello, I re-opened this discussion because I found the issue reported again after the last commit.

Here is my configuration:
{
"HF_MODEL_ID": "tiiuae/falcon-40b-instruct",
"SM_NUM_GPUS": "4",
"HF_MODEL_QUANTIZE": "bitsandbytes",
"MAX_INPUT_LENGTH": "1024",
"MAX_TOTAL_TOKENS": "2048"
}

With this configuration and using an instance ml.g5.12xlarge I get the error message: ValueError: sharded is not supported for AutoModel

Adding "HF_MODEL_REVISION": ca78eac0ed45bf64445ff0687fabba1598daebf3 to deploy the previous commit works perfectly fine.

The issue is again in the last commit uploaded: ecb78d97ac356d098e79f0db222c9ce7c5d9ee5f

I ran into the same issue today.
Changed the revision and it works fine as mentioned by valenlopez3 in the above thread

Sign up or log in to comment