Inference endpoint fails to deploy

#13
by dragosmc - opened

Hi,

The HF inference endpoint fails to deploy with

 get_model(File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py\", line 317, in get_model
 raise NotImplementedError(\n\nNotImplementedError: Mixtral models requires flash attention v2, stk and megablocks\n"}

Any thoughts on this?

LE: Another attempt in fails with

raise NotImplementedError(\"Mixtral does not support weight quantization yet.\")\n\nNotImplementedError: Mixtral does not support weight quantization yet.\n"}

Gotcha, thanks for the info. I was following the UI and tried with the first available instance type that didn't say "Low Memory". Will try with 2xA100 once I get access to it. Thanks.

What instance type, container and config did you use? The default config should work with 2x A100 80GBs or use that link https://ui.endpoints.huggingface.co/new?repository=mistralai%2FMixtral-8x7B-Instruct-v0.1&vendor=aws&region=us-east-1&accelerator=gpu&instance_size=2xlarge&task=text-generation&no_suggested_compute=true&tgi=true&tgi_max_batch_total_tokens=1024000&tgi_max_total_tokens=32000

Got acces to 2xA100 and not it doesn't seem to go past this point

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/whoami-v2 (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f88d9b97b80>: Failed to resolve 'huggingface.co' ([Errno -3] Temporary failure in name resolution)"))

Anything else you reckon I should try?

dragosmc changed discussion status to closed

Sign up or log in to comment