ibm-ai-platform/llama3-8b-accelerator · ValueError: Unsupported model type mlp

May 16, 2024

I tried to start the server using:

model=ibm-fms/llama3-8b-accelerator
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model

I am getting the following issue, can someone please help?

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/llama-acc ghcr.io/huggingface/text-generation-inferen
ce:latest --model-id $model
2024-05-16T21:39:16.633017Z INFO text_generation_launcher: Args { model_id: "ibm-fms/llama3-8b-accelerator", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: Non
e, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting
_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "acfcba2db2e7", port: 80, shard_uds_path: "/tmp/text-g
eneration-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None,
rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None
, disable_grammar_support: false, env: false }
2024-05-16T21:39:16.637684Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-05-16T21:39:16.798269Z INFO text_generation_launcher: Default max_input_tokens to 4095
2024-05-16T21:39:16.798286Z INFO text_generation_launcher: Default max_total_tokens to 4096
2024-05-16T21:39:16.798288Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4145
2024-05-16T21:39:16.798290Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-05-16T21:39:16.798354Z INFO download: text_generation_launcher: Starting download process.
2024-05-16T21:39:29.882549Z INFO text_generation_launcher: Download file: model-00001-of-00002.safetensors

2024-05-16T21:39:36.553663Z INFO text_generation_launcher: Downloaded /data/models--ibm-fms--llama3-8b-accelerator/snapshots/7a6c81324d70c9e214687d021481c5c25c8fa694/model-00001-of-00002.safetensors in 0:00:06.

2024-05-16T21:39:36.553759Z INFO text_generation_launcher: Download: [1/2] -- ETA: 0:00:06

2024-05-16T21:39:36.554059Z INFO text_generation_launcher: Download file: model-00002-of-00002.safetensors

2024-05-16T21:39:39.242592Z INFO text_generation_launcher: Downloaded /data/models--ibm-fms--llama3-8b-accelerator/snapshots/7a6c81324d70c9e214687d021481c5c25c8fa694/model-00002-of-00002.safetensors in 0:00:02.

2024-05-16T21:39:39.242670Z INFO text_generation_launcher: Download: [2/2] -- ETA: 0

2024-05-16T21:39:43.567122Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-05-16T21:39:43.584374Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-05-16T21:39:53.605036Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-05-16T21:40:02.298874Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke [123/1097]
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 240, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 201, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 648, in get_model
raise ValueError(f"Unsupported model type {model_type}")
ValueError: Unsupported model type mlp_speculator

2024-05-16T21:40:03.512684Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
server.serve(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 240, in serve
asyncio.run(

File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)

File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 201, in serve_inner
model = get_model(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 648, in get_model
raise ValueError(f"Unsupported model type {model_type}")

ValueError: Unsupported model type mlp_speculator
rank=0
2024-05-16T21:40:03.587086Z ERROR text_generation_launcher: Shard 0 failed to start
2024-05-16T21:40:03.587105Z INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

JRosenkranz

ibm-ai-platform org May 17, 2024

•

edited May 17, 2024

Can you pull the latest image with docker pull ghcr.io/huggingface/text-generation-inference:latest. This just worked for me with no issues. (I downloaded the llama3 weights and had them in my volume prior)

Narsil

May 17, 2024

Should be fixed soon: https://github.com/huggingface/text-generation-inference/pull/1917

ibm-ai-platform
/

llama3-8b-accelerator

ValueError: Unsupported model type mlp_speculator using TGI server