Hosting using TGI : "ValueError: Unsupported model type phi3"

#38
by nagarjunrajen - opened

Tried to host phi-3 using TGI Server. But fails with the below error. Looks like the support to host phi-3 using TGI is still not available. Any leads?

I am using this TGI Server image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-hf-tgi-serve:20240220_0936_RC01

And trying to host in Kubernetes Cluster in Google Cloud.

Error:

2024-04-25T20:37:04.979957Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-25T20:37:04.981095Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-25T20:37:15.030108Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-04-25T20:37:25.039480Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-04-25T20:37:35.049359Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-04-25T20:37:37.784256Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 475, in get_model
raise ValueError(f"Unsupported model type {model_type}")
ValueError: Unsupported model type phi3

2024-04-25T20:37:38.656203Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve
server.serve(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve
asyncio.run(

File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)

File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner
model = get_model(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 475, in get_model
raise ValueError(f"Unsupported model type {model_type}")

ValueError: Unsupported model type phi3
rank=0
2024-04-25T20:37:38.717605Z ERROR text_generation_launcher: Shard 0 failed to start
2024-04-25T20:37:38.717631Z INFO text_generation_launcher: Shutting down shards

nagarjunrajen changed discussion title from Hosting suing TGI : "ValueError: Unsupported model type phi3" to Hosting using TGI : "ValueError: Unsupported model type phi3"

Looking at the commit history, it doesn't appear that TGI supported Phi 3 until a couple days ago (when Phi 3 was released). TGI hasn't cut another release since then, so you'd need to grab an automated build, eg: docker pull ghcr.io/huggingface/text-generation-inference:sha-ee47973

Note: Phi-3-mini-4k-instruct on TGI works fine for me. Phi-3-mini-128k-instruct has some interesting rope factors and I haven't gotten it to work.

Microsoft org

You can pass --trust-remote-code when initializing the TGI container. By default, implementations fallback to transformers if a model is not supported in TGI.

gugarosa changed discussion status to closed

Dear @gugarosa ,

I am facing the same rope factor issue while starting TGI container for Phi-3-mini-128k-instruct, surprisingly no issues spotted for Phi-3-mini-4k-instruct, as mentioned by @writerflether , here is the error log:

tgi-container-1  | 2024-05-06T13:15:53.544574Z  INFO download: text_generation_launcher: Successfully downloaded weights.
tgi-container-1  | 2024-05-06T13:15:53.544838Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
tgi-container-1  | 2024-05-06T13:15:55.366489Z ERROR text_generation_launcher: Error when initializing model
tgi-container-1  | Traceback (most recent call last):
tgi-container-1  |   File "/opt/conda/bin/text-generation-server", line 8, in <module>
tgi-container-1  |     sys.exit(app())
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
tgi-container-1  |     return get_command(self)(*args, **kwargs)
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
tgi-container-1  |     return self.main(*args, **kwargs)
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
tgi-container-1  |     return _main(
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
tgi-container-1  |     rv = self.invoke(ctx)
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
tgi-container-1  |     return _process_result(sub_ctx.command.invoke(sub_ctx))
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
tgi-container-1  |     return ctx.invoke(self.callback, **ctx.params)
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
tgi-container-1  |     return __callback(*args, **kwargs)
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
tgi-container-1  |     return callback(**use_params)  # type: ignore
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
tgi-container-1  |     server.serve(
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 253, in serve
tgi-container-1  |     asyncio.run(
tgi-container-1  |   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
tgi-container-1  |     return loop.run_until_complete(main)
tgi-container-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
tgi-container-1  |     self.run_forever()
tgi-container-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
tgi-container-1  |     self._run_once()
tgi-container-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
tgi-container-1  |     handle._run()
tgi-container-1  |   File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
tgi-container-1  |     self._context.run(self._callback, *self._args)
tgi-container-1  | > File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 217, in serve_inner
tgi-container-1  |     model = get_model(
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 333, in get_model
tgi-container-1  |     return FlashLlama(
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 84, in __init__
tgi-container-1  |     model = FlashLlamaForCausalLM(prefix, config, weights)
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 385, in __init__
tgi-container-1  |     self.model = FlashLlamaModel(prefix, config, weights)
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 309, in __init__
tgi-container-1  |     [
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 310, in <listcomp>
tgi-container-1  |     FlashLlamaLayer(
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 249, in __init__
tgi-container-1  |     self.self_attn = FlashLlamaAttention(
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 107, in __init__
tgi-container-1  |     self.rotary_emb = PositionRotaryEmbedding.static(
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 1032, in static
tgi-container-1  |     scaling_factor = rope_scaling["factor"]
tgi-container-1  | KeyError: 'factor'
tgi-container-1  | 
tgi-container-1  | 2024-05-06T13:15:55.847905Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
tgi-container-1  | 
tgi-container-1  | The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
tgi-container-1  | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
tgi-container-1  | /opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:658: UserWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
tgi-container-1  |   warnings.warn(
tgi-container-1  | Traceback (most recent call last):
tgi-container-1  | 
tgi-container-1  |   File "/opt/conda/bin/text-generation-server", line 8, in <module>
tgi-container-1  |     sys.exit(app())
tgi-container-1  | 
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
tgi-container-1  |     server.serve(
tgi-container-1  | 
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 253, in serve
tgi-container-1  |     asyncio.run(
tgi-container-1  | 
tgi-container-1  |   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
tgi-container-1  |     return loop.run_until_complete(main)
tgi-container-1  | 
tgi-container-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
tgi-container-1  |     return future.result()
tgi-container-1  | 
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 217, in serve_inner
tgi-container-1  |     model = get_model(
tgi-container-1  | 
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 333, in get_model
tgi-container-1  |     return FlashLlama(
tgi-container-1  | 
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 84, in __init__
tgi-container-1  |     model = FlashLlamaForCausalLM(prefix, config, weights)
tgi-container-1  | 
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 385, in __init__
tgi-container-1  |     self.model = FlashLlamaModel(prefix, config, weights)
tgi-container-1  | 
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 309, in __init__
tgi-container-1  |     [
tgi-container-1  | 
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 310, in <listcomp>
tgi-container-1  |     FlashLlamaLayer(
tgi-container-1  | 
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 249, in __init__
tgi-container-1  |     self.self_attn = FlashLlamaAttention(
tgi-container-1  | 
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 107, in __init__
tgi-container-1  |     self.rotary_emb = PositionRotaryEmbedding.static(
tgi-container-1  | 
tgi-container-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 1032, in static
tgi-container-1  |     scaling_factor = rope_scaling["factor"]
tgi-container-1  | 
tgi-container-1  | KeyError: 'factor'
tgi-container-1  |  rank=0
tgi-container-1  | Error: ShardCannotStart
tgi-container-1  | 2024-05-06T13:15:55.947161Z ERROR text_generation_launcher: Shard 0 failed to start
tgi-container-1  | 2024-05-06T13:15:55.947187Z  INFO text_generation_launcher: Shutting down shards
tgi-container-1 exited with code 1

were you able to fix it?
i am getting similar errors

2024-05-22T02:27:07.993760Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-05-22T02:27:07.993970Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-05-22T02:27:11.749336Z  WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2

2024-05-22T02:27:12.082919Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 240, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 201, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 661, in get_model
    raise ValueError(f"Unsupported model type {model_type}")
ValueError: Unsupported model type phi3_v

2024-05-22T02:27:12.697919Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 240, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 201, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 661, in get_model
    raise ValueError(f"Unsupported model type {model_type}")

ValueError: Unsupported model type phi3_v
 rank=0
2024-05-22T02:27:12.797090Z ERROR text_generation_launcher: Shard 0 failed to start
2024-05-22T02:27:12.797112Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

how i am running


token=token
model=microsoft/Phi-3-vision-128k-instruct
volume=$PWD/phi3/data
docker run --gpus all --shm-size 1g -p 8080:80 -e HUGGING_FACE_HUB_TOKEN=$token -v $volume:/data ghcr.io/huggingface/text-generation-inference:sha-ee47973 --model-id $model

Sign up or log in to comment