TGI model serving errors

#4
by wennycooper - opened

I tried to use TGI to serve the model, but I got following errors.
Any comment will be appreciated.

2024-07-02T01:16:56.299481Z  INFO text_generation_launcher: Args {
    model_id: "yentinglin/Llama-3-Taiwan-8B-Instruct-128k",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,   
    quantize: Some(
        Bitsandbytes,  
    ),
    speculate: None,   
    dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: Some(
        96000,
    ),
    max_total_tokens: Some(
        128000,
    ),
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: Some(
        96000,
    ),
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None, 
    hostname: "423c2839e036",
    port: 80,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None, 
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,  
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
}
2024-07-02T01:16:56.299797Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-07-02T01:16:56.710693Z  INFO text_generation_launcher: Bitsandbytes doesn't work with cuda graphs, deactivating them
2024-07-02T01:16:56.711086Z  INFO download: text_generation_launcher: Starting download process.
2024-07-02T01:17:03.588211Z  INFO text_generation_launcher: Download file: model-00001-of-00004.safetensors
2024-07-02T01:19:13.452334Z  INFO text_generation_launcher: Downloaded /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model-00001-of-00004.safetensors in 0:02:09.
2024-07-02T01:19:13.452806Z  INFO text_generation_launcher: Download: [1/5] -- ETA: 0:08:36
2024-07-02T01:19:13.454338Z  INFO text_generation_launcher: Download file: model-00002-of-00004.safetensors
2024-07-02T01:21:25.292526Z  INFO text_generation_launcher: Downloaded /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model-00002-of-00004.safetensors in 0:02:11.
2024-07-02T01:21:25.292664Z  INFO text_generation_launcher: Download: [2/5] -- ETA: 0:06:31.500000
2024-07-02T01:21:25.293379Z  INFO text_generation_launcher: Download file: model-00003-of-00004.safetensors
2024-07-02T01:23:39.538011Z  INFO text_generation_launcher: Downloaded /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model-00003-of-00004.safetensors in 0:02:14.
2024-07-02T01:23:39.538489Z  INFO text_generation_launcher: Download: [3/5] -- ETA: 0:04:23.333334
2024-07-02T01:23:39.540114Z  INFO text_generation_launcher: Download file: model-00004-of-00004.safetensors
2024-07-02T01:24:11.750936Z  INFO text_generation_launcher: Downloaded /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model-00004-of-00004.safetensors in 0:00:32.
2024-07-02T01:24:11.751393Z  INFO text_generation_launcher: Download: [4/5] -- ETA: 0:01:47
2024-07-02T01:24:11.752106Z  INFO text_generation_launcher: Download file: model.safetensors
5003f467faff/model.safetensors in 0:00:00.
2024-07-02T01:24:12.290813Z  INFO text_generation_launcher: Download: [5/5] -- ETA: 0
2024-07-02T01:24:13.297441Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-07-02T01:24:13.297892Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-07-02T01:24:18.628661Z  INFO text_generation_launcher: Detected system cuda
2024-07-02T01:24:23.315694Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-07-02T01:24:24.011563Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 94, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 267, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever() 
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()   
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 225, in serve_inner
    model = get_model( 
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 591, in get_model
    return FlashLlama( 
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 69, in __init__
    weights = Weights(filenames, device, dtype, process_group=self.process_group)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 37, in __init__
    raise RuntimeError(
RuntimeError: Key lm_head.weight was found in multiple files: /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model.safetensors and /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model-00004-of-00004.safetensors
2024-07-02T01:24:25.118582Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:658: UserWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
  warnings.warn(
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 94, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 267, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 225, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 591, in get_model
    return FlashLlama( 

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 69, in __init__
    weights = Weights(filenames, device, dtype, process_group=self.process_group)

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 37, in __init__
    raise RuntimeError(

RuntimeError: Key lm_head.weight was found in multiple files: /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model.safetensors and /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model-00004-of-00004.safetensors
 rank=0
2024-07-02T01:24:25.214480Z ERROR text_generation_launcher: Shard 0 failed to start
2024-07-02T01:24:25.214519Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

This is the command to start the TGI

model=yentinglin/Llama-3-Taiwan-8B-Instruct-128k
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run -e HF_TOKEN='hf_xxxx' --gpus '"device=1"' --shm-size 1g -p 8081:80 -v $volume:/data --name Llama-3-Taiwan-8B-Instruct-128k ghcr.io/huggingface/text-generation-inference:latest --model-id $model --quantize bitsandbytes --max-input-length=96000 --max-total-tokens=128000 --max-batch-prefill-tokens 96000
wennycooper changed discussion title from TGI loading errors to TGI model serving errors

modify format

I've encountered similar bugs but happended when downloading model. I use AutoModelForCausalLM for downloading model from huggingface and the error messages indicates that there are parameter size different issue. Belows is the code for reproduce bugs and I my transformers version is 4.40.0 .

from transformers import AutoModelForCausalLM
model=AutoModelForCausalLM.from_pretrained('yentinglin/Llama-3-Taiwan-8B-Instruct-128k')

The error message I got is here:
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([128258, 4096]). size mismatch for model.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 4096]). size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for model.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for model.layers.0.self_attn.o_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 4096]). size mismatch for model.layers.0.mlp.gate_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([14336, 4096]). size mismatch for model.layers.0.mlp.up_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([14336, 4096]). size mismatch for model.layers.0.mlp.down_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
.
.
.
size mismatch for model.layers.31.self_attn.q_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for model.layers.31.self_attn.k_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for model.layers.31.self_attn.v_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for model.layers.31.self_attn.o_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for model.layers.31.mlp.gate_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for model.layers.31.mlp.up_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for model.layers.31.mlp.down_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for lm_head.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([128258, 4096]).
You may consider adding ignore_mismatched_sizes=True in the model from_pretrained method.

Is the version fo transformers lead to this issue? I just assume that as long as the transfoermers version is new enough for running Llama3 then it is new enough for the model to as this can be viewed as another fine tuned version of Llama3. I've got everything ok under same enviornment and using same procedure for your none-128k token version ('yentinglin/Llama-3-Taiwan-8B-Instruct'). Besides I tried adding ignore_mismatched_sizes=True for downloading but the generation based on the snippet code in the Model card is bad so I think that there must be some bugs happended in the loading process.

it's weird. i will check and hopefully i didnt mess up anything...

@yentinglin , it seems like there is a model.safetensors file in the repo which indicates it contains the full model. Alongside it, there are sharded files.

It seems like the issue here is that the model.safetensors isn't the full-version of the model: it's a 562kb file. The inference engines try and load that file, but it's incomplete when compared to the sharded files.

Would it be possible to remove that model.safetensors file?

sure it's removed now

It works. Thank you!

wennycooper changed discussion status to closed

Sign up or log in to comment