Installation Issue with 'text-generation-inference' Models Including 'unsloth/phi-4-unsloth-bnb-4bit
The model is marked as 'text-generation-inference' but cannot be installed with 'text-generation-inference'. I have also tested 'unsloth/phi-4-bnb-4bit' and encountered the exact same problem. I have installed similar models that are quantized with 'gptq' and they work fine.
docker run --gpus all --shm-size 1g -p 8085:80 -v tgi_data:/data ghcr.io/huggingface/text-generation-inference:latest --model-id unsloth/phi-4-bnb-4bit --quantize bitsandbytes-nf4
Steps to Reproduce:
root@ai:~# docker run --gpus all --shm-size 2g -p 8085:80 -v tgi_data:/data ghcr.io/huggingface/text-generation-inference:latest --model-id unsloth/phi-4-unsloth-bnb-4bit --quantize bitsandbytes-nf4
2025-02-25T15:21:18.103062Z INFO text_generation_launcher: Args {
model_id: "unsloth/phi-4-unsloth-bnb-4bit",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: Some(
BitsandbytesNf4,
),
speculate: None,
dtype: None,
kv_cache_dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: None,
max_total_tokens: None,
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "e8373c22ac8f",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
}
2025-02-25T15:21:18.103098Z INFO hf_hub: Token file not found "/data/token"
2025-02-25T15:21:18.790265Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching true
2025-02-25T15:21:18.790286Z INFO text_generation_launcher: Default max_batch_prefill_tokens
to 4096
2025-02-25T15:21:18.790290Z WARN text_generation_launcher: Bitsandbytes doesn't work with cuda graphs, deactivating them
2025-02-25T15:21:18.790383Z INFO download: text_generation_launcher: Starting check and download process for unsloth/phi-4-unsloth-bnb-4bit
2025-02-25T15:21:20.625640Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2025-02-25T15:21:21.007315Z INFO download: text_generation_launcher: Successfully downloaded weights for unsloth/phi-4-unsloth-bnb-4bit
2025-02-25T15:21:21.007486Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2025-02-25T15:21:22.525058Z INFO text_generation_launcher: Using prefix caching = True
2025-02-25T15:21:22.525079Z INFO text_generation_launcher: Using Attention = flashinfer
2025-02-25T15:21:31.027385Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-02-25T15:21:34.872717Z INFO text_generation_launcher: Using experimental prefill chunking = False
2025-02-25T15:21:35.204125Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2025-02-25T15:21:35.230165Z INFO shard-manager: text_generation_launcher: Shard ready in 14.210727348s rank=0
2025-02-25T15:21:35.315943Z INFO text_generation_launcher: Starting Webserver
2025-02-25T15:21:35.345041Z INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2025-02-25T15:21:35.356209Z INFO text_generation_launcher: Using optimized Triton indexing kernels.
2025-02-25T15:21:36.399804Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 311, in call
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 117, in serve
server.serve(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 144, in Warmup
self.model.warmup(batch, max_input_tokens, max_total_tokens)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1540, in warmup
_, _batch, _ = self.generate_token(batch)
File "/opt/conda/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1909, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1804, in forward
logits, speculative_logits = self.model.forward(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 688, in forward
hidden_states = self.model(
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 611, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 477, in forward
attn_output = self.self_attn(
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 219, in forward
qkv = self.query_key_value(hidden_states, adapter_data)
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/lora.py", line 187, in forward
result = self.base_layer(input)
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/tensor_parallel.py", line 36, in forward
return self.linear.forward(x)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/bnb.py", line 118, in forward
out = bnb.matmul_4bit(
File "/opt/conda/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
return MatMul4Bit.apply(A, B, out, bias, quant_state)
File "/opt/conda/lib/python3.11/site-packages/torch/autograd/function.py", line 574, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/opt/conda/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py", line 509, in forward
output = torch.nn.functional.linear(A, F.dequantize_4bit(B, quant_state).to(A.dtype).t(), bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4096x5120 and 1x19660800)
2025-02-25T15:21:36.400096Z ERROR warmup{max_input_length=None max_prefill_tokens=4096 max_total_tokens=None max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: mat1 and mat2 shapes cannot be multiplied (4096x5120 and 1x19660800)
Error: Backend(Warmup(Generation("mat1 and mat2 shapes cannot be multiplied (4096x5120 and 1x19660800)")))
2025-02-25T15:21:36.438814Z ERROR text_generation_launcher: Webserver Crashed
2025-02-25T15:21:36.438829Z INFO text_generation_launcher: Shutting down shards
2025-02-25T15:21:36.530880Z INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2025-02-25T15:21:36.531046Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2025-02-25T15:21:36.731610Z INFO shard-manager: text_generation_launcher: shard terminated rank=0