TheBloke/Mixtral-8x7B-v0.1-GPTQ · Unable to start TGI service for TheBloke/Mixtral-8x7B-v0.1-GPTQ with num

System Info

Text Generation Inference Details:
TGI Docker Version: text-generation-launcher 1.3.4

Issue Details:
I'm able to start TGI server for TheBloke/Mixtral-8x7B-v0.1-GPTQ with num shards as 1 and 2, but with 4 shards I'm getting

Command used to start TGI server:

docker run --gpus all -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN -p 8080:80 -v /data ghcr.io/huggingface/text-generation-inference:latest --model-id TheBloke/Mixtral-8x7B-v0.1-GPTQ --quantize gptq --num-shard 4

Getting following issue on terminal:

2024-01-19T07:13:11.235812Z  INFO text_generation_launcher: Args { model_id: "TheBloke/Mixtral-8x7B-v0.1-GPTQ", revision: None, validation_workers: 2, sharded: None, num_shard: Some(4), quantize: Some(Gptq), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "fc04d78daed4", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2024-01-19T07:13:11.235844Z  INFO text_generation_launcher: Sharding model on 4 processes
2024-01-19T07:13:11.235941Z  INFO download: text_generation_launcher: Starting download process.
2024-01-19T07:13:19.270520Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-01-19T07:13:20.646805Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-01-19T07:13:20.647234Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-01-19T07:13:20.648012Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
2024-01-19T07:13:20.648628Z  INFO shard-manager: text_generation_launcher: Starting shard rank=2
2024-01-19T07:13:20.648646Z  INFO shard-manager: text_generation_launcher: Starting shard rank=3
2024-01-19T07:13:25.313705Z  WARN text_generation_launcher: Disabling exllama v2 and using v1 instead because there are issues when sharding

2024-01-19T07:13:25.325550Z  WARN text_generation_launcher: Disabling exllama v2 and using v1 instead because there are issues when sharding

2024-01-19T07:13:25.342082Z  WARN text_generation_launcher: Disabling exllama v2 and using v1 instead because there are issues when sharding

2024-01-19T07:13:25.350426Z  WARN text_generation_launcher: Disabling exllama v2 and using v1 instead because there are issues when sharding

2024-01-19T07:13:30.660947Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-01-19T07:13:30.661904Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2024-01-19T07:13:30.662078Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2024-01-19T07:13:30.662083Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-01-19T07:13:37.771495Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

[W socket.cpp:663] [c10d] The client socket has failed to connect to [localhost]:29500 (errno: 99 - Cannot assign requested address).
[W socket.cpp:663] [c10d] The client socket has failed to connect to [localhost]:29500 (errno: 99 - Cannot assign requested address). rank=2
2024-01-19T07:13:37.771535Z ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 7 rank=2
2024-01-19T07:13:37.867562Z ERROR text_generation_launcher: Shard 2 failed to start
2024-01-19T07:13:37.867594Z  INFO text_generation_launcher: Shutting down shards
2024-01-19T07:13:38.066300Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=3
2024-01-19T07:13:38.128980Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0
2024-01-19T07:13:39.191216Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=1

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Run Docker command on a 4-GPU server:

export HUGGING_FACE_HUB_TOKEN="<your HF token>"
docker run --gpus all -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN -p 8080:80 -v /data ghcr.io/huggingface/text-generation-inference:latest --model-id TheBloke/Mixtral-8x7B-v0.1-GPTQ --quantize gptq --num-shard 4

Text Generation Inference Docker version Details:
TGI Docker Version: text-generation-launcher 1.3.4

Expected behavior

TGI server must start.

TheBloke
/

Mixtral-8x7B-v0.1-GPTQ

Unable to start TGI service for TheBloke/Mixtral-8x7B-v0.1-GPTQ with num_shard as 4

System Info

Information

Tasks

Reproduction

Expected behavior