Deploy on Azure 4 x A100 (80 GB) got hang

#23
by hugging-face-infrax - opened

Hi abacusai team,

We're currently trying to deploy the abacusai/Smaug-72B-v0.1 model in Azure with the following specifications:

GPU: 4 x A100 (80 GB)
CPU: 96 vCPU
Memory: 880 GB
Nvidia CUDA: 12.3
Nvidia Driver Version: v545.23.08
Text Generation Inference: v1.4.2

The docker manifest file:

services:
  text-generation-inference:
    image: ghcr.io/huggingface/text-generation-inference:1.4.2
    container_name: text-generation-inference
    command: >
      --model-id abacusai/Smaug-72B-v0.1
      --max-total-tokens 16384
      --max-input-length 8192
      --num-shard 4
      --sharded true
      --max-top-n-tokens 1
      --max-best-of 1
      --disable-custom-kernels
      --trust-remote-code
      --max-stop-sequences 1
      --validation-workers 1
      --waiting-served-ratio 0
      --max-batch-total-tokens 16384
      --max-waiting-tokens 8192
      --cuda-memory-fraction 0.8
      --max-concurrent-requests 512
      --max-batch-prefill-tokens 16384
      --json-output
    volumes:
      - ./data:/data
    ports:
      - 8080:80
    shm_size: '1gb'
    restart: always
    env_file:
      - .env
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:80/health"]
      interval: 30s
      timeout: 45s
      start_period: 180s
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            count: all
            capabilities: [gpu]

We expected the model to run smoothly with these specs, but we're encountering hanging issues. Could you please provide some recommendations for troubleshooting this problem?

Sign up or log in to comment