Deploy this model using TGI

#4
by nielsr HF staff - opened

Hi,

I'd like to deploy this model on 2 L4 GPUs (which should be possible given that this gives you 48GB of RAM - this model is 35B parameters, hence 35/2 = 17.5GB in 4 bit).

I'm following this guide, except that I'm deploying this model instead of Mistral-7B on 2 L4 GPUs. Here's my TGI configuration:

env:
  - name: MODEL_ID
    value: CohereForAI/c4ai-command-r-v01-4bit
  - name: PORT
    value: "8080"
  - name: QUANTIZE
    value: bitsandbytes-nf4
  volumeMounts:
    - mountPath: /dev/shm
    name: dshm
    - mountPath: /data
    name: data

This fails with:

"Server error: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`","target":"text_generation_client","filename":"router/client/src/lib.rs","line_number":33,"span":{"name":"warmup"},"spans":[{"max_batch_size":"None","max_input_length":1024,"max_prefill_tokens":4096,"max_total_tokens":2048,"name":"warmup"

Shouldn't this work given the 48GB of RAM? Ideally I'd like to use a context window which is as large as possible.

Sign up or log in to comment