Batched inference on multi-GPUs

#56
by d-i-o-n - opened

What's the most efficient way to run batch inference on a mult-GPU machine at the moment? The script below is fairly slow.

import transformers
import torch
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline = transformers.pipeline(
  "text-generation",
  model="meta-llama/Meta-Llama-3-8B-Instruct",
  model_kwargs={"torch_dtype": torch.bfloat16},
  device_map="auto",
)

messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True,
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
pipeline.tokenizer.pad_token_id = pipeline.model.config.eos_token_id
outputs = pipeline(
    256*[prompt],
    max_new_tokens=2048,
    eos_token_id=terminators,
    do_sample=True,
    temperature=1.,
    top_p=0.9,
    batch_size=256
)

I guess there doesn't exist off the shelf way to accelerate the batch inference efficiently if you already have the best setup, especially for 7B model. Fortunately, you don't have the best setup. First of all, I think you can use kv-cache if you have enough gpu memory. Secondly, auto-device-map will make a single model parameters seperated into all gpu devices which probablily the bottleneck for your situatioin, my suggestion is data-parallelism instead(:which may have multiple copies of whole model into different devices but considering you have such large batch size, the gpu memories of model-copies arefar less than the kv-cache memories used, so I think it will work.

Sign up or log in to comment