deepspeed inference tensor parallelism memory footprint doesn't decrease with deepspeed tp_size increase.

#92
by jiangtaozh - opened
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
model = AutoModelForCausalLM.from_pretrained(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token  # Set pad_token to eos_token

model.eval()

ds_engine = deepspeed.init_inference(model,
                                     tensor_parallel={"tp_size": world_size},
                                     #dtype=torch.float32,
                                     dtype=torch.float16,
                                     replace_with_kernel_inject=True)

model = ds_engine.module
model.eval()

generator = pipeline(task='text-generation', model=model, tokenizer=tokenizer, device=local_rank)
return generator

according to this official link https://www.deepspeed.ai/tutorials/automatic-tensor-parallelism/ with tp size increase, the memory allocated on each gpu will be decreased. However for mistral model, it doesn't work. I tested gpt-j 6B and llama model 7B models with exactly the same code, both of them are working (GPU memory allocation decrease with tp size increase). What's wrong with mistral model? I tried the deepspeed-mii too, the conclusion is the same.

anyone can throw some lights on this?

they do not support mistral models with the old inference engine. so you should try to use the latest inference engine DeepSpeed-MII. Here's an example for running a mistral model:

import mii
pipe = mii.pipeline("mistralai/Mistral-7B-v0.1")
response = pipe(["DeepSpeed is"], max_new_tokens=128)
print(response)

in the backend of mii, the inference is still backed by deepspeed inference engine. I tried, it is the same. no memory footprint reduction along the tensor parallism with the following code.

import argparse
import mii

def main():
parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, default="mistralai/Mistral-7B-v0.1")
parser.add_argument("--tensor-parallel", type=int, default=1)
args = parser.parse_args()
mii.serve(args.model, tensor_parallel=args.tensor_parallel)
print(f"Serving model {args.model} on {args.tensor_parallel} GPU(s).")
print(f"Run python client.py --model {args.model} to connect.")
print(f"Run python terminate.py --model {args.model} to terminate.")=
main()

Sign up or log in to comment