deepspeed inference tensor parallelism memory footprint doesn't decrease with deepspeed tp_size increase.

#92

by jiangtaozh - opened Apr 18, 2024

Apr 18, 2024

model_name = "mistralai/Mistral-7B-Instruct-v0.2"
model = AutoModelForCausalLM.from_pretrained(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token  # Set pad_token to eos_token

model.eval()

ds_engine = deepspeed.init_inference(model,
                                     tensor_parallel={"tp_size": world_size},
                                     #dtype=torch.float32,
                                     dtype=torch.float16,
                                     replace_with_kernel_inject=True)

model = ds_engine.module
model.eval()

generator = pipeline(task='text-generation', model=model, tokenizer=tokenizer, device=local_rank)
return generator

according to this official link https://www.deepspeed.ai/tutorials/automatic-tensor-parallelism/ with tp size increase, the memory allocated on each gpu will be decreased. However for mistral model, it doesn't work. I tested gpt-j 6B and llama model 7B models with exactly the same code, both of them are working (GPU memory allocation decrease with tp size increase). What's wrong with mistral model? I tried the deepspeed-mii too, the conclusion is the same.

anyone can throw some lights on this?

ayu8393

Apr 26, 2024

they do not support mistral models with the old inference engine. so you should try to use the latest inference engine DeepSpeed-MII. Here's an example for running a mistral model:

import mii
pipe = mii.pipeline("mistralai/Mistral-7B-v0.1")
response = pipe(["DeepSpeed is"], max_new_tokens=128)
print(response)

jiangtaozh

May 1, 2024

•

edited May 1, 2024

in the backend of mii, the inference is still backed by deepspeed inference engine. I tried, it is the same. no memory footprint reduction along the tensor parallism with the following code.

import argparse
import mii

def main():
parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, default="mistralai/Mistral-7B-v0.1")
parser.add_argument("--tensor-parallel", type=int, default=1)
args = parser.parse_args()
mii.serve(args.model, tensor_parallel=args.tensor_parallel)
print(f"Serving model {args.model} on {args.tensor_parallel} GPU(s).")
print(f"Run python client.py --model {args.model} to connect.")
print(f"Run python terminate.py --model {args.model} to terminate.")=
main()

jiangtaozh

May 1, 2024

https://github.com/microsoft/DeepSpeed-MII/issues/329

Somniloquy

Jun 13, 2024

I have the same problem, have you solved it?

jiangtaozh

Jun 14, 2024

Not yet. I don't deepspeed support mistral tensor parallelism.

subhrokomol

Sep 23, 2024

Can we quantize the model and run deepspeed MII ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment