Running in Multi-gpu's

#134

by kmukeshreddy - opened Feb 16, 2024

Feb 16, 2024

I have 4 gpu's.
I know that we can run the model on multiple gpu's using device="auto", but how to convert the input token's to load on multiple gpu's.

This way we can only load onto one gpu
inputs = inputs.to("cuda") [inputs will be on cuda:0]

I want lo load them on all GPU's.
Example:
cuda: 0,1,2,3

kmukeshreddy

Feb 16, 2024

@ybelkada

ybelkada

Feb 19, 2024

•

edited Feb 19, 2024

hi @kmukeshreddy

Thanks for the issue! For me, it is unclear to me what is the motivation behind this. When you load the model across multiple GPUs through device_map="auto", instead of having one replica of the model on each GPU, your model will be sharded across all these GPUs. E.g. the first layer will be loaded on GPU:0, the second on GPU:1 and so on. To perform text generation with such a model you need to make sure your input is on the same device as the first layers of the model, hence the inputs = inputs.to("cuda") and placing it on cuda:0, and the computation will be done sequentially, meaning while a GPU is being used, all other GPUs are kept idle.

If you want to parallelize the text generation procedure by let's say loading one copy of the model per GPU, you can pass device_map={"": PartialState().process_index} (after importing PartialState from accelerate), that way the model will be entirely loaded on the device PartialState().process_index which should correspond to the index of the current GPU. After that you just need to set your input to that device

device_index = PartialState().process_index
inputs = inputs.to(device_index)

However I doubt Mixtral will fit on a single GPU unless you use the 2bit version of the model, e.g.: https://huggingface.co/BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf-test-dispatch from @BlackSamorez (you need to pip install aqlm and install transformers from source until the next release)

kmukeshreddy

Feb 26, 2024

Hi @ybelkada , Apologies for delayed response.

I will rephrase the question, the concept of GPU allocation in the above context was assumed wrong. Apologies for confusion.

Updated question:

I have 4 GPU's, Each GPU has ~ 40 GB:
GPU 0: 40 GB
GPU 1: 40 GB
GPU 2: 40 GB
GPU 3: 40 GB

As per the hugging face blog for loading large models here: https://huggingface.co/docs/accelerate/en/concept_guides/big_model_inference

"balanced_low_0" evenly splits the model on all GPUs except the first one, and only puts on GPU 0 what does not fit on the others. This option is great when you need to use GPU 0 for some processing of the outputs, like when using the generate function for Transformers models

Using balanced_low_0 for text-generation, I loaded the Mixtral to the onto all GPU's except the GPU: 0. [i.e. GPU: 0 is saved for model.generate() function]

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="balanced_low_0")

GPU Utilization:

Here we can see, Complete GPU: 0 and Partial GPU: 3 has some memory left.

Now, when i feed a long text input to model generate less than 32k tokens. The CUDA Memory error appears, but this memory error only says there is no enough space on GPU: 0.

inputs = tokenizer(long_text_message, return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=20) # CUDA error from this part

CUDA Error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 22.80 GiB (GPU 0; 39.59 GiB total capacity; 28.81 GiB already allocated; 6.46 GiB free; 31.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CON

My expectation is to utilize the complete GPU: 0 & Partial GPU: 4 for model.generate to process long texts.

Could you please let me know how can I utilize the GPU: 4 on top of GPU: 0 for model.generate()? (or utilize all the remaining GPU resources for model.generate()?)

ybelkada

Feb 27, 2024

Hi @kmukeshreddy
Hmm interesting I see. My guess here is that the text you're passing it so large that the hidden states computed on the first GPU exceeds 40GB. Can you try again by reducing the size of the model by loading it in half-precision or 8-bit / 4-bit precision?

kmukeshreddy

Feb 27, 2024

Hi @ybelkada , Thank you for the follow up comment.
The issue with the quantized version is, there is a significant performance drop for my task. So, I was looking for the work-around to use the model as-is.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment