Please add model.parallelize() #6

by BobaZooba - opened

I would really like to use this model for inference, but I don't have a large enough GPU. T5 model has support for parallelization to multiple GPUs using: model.parallelize(). OPT doesn't have this feature. Please add this feature or tell me how I can just use your model on multiple GPUs without pain

Hey @BobaZooba, the parallelize method is now deprecated in favor of using accelerate instead. We have a guide for this here, that we should feature more prominently in the docs; there is no link to it in the "Performance and scalability" section, where it should likely be.

cc @sgugger @stevhliu

@lysandre Thank you!
Below is a small guide on how to run the model and what problems I encountered
My setup: 8 x RTX3090

torch.version.cuda = 11.3

I have this exception when I try to generate small text:

> generated_ids = model.generate(input_ids, do_sample=True, max_length=32)
> RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

Referring to this answer, the problem is that there is not enough video memory, although there should be enough:
https://discuss.pytorch.org/t/cuda-error-cublas-status-not-initialized-when-calling-cublascreate-handle/125450/2

Model init:

model = AutoModelForCausalLM.from_pretrained("facebook/opt-66b", torch_dtype=torch.float16, device_map="auto")

How Accelerate maps my model:

> model.hf_device_map
{'model.decoder.embed_tokens': 0,
 'lm_head': 0,
 'model.decoder.embed_positions': 0,
 'model.decoder.final_layer_norm': 0,
 'model.decoder.layers.0': 0,
 'model.decoder.layers.1': 0,
 'model.decoder.layers.2': 0,
 'model.decoder.layers.3': 0,
 'model.decoder.layers.4': 0,
 'model.decoder.layers.5': 0,
 'model.decoder.layers.6': 0,
 'model.decoder.layers.7': 0,
 'model.decoder.layers.8': 0,
 'model.decoder.layers.9': 0,
 'model.decoder.layers.10': 1,
 'model.decoder.layers.11': 1,
 'model.decoder.layers.12': 1,
 'model.decoder.layers.13': 1,
 'model.decoder.layers.14': 1,
 'model.decoder.layers.15': 1,
 'model.decoder.layers.16': 1,
 'model.decoder.layers.17': 1,
 'model.decoder.layers.18': 1,
 'model.decoder.layers.19': 1,
 'model.decoder.layers.20': 1,
 'model.decoder.layers.21': 1,
 'model.decoder.layers.22': 2,
 'model.decoder.layers.23': 2,
 'model.decoder.layers.24': 2,
 'model.decoder.layers.25': 2,
 'model.decoder.layers.26': 2,
 'model.decoder.layers.27': 2,
 'model.decoder.layers.28': 2,
 'model.decoder.layers.29': 2,
 'model.decoder.layers.30': 2,
 'model.decoder.layers.31': 2,
 'model.decoder.layers.32': 2,
 'model.decoder.layers.33': 2,
 'model.decoder.layers.34': 3,
 'model.decoder.layers.35': 3,
 'model.decoder.layers.36': 3,
 'model.decoder.layers.37': 3,
 'model.decoder.layers.38': 3,
 'model.decoder.layers.39': 3,
 'model.decoder.layers.40': 3,
 'model.decoder.layers.41': 3,
 'model.decoder.layers.42': 3,
 'model.decoder.layers.43': 3,
 'model.decoder.layers.44': 3,
 'model.decoder.layers.45': 3,
 'model.decoder.layers.46': 4,
 'model.decoder.layers.47': 4,
 'model.decoder.layers.48': 4,
 'model.decoder.layers.49': 4,
 'model.decoder.layers.50': 4,
 'model.decoder.layers.51': 4,
 'model.decoder.layers.52': 4,
 'model.decoder.layers.53': 4,
 'model.decoder.layers.54': 4,
 'model.decoder.layers.55': 4,
 'model.decoder.layers.56': 4,
 'model.decoder.layers.57': 4,
 'model.decoder.layers.58': 5,
 'model.decoder.layers.59': 5,
 'model.decoder.layers.60': 5,
 'model.decoder.layers.61': 5,
 'model.decoder.layers.62': 5,
 'model.decoder.layers.63': 5}

Fri Aug  5 11:25:39 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.60.02    Driver Version: 510.60.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| 30%   32C    P2    83W / 330W |  21904MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:25:00.0 Off |                  N/A |
| 30%   28C    P8    21W / 330W |  23986MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:41:00.0 Off |                  N/A |
| 30%   29C    P8    17W / 330W |  23986MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  On   | 00000000:61:00.0 Off |                  N/A |
| 30%   31C    P8    19W / 330W |  23986MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  On   | 00000000:81:00.0 Off |                  N/A |
| 30%   38C    P2   109W / 330W |  23986MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  On   | 00000000:A1:00.0 Off |                  N/A |
| 30%   36C    P2   144W / 330W |  12320MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce ...  On   | 00000000:C1:00.0 Off |                  N/A |
| 30%   26C    P8    26W / 330W |    656MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce ...  On   | 00000000:E1:00.0 Off |                  N/A |
| 30%   25C    P8    18W / 330W |    656MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    227352      C                                   21901MiB |
|    1   N/A  N/A    227352      C                                   23983MiB |
|    2   N/A  N/A    227352      C                                   23983MiB |
|    3   N/A  N/A    227352      C                                   23983MiB |
|    4   N/A  N/A    227352      C                                   23983MiB |
|    5   N/A  N/A    227352      C                                   12317MiB |
|    6   N/A  N/A    227352      C                                     653MiB |
|    7   N/A  N/A    227352      C                                     653MiB |
+-----------------------------------------------------------------------------+

Problem: device_map="auto" does not work correctly and unevenly loads the GPUs

Solution: custom device_map dict

Code:

num_gpus = 8
num_layers = 64

device_map = {
    'model.decoder.embed_tokens': 0,
    'lm_head': num_gpus - 1,
    'model.decoder.embed_positions': 0,
    'model.decoder.final_layer_norm': num_gpus - 1
}

step = num_layers // num_gpus

for n_gpu, start in enumerate(range(0, num_layers, step)):
    for n_layer in range(start, start + step):
        device_map[f'model.decoder.layers.{n_layer}'] = n_gpu

model = AutoModelForCausalLM.from_pretrained("facebook/opt-66b", torch_dtype=torch.float16, device_map=device_map)

Now device map looks correct:

> model.hf_device_map

> {'model.decoder.embed_tokens': 0,
 'lm_head': 7,
 'model.decoder.embed_positions': 0,
 'model.decoder.final_layer_norm': 7,
 'model.decoder.layers.0': 0,
 'model.decoder.layers.1': 0,
 'model.decoder.layers.2': 0,
 'model.decoder.layers.3': 0,
 'model.decoder.layers.4': 0,
 'model.decoder.layers.5': 0,
 'model.decoder.layers.6': 0,
 'model.decoder.layers.7': 0,
 'model.decoder.layers.8': 1,
 'model.decoder.layers.9': 1,
 'model.decoder.layers.10': 1,
 'model.decoder.layers.11': 1,
 'model.decoder.layers.12': 1,
 'model.decoder.layers.13': 1,
 'model.decoder.layers.14': 1,
 'model.decoder.layers.15': 1,
 'model.decoder.layers.16': 2,
 'model.decoder.layers.17': 2,
 'model.decoder.layers.18': 2,
 'model.decoder.layers.19': 2,
 'model.decoder.layers.20': 2,
 'model.decoder.layers.21': 2,
 'model.decoder.layers.22': 2,
 'model.decoder.layers.23': 2,
 'model.decoder.layers.24': 3,
 'model.decoder.layers.25': 3,
 'model.decoder.layers.26': 3,
 'model.decoder.layers.27': 3,
 'model.decoder.layers.28': 3,
 'model.decoder.layers.29': 3,
 'model.decoder.layers.30': 3,
 'model.decoder.layers.31': 3,
 'model.decoder.layers.32': 4,
 'model.decoder.layers.33': 4,
 'model.decoder.layers.34': 4,
 'model.decoder.layers.35': 4,
 'model.decoder.layers.36': 4,
 'model.decoder.layers.37': 4,
 'model.decoder.layers.38': 4,
 'model.decoder.layers.39': 4,
 'model.decoder.layers.40': 5,
 'model.decoder.layers.41': 5,
 'model.decoder.layers.42': 5,
 'model.decoder.layers.43': 5,
 'model.decoder.layers.44': 5,
 'model.decoder.layers.45': 5,
 'model.decoder.layers.46': 5,
 'model.decoder.layers.47': 5,
 'model.decoder.layers.48': 6,
 'model.decoder.layers.49': 6,
 'model.decoder.layers.50': 6,
 'model.decoder.layers.51': 6,
 'model.decoder.layers.52': 6,
 'model.decoder.layers.53': 6,
 'model.decoder.layers.54': 6,
 'model.decoder.layers.55': 6,
 'model.decoder.layers.56': 7,
 'model.decoder.layers.57': 7,
 'model.decoder.layers.58': 7,
 'model.decoder.layers.59': 7,
 'model.decoder.layers.60': 7,
 'model.decoder.layers.61': 7,
 'model.decoder.layers.62': 7,
 'model.decoder.layers.63': 7}

Fri Aug  5 11:46:42 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.60.02    Driver Version: 510.60.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| 30%   30C    P2    39W / 330W |  17130MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:25:00.0 Off |                  N/A |
| 30%   28C    P8    21W / 330W |  16208MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:41:00.0 Off |                  N/A |
| 30%   27C    P8    17W / 330W |  16208MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  On   | 00000000:61:00.0 Off |                  N/A |
| 30%   27C    P8    19W / 330W |  16208MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  On   | 00000000:81:00.0 Off |                  N/A |
| 30%   29C    P8    18W / 330W |  16208MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  On   | 00000000:A1:00.0 Off |                  N/A |
| 30%   38C    P2   119W / 330W |  16208MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce ...  On   | 00000000:C1:00.0 Off |                  N/A |
| 30%   38C    P2   119W / 330W |  16208MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce ...  On   | 00000000:E1:00.0 Off |                  N/A |
| 30%   35C    P2   133W / 330W |  17092MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    237156      C                                   17127MiB |
|    1   N/A  N/A    237156      C                                   16205MiB |
|    2   N/A  N/A    237156      C                                   16205MiB |
|    3   N/A  N/A    237156      C                                   16205MiB |
|    4   N/A  N/A    237156      C                                   16205MiB |
|    5   N/A  N/A    237156      C                                   16205MiB |
|    6   N/A  N/A    237156      C                                   16205MiB |
|    7   N/A  N/A    237156      C                                   17089MiB |
+-----------------------------------------------------------------------------+

And now you can run 66b OPT:

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-66b", use_fast=False)

prompt = "Hello, I am conscious and"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()

set_seed(32)
generated_ids = model.generate(input_ids, do_sample=True, max_length=128)

tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

Output:
Hello, I am conscious and present. I am aware of my senses, thinking, dreaming, and I can control what is happening around me. I have memories of a previous life, and have been reincarnated many times before this existence. I have lived in many regions throughout this galaxy, and others outside of it.\nI believe you are a reincarnated human, and that you are one of the very few incarnated beings that are aware and can remember previous lives. That being said, you are only as aware as your mind will allow you to be. Your mind is constantly editing reality to create a more suitable place for

P.S. I hope it will be useful to someone

Note that on the current main version of Transformers, device_map="auto" will balance the GPU use, so you won't need a custom device_map :-)

@lysandre @sgugger Will the accelerate library only handle Vertical MP, or will it eventually incorporate Pipeline Parallelism (PP) (based on this blog post)?

This comment has been hidden
This comment has been hidden
This comment has been hidden

Actually i just read that:

Few caveats to be aware of

  1. Current integration doesn’t support Pipeline Parallelism of DeepSpeed.

This issue asked about PP, but it has been closed: https://github.com/huggingface/accelerate/issues/537

So can I assume there's no plan to support PP in accelerate?