Ineffective bloom 176B inference on 8*A100

#118
by bver - opened

Hello,

I have tested the 176B bloom model on the 8*A100 GPUs setup. It should be possible to run it as fp16, without the load_in_8bit quantization feature.
Although the model was initialized successfully there were inference issues.

Code:

model = BloomForCausalLM.from_pretrained(
   '.',  # the model path (downloaded locally)
   device_map='balanced_low_0',
   offload_folder='/mnt/bloom/cache'
)
tokenizer = BloomTokenizerFast.from_pretrained('.') # (downloaded locally)
inputs = tokenizer(prompt, return_tensors="pt").to(“cuda:0”)
output = model.generate(inputs["input_ids"], ...  )
text = tokenizer.decode(output[0])

Observations:

  1. Model initialization took 10-20 mins.
  2. After the initialization with device_map='balanced_low_0' the GPU memory allocated was 313898MiB (i.e. 47.9% of total 81920MiB *8 = 655360MiB); No model modules offloaded to disk.
  3. The max_memory map recommended by the accelerate doc for BLOOM-176B on 8x80 A100 setup resulted in tensors offloaded to disk.
  4. The inference is very slow (several minutes for ~100 tokens). Only the single GPU (cuda:1) has the utilization 15-41%, other 7 GPUs have 0% all the time.

Suspicions:

  1. Some operations are done by CPUs with ineffective RAM/GPU communication.
  2. The missing preload_module_classes=['BloomBlock'] argument is crucial.
  3. inputs should be moved to a different device or the tokenizer itself should be moved to some device (which one?).
  4. GPU driver or CUDA versions are not sufficient for 8*A100.

Questions:

  1. What is the expected speed of inference on 8*A100 GPUs if everything works normally?
  2. Is there any important detail missing?

I would appreciate comments from those who have experience with this huge model.

Thank you in advance!

PS:
There are related discussions, e.g. :

Versions:

  • transformers 4.22.2
  • accelerate 0.12.0
  • tensorboard 2.10.1
  • pytorch 1.12.1
  • GPU Driver Version: 510.73.08
  • CUDA Version: 11.6
BigScience Workshop org
edited Sep 29, 2022

hi @bver !
Thanks for the great summary! Indeed, I suspect that some weights are offloaded to the disk / cpu, which could explain the inference speed that you are observing.
One thing I am curious about is if you can try:

model = BloomForCausalLM.from_pretrained(
   '.',  # the model path (downloaded locally)
   device_map='auto',
)

This should work fine I guess, since it is what we are using, but most importantly the script should not yield you an error as it did I guess when you tried

model = BloomForCausalLM.from_pretrained(
   '.',  # the model path (downloaded locally)
   device_map='balanced_low_0',
)

(The command above should yield you an error Please specify an offload folder...)

If this is the case then I guess we can dig further ;)

PS:
Also another thing that I suspect is that the model might be loaded in fp32, therefore the device maps that are created by accelerate takes into account fp32 weights! In this case make sure you load it in bf16 by doing:

model = BloomForCausalLM.from_pretrained(
   '.',  # the model path (downloaded locally)
   device_map='auto',
   torch_dtype="auto",
)

or

model = BloomForCausalLM.from_pretrained(
   '.',  # the model path (downloaded locally)
   device_map='auto',
   torch_dtype=torch.bfloat16,
)

@ybelkada ,
thank you very much for your help!

I have tried:

model = BloomForCausalLM.from_pretrained(
    '.',
    device_map='auto',
    torch_dtype='auto'
)

and it fixed model initialization issues (i.e. modules offloading to CPU). My guess is that the torch_dtype was the missing piece. The documentation says that the torch_dtype default uses the pytorch default type and not the model's implicit type, doubling device memory requirements in this case.

The total GPUs memory allocated during inference was ~53% of the total available device memory. The first device was 93% full, the last one 24% (this was my motivation for using balanced_low_0 originally).

A typical inference took ~15 secs for small hundreds of tokens which is great. However, the entire model initialization took ~45 mins, which is less than great. I suspect shards unpacking is ineffective.

Additional questions remain:

  1. What is the recommended GPU device number for sending input tokens to? I guess this is the device where the transformer input (first BloomBlocks?) is allocated but it is probably controlled by the device map modes such as ‘auto’ or ‘balanced_low_0’. (I did not try to infer_auto_device_map(model) yet.)
  2. Is there any way to make the model initialization faster?

However, generally I am glad to write that my issue was solved. Thanks again!

BigScience Workshop org

Thanks a lot @bver ! Glad to hear that it worked on your side!

1- I think that you are right here, you should double check the device map and send the input tokens on the first device as you suggested!
2- From our experience the initialization took in average ~2-3 minutes MAX, are the weights still offloaded on the disk?

are the weights still offloaded on the disk?

I do not think so, the offload_folder argument was not present and also no disk nor RAM excessive allocation was seen during the experiment. I need to check possible bottlenecks (io?) later.

Hi,
after a while, I was able to run more experiments and found that:

  1. the device map (model.hf_device_map) confirms all model modules are on GPUs (no "cpu" values, just device numbers),
  2. model initialization is likely IO-bound -- if the model is on a tmpfs the loading is faster (~4 minutes). There's probably a way to pre-extract tensors from pytorch .bin files and save them to disk (if that's the reason for inefficiency).
bver changed discussion status to closed

Sign up or log in to comment