ERROR root CUDA out of memory #6

by edmond - opened

I have an out of memory error during a training iteration despite :

  • my V100 GPU has 32Gb of ram
  • I used "with torch.cuda.amp.autocast():"
  • I used torch.cuda.empty_cache(). I have no memory leak since I printed memory usage after each batch, and it was constant
  • my sequence (batch of size 1) is not supposed to be the longest (ie. longer sequences didnt cause the error during training)

It says this : "ERROR root CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 31.75 GiB total capacity; 30.22 GiB already allocated; 61.50 MiB free; 30.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF."

Isnt 1b3 supposed to fit easily in a V100 ?

Hi @edmond !
Thank you very much for your question. This is weird indeed, what I suspect as a first sight is that you are loading everything in fp32 (by default). The weights in the Hub are stored in fp16, they are ~3.4Gb in fp16 and would be 2x higher in fp32 I think. This can rapidly blow up the GPU ram together with the optimizer state.
Can you try to load the weights + optimizer state directly in fp16 like the following: AutoModelForCausalLM.from_pretrained("bigscience/bloom-1b3", torch_dtype="auto") ? Let me know if this works!

Would you be able to share a minimal example? I think this should work with a V100 32Gb (including the optimizer states)

Hi, thanks for the quick answer, AutoModelWithLMHead.from_pretrained("bigscience/bloom-1b3", torch_dtype="auto") leads to ERROR root Attempting to unscale FP16 gradients.

Here is a minimal example :
import sys, torch, time, logging
import pandas as pd
import numpy as np
from transformers import AutoModelWithLMHead, AutoTokenizer
from sklearn.model_selection import KFold

df = pd.DataFrame()
df['tokens'] = [[i+100 for i in range(200)] for _ in range(10000)]

with torch.cuda.amp.autocast():
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-1b3")
model = AutoModelWithLMHead.from_pretrained("bigscience/bloom-1b3")

df['tokens'] = df['tokens'].map(lambda x: x + [tokenizer.eos_token_id])

scaler = torch.cuda.amp.GradScaler()

model.opt = torch.optim.NAdam(model.parameters(), lr=0.000007)
device = 'cuda'

kf = KFold(n_splits=len(df), shuffle=True)
for batch_nb, (_, batch_indexes) in enumerate(kf.split(df)):
    batch = df.iloc[batch_indexes]
    batch = np.array(batch['tokens'].values[0])
    batch = torch.from_numpy(batch)[None, :].to(model.device)
    batch = batch[:, :1024]

    pred = model.forward(batch[:, :-1]).logits[0]
    loss = torch.nn.functional.cross_entropy(pred, batch[0, 1:])



It returns:
/workspace/.miniconda3/lib/python3.9/site-packages/transformers/models/auto/ FutureWarning: The class AutoModelWithLMHead is deprecated and will be removed in a future version. Please use AutoModelForCausalLM for causal language models, AutoModelForMaskedLM for masked language models and AutoModelForSeq2SeqLM for encoder-decoder models.
torch.Size([1, 201])
torch.Size([1, 201])
torch.Size([1, 201])
torch.Size([1, 201])
torch.Size([1, 201])

RuntimeError Traceback (most recent call last)
Input In [1], in <cell line: 10>()
34 model.opt.zero_grad()
35 scaler.scale(loss).backward()
---> 36 scaler.step(model.opt)
37 scaler.update()

File ~/.miniconda3/lib/python3.9/site-packages/torch/cuda/amp/, in GradScaler.step(self, optimizer, *args, **kwargs)
334 self.unscale_(optimizer)
336 assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
--> 338 retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
340 optimizer_state["stage"] = OptState.STEPPED
342 return retval

File ~/.miniconda3/lib/python3.9/site-packages/torch/cuda/amp/, in GradScaler._maybe_opt_step(self, optimizer, optimizer_state, *args, **kwargs)
283 retval = None
284 if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()):
--> 285 retval = optimizer.step(*args, **kwargs)
286 return retval

File ~/.miniconda3/lib/python3.9/site-packages/torch/optim/, in Optimizer._hook_for_profile..profile_hook_step..wrapper(*args, **kwargs)
86 profile_name = "Optimizer.step#{}.step".format(
87 with torch.autograd.profiler.record_function(profile_name):
---> 88 return func(*args, **kwargs)

File ~/.miniconda3/lib/python3.9/site-packages/torch/autograd/, in*args, **kwargs)
25 @functools.wraps(func)
26 def decorate_context(*args, **kwargs):
27 with self.class():
---> 28 return func(*args, **kwargs)

File ~/.miniconda3/lib/python3.9/site-packages/torch/optim/, in NAdam.step(self, closure)
116 # record the step after step update
117 state_steps.append(state['step'])
--> 119 F.nadam(params_with_grad,
120 grads,
121 exp_avgs,
122 exp_avg_sqs,
123 mu_products,
124 state_steps,
125 beta1=beta1,
126 beta2=beta2,
127 lr=group['lr'],
128 weight_decay=group['weight_decay'],
129 momentum_decay=group['momentum_decay'],
130 eps=group['eps'])
132 # update mu_product
133 for p, mu_product in zip(params_with_grad, mu_products):

File ~/.miniconda3/lib/python3.9/site-packages/torch/optim/, in nadam(params, grads, exp_avgs, exp_avg_sqs, mu_products, state_steps, beta1, beta2, lr, weight_decay, momentum_decay, eps)
399 exp_avg.mul
(beta1).add_(grad, alpha=1 - beta1)
400 exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)
--> 402 denom = exp_avg_sq.div(bias_correction2).sqrt().add_(eps)
403 param.addcdiv_(grad, denom, value=-lr * (1. - mu) / (1. - mu_product))
404 param.addcdiv_(exp_avg, denom, value=-lr * mu_next / (1. - mu_product_next))

RuntimeError: CUDA out of memory. Tried to allocate 1.91 GiB (GPU 0; 31.75 GiB total capacity; 28.97 GiB already allocated; 999.50 MiB free; 29.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

edmond changed discussion status to closed
edmond changed discussion status to open

Anyone ?
I used Adam but it still doesnt fit, only SGD does or if I freeze many layers, is that normal ? :(

Hey! Sorry I'm currently busy on another project. I'll try your minimal script as soon as I have some bandwidth. I'll have to double check a few things:

  • is 1b3 the total amount of parameters with or without vocabulary? Vocab is actually quite big in our case but I don't think it should really tip the scale on v100 32g.
  • i would suggest removing auto as it's for larger models that require cpu offloading, but you should easily fit the weights inside a single device.
  • can you just load weights + optimizer and check memory footprint? I want to understand if the issue is actually loading the model or running it?

Sorry for the very late reply.

Also can you check if removing GradScaler helps?

Hi, no worries for the late, I just thought you guys forgot me, I will be a bit more patient next time ^^.

  • About the number of parameters I am not sure but I tried this :
    Capture d’écran 2022-07-01 à 6.28.11 PM.png
    Capture d’écran 2022-07-01 à 6.27.59 PM.png

  • I deleted auto and the model runs if I use SGD or more frozen layers by doing :
    for name, param in model.named_parameters():
    param.requires_grad = True if 'ln' in name.lower() or 'norm' in
    name.lower() or 'wpe' in name.lower() or
    'wte' in name.lower() or
    'position_embeddings' in name.lower() or
    'pos_drop' in name.lower() else False

  • Sure, about the footprint I am following and running this before having the exception being raised :
    Capture d’écran 2022-07-01 à 6.45.34 PM.png

  • Without the GradScaler its even worse, not even one backprop step succeeds, it crashes at the first one (but I think it is a bit random since when I got also rid of with torch.cuda.amp.autocast(): after that, and it crashed at the 2nd iteration).

Thanks !