Memory leak.

#2
by Yurkoff - opened

There is a memory leak somewhere. After the calculations are completed, the model does not return the used memory to the pool.

image.png

This is what it looks like when the model boots up.
image.png

That's a code problem, nothing to do with the model itself. And no-one can help with that without seeing the code being run.

deleted

Might want to report this to the dev of whatever app/module/etc you are using.

My code:

import torch
from transformers import LlamaTokenizerFast, LlamaForCausalLM


tokenizer = LlamaTokenizerFast.from_pretrained(model_dir)
model = LlamaForCausalLM.from_pretrained(model_dir,
                                         load_in_8bit=True,
                                         device_map='sequential',
                                         torch_dtype=torch.float16,
                                         low_cpu_mem_usage=True,
                                         )


inputs = self.tokenizer(prompts)
output_ids = self.model.generate(torch.as_tensor(inputs.input_ids).to(self.device),
                                 do_sample=True,
                                 temperature=0.8,
                                 max_new_tokens=512,
                                 top_p=0.95,
                                 # synced_gpus=True,
                                 )
results = self.tokenizer.batch_decode(output_ids,
                                      skip_special_tokens=True,
                                      clean_up_tokenization_spaces=False)[0]

Virsions of my packeges:

torch==2.0.1+cu118; sys_platform == 'linux'
torchvision==0.15.2+cu118; sys_platform == 'linux'
torchtext==0.15.2; sys_platform == 'linux'
torchaudio==2.0.2+cu118; sys_platform == 'linux'
psutil==5.9.5
requests==2.31.0
captum==0.6.0
packaging==23.1
pynvml==11.4.1
pyyaml==6.0
nvgpu
cython==0.29.34
wheel==0.40.0
pillow==9.3.0
numpy==1.24.3
torchtext==0.15.2
torchserve==0.7.1
torch-model-archiver==0.7.1
transformers==4.31.0
tokenizers==0.13.3
sentencepiece==0.1.99
bitsandbytes==0.41.1
accelerate==0.21.0
scipy==1.10.1

I solved problem. After each inference i call

gc.collect()
torch.cuda.empty_cache()

https://github.com/huggingface/transformers/issues/25690

Yurkoff changed discussion status to closed

Sign up or log in to comment