Probably Memory Leak issue?

#46
by Tieni - opened

When performing batch processing with the 128k model for long-context (>10k token) reasoning, GPU memory continues to rise until it runs out of memory (OOM). To resolve this issue, I have to add “torch.cuda.empty_cache()” after the pipeline call. However, I’m uncertain whether this is the expected behavior or if there’s something else I should do to address this problem.

This comment has been hidden
Microsoft org

Thanks for raising this issue. We are unaware of any memory impacts that LongRoPE extension might have.

Could it be that during the batch processing, all of the generations are getting extended to the maximum generated length inside the batch? And thus, because it is a long context, fills up the GPU memory?

nguyenbh changed discussion status to closed

Sign up or log in to comment