Out of CPU memory in pipeline

#2
by dmiche - opened

Hi!
The model files itself are small and should fit in 24Gb of GPU.
But if I try the suggested pipeline:

pipe = pipeline("text-generation", model="relaxml/Llama-2-70b-E8P-2Bit")

it start to grow the CPU memory till 128Gb and then be killed on OOM.
Can I avoid this memory allocation?

RelaxML org

I'm confused, what is this suggested pipeline? I don't think we have any code in our codebase that uses a pipeline() call.

Hm... "Model card" tab, most right button above "Downloads" chart "Use in Transformers" :)

Do you offer a working example of code?

RelaxML org

Yes, in our github repo https://github.com/Cornell-RelaxML/quip-sharp. We use a modified version of the modeling_llama.py file to handle our quantized linear layers, which is why calling the default "pipeline" command without using our repo will not work.

Thank You!

dmiche changed discussion status to closed

@at676 , I would appriate if you provide us with a colab link for QUIP inference

Sign up or log in to comment