I'm running out of memory while generating on a RTX A5000 (24 GB)

#83

by xsanskarx - opened Jun 28

Jun 28

It runs out of memory every time

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain.llms import HuggingFacePipeline

model_name_or_path = "microsoft/Phi-3-mini-128k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens = 1048,
return_full_text= False,
temperature = 0.3,
do_sample = True,
)

llm = HuggingFacePipeline(pipeline=pipe)
torch.cuda.empty_cache()

handler = StdOutCallbackHandler()

qa_with_sources_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever = ensemble_retriever,
callbacks=[handler],
chain_type_kwargs={"prompt": custom_prompt},
return_source_documents=True
)

Granther

Jun 28

Install flash-attn
!pip install flash-attn --no-build-isolation

Add 'attn_implementation="flash_attention_2"' to your AutoModelForCausalLM.from_pretrained arguments.
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto", trust_remote_code=True, attn_implementation="flash_attention_2")

Flash Attention helps reduce memory usage, it helped reduce my VRAM usage by about 10 Gigs when Quantizing with GPTQ.

jcalvo1234

Jul 2

Also facing a similar issue with a V100 GPU.

I tried attn_implementation="flash_attention_2" as suggested but I am getting the following error:

ValueError: The current flash attention version does not support sliding window attention.

Based on my research you are supposed to install flash-attn sepearately but I already did that (and restarted my kernel) and still getting the error.

Name: flash-attn
Version: 2.5.9.post1

from transformers.utils import is_flash_attn_2_available
is_flash_attn_2_available()
True

Granther

Jul 2

Uninstall and reinstall flash-attn
pip list | grep flash
pip uninstall ...
pip install flash-attn --no-build-isolation

Looking at the Phi3 modeling implementation in transformers, it seems it may be failing due to Phi3 not being compatible with output_attentions

# Phi3FlashAttention2 attention does not support output_attentions

        if not _flash_supports_window_size:
            logger.warning_once(
                "The current flash attention version does not support sliding window attention. Please use `attn_implementation='eager'` or upgrade flash-attn library."
            )
            raise ValueError("The current flash attention version does not support sliding window attention.")

Are you trying to run Phi3 Causally or some other method like Seq2Seq?

jcalvo1234

Jul 2

Thank you I loaded an entirely new Kernel and got that part resolved but then discovered that my nVidia V100 GPU is not supported by Flash Attention.

I am using Phi3 causally for this. Well, I guess I will continue researching on my own and perhaps open a new conversation as I don't want to hijack OPs post.

nguyenbh changed discussion status to closed Oct 10

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment