Is it possible to run inference on MiniMonkey without using Flash attention?

#2
by prashjeev - opened

I have an old GPU, and i keep getting the following error
"FlashAttention only supports Ampere GPUs or newer."
I can usually by pass this by setting
attn_implementation='eager' or use_flash_attn=False
But none of these have worked so far. I am looking at the code in detail, but before I spend time going through it, I would like to know if it is at all possible to run inference on the model without using Flash Attention.

Owner

Hi~, you can set this line to 'eager' to run inference on the model without using Flash Attention.

Thank you for the suggestion. I did what you suggested and turns out you also have to change use_flash_attn to false.
Below is the code that worked for me. Hope it helps if anyone else had a similar error.

path = 'mx262/MiniMonkey'
config, unused_kwargs = AutoConfig.from_pretrained(path, trust_remote_code=True, return_unused_kwargs=True)

Modify 'attn_implementation' inside 'llm_config'

config.llm_config.attn_implementation = "eager"
config.vision_config.use_flash_attn = False

model = AutoModel.from_pretrained(
path,
config = config,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

prashjeev changed discussion status to closed

Sign up or log in to comment