FA Increases possible context length @Q4

by saishf - opened May 14, 2024

May 14, 2024

Using the Flash Attention implementation into KoboldCPP it is posible to fit 16K into 8GB of vram @Q4_K_M
When running an IGPU i can fit 16K @Q5_K_S with FA and 512 batch size into 8GB

For the usual use case, a monitor running on the gpu. It's still possible. This is with one monitor on my gpu using 16K context.

And FA support for cards without tensor cores is coming: https://github.com/LostRuins/koboldcpp/issues/844

Lewdiculous

Owner May 14, 2024

•

edited May 14, 2024

That's great news. Hurray. I still keep my usual recommendation for now because of the Tensor Core reqs, but if that's lifted I'll add that as an added recommendations if speeds are good.

Just adding a small data point, with KoboldCPP compiled with this, with a Q8_K 11b model on 2 x 1080 Ti (Pascal) setup, I get:

~20.2 T/s avg (proc + gen) with FP32 FA enabled.
~13.4 T/s avg (proc + gen) with FP32 FA disabled.
So a significant improvement in my case. Whereas with FP16 FA, I saw a decrease. So it definitely has utility for a subset of users.

This and the PR graphs look very promising!

Lewdiculous

Owner May 14, 2024

•

edited May 14, 2024

Using Nexesenex's KCPP since it already merged it, things look good, performance is good and it seems to work well.

saishf

May 14, 2024

Using Nexesenex's KCPP since it already merged it, things look good, performance is good and it seems to work well.

I've only seen a slight increase with processing with FA, from like 1Kt/s to 1.1Kt/s when ingesting 8k context (turing)
I imagine it'll be a big deal for pascal users though
I'm trying out how well it squishes phi3 context now

saishf

May 14, 2024

16K without FA

16K with FA

Phi3 is cursed with insane memory usage, it's worse than llama3 and somehow uses like 2GB extra vram

Virt-io

May 14, 2024

Does phi-3 have GQA?

saishf

May 14, 2024

•

edited May 14, 2024

I guess this might be the cause?
Hardware

Note that by default, the Phi-3-mini model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types:

NVIDIA A100
NVIDIA A6000
NVIDIA H100

If you want to run the model on:

NVIDIA V100 or earlier generation GPUs: call AutoModelForCausalLM.from_pretrained() with attn_implementation="eager"
Optimized inference on GPU, CPU, and Mobile: use the ONNX models 128K

Ps. i dont know how to check for GQA. Theres nothing stated on the models card

saishf

May 14, 2024

I am unsure if this is caused by Flash Attention F32 but llama3 is running @50+t/s suddenly?
These are fresh responses too, not regens

It's kinda insane 😭

Lewdiculous

Owner May 14, 2024

•

edited May 14, 2024

I noticed some higher token numbers too but didn't compare directly to get accurate measures. It's at least not worse, and, bigger context for the same amount of VRAM, win-win if the quality remains the same.

I can at least continue to act smug over the EXL2 users and cope that LlamaCpp is the best thing to ever exist.

saishf

May 14, 2024

I noticed some higher token numbers too but didn't compare directly to get accurate measures. It's at least not worse, and, bigger context for the same amount of VRAM, win-win if the quality remains the same.

I can at least continue to act smug over the EXL2 users and cope that LlamaCpp is the best thing to ever exist.

I haven't noticed any degredation of context quality, and there's been no issues on the official koboldcpp opened relating to context issues with FA

I used to run about 35T/s when llama3 first came out and that was at 8k context. So there's been major improvements somewhere? I've made no hardware changes at all 🐥

Virt-io

May 14, 2024

•

edited May 14, 2024

Is it possible that the gains are also from CUDA 12?

Or did you test against CUDA 12 koboldcpp?

saishf

May 14, 2024

Is it possible that the gains are also from CUDA 12?

Or did you test against CUDA 12 koboldcpp?

Old testing was done with the CUDA 12 Nexesenex forks (their forks have been on Cublas 12+ since like V1.58?)
New testing uses Nexesenex forks too, Cublas 12.2

Ardvark123

May 14, 2024

Looks like I gotta compile this test just to see how my speeds are. I am pascal so once I read about it I was super excited

Lewdiculous

Owner May 14, 2024

@Ardvark123 You can text quickly using Nexesenex's KoboldCpp, it's good news indeed.

saishf

May 16, 2024

It gets even better for older gpus
https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.66d_b2902%2B2

Lewdiculous

Owner May 16, 2024

•

edited May 16, 2024

I was using my Pascal w/ Cuda 12.2 before and it was good but these additional PRs for speedups are great, will try later but if it is even faster that's crazy.

We're eating good boys.

@saishf , I'll open a new discussion in LLM-Discussions for this topic, to keep things organized.

Lewdiculous

Owner May 16, 2024

Since this seems quite relevant I'll move things to here so it's better documented:

https://huggingface.co/LWDCLS/LLM-Discussions/discussions/11

Lewdiculous changed discussion status to closed May 16, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment