Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
sr-rai 
posted an update 4 days ago
Post
2528
ExLlamaV3 is out. And it introduces EXL3 - a new SOTA quantization format!

"The conversion process is designed to be simple and efficient and requires only an input model (in HF format) and a target bitrate. By computing Hessians on the fly and thanks to a fused Viterbi kernel, the quantizer can convert a model in a single step, taking a couple of minutes for smaller models, up to a few hours for larger ones (70B+) (on a single RTX 4090 or equivalent GPU.)"

Repo: https://github.com/turboderp-org/exllamav3



I must try it out right now. Downloading first model with:

huggingface-cli download turboderp/Qwen2.5-14B-Instruct-exl3 --local-dir turboderp/Qwen2.5-14B-Instruct-exl3 --revision=6.0bpw

just I don't see any server yet, as I need API endpoint.

What is speed as compared to llama.cpp?

In this post