kernelpool/LongCat-2.0-3bit-UVMAX

Mixed-precision (UVMAX) quantization of meituan-longcat/LongCat-2.0-FP8, converted with mlx-lm.

What is UVMAX?

UVMAX is a mixed-precision scheme: bit widths are assigned per tensor class from measured round-trip quantization error, rather than uniformly. All classes use group size 64.

Tensor class Bits Parameters Size Share
Expert FFNs 3 1.47 T 598.5 GiB 88.0%
N-gram embedding tables 3 135 B 55.0 GiB 8.1%
Attention, dense MLPs 6 31.4 B 22.9 GiB 3.4%
Embeddings, lm_head 6 2.8 B 2.1 GiB 0.3%
DSA indexer, MoE routers 8 0.6 B 0.6 GiB 0.1%
Norms, correction biases (unquantized) 0.9 GiB 0.1%

Use with mlx

This model requires LongCat-2.0 support from mlx-lm PR #1464, which has not yet been merged. Until it is included in an mlx-lm release, install mlx-lm from the PR branch:

pip install git+https://github.com/ml-explore/mlx-lm.git@refs/pull/1464/head
from mlx_lm import load, generate

model, tokenizer = load("kernelpool/LongCat-2.0-3bit-UVMAX")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_dict=False,
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)
Downloads last month
-
Safetensors
Model size
1.6T params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kernelpool/LongCat-2.0-3bit-UVMAX

Quantized
(3)
this model