Lance LLM (understanding path) — MLX 4-bit

MLX-format 4-bit quantization of the understanding-path language model extracted from bytedance-research/Lance. Runs on Apple Silicon (M1/M2/M3/M4) via mlx-lm.

What's quantized

Lance ships a custom modified Qwen2.5-VL with Mixture-of-Tasks routing: understanding tokens flow through one set of layer weights, generation tokens flow through _moe_gen siblings. This checkpoint contains only the understanding path weights, re-packaged as a standard Qwen2 LLM so mlx-lm accepts it.

That means:

  • ✓ Text generation, instruction-following, VQA on text-only (use vision via the original Lance pipeline)
  • ✗ Image/video generation, which lives in the _moe_gen path (separate quantization, not in this repo)

The Lance team's actual full inference loop is the canonical way to use this model — these MLX weights are most useful for text-decoder-only experimentation and for benchmarking the LLM half independently.

Variants in this repo family

Repo Format Group size Bits/weight DWQ refined
…-MLX-4bit affine INT4 64 4.50 no
…-MLX-4bit-DWQ affine INT4 (distilled) 64 4.50 yes
…-MLX-NVFP4 NVFP4 (E2M1) 16 4.50 no

DWQ (Distillation-aware Weight Quantization) optimises the per-group scales/biases via KL-divergence distillation from the bf16 teacher. Typically recovers ~0.6 bits-per-weight of quality vs plain post-training quantization at the same bit budget.

Usage

from mlx_lm import load, generate

model, tokenizer = load("Reza2kn/Lance-3B-Video-und-MLX-4bit-DWQ")
print(generate(
    model, tokenizer,
    prompt="What is the capital of France?",
    max_tokens=64, verbose=True,
))

Or via CLI:

mlx_lm.generate --model Reza2kn/Lance-3B-Video-und-MLX-4bit-DWQ \
    --prompt "Describe Persian cuisine in one paragraph."

Extraction recipe

# scripts/extract_und_to_qwen.py from https://github.com/Reza2kn/lance-quant
python extract_und_to_qwen.py \
    --src downloads/Lance_3B_Video/model.safetensors \
    --llm_config downloads/Lance_3B_Video/llm_config.json \
    --tokenizer_src downloads/Lance_3B_Video \
    --out Lance_3B_Video-und-qwen \
    --variant und

# Then drop the qk_norm weights (mlx-lm's Qwen2 doesn't have them) and
# convert to MLX 4-bit
mlx_lm.convert --hf-path Lance_3B_Video-und-qwen \
    --mlx-path Lance_3B_Video-und-MLX-4bit \
    -q --q-bits 4 --q-group-size 64

# Optional: DWQ refinement
mlx_lm.dwq --model Lance_3B_Video-und-qwen \
    --quantized-model Lance_3B_Video-und-MLX-4bit \
    --mlx-path Lance_3B_Video-und-MLX-4bit-DWQ \
    --bits 4 --group-size 64 --num-samples 256

Limitations

  • Only the understanding path is quantized here. Image/video generation uses _moe_gen weights which aren't in this checkpoint.
  • The qk_norm weights from the original Lance modified-Qwen2.5-VL had to be dropped (mlx-lm's qwen2 model class doesn't define them). Small but measurable quality cost vs the original FP32.
  • Vision encoding (ViT, Wan VAE) must come from the original Lance pipeline.

For full multimodal use, see the AWQ INT4 / NVFP4 sibling repos which preserve the entire Lance architecture:

Reproduction toolkit: https://github.com/Reza2kn/lance-quant

License

Apache 2.0, inherited from the base model.

Downloads last month
77
Safetensors
Model size
0.5B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Reza2kn/Lance-3B-Video-und-MLX-4bit

Quantized
(16)
this model