Command-A-Plus-Lite (int2 experts / int4 resident)

Pre-quantized weights for running Cohere's Command-A-Plus (218B-parameter Mixture-of-Experts, 25B active) on a single 24GB GPU.

Component	Precision	Where
Routed experts (128/layer)	int2, group-wise (g=64)	CPU RAM, streamed per active expert
Attention q/k/v/o + shared experts + embedding	int4, group-wise (g=64)	GPU-resident
Router gate / layernorms	fp16	GPU-resident

weights on disk      ~67 GB
resident VRAM        ~8.4 GB
host RAM (pinned)    ~61 GB   (peaks ~108 GB during load)
decode speed         ~0.3 tok/s   (single 24GB GPU, --pin --gemlite)

Decode is transfer-bound (CPU→GPU expert streaming dominates), so this is a capacity play — fitting a 218B model on one 24GB card — not a throughput one.

Usage

Install the runtime: https://github.com/kizuna-intelligence/Command-A-Plus-Lite

pip install -e ".[gemlite]"
hf download kizuna-intelligence/Command-A-Plus-Lite --local-dir ./cmda_int4

import torch
from command_a_plus_lite import load_quantized

model = load_quantized("./cmda_int4", device="cuda:0", dtype=torch.float16,
                       pin_experts=True, use_gemlite=True)

The tokenizer is not included here — use the one from the base model CohereLabs/command-a-plus-05-2026.

License

The model weights are governed by Cohere's license for Command-A-Plus. The runtime code is MIT (see the GitHub repository). int2 routed experts are blind RTN (no calibration); quality is below the bf16 original.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kizuna-intelligence/Command-A-Plus-Lite

Base model

CohereLabs/command-a-plus-05-2026-bf16

Finetuned

(1)

this model