VibeThinker-3B β€” LiteRT-LM (blockwise int4)

WeiboAI/VibeThinker-3B converted to the LiteRT-LM (.litertlm) format for on-device inference with Google's LiteRT-LM runtime (the engine behind the official litert-community/* models).

VibeThinker-3B is a dense 3B math/reasoning model (Qwen2ForCausalLM, 36 layers) β€” it solves problems with an inline chain-of-thought and is strong at arithmetic and math word problems. Standard Qwen2 architecture, so it rides the existing converter and runtime directly.

File model.litertlm β€” int4 block 32 (~1.9 GB)
Quantization int4 weights (symmetric) + OCTAV optimal-clipping; embeddings INT8 (externalized section)
Compute integer
Context (KV cache) 4096
Base model WeiboAI/VibeThinker-3B
Decode speed iPhone 17 Pro (GPU) Β· ~87–93 tok/s (Mac M-series, GPU)

⚠️ It's a reasoning model β€” give it room to think

VibeThinker solves with a step-by-step chain-of-thought, then a \boxed{} answer. Run it with max_tokens β‰₯ 2048 β€” at a short limit it gets cut off before the answer. (All quality numbers below were measured at 2048.)

Quality β€” GSM8K parity

Measured on GSM8K (n=100, greedy, 0-shot chain-of-thought, max_tokens 2048, identical prompt and answer-extraction for every row).

Configuration GSM8K
bf16 (reference) 97.0%
LiteRT int4 β€” block 32 90.0% (βˆ’7 pt)

int4 (block 32) is at parity (βˆ’7 pt) and still 90% β€” strong for an on-device math model. bf16's 97% reflects this model's math specialization.

Why block 32 (not block 128)? This is a precision-sensitive math model: the coarser block-128 int4 (ΒΌ the dequant scales) collapsed to 64% (βˆ’33 pt) on GSM8K, while block 32 holds at 90%. So only the block-32 build is published. (Note: for general-purpose 4B reasoning models the opposite holds β€” block 128 is fine and faster β€” but exact arithmetic needs the finer block-32 grid.)

Usage

# build litert-lm from https://github.com/google-ai-edge/litert-lm, then:
litert_lm_main \
  --model_path model.litertlm \
  --backend gpu \
  --input_prompt "A bat and a ball cost \$1.10. The bat costs \$1.00 more than the ball. How much is the ball?"

The .litertlm bundle carries the tokenizer and prompt template (Qwen2 ChatML β€” <|im_start|>role\n…<|im_end|>), so no separate tokenizer files are needed.

Run on Android

The official Google AI Edge Gallery app runs .litertlm models on-device:

  1. Install a recent Gallery (package com.google.ai.edge.gallery, 1.0.15+ supports .litertlm).
  2. Download model.litertlm and push it: adb push model.litertlm /sdcard/Download/
  3. In the app tap +, pick the file, choose the GPU backend, and raise the max-tokens setting (β‰₯2048).
  4. Chat β€” the bundle already carries the tokenizer and Qwen2 chat template.

Run on iPhone

Verified on iPhone 17 Pro (LiteRT-LM Swift runtime): the block-32 build (1.62 GiB section, under the iOS limit) loads and generates correct answers.

Conversion

Converted with the official litert-torch converter β€” a standard Qwen2ForCausalLM, no custom graph code. Recipe: blockwise-32 int4 + OCTAV (INT4 weights, block 32, symmetric, OCTAV optimal-clipping), embeddings INT8, KV cache 4096.

from litert_torch.generative.export_hf.export import export
export(
    model="WeiboAI/VibeThinker-3B",
    output_dir="out",
    quantization_recipe="qwen3_int4_block32_octav.json",  # blockwise-32 int4 + OCTAV, int8 embeddings
    cache_length=4096,
    externalize_embedder=True,
)

License

MIT, inherited from the base model WeiboAI/VibeThinker-3B.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for litert-community/VibeThinker-3B

Base model

Qwen/Qwen2.5-3B
Finetuned
(21)
this model