Instructions to use litert-community/VibeThinker-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT-LM
How to use litert-community/VibeThinker-3B with LiteRT-LM:
# LiteRT-LM runs on various platforms (Android, iOS, Windows, Linux, macOS, IoT, Web/WASM) # and supports many APIs (C++, Python, Kotlin, Swift, JavaScript, Flutter). # For platform-specific integration guides, please refer to the official developer website: # https://ai.google.dev/edge/litert-lm # To try LiteRT-LM, the easiest way is to use our CLI tool. # 1. Install the LiteRT-LM CLI tool: pip install litert-lm # 2. Download and run this model locally: # See: https://ai.google.dev/edge/litert-lm/cli litert-lm run \ --from-huggingface-repo=litert-community/VibeThinker-3B \ model.litertlm \ --prompt="Write me a poem"
- LiteRT
How to use litert-community/VibeThinker-3B with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
VibeThinker-3B β LiteRT-LM (blockwise int4)
WeiboAI/VibeThinker-3B converted to the
LiteRT-LM (.litertlm) format for on-device inference with Google's
LiteRT-LM runtime (the engine behind the official
litert-community/* models).
VibeThinker-3B is a dense 3B math/reasoning model (Qwen2ForCausalLM, 36 layers) β it solves
problems with an inline chain-of-thought and is strong at arithmetic and math word problems. Standard
Qwen2 architecture, so it rides the existing converter and runtime directly.
| File | model.litertlm β int4 block 32 (~1.9 GB) |
| Quantization | int4 weights (symmetric) + OCTAV optimal-clipping; embeddings INT8 (externalized section) |
| Compute | integer |
| Context (KV cache) | 4096 |
| Base model | WeiboAI/VibeThinker-3B |
| Decode speed | iPhone 17 Pro (GPU) Β· ~87β93 tok/s (Mac M-series, GPU) |
β οΈ It's a reasoning model β give it room to think
VibeThinker solves with a step-by-step chain-of-thought, then a \boxed{} answer. Run it with
max_tokens β₯ 2048 β at a short limit it gets cut off before the answer. (All quality numbers
below were measured at 2048.)
Quality β GSM8K parity
Measured on GSM8K (n=100, greedy, 0-shot chain-of-thought, max_tokens 2048, identical prompt and answer-extraction for every row).
| Configuration | GSM8K |
|---|---|
| bf16 (reference) | 97.0% |
| LiteRT int4 β block 32 | 90.0% (β7 pt) |
int4 (block 32) is at parity (β7 pt) and still 90% β strong for an on-device math model. bf16's 97% reflects this model's math specialization.
Why block 32 (not block 128)? This is a precision-sensitive math model: the coarser block-128 int4 (ΒΌ the dequant scales) collapsed to 64% (β33 pt) on GSM8K, while block 32 holds at 90%. So only the block-32 build is published. (Note: for general-purpose 4B reasoning models the opposite holds β block 128 is fine and faster β but exact arithmetic needs the finer block-32 grid.)
Usage
# build litert-lm from https://github.com/google-ai-edge/litert-lm, then:
litert_lm_main \
--model_path model.litertlm \
--backend gpu \
--input_prompt "A bat and a ball cost \$1.10. The bat costs \$1.00 more than the ball. How much is the ball?"
The .litertlm bundle carries the tokenizer and prompt template (Qwen2 ChatML β
<|im_start|>role\nβ¦<|im_end|>), so no separate tokenizer files are needed.
Run on Android
The official Google AI Edge Gallery app runs
.litertlm models on-device:
- Install a recent Gallery (package
com.google.ai.edge.gallery, 1.0.15+ supports.litertlm). - Download
model.litertlmand push it:adb push model.litertlm /sdcard/Download/ - In the app tap +, pick the file, choose the GPU backend, and raise the max-tokens setting (β₯2048).
- Chat β the bundle already carries the tokenizer and Qwen2 chat template.
Run on iPhone
Verified on iPhone 17 Pro (LiteRT-LM Swift runtime): the block-32 build (1.62 GiB section, under the iOS limit) loads and generates correct answers.
Conversion
Converted with the official litert-torch
converter β a standard Qwen2ForCausalLM, no custom graph code. Recipe: blockwise-32 int4 + OCTAV
(INT4 weights, block 32, symmetric, OCTAV optimal-clipping), embeddings INT8, KV cache 4096.
from litert_torch.generative.export_hf.export import export
export(
model="WeiboAI/VibeThinker-3B",
output_dir="out",
quantization_recipe="qwen3_int4_block32_octav.json", # blockwise-32 int4 + OCTAV, int8 embeddings
cache_length=4096,
externalize_embedder=True,
)
License
MIT, inherited from the base model WeiboAI/VibeThinker-3B.
- Downloads last month
- -