VibeThinker-3B — LiteRT-LM

LiteRT-LM (.litertlm) conversion of WeiboAI/VibeThinker-3B (a Qwen2.5-3B–architecture reasoning model) for on-device / edge inference via Google AI Edge LiteRT-LM.

Files

file	size	notes
`vibethinker3b_q8_ekv8192_lora16.litertlm`	~3.4 GB	prefill+decode, int8 weights, 8192 ctx, runtime-swappable LoRA (rank 16)

Conversion details

Source: WeiboAI/VibeThinker-3B (Qwen2.5-3B: 36 layers, hidden 2048, 16 heads / 2 KV groups, vocab 151936)
Tool: litert-torch 0.9.0 generative converter (examples.qwen.convert_to_tflite)
Quantization: dynamic_int8 (int8 weights / fp32 activations)
Context / KV cache: 8192 tokens (chosen for long reasoning traces)
Signatures: prefill_256, decode, plus LoRA-enabled prefill_256_lora_r16, decode_lora_r16
Metadata: model type qwen2p5, HF tokenizer, Qwen2.5 chat template embedded, stop tokens <|im_end|> (151645) / <|endoftext|> (151643)

LoRA note: the rank-16 LoRA signatures target the q/k/v/o projections and let the LiteRT-LM runtime load/swap a fine-tuned adapter at init (EngineSettings::SetScopedLoraFile). Exporting these required fixing a grouped-query-attention out-dim bug in litert-torch's lora.py (reported upstream: litert-torch#1066).

Usage

Run with the LiteRT-LM runtime (litert_lm_main / engine API):

litert_lm_main --backend=cpu --model_path=vibethinker3b_q8_ekv8192_lora16.litertlm

To attach a fine-tuned rank-16 LoRA adapter, convert it with litert_torch's LoRA.from_safetensors(...).to_tflite() and load the resulting file via the runtime's scoped-LoRA API.

License

MIT, inherited from the base model WeiboAI/VibeThinker-3B.

Downloads last month: 7

Model tree for macmacmacmac/VibeThinker-3B-litert-lm

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-Coder-3B

Finetuned

WeiboAI/VibeThinker-3B

Quantized

(46)

this model