VibeThinker-3B — LiteRT-LM

LiteRT-LM (.litertlm) conversion of WeiboAI/VibeThinker-3B (a Qwen2.5-3B–architecture reasoning model) for on-device / edge inference via Google AI Edge LiteRT-LM.

Files

file size notes
vibethinker3b_q8_ekv8192_lora16.litertlm ~3.4 GB prefill+decode, int8 weights, 8192 ctx, runtime-swappable LoRA (rank 16)

Conversion details

  • Source: WeiboAI/VibeThinker-3B (Qwen2.5-3B: 36 layers, hidden 2048, 16 heads / 2 KV groups, vocab 151936)
  • Tool: litert-torch 0.9.0 generative converter (examples.qwen.convert_to_tflite)
  • Quantization: dynamic_int8 (int8 weights / fp32 activations)
  • Context / KV cache: 8192 tokens (chosen for long reasoning traces)
  • Signatures: prefill_256, decode, plus LoRA-enabled prefill_256_lora_r16, decode_lora_r16
  • Metadata: model type qwen2p5, HF tokenizer, Qwen2.5 chat template embedded, stop tokens <|im_end|> (151645) / <|endoftext|> (151643)

LoRA note: the rank-16 LoRA signatures target the q/k/v/o projections and let the LiteRT-LM runtime load/swap a fine-tuned adapter at init (EngineSettings::SetScopedLoraFile). Exporting these required fixing a grouped-query-attention out-dim bug in litert-torch's lora.py (reported upstream: litert-torch#1066).

Usage

Run with the LiteRT-LM runtime (litert_lm_main / engine API):

litert_lm_main --backend=cpu --model_path=vibethinker3b_q8_ekv8192_lora16.litertlm

To attach a fine-tuned rank-16 LoRA adapter, convert it with litert_torch's LoRA.from_safetensors(...).to_tflite() and load the resulting file via the runtime's scoped-LoRA API.

License

MIT, inherited from the base model WeiboAI/VibeThinker-3B.

Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for macmacmacmac/VibeThinker-3B-litert-lm

Base model

Qwen/Qwen2.5-3B
Quantized
(46)
this model