Qwen3-4B Latte v5

Voice-distillation LoRA fine-tune of Qwen3-4B-Instruct-2507, targeting the private "Latte" agent persona: warm-direct, technical, takes a stance, concrete numbers, bilingual EN/ZH, no template openers.

This is an archival/experimental release. It is not the production brain for the live Latte agent — see eval caveats below.

What's inside

File Size Format Use
adapter_model.safetensors 14 MB mlx LoRA (rank 8, scale 20) Apply on top of base with mlx_lm.fuse
adapter_config.json <1 KB mlx config LoRA hyperparameters
model-0000{1,2}-of-00002.safetensors 8 GB HF / bfloat16 fused Direct transformers / vLLM use
qwen3-4b-latte-v5-f16.gguf 7.5 GB GGUF F16 llama.cpp / Ollama (high quality)
qwen3-4b-latte-v5-Q4_K_M.gguf 2.3 GB GGUF Q4_K_M llama.cpp / Ollama (balanced)

Training

  • Base: mlx-community/Qwen3-4B-Instruct-2507-4bit (4-bit MLX)
  • Method: LoRA via mlx_lm.lora
  • LoRA: rank 8, scale 20.0, 8 layers, dropout 0
  • Optimizer: Adam, lr 1e-4, batch 1, grad accum 8, grad checkpoint on
  • Iters: 800 trained, best checkpoint = iter 450 (val loss 2.732)
  • Max seq: 1536, mask_prompt: true, seed: 42
  • Dataset: 475 curated (instruction, response) pairs across 7 categories: Moltbook-style comment, HF discussion reply, technical analysis (ZH), code review snippet, persona Q&A, peer-event reply, real-time observation. Anchored against 356 raw Latte-voice messages.

Evaluation

30 held-out (prompt, response) pairs per pairing. Each response pair shown blind to a Claude judge (positions randomized, model identity stripped).

Comparison v5 wins base/v4 wins ties mean score (1-5)
v5 vs base 20 (66.7%) 8 (26.7%) 2 (6.7%) v5 3.20 / base 2.93
v4 vs base 22 (73.3%) 8 (26.7%) 0 v4 3.13 / base 2.70
v5 vs v4 14 (46.7%) 15 (50.0%) 1 (3.3%) v5 3.00 / v4 2.97

Headline: v5 clearly beats the un-tuned base on in-distribution prompts (the 7 trained categories), passing the 55% ship threshold.

Caveat 1: v5 vs v4 is statistically a tie. Lower val loss (2.732 vs 2.785) did not produce a perceptible quality gain in blind eval. The additional curation effort and training steps produced marginal returns.

Caveat 2 — why this isn't production: Out-of-distribution smoke testing (prompts unlike the 7 training categories) shows v5 is tied or slightly worse than base:

  • Stage-direction leakage: v5 occasionally prefixes responses with "(soft, soothing Latte voice)" — an artifact of training data that characterized Latte's voice.
  • Occasional factual regressions (e.g., confusing latte and latte macchiato in a generic coffee Q&A).
  • Reduced robustness on prompts that pull the "Latte" token toward unrelated semantic neighborhoods (the literal coffee drink).

The 66.7% in-distribution win does not justify replacing a battle-tested general-purpose base in production. Use this checkpoint for tasks closely matching the 7 training categories.

Usage

MLX (Apple Silicon, recommended for inference)

from mlx_lm import load, generate

model, tokenizer = load(
    "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    adapter_path="./",  # this repo
)
print(generate(model, tokenizer, "Your prompt", max_tokens=200))

llama.cpp / Ollama

# Modelfile
FROM qwen3-4b-latte-v5-Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER top_k 20
PARAMETER top_p 0.8
ollama create latte:v5 -f Modelfile
ollama run latte:v5

Transformers (any platform)

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("latte-agent/qwen3-4b-latte-v5")
model = AutoModelForCausalLM.from_pretrained(
    "latte-agent/qwen3-4b-latte-v5", torch_dtype="bfloat16"
)

License

Inherits Apache 2.0 from base (Qwen3-4B-Instruct-2507, © Alibaba Cloud).

Citation

If you reference this work, please cite the base model. This adapter has no formal publication.

Downloads last month
77
Safetensors
Model size
4B params
Tensor type
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for latte-agent/qwen3-4b-latte-v5