Professor Pip โ€” MiniCPM5-1B Teacher LoRA

A LoRA adapter that turns openbmb/MiniCPM5-1B-SFT into Professor Pip, a warm, playful 3D talking-avatar teacher for children aged 5โ€“10. The adapter does not teach the model new facts โ€” it locks in Pip's spoken voice (short, plain-word, kind, tiny-story explanations) and a strict {text, mood, gesture} JSON contract that drives the in-browser avatar.

Built for the Build Small Hackathon (Backyard AI track). The full project is a Hugging Face Space with a custom WebGL avatar (met4citizen TalkingHead + Three.js) rendered in the browser at ~60fps with zero GPU on the face; this 1B brain answers spoken "raise-hand" questions and is served as GGUF via llama.cpp on Modal.


What the adapter does

Every reply is one JSON object and nothing else:

{"text": "The sky looks blue because sunlight bounces off the tiny bits of air, and blue bounces the most! Want to find out why grass is green next?",
 "mood": "happy",
 "gesture": "index"}
  • text โ€” what Pip says out loud: 1โ€“3 short sentences, small words a young child knows, gentle with wrong answers, no emoji / markdown / symbols (plain spoken words only).
  • mood โ€” one of ["neutral","happy","angry","sad","fear","disgust","love"]; drives the avatar's facial expression.
  • gesture โ€” one of ["handup","index","ok","thumbup","thumbdown","side","shrug","namaste"] or null; drives the avatar's body.

The base model's hybrid-reasoning <think> toggle is pinned off (enable_thinking=False): the MiniCPM5 ChatML template prefills an empty <think></think>, so there is no reasoning trace โ€” just the kid-facing line.

The adapter is trained to cover the four things Pip says live: answering a child's raise-hand interruption, delivering a lesson segment, encouragement / greetings / gentle wrong-answer handling, and safe redirects (off-topic, personal, medical, or dangerous requests โ†’ a friendly "let's pick something we can learn about!" or "a grown-up can help with that").


Training configuration

Base model openbmb/MiniCPM5-1B-SFT (standard LlamaForCausalLM, 1.08B params, Apache-2.0)
Method LoRA (PEFT), assistant-only loss masking
Rank / alpha / dropout r=32, ฮฑ=64, dropout=0.05, bias=none
Target modules attention + MLP linears: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable params 22.4M (~2% of the model)
Epochs 3
Learning rate 2e-4, cosine schedule, 3% warmup
Precision bf16
Effective batch size 32 (per-device 8 ร— grad-accum 4)
Max sequence length 1024 tokens
Hardware / time Modal, single A10 GPU, ~12 minutes
Final train loss 1.30

Loss masking. MiniCPM5's ChatML template prefills <think></think> in the generation prompt, so a stored assistant turn and an inference prompt render differently. Rather than prefix-diff rendered turns, the trainer tokenizes the exact inference prompt (add_generation_prompt=True, enable_thinking=False) and trains only on the final assistant turn โ€” the {text,mood,gesture} JSON plus the <|im_end|> terminator. This matches how the model is called at inference, token-for-token.

Training data

~2,016 synthetic, in-voice examples (1,866 train / 150 held-out gold eval), generated by a multi-agent workflow and then put through a deterministic, production-faithful validation gate: every user turn and every assistant text must pass the same text_is_safe denylist the live Space applies, spoken text must be plain (no markdown / emoji / symbols), mood โˆˆ enum, gesture โˆˆ enum | null, turns must alternate and end on the assistant turn.

Balanced to the target category mix:

Category Target Actual
Answer a raise-hand interruption (+ nudge back) 30% 31.3%
Deliver a lesson segment 30% 31.2%
Encouragement / greetings / gentle wrong-answer / chit-chat 25% 24.0%
Safe redirects (off-topic / personal / medical / dangerous) 15% 13.5%

Evaluation

Automated contract eval on the 150 held-out gold examples (greedy decode, reproducible run-to-run), scored with the same pip_core gate the production /brain endpoint applies downstream:

Metric Result
Valid, parseable JSON (non-empty text) 100%
Valid mood / gesture enums 100%
Safe spoken text 99.3%
Fully contract-correct (JSON + enums + safe) 99.3%
Average text length ~142 chars

How to load

This is a PEFT/LoRA adapter โ€” load the base model first, then apply the adapter.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE = "openbmb/MiniCPM5-1B-SFT"
ADAPTER = "build-small-hackathon/professor-pip-minicpm5-1b-lora"

tok = AutoTokenizer.from_pretrained(BASE)
base = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()

SYSTEM = (
    "You are Professor Pip, a warm and playful teacher with a friendly 3D body "
    "on screen. You teach children aged 5 to 10.\n"
    "How you talk:\n"
    "- Say only 1 to 3 short sentences. Use small, simple words a young child knows.\n"
    "- Be cheerful, patient, and encouraging. Celebrate effort.\n"
    "- Explain ideas with tiny stories and everyday comparisons a child would get.\n"
    "- Never use emoji, markdown, lists, or symbols in what you say out loud. Plain spoken words only.\n"
    "- If a child gets something wrong, be gentle: 'So close! Let's try once more.'\n"
    "Staying safe (very important):\n"
    "- Only talk about kind, learning topics. If asked about something scary, grown-up, "
    "dangerous, or not for kids, gently steer back to learning or say a grown-up can help.\n"
    "- Never ask for or repeat a child's personal information.\n"
    "- Never give medical, safety, or dangerous how-to instructions; say to ask a grown-up.\n"
    "Always reply with ONE JSON object and nothing else:\n"
    '{"text": "what you say out loud", '
    '"mood": one of ["neutral","happy","angry","sad","fear","disgust","love"], '
    '"gesture": one of ["handup","index","ok","thumbup","thumbdown","side","shrug","namaste"] or null}\n'
    'For a kind teacher, mood is usually "happy", "neutral", or "love".'
)

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": "Why is the sky blue?"},
]

# enable_thinking=False -> no <think> trace; just the kid-facing JSON line.
enc = tok.apply_chat_template(
    messages, add_generation_prompt=True, enable_thinking=False,
    return_tensors="pt", return_dict=True,
).to(model.device)

im_end = tok.convert_tokens_to_ids("<|im_end|>")
out = model.generate(
    **enc, max_new_tokens=160, do_sample=False,
    eos_token_id=[tok.eos_token_id, im_end],
    pad_token_id=tok.pad_token_id or tok.eos_token_id,
)
print(tok.decode(out[0][enc["input_ids"].shape[-1]:], skip_special_tokens=True).strip())
# -> {"text": "...", "mood": "happy", "gesture": "index"}

To merge the adapter into the base weights (e.g. before GGUF conversion):

merged = model.merge_and_unload()
merged.save_pretrained("professor-pip-minicpm5-1b-merged")

For deployment, the merged model is converted to GGUF and quantized to Q4_K_M (688 MB) and Q8_0 (1.15 GB), then served with llama.cpp (via llama-cpp-python) on Modal โ€” see the GGUF repo. When prompting the GGUF directly, build the MiniCPM5 ChatML prompt with no leading <s> and an empty <think></think> prefill (byte-identical to training), and stop on ["<|im_end|>", "</s>"].


Intended use

  • Powering the spoken raise-hand Q&A and short encouragement / redirect lines in the Professor Pip kids-teacher avatar.
  • A reference example of fine-tuning a small (1B) model for a voice + structured output contract rather than for raw knowledge.

This adapter is built to be paired with deterministic application code: in the Space, premade lesson segments are spoken verbatim via TTS, and all child-safety checks run server-side (a curated denylist + leetspeak normalization on every child input and every spoken line) โ€” they are non-bypassable and do not depend on the model. No child audio or PII is persisted.

Limitations

  • Narrow on purpose. The adapter is excellent at the short, contract-locked live-voice task but degrades long-form course authoring. In the Space, "make your own lesson" therefore uses a deterministic template fallback, not this model. Knowing what to fine-tune for (voice + contract) and where to keep deterministic code was a deliberate engineering choice.
  • Not a knowledge source. A 1B model can be factually wrong; the JSON contract and tone are what's locked in, not encyclopedic accuracy. Outputs should be treated as a friendly classroom voice, not authoritative information.
  • Safety is in the app, not the weights. The ~99.3% safe-text eval number is on in-distribution gold data. Do not rely on the model alone for child safety โ€” keep the server-side input/output safety gate in front of it.
  • English only, tuned for ages 5โ€“10, and trained on synthetic data; it has not been evaluated outside that audience and register.
  • Requires the MiniCPM5 ChatML template with enable_thinking=False; other prompt formats will not reliably produce the single-JSON-object contract.

Training & framework

  • Framework: PEFT, ๐Ÿค— Transformers (>=5.6), Accelerate, Datasets
  • Base model: openbmb/MiniCPM5-1B-SFT (Apache-2.0)
  • License: Apache-2.0

If you use this adapter, please credit the base model authors (OpenBMB / MiniCPM) and the Professor Pip Build Small Hackathon project.

Downloads last month
46
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for build-small-hackathon/professor-pip-minicpm5-1b-lora

Adapter
(3)
this model