NPU-Forge π₯ β fine-tune a model and put it on your AMD Ryzen AI NPU in ~3 minutes
Measured, on a Strix Halo (Ryzen AI MAX+ 395), June 2026: a LoRA fine-tune of Llama-3.2-1B on 300+ real chat exchanges β trained, merged, behavior-verified, converted to GGUF, re-quantized to FastFlowLM's Q4NX, NPU-ready β in 183 seconds of cloud time (β $0.10 on a rented T4):
forge tune my-chats.jsonl --name grandma # proven: coherent + IN-BAND on NPU
ββ LoRA fine-tune (cloud GPU) 122 s
ββ merge 3 s
ββ voice proof (model speaks first!) 5 s
ββ HF -> f16 -> Q4_K_M GGUF 20 s
ββ GGUF -> Q4NX (NPU format) 24 s
forge register (one UAC click)
flm run grandma-forge:1b
The "voice proof" stage generates a sample from the merged model inside the training job, before any conversion β so you know the tune actually took. Ours came back with the persona's exact ritual phrases after 2 minutes of training. That's the bar.
What's in this repo
forge.js/forge.batβ the CLI:tune,convert,register,list,doctor,servemodal/tune_npu.pyβ the whole tuneβNPU pipeline as one Modal job (bring your own Modal account; T4 is plenty)modal/convert_q4nx.pyβ just the GGUFβQ4NX stage (65 s for a 1B)bin/assemble.jsβ downloads results and stages the FLM model folderbin/register.js+register-admin.batβ the permanent custom-model registry that survives FLM updates (see below)registry.example.jsonβ entry template
Chat data format: one JSON per line, {"messages":[{"role":"user","content":...},{"role":"assistant","content":...}]}.
The registry problem (why forge register exists)
FLM's model_list.json lives in C:\Program Files\flm\ and every FLM update
resets it, silently de-registering all your custom models. Your model files
survive (they're in Documents\flm\models\) but they vanish from flm list.
Forge keeps its own user-space registry.json forever and re-merges with one
click. forge doctor tells you when an update has eaten your registrations.
NEW in v0.3 β a voice-verifier "ear" that runs on the NPU
Train a ~111KB classification head over EmbeddingGemma-300m embeddings
(modal/train_ear_head.py, bring your own labeled texts), then run it locally
with bin/ear.js against FLM's /v1/embeddings (flm serve <model> --embed 1).
The embeddings come off the NPU; the head is plain JS. In our tests the
111KB head matched a fine-tuned 268MB DistilBERT on real-voice accuracy
(95.9%) and beat it on the hard boundary cases, live on a Strix Halo NPU.
Also measured: llama3.2:1b chat on the NPU = 47.8 tokens/s including prefill (FLM, performance pmode).
Snag #8: FLM's embeddings endpoint closes the TCP connection per request β retry once on ECONNRESET (ear.js does).
start.bat gives you a menu: doctor / list / register / serve / tune guide.
The snag ledger β ten walls we hit so you don't
- The Q4NX converter's
convert.pyCLI is broken at HEAD (uncommented debugsys.argvoverride hijacks every invocation). Call the module API:from q4nx import create_converter; create_converter(gguf, "").convert(q4nx_path=out, weights_type="language") - Converter needs
einopsandtqdmbeyond its README list, and must run with cwd = its repo root (relativeconfigs/<arch>.jsonloads). - Llama-3.2 tokenizers need
transformers>=4.46β the erroruntagged enum ModelWrapperis that wall exactly. transformers 4.46needsaccelerate>=1.0β the error'AdamW' object has no attribute 'train'at step 0 is that skew.- T4 + Llama-3.2's 128k vocab OOMs at batch 4 (loss-logits blowup). Floor: batch 1 Γ grad-accum 8 + gradient checkpointing.
- NPU driver minimum for current FLM:
32.0.203.304(.311recommended).flm validatewill tell you; so willforge doctor. - EmbeddingGemma needs
transformers>=4.5x+sentence-transformers 5.xand the official weights are license-gated (use theunsloth/mirror, or accept the Gemma license on your HF account + pass anHF_TOKENsecret). - FLM's
/v1/embeddingscloses the TCP connection per request β retry once onECONNRESET(the ear runtime does). - For a FINE-TUNED model, exporting GGUF as
q8_0produces repetition garbage on the NPU even though the merged model is perfect β the q8_0 then Q4NX re-quant is a lossy double-quantization. UseQ4_K_M. - The Q4NX converter's llama path rejects
f16(not enough values to unpackβ it expects pre-quantized blocks). So the GGUF must be quantized before Q4NX, andQ4_K_Mis the format proven to produce a coherent, in-voice NPU model. Pipeline: HF β f16 βllama-quantize Q4_K_Mβ Q4NX.
Frozen known-good stack (the whole point β never debug this again):
torch 2.4.1 Β· transformers 4.46.3 Β· trl 0.9.6 Β· peft 0.12.0 Β· accelerate 1.1.1 Β· datasets 2.21.0 Β· gguf Β· amd-quark Β· einops Β· tqdm Β· protobuf + a compiled llama-quantize (the Modal job builds it).
Proven, measured (Strix Halo, June 2026)
A LoRA fine-tune of Llama-3.2-1B on 300 real chat exchanges, run through the whole pipeline and served on the NPU:
- Coherent and in-voice β the persona's rituals and endearments intact.
- 41.9 tokens/s on the NPU (FLM, performance pmode).
- In-band against the source voiceprint β mean 0.845 vs the original's own held-out band of 0.83 Β± 0.07 (3 prompts). A separate stylometric scorer certified the NPU model speaks like the source it was tuned on.
That is the bar: not "it converts," but "it talks like itself, on the NPU."
Requirements
- AMD Ryzen AI machine with XDNA2 NPU (Strix, Strix Halo, Krakenβ¦) + FastFlowLM
- Node.js (the CLI), Python + a Modal account (the cloud legs)
- NPU driver β₯ 32.0.203.304
Part of an ongoing project to make local NPUs a first-class home for personal AI β voices you own, on silicon you own.