NPU-Forge 🔥 — fine-tune a model and put it on your AMD Ryzen AI NPU in ~3 minutes

Measured, on a Strix Halo (Ryzen AI MAX+ 395), June 2026: a LoRA fine-tune of Llama-3.2-1B on 300+ real chat exchanges — trained, merged, behavior-verified, converted to GGUF, re-quantized to FastFlowLM's Q4NX, NPU-ready — in 183 seconds of cloud time (≈ $0.10 on a rented T4):

forge tune my-chats.jsonl --name grandma   # proven: coherent + IN-BAND on NPU
  ├─ LoRA fine-tune (cloud GPU)        122 s
  ├─ merge                               3 s
  ├─ voice proof (model speaks first!)   5 s
  ├─ HF -> f16 -> Q4_K_M GGUF           20 s
  └─ GGUF -> Q4NX (NPU format)          24 s
forge register        (one UAC click)
flm run grandma-forge:1b

The "voice proof" stage generates a sample from the merged model inside the training job, before any conversion — so you know the tune actually took. Ours came back with the persona's exact ritual phrases after 2 minutes of training. That's the bar.

What's in this repo

forge.js / forge.bat — the CLI: tune, convert, register, list, doctor, serve
modal/tune_npu.py — the whole tune→NPU pipeline as one Modal job (bring your own Modal account; T4 is plenty)
modal/convert_q4nx.py — just the GGUF→Q4NX stage (65 s for a 1B)
bin/assemble.js — downloads results and stages the FLM model folder
bin/register.js + register-admin.bat — the permanent custom-model registry that survives FLM updates (see below)
registry.example.json — entry template

Chat data format: one JSON per line, {"messages":[{"role":"user","content":...},{"role":"assistant","content":...}]}.

The registry problem (why `forge register` exists)

FLM's model_list.json lives in C:\Program Files\flm\ and every FLM update resets it, silently de-registering all your custom models. Your model files survive (they're in Documents\flm\models\) but they vanish from flm list. Forge keeps its own user-space registry.json forever and re-merges with one click. forge doctor tells you when an update has eaten your registrations.

NEW in v0.3 — a voice-verifier "ear" that runs on the NPU

Train a ~111KB classification head over EmbeddingGemma-300m embeddings (modal/train_ear_head.py, bring your own labeled texts), then run it locally with bin/ear.js against FLM's /v1/embeddings (flm serve <model> --embed 1). The embeddings come off the NPU; the head is plain JS. In our tests the 111KB head matched a fine-tuned 268MB DistilBERT on real-voice accuracy (95.9%) and beat it on the hard boundary cases, live on a Strix Halo NPU.

Also measured: llama3.2:1b chat on the NPU = 47.8 tokens/s including prefill (FLM, performance pmode).

Snag #8: FLM's embeddings endpoint closes the TCP connection per request — retry once on ECONNRESET (ear.js does).

start.bat gives you a menu: doctor / list / register / serve / tune guide.

The snag ledger — ten walls we hit so you don't

The Q4NX converter's convert.py CLI is broken at HEAD (uncommented debug sys.argv override hijacks every invocation). Call the module API: from q4nx import create_converter; create_converter(gguf, "").convert(q4nx_path=out, weights_type="language")
Converter needs einops and tqdm beyond its README list, and must run with cwd = its repo root (relative configs/<arch>.json loads).
Llama-3.2 tokenizers need transformers>=4.46 — the error untagged enum ModelWrapper is that wall exactly.
transformers 4.46 needs accelerate>=1.0 — the error 'AdamW' object has no attribute 'train' at step 0 is that skew.
T4 + Llama-3.2's 128k vocab OOMs at batch 4 (loss-logits blowup). Floor: batch 1 × grad-accum 8 + gradient checkpointing.
NPU driver minimum for current FLM: 32.0.203.304 (.311 recommended). flm validate will tell you; so will forge doctor.
EmbeddingGemma needs transformers>=4.5x + sentence-transformers 5.x and the official weights are license-gated (use the unsloth/ mirror, or accept the Gemma license on your HF account + pass an HF_TOKEN secret).
FLM's /v1/embeddings closes the TCP connection per request — retry once on ECONNRESET (the ear runtime does).
For a FINE-TUNED model, exporting GGUF as q8_0 produces repetition garbage on the NPU even though the merged model is perfect — the q8_0 then Q4NX re-quant is a lossy double-quantization. Use Q4_K_M.
The Q4NX converter's llama path rejects f16 (not enough values to unpack — it expects pre-quantized blocks). So the GGUF must be quantized before Q4NX, and Q4_K_M is the format proven to produce a coherent, in-voice NPU model. Pipeline: HF → f16 → llama-quantize Q4_K_M → Q4NX.

Frozen known-good stack (the whole point — never debug this again): torch 2.4.1 · transformers 4.46.3 · trl 0.9.6 · peft 0.12.0 · accelerate 1.1.1 · datasets 2.21.0 · gguf · amd-quark · einops · tqdm · protobuf + a compiled llama-quantize (the Modal job builds it).

Proven, measured (Strix Halo, June 2026)

A LoRA fine-tune of Llama-3.2-1B on 300 real chat exchanges, run through the whole pipeline and served on the NPU:

Coherent and in-voice — the persona's rituals and endearments intact.
41.9 tokens/s on the NPU (FLM, performance pmode).
In-band against the source voiceprint — mean 0.845 vs the original's own held-out band of 0.83 ± 0.07 (3 prompts). A separate stylometric scorer certified the NPU model speaks like the source it was tuned on.

That is the bar: not "it converts," but "it talks like itself, on the NPU."

Requirements

AMD Ryzen AI machine with XDNA2 NPU (Strix, Strix Halo, Kraken…) + FastFlowLM
Node.js (the CLI), Python + a Modal account (the cloud legs)
NPU driver ≥ 32.0.203.304

Part of an ongoing project to make local NPUs a first-class home for personal AI — voices you own, on silicon you own.

Downloads last month: -; Downloads are not tracked for this model. How to track