NPU-Forge πŸ”₯ β€” fine-tune a model and put it on your AMD Ryzen AI NPU in ~3 minutes

Measured, on a Strix Halo (Ryzen AI MAX+ 395), June 2026: a LoRA fine-tune of Llama-3.2-1B on 300+ real chat exchanges β€” trained, merged, behavior-verified, converted to GGUF, re-quantized to FastFlowLM's Q4NX, NPU-ready β€” in 183 seconds of cloud time (β‰ˆ $0.10 on a rented T4):

forge tune my-chats.jsonl --name grandma   # proven: coherent + IN-BAND on NPU
  β”œβ”€ LoRA fine-tune (cloud GPU)        122 s
  β”œβ”€ merge                               3 s
  β”œβ”€ voice proof (model speaks first!)   5 s
  β”œβ”€ HF -> f16 -> Q4_K_M GGUF           20 s
  └─ GGUF -> Q4NX (NPU format)          24 s
forge register        (one UAC click)
flm run grandma-forge:1b

The "voice proof" stage generates a sample from the merged model inside the training job, before any conversion β€” so you know the tune actually took. Ours came back with the persona's exact ritual phrases after 2 minutes of training. That's the bar.

What's in this repo

  • forge.js / forge.bat β€” the CLI: tune, convert, register, list, doctor, serve
  • modal/tune_npu.py β€” the whole tuneβ†’NPU pipeline as one Modal job (bring your own Modal account; T4 is plenty)
  • modal/convert_q4nx.py β€” just the GGUFβ†’Q4NX stage (65 s for a 1B)
  • bin/assemble.js β€” downloads results and stages the FLM model folder
  • bin/register.js + register-admin.bat β€” the permanent custom-model registry that survives FLM updates (see below)
  • registry.example.json β€” entry template

Chat data format: one JSON per line, {"messages":[{"role":"user","content":...},{"role":"assistant","content":...}]}.

The registry problem (why forge register exists)

FLM's model_list.json lives in C:\Program Files\flm\ and every FLM update resets it, silently de-registering all your custom models. Your model files survive (they're in Documents\flm\models\) but they vanish from flm list. Forge keeps its own user-space registry.json forever and re-merges with one click. forge doctor tells you when an update has eaten your registrations.

NEW in v0.3 β€” a voice-verifier "ear" that runs on the NPU

Train a ~111KB classification head over EmbeddingGemma-300m embeddings (modal/train_ear_head.py, bring your own labeled texts), then run it locally with bin/ear.js against FLM's /v1/embeddings (flm serve <model> --embed 1). The embeddings come off the NPU; the head is plain JS. In our tests the 111KB head matched a fine-tuned 268MB DistilBERT on real-voice accuracy (95.9%) and beat it on the hard boundary cases, live on a Strix Halo NPU.

Also measured: llama3.2:1b chat on the NPU = 47.8 tokens/s including prefill (FLM, performance pmode).

Snag #8: FLM's embeddings endpoint closes the TCP connection per request β€” retry once on ECONNRESET (ear.js does).

start.bat gives you a menu: doctor / list / register / serve / tune guide.

The snag ledger β€” ten walls we hit so you don't

  1. The Q4NX converter's convert.py CLI is broken at HEAD (uncommented debug sys.argv override hijacks every invocation). Call the module API: from q4nx import create_converter; create_converter(gguf, "").convert(q4nx_path=out, weights_type="language")
  2. Converter needs einops and tqdm beyond its README list, and must run with cwd = its repo root (relative configs/<arch>.json loads).
  3. Llama-3.2 tokenizers need transformers>=4.46 β€” the error untagged enum ModelWrapper is that wall exactly.
  4. transformers 4.46 needs accelerate>=1.0 β€” the error 'AdamW' object has no attribute 'train' at step 0 is that skew.
  5. T4 + Llama-3.2's 128k vocab OOMs at batch 4 (loss-logits blowup). Floor: batch 1 Γ— grad-accum 8 + gradient checkpointing.
  6. NPU driver minimum for current FLM: 32.0.203.304 (.311 recommended). flm validate will tell you; so will forge doctor.
  7. EmbeddingGemma needs transformers>=4.5x + sentence-transformers 5.x and the official weights are license-gated (use the unsloth/ mirror, or accept the Gemma license on your HF account + pass an HF_TOKEN secret).
  8. FLM's /v1/embeddings closes the TCP connection per request β€” retry once on ECONNRESET (the ear runtime does).
  9. For a FINE-TUNED model, exporting GGUF as q8_0 produces repetition garbage on the NPU even though the merged model is perfect β€” the q8_0 then Q4NX re-quant is a lossy double-quantization. Use Q4_K_M.
  10. The Q4NX converter's llama path rejects f16 (not enough values to unpack β€” it expects pre-quantized blocks). So the GGUF must be quantized before Q4NX, and Q4_K_M is the format proven to produce a coherent, in-voice NPU model. Pipeline: HF β†’ f16 β†’ llama-quantize Q4_K_M β†’ Q4NX.

Frozen known-good stack (the whole point β€” never debug this again): torch 2.4.1 Β· transformers 4.46.3 Β· trl 0.9.6 Β· peft 0.12.0 Β· accelerate 1.1.1 Β· datasets 2.21.0 Β· gguf Β· amd-quark Β· einops Β· tqdm Β· protobuf + a compiled llama-quantize (the Modal job builds it).

Proven, measured (Strix Halo, June 2026)

A LoRA fine-tune of Llama-3.2-1B on 300 real chat exchanges, run through the whole pipeline and served on the NPU:

  • Coherent and in-voice β€” the persona's rituals and endearments intact.
  • 41.9 tokens/s on the NPU (FLM, performance pmode).
  • In-band against the source voiceprint β€” mean 0.845 vs the original's own held-out band of 0.83 Β± 0.07 (3 prompts). A separate stylometric scorer certified the NPU model speaks like the source it was tuned on.

That is the bar: not "it converts," but "it talks like itself, on the NPU."

Requirements

  • AMD Ryzen AI machine with XDNA2 NPU (Strix, Strix Halo, Kraken…) + FastFlowLM
  • Node.js (the CLI), Python + a Modal account (the cloud legs)
  • NPU driver β‰₯ 32.0.203.304

Part of an ongoing project to make local NPUs a first-class home for personal AI β€” voices you own, on silicon you own.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support