Nanbeige4.1-3B (text decoder) β Core AI
Plain-Llama dense decoder (Nanbeige LLM Lab): 32 layers, GQA 20 q / 4 kv heads, head_dim 128,
hidden 2560, SwiGLU intermediate 10496, vocab 166144 (untied lm_head), RMSNorm eps 1e-5, RoPE
ΞΈ=70M, context 262144 β no QK-norm, no qkv/mlp bias (the textbook Llama shape; model_type: "llama",
3.93B total / ~3B non-embedding backbone). Source: Nanbeige/Nanbeige4.1-3B (Apache-2.0). A reasoning /
agentic model whose first-party card claims it beats Qwen3-4B and rivals Qwen3-32B / Qwen3-30B-A3B
(LiveCodeBench-Pro-Easy 81.4 vs 40.2, AIME 2026-I 87.4, GPQA 83.8) β a 32B-class reasoner at 3.93B,
running on an iPhone.
β¬οΈ Converted .aimodel bundle (ready to run):
mlboydaisuke/Nanbeige4.1-3B-CoreAI β
gpu-pipelined/nanbeige4_1_3b_decode_int8hu_block32_sym_s1/ (full LanguageBundle incl. tokenizer).
The first plain-Llama model on the pipelined-engine fast path:
it reuses qwen3.py MINUS the q/k-norm (qwen3 already has a bias-free fused QKV), so the body is the
existing overlay with one norm removed β see models/macos/llama.py. Pure-attention, KV-only state
(no conv / recurrent), so it needs no engine patch beyond the base stack.
Measured (macOS + iOS 27 beta, release builds, p=128 g=256, COREAI_CHUNK_THRESHOLD=1)
| config | bundle | prefill tok/s | decode tok/s | numerics |
|---|---|---|---|---|
| int8hu --head-sym (ship), M4 Max | 4.3 GB | 114.9 | 114.5 | engine β‘ fp32-HF oracle (raw greedy β "Paris"); reasoning coherent (trick "17 sheep, all but 9" β reasons to 9) |
int8hu --head-sym --static-ids (_s1, ship), iPhone 17 Pro |
4.3 GB | 16.6 | 15.9 | nat 24/24 + oracle 24/24 β token-identical to the M4 Max GPU reference (Paris / Tokyo + full continuation) |
- Loads on iPhone 17 Pro: cold GPU specialization
engine ready 53.5 s, device free 51 GB, no jetsam / no std::bad_alloc β the largest bundle we have run on the pipelined bench (4.58 GB payload). --static-idsis REQUIRED for the device. The generic dynamic-input_idsexport is fast on the Mac but on the iPhone pipelined engine (chunkThreshold=1, every step S=1) it pays a per-step input_ids re-specialization that is pathological on a 4.3 GB model (~37 s/step cold; the 900 s probe never finished the first 24-token run). Fixinginput_idsat[1,1](the qwen3.5 loop-free device pattern;--static-idsβ_s1bundle) eliminates it β chunkThreshold=1 feeds S=1 anyway, so no prefill loss β and the device numerics complete 24/24.- The untied 166144-vocab head is ~0.85 GB; quantize it absmax per-block-32 int8 (
--head-sym, plainsymmetric).symmetric_with_clippingcraters big-vocab heads (the documented qwen lever).
int4: NO-GO β int8 is this reasoning model's floor
int4hu (body int4 per-block-32 + int8 head) is 2.9 GB and 169 tok/s on the Mac, and its raw
single-token greedy still returns "Paris" β but multi-token reasoning CRATERS: the same
"17 sheep, all but 9 run away" trick collapses to a wrong "17" with a repetition loop and Chinese
drift. The single-token probe is misleading for a reasoning model β you must check multi-token.
This is the non-QAT-int4 structural cliff (same wall as qwen3.5 / LFM2.5; needs QAT). Palettized
(k-means) int4 does not rescue it either β for non-QAT weights the cliff is the scheme-independent
property, and on the GPU-pipelined path the LUT dequant is slower than linear besides. int8hu ships.
ANE: right architecture class, wrong size
Plain-dense is the one class that could ride the ANE (where the LUT-friendly palettized weights run native-fast, unlike on the GPU). But the ANE sweet spot is the ~0.6β1B rung (tied head): a 0.6B fully-palettized model rides the ANE blazing. At 3.93B + a 166144 untied head Nanbeige overruns the ANE working set, so it ships GPU-pipelined like the rest of the dense line. The ANE-blazing target is a 1B plain-dense model, not this one.
Numerics gating
- Parity ladder (fp32 eager vs native HF
LlamaForCausalLMoracle, no trust_remote_code): teacher-forced top-1 24/24, cosine 1.000000, max-abs-logit Ξ = 0 (_smoke/test_nanbeige_parity.py,USE_HF_IMPL=true). - Engine gate: raw-token greedy on the int8hu bundle reproduces the fp32 oracle's first token ("Paris"); reasoning output coherent and correct.
- Device gate: iPhone greedy sequences 24/24 token-identical to the Mac reference on both fixed
prompts (
_smoke/gen_nanbeige_device_ref_tokens.py). Reasoning models drift on a bare prompt after the answer β the first token is the anchor (Paris 9965 / Tokyo 20150) and the full 24 still matched here.
Convert it yourself
cd coreai-models # with the plain-Llama overlay (models/macos/llama.py) in place
# device ship (REQUIRED static [1,1] for fast iPhone decode):
.venv/bin/python ../coreai-models-community/conversion/export_nanbeige41_decode_pipelined.py \
int8hu --head-sym --static-ids
COREAI_CHUNK_THRESHOLD=1 ./.build/out/Products/Release/llm-benchmark \
--model exports/nanbeige4_1_3b_decode_int8hu_block32_sym_s1 -p 128 -g 256 -n 3
Run contract: COREAI_CHUNK_THRESHOLD=1 before engine creation; the bundle's input_ids is static
[1,1], so every prefill token is fed as an S=1 step (never call engine.warmup() β warm with a
1-token generate; llm-runner needs --warmup exact --warmup-length 1).
License
Model weights and conversion code: Apache-2.0 (Nanbeige LLM Lab upstream; the conversion code in this repo is BSD-3-Clause). Redistribution retains the upstream notices.