Granite 4.0-H 1B / 350M β€” Apple Core AI (.aimodel)

IBM Granite 4.0-H (Mamba2 + attention hybrid; 1b: 36 Mamba2 mixers + 4 NoPE GQA attention layers) converted to Apple Core AI for iOS 27 / macOS 27 (beta), riding Apple's coreai-pipelined GPU engine via the decode-only loop-free export β€” async encode, on-GPU argmax sampling, on-device KV growth, zero custom kernels.

The first SSM-scan architecture on this path: at S=1 the Mamba2 selective scan is a single recurrence step (no while_loop in the graph), and the conv/SSM states ride as two fixed-shape extra states β€” the same shape-class as Qwen3.5's GDN, so no engine changes beyond the existing patch stack.

surface bundle prefill (S=1) decode
1b int8hu (int8 head), iPhone 17 Pro (one-shot runner) β€” device ship 1.79 GB 35.1–37.0 35.4–37.1 tok/s
1b int8hu (int8 head), M4 Max 1.79 GB 134.9 134.2 tok/s
1b int8lin, M4 Max (release llm-benchmark, p=128 g=256) β€” Mac ship 1.63 GB 136.7 136.5 tok/s
1b int8lin, iPhone 17 Pro (one-shot runner) 1.63 GB 30.1–32.2 30.2–31.3 tok/s
350m fp16, M4 Max 0.66 GB 193.2 191.1 tok/s

Numerics: 16/16 teacher-forced single-step top-1 vs the fp32 HF oracle + HF-cache-seeded decode step (the zoo ship gate, on an oracle whose top-2 margin is β‰₯ 0.1 at every position), and the iPhone greedy sequences are 24/24 token-identical to the Mac GPU on both fixed prompts, both runs.

Bundles

  • gpu-pipelined/granite_4_0_h_1b_decode_int8hu_block32_sym/ β€” int8lin + the tied lm_head untied and quantized absmax per-block-32 int8 (symmetric, no clipping β€” clipping corrupts big-vocab heads), 1.79 GB. The device ship: +17–21% decode on iPhone (the head was ~10% of the per-token read on the bandwidth-saturated surface; on the Mac it is ~flat, the engine pipeline hides the head there). Oracle gate 16/16 + decode step; device numerics 24/24 ≑ Mac-GPU on all 3 runs. (Do not re-quantize heads per-channel: per-channel axis-0 int8 is broken on the current beta GPU delegate β€” garbage logits.)
  • gpu-pipelined/granite_4_0_h_1b_decode_int8lin/ β€” full LanguageBundle (metadata.json + tokenizer/ + .aimodel), int8 linear per-block-32 (scale-multiply dequant, no LUT), fp16 embed + tied lm_head in-graph, 1.63 GB. The Mac ship configuration (136.5 vs 134.2) and the lighter iPhone alternative.
  • gpu-pipelined/granite_4_0_h_350m_decode_fp16/ β€” the 350m as fp16, 0.66 GB. At this size the model is overhead-bound, not bandwidth-bound: int8 measured slower than fp16 (185.8 vs 191.1) and fails the oracle gate (shared_mlp.output_linear is block-32-sensitive), so the 350m ships fp16.

input_ids is STATIC [1,1] (the selective scan at S=1 ≑ one loop-free recurrence step); position_ids + KV seq stay dynamic, so EngineFactory classifies the bundle dynamic β†’ pipelined engine. States: growing KV (4 attention layers) + conv columns [36,1,conv_dim,3]

  • SSM state [36,1,48,64,128] (fixed shape, carried by the extra-states patch).

Run

Needs the engine patch stack from the zoo (apps/coreai-shared-product.patch β†’ apps/coreai-pipelined-extra-states.patch; Apple's repo is issues-only, so capabilities ship as patches), then:

COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model granite_4_0_h_1b_decode_int8lin -p 128 -g 256 -n 3
  • COREAI_CHUNK_THRESHOLD=1 before engine creation β€” prefill runs as pipelined S=1 steps (prompt tok/s β‰ˆ decode tok/s).
  • Never call engine.warmup() β€” it warms query length 256 and the static [1,1] graph rejects it. A 1-token generate after load is the warmup (llm-runner needs --warmup exact --warmup-length 1).
  • Benchmark Release builds only (a Debug engine measures ~3Γ— slow).

iPhone

The 1b int8lin runs at ~31 tok/s β‰ˆ ~84% of the naive bandwidth ceiling (~60 GB/s Γ· 1.6 GB/token β‰ˆ 37) on an iPhone 17 Pro β€” effectively memory-bandwidth saturated; the SSM scan costs nothing extra at S=1. Cold GPU specialization 5.7 s, warm loads 1.9 s (content-keyed cache β€” no AOT compile needed, and at 1.6 GB no increased-memory entitlement is required, unlike 2B-class bundles).

Reproduce

Conversion script (self-contained) + method page in the zoo: conversion/export_granite4h_decode_pipelined.py (int8lin, or fp16 --hf-id ibm-granite/granite-4.0-h-350m) Β· zoo/granite-4.0-h.md (includes the 350m int8 post-mortem and the oracle-margin rule) Β· knowledge/pipelined-engine.md

License

Model weights: Apache-2.0 (IBM Granite; LICENSE included). Conversion code: BSD-3-Clause (see the zoo).

Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/granite-4.0-h-CoreAI

Finetuned
(10)
this model