A13 / 3rd-gen Neural Engine: text_encoder + vector_estimator plan build fails; 8-step too slow, 4-step quality drops

by shlaikov - opened 2 days ago

Device: iPhone 11 Pro Max, A13 Bionic (3rd-gen ANE), iOS 26.0.1.
Models: Reza2kn/supertonic-3-coreml/fp16, compiled to .mlmodelc via xcrun coremlc compile. CoreMLTools 8.3.0.

What we observed

Tried compute_units ladder [.all → .cpuAndGPU → .cpuOnly] per stage. First success wins:

Stage	`.all` (ANE+GPU+CPU)	`.cpuAndGPU`	`.cpuOnly`
duration_predictor	ok ~1.1s	—	—
text_encoder	FAIL "Error in building plan." (~5s)	FAIL same (~5s)	ok ~1s
vector_estimator	(skipped, planner can wedge for minutes)	ok ~3.6s	ok
vocoder	ok ~2.3s	—	—

So on A13: DP and Voc run on ANE, TE on scalar CPU, VE on Metal GPU.

Per-chunk timing (T=L=320, FP16, voice F1, 8 Euler steps)

DP: 15–60 ms · TE: 17–50 ms · Voc: 250–400 ms
VE: 10–14 s per chunk (~1.3 s/step on A13 GPU; cold 10 s, throttles to 14 s after ~80 s sustained)
Total per chunk: 10.5–14 s.

Trade-off we hit

8 steps → above timings: too slow for our use.
4 steps → VE halves to ~6.3 s, but audio quality drops audibly on RU: grainy on sustained vowels, smudged plosives, artifact bursts on chunk boundaries. Not usable.

Questions

Is the ANE planner rejection on A13 expected? Could a re-export with op patterns 3rd-gen Neural Engine accepts (e.g. unfused rotary cos/sin) light up ANE for TE/VE?
Any INT8 or step-distilled variant you're considering for older devices?
Is 4-step quality drop expected for flow-matching at this scale, or is it FP16/voice-specific?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment