A13 / 3rd-gen Neural Engine: text_encoder + vector_estimator plan build fails; 8-step too slow, 4-step quality drops

#1
by shlaikov - opened

Device: iPhone 11 Pro Max, A13 Bionic (3rd-gen ANE), iOS 26.0.1.
Models: Reza2kn/supertonic-3-coreml/fp16, compiled to .mlmodelc via xcrun coremlc compile. CoreMLTools 8.3.0.

What we observed

Tried compute_units ladder [.all β†’ .cpuAndGPU β†’ .cpuOnly] per stage. First success wins:

Stage .all (ANE+GPU+CPU) .cpuAndGPU .cpuOnly
duration_predictor ok ~1.1s β€” β€”
text_encoder FAIL "Error in building plan." (~5s) FAIL same (~5s) ok ~1s
vector_estimator (skipped, planner can wedge for minutes) ok ~3.6s ok
vocoder ok ~2.3s β€” β€”

So on A13: DP and Voc run on ANE, TE on scalar CPU, VE on Metal GPU.

Per-chunk timing (T=L=320, FP16, voice F1, 8 Euler steps)

  • DP: 15–60 ms Β· TE: 17–50 ms Β· Voc: 250–400 ms
  • VE: 10–14 s per chunk (~1.3 s/step on A13 GPU; cold 10 s, throttles to 14 s after ~80 s sustained)
  • Total per chunk: 10.5–14 s.

Trade-off we hit

  • 8 steps β†’ above timings: too slow for our use.
  • 4 steps β†’ VE halves to ~6.3 s, but audio quality drops audibly on RU: grainy on sustained vowels, smudged plosives, artifact bursts on chunk boundaries. Not usable.

Questions

  1. Is the ANE planner rejection on A13 expected? Could a re-export with op patterns 3rd-gen Neural Engine accepts (e.g. unfused rotary cos/sin) light up ANE for TE/VE?
  2. Any INT8 or step-distilled variant you're considering for older devices?
  3. Is 4-step quality drop expected for flow-matching at this scale, or is it FP16/voice-specific?

Sign up or log in to comment