GigaAM-v3 e2e RNNT - Core ML (fp16)

fp16 Core ML (mlprogram) export of the v3_e2e_rnnt variant of ai-sage/GigaAM-v3 - a 240M-parameter Conformer transducer for Russian speech recognition that emits punctuated, capitalized, normalized text end-to-end. Runs fully on-device on Apple Silicon (CPU / GPU / Neural Engine), no Python at inference time.

The model is split into three packages (mobius-style): the encoder runs once per audio window, the decoder and joint run in a host-side greedy loop. The mel front-end stays outside Core ML.

Files

file	size	role
`GigaAMv3Encoder.mlpackage`	422 MB	Conformer encoder, fixed 30 s window
`GigaAMv3DecoderStep.mlpackage`	2.2 MB	prediction net, one LSTM step with explicit state I/O
`GigaAMv3JointStep.mlpackage`	1.3 MB	joint network, raw logits (log_softmax stripped - argmax-invariant)
`tokens.json`	12 KB	1024 SentencePiece pieces (`▁` = word boundary)
`tokenizer.model`	250 KB	the original SentencePiece model (canonical detokenization, byte fallback)
`v3_e2e_rnnt.yaml`	-	full upstream model config extracted from the checkpoint
`model_info.json`	-	dims, blank id, mel parameters
`convert_info.json`	-	window length in mel frames
`example_infer.py`	-	runnable reference inference (the snippet below as a file)

The repo is self-contained: everything needed for on-device inference is here, no calls to other repos.

I/O contracts

Encoder: features float32 [1, 64, 2999] (log-mel, 30 s window, zero-padded right), length int32 [1] (true mel frames) -> encoded [1, 768, 750], encoded_len (float32 with an exact integer value - cast at the call site).

Decoder step: token int32 [1, 1], h_in/c_in float32 [1, 1, 320] -> dec_out [1, 320], h_out, c_out. Start state: token = 1024 (blank), zero h/c - the blank embedding row is zeros, which reproduces the reference fresh start exactly.

Joint step: enc_t [1, 768] (one encoder frame), dec_t [1, 320] -> logits [1, 1025]. blank_id = 1024, the 1024 non-blank ids index tokens.json.

Mel front-end (must match exactly): 16 kHz mono, n_fft = 320, win_length = 320, hop_length = 160, center = False, 64 mel bins, HTK mel scale, no filterbank norm, then log(clamp(x, 1e-9, 1e9)). No mean/variance normalization. Frames for n samples = (n - 320) // 160 + 1. (These match torchaudio MelSpectrogram defaults plus the explicit values; the full upstream config is in v3_e2e_rnnt.yaml.)

Usage (Python, coremltools)

Runnable version: python example_infer.py audio.wav (deps: coremltools, torch, torchaudio, numpy + the ffmpeg binary; audio is loaded via an ffmpeg pipe, mirroring gigaam.preprocess.load_audio, so torchcodec is not needed). Detokenization below joins tokens.json pieces; for canonical SentencePiece decoding load tokenizer.model instead.

import json
import subprocess
import coremltools as ct
import numpy as np
import torch
import torchaudio

CU = ct.ComputeUnit.CPU_AND_GPU  # token-exact; see "Conversion fidelity"
enc = ct.models.MLModel("GigaAMv3Encoder.mlpackage", compute_units=CU)
dec = ct.models.MLModel("GigaAMv3DecoderStep.mlpackage", compute_units=CU)
joint = ct.models.MLModel("GigaAMv3JointStep.mlpackage", compute_units=CU)
pieces = json.loads(open("tokens.json").read())
BLANK, SR, WINDOW = 1024, 16000, 30 * 16000

mel = torchaudio.transforms.MelSpectrogram(
    sample_rate=SR, n_mels=64, win_length=320, hop_length=160,
    n_fft=320, center=False, mel_scale="htk", norm=None,
)

raw = subprocess.run(
    ["ffmpeg", "-nostdin", "-i", "audio.wav", "-f", "s16le", "-ac", "1",
     "-acodec", "pcm_s16le", "-ar", str(SR), "-"],
    capture_output=True, check=True,
).stdout
wav = torch.frombuffer(bytearray(raw), dtype=torch.int16).float() / 32768.0
n = min(wav.shape[-1], WINDOW)  # <= 30 s per window
padded = torch.zeros(1, WINDOW)
padded[0, :n] = wav[:n]
feats = mel(padded).clamp(1e-9, 1e9).log().numpy().astype(np.float32)
feat_len = (n - 320) // 160 + 1

out = enc.predict({"features": feats, "length": np.array([feat_len], dtype=np.int32)})
encoded = out["encoded"]
enc_len = int(np.array(out["encoded_len"]).reshape(-1)[0])

ids, h = [], np.zeros((1, 1, 320), dtype=np.float32)
c, last = h.copy(), BLANK
for t in range(enc_len):
    enc_t = encoded[:, :, t].astype(np.float32)
    for _ in range(10):  # max symbols per frame
        d = dec.predict({"token": np.array([[last]], dtype=np.int32), "h_in": h, "c_in": c})
        logits = joint.predict({"enc_t": enc_t, "dec_t": d["dec_out"].astype(np.float32)})["logits"]
        k = int(np.argmax(logits))
        if k == BLANK:
            break
        ids.append(k)
        h, c, last = d["h_out"].astype(np.float32), d["c_out"].astype(np.float32), k

print("".join(pieces[i] for i in ids).replace("▁", " ").strip())

Longer audio: chunk into <= 30 s windows (VAD or energy boundaries), reset h/c/last per window - this mirrors the reference transcribe_longform. Token timestamps: the emission frame index t maps to seconds as t * 4 * 160 / 16000 (4x subsampling, 10 ms hop) plus the window offset.

Conversion fidelity

Verified against the PyTorch reference (gigaam package, fp32 CPU) on real Russian call recordings, decoding the same padded 30 s window:

compute units	encoder mean abs diff	transcript
`CPU_ONLY`	0.005	token-exact (CER 0.0000)
`CPU_AND_GPU`	0.0004	token-exact (CER 0.0000)
`CPU_AND_NE`	0.021-0.025	rare borderline-emission flips (CER 0.017 on one of two clips)

The ANE runs the encoder in higher-noise fp16 accumulation; on borderline frames a low-confidence emission can flip to blank. Both observed variants were linguistically valid. If transcripts must match the reference exactly, use CPU_AND_GPU.

Latency (Apple M4, 30 s window, median of 10)

unit	first load	encoder	RTF	dec+joint per step
`CPU_ONLY`	1.2 s	0.157 s	0.0052	0.09 ms
`CPU_AND_GPU`	1.3 s	0.136 s	0.0045	0.09 ms
`CPU_AND_NE`	9.2 s (one-time ANE compile)	0.076 s	0.0025	0.09 ms

Every unit is 190-400x faster than real time; the greedy loop is never the bottleneck (~11k steps/s).

Base model quality (from the GigaAM-v3 card)

V3_RNNT averages 8.4% WER across Russian domains vs 25.1% for Whisper-large-v3; in end-to-end comparisons with LLM-as-a-judge the e2e models win over Whisper-large-v3 by ~70:30. The e2e variants trade a few WER points for built-in punctuation and normalization. See the base card for the full table and methodology.

Conversion notes

torch 2.12.0 -> torch.jit.trace(strict=False) -> coremltools 9.0 (mlprogram, FLOAT16, macOS15). Three things needed care: the rotary positional-encoding cache is built lazily (warm the encoder up with one forward before tracing); torch 2.12 emits aten::Int patterns that crash coremltools 9.0 under numpy 2 (fixed with a scoped register_torch_op override); encoded_len must be recomputed with integer ops in the export wrapper because fp16 cannot represent 2999 (the float path returns an off-by-one length).

License and attribution

MIT, inherited from ai-sage/GigaAM-v3 (© GigaChat Team / SaluteDevices, see LICENSE). Paper: GigaAM: Efficient Self-Supervised Learner for Speech Recognition (InterSpeech 2025). Reference implementation: salute-developers/GigaAM.

Downloads last month: 27

Model tree for smkrv/gigaam-v3-e2e-rnnt-coreml

Base model

ai-sage/GigaAM-v3

Quantized

(8)

this model

Paper for smkrv/gigaam-v3-e2e-rnnt-coreml

GigaAM: Efficient Self-Supervised Learner for Speech Recognition

Paper • 2506.01192 • Published Jun 1, 2025