Parakeet Realtime EOU 120M v1 โ€” cache-aware ONNX (FP16)

This is an ONNX export of NVIDIA's parakeet_realtime_eou_120m-v1 FastConformer RNN-T model with the cache-aware encoder I/O ports preserved, so it can be used for streaming inference under ONNX Runtime without mid-word corruption at chunk boundaries.

It is intended as a drop-in replacement for the community export at ysdede/parakeet-realtime-eou-120m-v1-onnx for users who need the cache-aware streaming surface.

Why this export exists

The encoder is a 17-layer FastConformer with attention context [70, 1] (5.6 s left, 80 ms right). When that left-context state is not threaded across chunks, the encoder warms up mid-word at every chunk boundary, producing artifacts like talk / ovilla / c for the word "talkative". The community ONNX export strips the cache I/O ports, which is fine for full-utterance offline use but breaks streaming.

This export wraps cache_aware_stream_step() so all four cache tensors are exposed as explicit ONNX inputs and outputs, allowing the caller to thread state across chunks. The wrapper pattern is taken from FluidAudio's convert_streaming_encoder.py, translated from CoreML to ONNX.

Files

File Size Purpose
streaming_encoder.fp16.onnx 222 MB Cache-aware FastConformer encoder (FP16 weights, FP32 I/O)
decoder_joint-model.int8.onnx 5.2 MB RNN-T prediction + joint network (INT8, reused unchanged from ysdede)
vocab.txt 8 KB 1027-line tokenizer vocabulary (1024 BPE + EOU + EOB + blank)
config.json <1 KB Model metadata (vocab size, special token ids, subsampling)
streaming_encoder.meta.json <1 KB Cache shapes + I/O contract for the cache-aware encoder

Cache-aware I/O contract

The streaming encoder takes six inputs and produces six outputs (cache shapes are for batch=1):

Tensor Shape dtype Purpose
Inputs
audio_signal [1, 128, 128] float32 Log-mel features, 128 mels ร— 128 frames (= 1.28 s @ 10 ms hop)
audio_length [1] int32 Number of valid mel frames in audio_signal
pre_cache [1, 128, 16] float32 Last 16 mel frames from the previous chunk (zero on first call)
cache_last_channel [17, 1, 70, 512] float32 Per-layer attention KV cache (17 layers ร— 70 frames ร— 512 hidden)
cache_last_time [17, 1, 512, 8] float32 Per-layer time-conv cache (17 layers ร— 512 hidden ร— 8 frames)
cache_last_channel_len [1] int32 How many valid frames the channel cache currently holds
Outputs
encoded_output [1, 512, 16] float32 Encoded features, ready to feed into decoder+joint
encoded_length [1] int32 Number of valid encoded frames
new_pre_cache [1, 128, 16] float32 Refreshed mel pre-cache for the next chunk
new_cache_last_channel [17, 1, 70, 512] float32 Refreshed attention KV cache
new_cache_last_time [17, 1, 512, 8] float32 Refreshed time-conv cache
new_cache_last_channel_len [1] int32 Refreshed channel cache length counter

On the first chunk of an utterance, all four cache tensors are zero-initialized. On every subsequent chunk, the caller passes the previous step's new_* outputs back in as the corresponding inputs.

The decoder + joint network is stateless per-call at the encoder level (the LSTM state is threaded via its own existing input_states_1 / input_states_2 inputs), so the ysdede decoder ONNX is reused unchanged.

Numerical fidelity

Compared to NVIDIA's PyTorch reference (cache_aware_stream_step()), running 11 sequential cache-threaded chunks of real audio:

  • FP32 export โ†’ PyTorch: max absolute difference 3.6 ร— 10โปโต (~30ร— tighter than the 1 ร— 10โปยณ tolerance from the parity check).
  • FP16 export โ†’ FP32 export: max relative error ~0.5% on intermediate activations; downstream RNN-T greedy decoding produces character-identical output for the validation clips.

End-to-end validation

Running the cache-aware encoder against two real desktop-recorded clips and comparing to the same model loaded with the cache I/O stripped:

Clip Stateless output (broken) Cache-aware output (this export)
13 s clip we can talk / ovilla / c / is a specation for / and theya we can talk a litt / also has now introduced / enough whether anyone / shipped any of it yet but like / re is a specation for
4 s clip there some some some / that these / ven / ans do but you know for the most / ate there are some are some are some are / that you can use that / these extensions do but you know for

Note: extensions survives as one whole word in the cache-aware output, where the stateless export splits it as these / ven / ans at chunk boundaries.

Usage

A minimal cache-threading loop in Python:

import numpy as np
import onnxruntime as ort

sess = ort.InferenceSession("streaming_encoder.fp16.onnx", providers=["CPUExecutionProvider"])

# Initial caches: all zeros.
pre_cache = np.zeros((1, 128, 16), dtype=np.float32)
cache_lc = np.zeros((17, 1, 70, 512), dtype=np.float32)
cache_lt = np.zeros((17, 1, 512, 8), dtype=np.float32)
cache_len = np.zeros((1,), dtype=np.int32)

for mel_chunk in stream_of_mel_features:           # each is [1, 128, 128]
    feeds = {
        "audio_signal": mel_chunk,
        "audio_length": np.asarray([128], dtype=np.int32),
        "pre_cache": pre_cache,
        "cache_last_channel": cache_lc,
        "cache_last_time": cache_lt,
        "cache_last_channel_len": cache_len,
    }
    encoded, enc_len, pre_cache, cache_lc, cache_lt, cache_len = sess.run(None, feeds)
    # ...feed `encoded` (shape [1, 512, 16]) into the decoder+joint network...

License and attribution

This export is derived from NVIDIA's parakeet_realtime_eou_120m-v1 checkpoint and is therefore released under the NVIDIA Open Model License (June 2024). The wrapper pattern is adapted from FluidAudio's convert_streaming_encoder.py, also under the NVIDIA Open Model License.

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for AIsley/parakeet-realtime-eou-120m-streaming-fp16

Quantized
(7)
this model