Parakeet Realtime EOU 120M v1 — cache-aware ONNX (FP16)

This is an ONNX export of NVIDIA's parakeet_realtime_eou_120m-v1 FastConformer RNN-T model with the cache-aware encoder I/O ports preserved, so it can be used for streaming inference under ONNX Runtime without mid-word corruption at chunk boundaries.

It is intended as a drop-in replacement for the community export at ysdede/parakeet-realtime-eou-120m-v1-onnx for users who need the cache-aware streaming surface.

Why this export exists

The encoder is a 17-layer FastConformer with attention context [70, 1] (5.6 s left, 80 ms right). When that left-context state is not threaded across chunks, the encoder warms up mid-word at every chunk boundary, producing artifacts like talk / ovilla / c for the word "talkative". The community ONNX export strips the cache I/O ports, which is fine for full-utterance offline use but breaks streaming.

This export wraps cache_aware_stream_step() so all four cache tensors are exposed as explicit ONNX inputs and outputs, allowing the caller to thread state across chunks. The wrapper pattern is taken from FluidAudio's convert_streaming_encoder.py, translated from CoreML to ONNX.

Files

File	Size	Purpose
`streaming_encoder.fp16.onnx`	222 MB	Cache-aware FastConformer encoder (FP16 weights, FP32 I/O)
`decoder_joint-model.int8.onnx`	5.2 MB	RNN-T prediction + joint network (INT8, reused unchanged from `ysdede`)
`vocab.txt`	8 KB	1027-line tokenizer vocabulary (1024 BPE + EOU + EOB + blank)
`config.json`	<1 KB	Model metadata (vocab size, special token ids, subsampling)
`streaming_encoder.meta.json`	<1 KB	Cache shapes + I/O contract for the cache-aware encoder

Cache-aware I/O contract

The streaming encoder takes six inputs and produces six outputs (cache shapes are for batch=1):

Tensor	Shape	dtype	Purpose
Inputs
`audio_signal`	`[1, 128, 128]`	float32	Log-mel features, 128 mels × 128 frames (= 1.28 s @ 10 ms hop)
`audio_length`	`[1]`	int32	Number of valid mel frames in `audio_signal`
`pre_cache`	`[1, 128, 16]`	float32	Last 16 mel frames from the previous chunk (zero on first call)
`cache_last_channel`	`[17, 1, 70, 512]`	float32	Per-layer attention KV cache (17 layers × 70 frames × 512 hidden)
`cache_last_time`	`[17, 1, 512, 8]`	float32	Per-layer time-conv cache (17 layers × 512 hidden × 8 frames)
`cache_last_channel_len`	`[1]`	int32	How many valid frames the channel cache currently holds
Outputs
`encoded_output`	`[1, 512, 16]`	float32	Encoded features, ready to feed into decoder+joint
`encoded_length`	`[1]`	int32	Number of valid encoded frames
`new_pre_cache`	`[1, 128, 16]`	float32	Refreshed mel pre-cache for the next chunk
`new_cache_last_channel`	`[17, 1, 70, 512]`	float32	Refreshed attention KV cache
`new_cache_last_time`	`[17, 1, 512, 8]`	float32	Refreshed time-conv cache
`new_cache_last_channel_len`	`[1]`	int32	Refreshed channel cache length counter

On the first chunk of an utterance, all four cache tensors are zero-initialized. On every subsequent chunk, the caller passes the previous step's new_* outputs back in as the corresponding inputs.

The decoder + joint network is stateless per-call at the encoder level (the LSTM state is threaded via its own existing input_states_1 / input_states_2 inputs), so the ysdede decoder ONNX is reused unchanged.

Numerical fidelity

Compared to NVIDIA's PyTorch reference (cache_aware_stream_step()), running 11 sequential cache-threaded chunks of real audio:

FP32 export → PyTorch: max absolute difference 3.6 × 10⁻⁵ (~30× tighter than the 1 × 10⁻³ tolerance from the parity check).
FP16 export → FP32 export: max relative error ~0.5% on intermediate activations; downstream RNN-T greedy decoding produces character-identical output for the validation clips.

End-to-end validation

Running the cache-aware encoder against two real desktop-recorded clips and comparing to the same model loaded with the cache I/O stripped:

Clip	Stateless output (broken)	Cache-aware output (this export)
13 s clip	`we can talk` / `ovilla` / `c` / `is a specation for` / `and theya`	`we can talk a litt` / `also has now introduced` / `enough whether anyone` / `shipped any of it yet but like` / `re is a specation for`
4 s clip	`there some some some` / `that these` / `ven` / `ans do but you know for the most` / `ate`	`there are some are some are some are` / `that you can use that` / `these extensions do but you know for`

Note: extensions survives as one whole word in the cache-aware output, where the stateless export splits it as these / ven / ans at chunk boundaries.

Usage

A minimal cache-threading loop in Python:

import numpy as np
import onnxruntime as ort

sess = ort.InferenceSession("streaming_encoder.fp16.onnx", providers=["CPUExecutionProvider"])

# Initial caches: all zeros.
pre_cache = np.zeros((1, 128, 16), dtype=np.float32)
cache_lc = np.zeros((17, 1, 70, 512), dtype=np.float32)
cache_lt = np.zeros((17, 1, 512, 8), dtype=np.float32)
cache_len = np.zeros((1,), dtype=np.int32)

for mel_chunk in stream_of_mel_features:           # each is [1, 128, 128]
    feeds = {
        "audio_signal": mel_chunk,
        "audio_length": np.asarray([128], dtype=np.int32),
        "pre_cache": pre_cache,
        "cache_last_channel": cache_lc,
        "cache_last_time": cache_lt,
        "cache_last_channel_len": cache_len,
    }
    encoded, enc_len, pre_cache, cache_lc, cache_lt, cache_len = sess.run(None, feeds)
    # ...feed `encoded` (shape [1, 512, 16]) into the decoder+joint network...

License and attribution

This export is derived from NVIDIA's parakeet_realtime_eou_120m-v1 checkpoint and is therefore released under the NVIDIA Open Model License (June 2024). The wrapper pattern is adapted from FluidAudio's convert_streaming_encoder.py, also under the NVIDIA Open Model License.

Original model: https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1
Stateless community ONNX (decoder reused from this): https://huggingface.co/ysdede/parakeet-realtime-eou-120m-v1-onnx
CoreML cache-aware export and wrapper code: https://huggingface.co/FluidInference/parakeet-realtime-eou-120m-coreml
License: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf

Downloads last month: 3

Model tree for AIsley/parakeet-realtime-eou-120m-streaming-fp16

Base model

nvidia/parakeet_realtime_eou_120m-v1

Quantized

(7)

this model