Instructions to use AIsley/parakeet-realtime-eou-120m-streaming-fp16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use AIsley/parakeet-realtime-eou-120m-streaming-fp16 with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("AIsley/parakeet-realtime-eou-120m-streaming-fp16") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
Parakeet Realtime EOU 120M v1 โ cache-aware ONNX (FP16)
This is an ONNX export of NVIDIA's parakeet_realtime_eou_120m-v1 FastConformer RNN-T model with the cache-aware encoder I/O ports preserved, so it can be used for streaming inference under ONNX Runtime without mid-word corruption at chunk boundaries.
It is intended as a drop-in replacement for the community export at ysdede/parakeet-realtime-eou-120m-v1-onnx for users who need the cache-aware streaming surface.
Why this export exists
The encoder is a 17-layer FastConformer with attention context [70, 1] (5.6 s left, 80 ms right). When that left-context state is not threaded across chunks, the encoder warms up mid-word at every chunk boundary, producing artifacts like talk / ovilla / c for the word "talkative". The community ONNX export strips the cache I/O ports, which is fine for full-utterance offline use but breaks streaming.
This export wraps cache_aware_stream_step() so all four cache tensors are exposed as explicit ONNX inputs and outputs, allowing the caller to thread state across chunks. The wrapper pattern is taken from FluidAudio's convert_streaming_encoder.py, translated from CoreML to ONNX.
Files
| File | Size | Purpose |
|---|---|---|
streaming_encoder.fp16.onnx |
222 MB | Cache-aware FastConformer encoder (FP16 weights, FP32 I/O) |
decoder_joint-model.int8.onnx |
5.2 MB | RNN-T prediction + joint network (INT8, reused unchanged from ysdede) |
vocab.txt |
8 KB | 1027-line tokenizer vocabulary (1024 BPE + EOU + EOB + blank) |
config.json |
<1 KB | Model metadata (vocab size, special token ids, subsampling) |
streaming_encoder.meta.json |
<1 KB | Cache shapes + I/O contract for the cache-aware encoder |
Cache-aware I/O contract
The streaming encoder takes six inputs and produces six outputs (cache shapes are for batch=1):
| Tensor | Shape | dtype | Purpose |
|---|---|---|---|
| Inputs | |||
audio_signal |
[1, 128, 128] |
float32 | Log-mel features, 128 mels ร 128 frames (= 1.28 s @ 10 ms hop) |
audio_length |
[1] |
int32 | Number of valid mel frames in audio_signal |
pre_cache |
[1, 128, 16] |
float32 | Last 16 mel frames from the previous chunk (zero on first call) |
cache_last_channel |
[17, 1, 70, 512] |
float32 | Per-layer attention KV cache (17 layers ร 70 frames ร 512 hidden) |
cache_last_time |
[17, 1, 512, 8] |
float32 | Per-layer time-conv cache (17 layers ร 512 hidden ร 8 frames) |
cache_last_channel_len |
[1] |
int32 | How many valid frames the channel cache currently holds |
| Outputs | |||
encoded_output |
[1, 512, 16] |
float32 | Encoded features, ready to feed into decoder+joint |
encoded_length |
[1] |
int32 | Number of valid encoded frames |
new_pre_cache |
[1, 128, 16] |
float32 | Refreshed mel pre-cache for the next chunk |
new_cache_last_channel |
[17, 1, 70, 512] |
float32 | Refreshed attention KV cache |
new_cache_last_time |
[17, 1, 512, 8] |
float32 | Refreshed time-conv cache |
new_cache_last_channel_len |
[1] |
int32 | Refreshed channel cache length counter |
On the first chunk of an utterance, all four cache tensors are zero-initialized. On every subsequent chunk, the caller passes the previous step's new_* outputs back in as the corresponding inputs.
The decoder + joint network is stateless per-call at the encoder level (the LSTM state is threaded via its own existing input_states_1 / input_states_2 inputs), so the ysdede decoder ONNX is reused unchanged.
Numerical fidelity
Compared to NVIDIA's PyTorch reference (cache_aware_stream_step()), running 11 sequential cache-threaded chunks of real audio:
- FP32 export โ PyTorch: max absolute difference
3.6 ร 10โปโต(~30ร tighter than the1 ร 10โปยณtolerance from the parity check). - FP16 export โ FP32 export: max relative error ~0.5% on intermediate activations; downstream RNN-T greedy decoding produces character-identical output for the validation clips.
End-to-end validation
Running the cache-aware encoder against two real desktop-recorded clips and comparing to the same model loaded with the cache I/O stripped:
| Clip | Stateless output (broken) | Cache-aware output (this export) |
|---|---|---|
| 13 s clip | we can talk / ovilla / c / is a specation for / and theya |
we can talk a litt / also has now introduced / enough whether anyone / shipped any of it yet but like / re is a specation for |
| 4 s clip | there some some some / that these / ven / ans do but you know for the most / ate |
there are some are some are some are / that you can use that / these extensions do but you know for |
Note: extensions survives as one whole word in the cache-aware output, where the stateless export splits it as these / ven / ans at chunk boundaries.
Usage
A minimal cache-threading loop in Python:
import numpy as np
import onnxruntime as ort
sess = ort.InferenceSession("streaming_encoder.fp16.onnx", providers=["CPUExecutionProvider"])
# Initial caches: all zeros.
pre_cache = np.zeros((1, 128, 16), dtype=np.float32)
cache_lc = np.zeros((17, 1, 70, 512), dtype=np.float32)
cache_lt = np.zeros((17, 1, 512, 8), dtype=np.float32)
cache_len = np.zeros((1,), dtype=np.int32)
for mel_chunk in stream_of_mel_features: # each is [1, 128, 128]
feeds = {
"audio_signal": mel_chunk,
"audio_length": np.asarray([128], dtype=np.int32),
"pre_cache": pre_cache,
"cache_last_channel": cache_lc,
"cache_last_time": cache_lt,
"cache_last_channel_len": cache_len,
}
encoded, enc_len, pre_cache, cache_lc, cache_lt, cache_len = sess.run(None, feeds)
# ...feed `encoded` (shape [1, 512, 16]) into the decoder+joint network...
License and attribution
This export is derived from NVIDIA's parakeet_realtime_eou_120m-v1 checkpoint and is therefore released under the NVIDIA Open Model License (June 2024). The wrapper pattern is adapted from FluidAudio's convert_streaming_encoder.py, also under the NVIDIA Open Model License.
- Original model: https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1
- Stateless community ONNX (decoder reused from this): https://huggingface.co/ysdede/parakeet-realtime-eou-120m-v1-onnx
- CoreML cache-aware export and wrapper code: https://huggingface.co/FluidInference/parakeet-realtime-eou-120m-coreml
- License: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
- Downloads last month
- 3
Model tree for AIsley/parakeet-realtime-eou-120m-streaming-fp16
Base model
nvidia/parakeet_realtime_eou_120m-v1