parakeet-tdt-0.6b-v2 ONNX (split decoder/joint)

Re-export of nvidia/parakeet-tdt-0.6b-v2 to four separate ONNX files, int8 dynamic-quantized for CPU / DirectML inference:

File	Inputs	Outputs
`preprocessor.int8.onnx`	`audio_signal [1, S] f32`, `audio_length [1] i32`	`mel [1, 128, F] f32`, `mel_length [1] i64`
`encoder.int8.onnx`	`mel [1, 128, F] f32`, `mel_length [1] i32`	`encoder [1, 1024, T] f32`, `encoder_length [1] i64`
`decoder.int8.onnx`	`targets [1, U] i32`, `target_length [1] i32`, `h_in [2, 1, 640] f32`, `c_in [2, 1, 640] f32`	`decoder [1, 640, 2] f32`, `h_out`, `c_out`
`joint_decision.int8.onnx`	`encoder [1, 1024, T] f32`, `decoder [1, 640, U] f32`	`token_id [1, T, U] i32`, `token_prob [1, T, U] f32`, `duration [1, T, U] i32`

joint_decision fuses the joint network with the decision head (argmax over token logits + argmax over duration logits + gather for token probability).

Why split?

NeMo's own asr_model.export() and istupakov/parakeet-tdt-0.6b-v2-onnx fuse the decoder and joint network into a single ONNX file. That's fine for inference engines that call the full TDT decoder loop in one go, but it doesn't fit pipelines that drive the loop themselves and need the sub-graphs callable independently (e.g. the talat Rust inference layer, which mirrors FluidAudio's macOS CoreML 4-file decomposition).

The PyTorch wrappers used to extract the four sub-graphs are adapted from FluidInference/mobius (Apache 2.0).

Quantization

Per-channel int8 weight-only quantization via onnxruntime.quantization.quantize_dynamic. Activations remain fp32 at runtime — keeps the int8 path stable across CPU EP and DirectML without needing a calibration dataset. ~3.7× size reduction on the encoder (2.3 GB fp32 → 625 MB int8). Token-level accuracy unchanged vs fp32 baseline.

License

Inherits NVIDIA Parakeet TDT v2's license (CC-BY-4.0).

Downloads last month: 117

Model tree for talatapp/parakeet-tdt-0.6b-v2-onnx-split

Base model

nvidia/parakeet-tdt-0.6b-v2

Quantized

(14)

this model