VisualEars FastConformer Persian ASR Full A+B

Persian/Farsi ASR fine-tune for the small/fast VisualEars model, trained from nvidia/stt_fa_fastconformer_hybrid_large on the full A+B training mix.

Main Checkpoint

fa_fastconformer_ab_final.nemo: final NeMo FastConformer hybrid RNNT/CTC checkpoint from the full A+B run.

Runtime Exports

Canonical runtime exports live in separate derivative model repos so Hugging Face can attach them to this fine-tune as quantized/export variants:

Repo	Format	Validation
`visualears-fastconformer-fa-full-ab-onnx-fp`	ONNX FP fixed CTC core	100.00% CTC argmax parity
`visualears-fastconformer-fa-full-ab-onnx-w4`	ONNX Runtime weight-only 4-bit, asymmetric block-32	98.61% CTC argmax parity
`visualears-fastconformer-fa-full-ab-coreml-fp16`	CoreML FP16 fixed CTC core	99.85% CTC argmax parity
`visualears-fastconformer-fa-full-ab-coreml-w4`	CoreML 4-bit k-means palettized, compressed variant	98.06% CTC argmax parity
`visualears-fastconformer-fa-full-ab-coreml-w4-quality`	CoreML 4-bit k-means palettized, quality-first variant	99.65% CTC argmax parity
`visualears-fastconformer-fa-full-ab-litert-fp`	LiteRT/TFLite FP fixed CTC core	100.00% CTC argmax parity; 100.00% transcript parity on 16 calibration items
`visualears-fastconformer-fa-full-ab-litert-w4`	LiteRT/TFLite selected fully-connected weight-only 4-bit	98.23% frame CTC argmax parity; failed transcript parity at 37.5% on 16 calibration items
`visualears-fastconformer-fa-full-ab-fp16`	NeMo FP16 reduced-precision checkpoint	98.0% exact transcript match vs FP base on 200 FLEURS-fa eval clips
`visualears-fastconformer-fa-full-ab-fp8`	NeMo FP8 via NVIDIA ModelOpt	18.48% WER / 6.69% CER on 200 FLEURS-fa eval clips; 99.47% WER retention vs FP base
`visualears-fastconformer-fa-full-ab-nvfp4`	NeMo NVFP4 W4A4 via NVIDIA ModelOpt	20.33% WER / 7.38% CER on 200 FLEURS-fa eval clips

The export repos are fixed-frame acoustic CTC-core artifacts. They take precomputed log-mel features as processed_signal; they are not full raw-audio-to-text pipelines by themselves.

Training Snapshot

Train manifest: 6,231,918 rows
Validation manifest: 31,424 rows
Final train step: 48,687
NeMo architecture: FastConformer hybrid RNNT/CTC

Benchmarks

External benchmark snapshot from June 10, 2026:

Decoder	Golha gold-69 WER	FLEURS fa WER	FLEURS fa CER
RNNT greedy	25.29	15.73	5.25
CTC + 4-gram LM, alpha=0.2 beta=-1.0 beam=50	25.96	13.60	5.39

The LM setting was calibrated on a FLEURS-256 slice and helped FLEURS WER, but did not improve Golha in this snapshot.

Notes

This is a research checkpoint. Normalization and tokenization choices matter for reported WER/CER.

Downloads last month: -

Model tree for Reza2kn/visualears-fastconformer-fa-full-ab

Base model

nvidia/stt_fa_fastconformer_hybrid_large

Finetuned

(3)

this model

Quantizations

10 models