VisualEars FA32M Streaming BPE1024 — LiteRT FP16 selective-weight

LiteRT/TFLite fixed-frame acoustic CTC-core export of Reza2kn/visualears-fastconformer-fa32m-streaming-bpe1024.

This is the FA32M length-aware core: it accepts precomputed NeMo-compatible log-mel features plus the real valid feature length, so short utterances do not get decoded as if all 2005 padded frames were valid.

Runtime contract

input 0 (serving_default_args_0): processed_signal float32 [1, 80, 2005]
input 1 (serving_default_args_1): processed_signal_length int64 [1] — valid log-mel frame count before zero padding
output 0 (serving_default_output_0_output): logits float32 [1, 252, 1025]
output 1 (serving_default_output_1_output): encoded_lengths int64 [1]
tokenizer blank id: 1024

Artifact

File: fastconformer_fa32m_ctc_fixed2005_len_fp16_fc_thr250000.tflite
Size: 67,222,016 bytes (60.79% of the FP LiteRT file)
SHA256: 9c125de4f89d9706d65b53f0f5e469ec336d0d7cf22e7a03ffa93f36be5b16bd
Quantization: ai_edge_quantizer float-casting, bits=16, dtype=FLOAT
Quantized op family: FULLY_CONNECTED
Minimum weight elements: 250000
Activations and I/O remain float32/int64; this is selective weight storage quantization with dequantization around the quantized weights.

269-clip transcription parity

Source: PyTorch NeMo preprocessor + encoder + auxiliary CTC fp32, decoded during calibration export.
Candidate: this LiteRT/TFLite model through ai_edge_litert XNNPACK CPU.

Validation set: all 269 clips from Reza2kn/visualears-benchmark-269-gold.

Metric	Result
Exact transcript matches	269 / 269
Exact transcript parity	100.00%
Exact normalized transcript parity	100.00%
Mean character similarity	100.00%
Candidate non-empty rate	98.88%
Source non-empty rate	98.88%
Encoded length match rate	100.00%

Result: passes the >98% transcription parity gate.

Feature contract

Use the sidecars preprocessor.json and mel_filters_slaney_80x257.json:

sample rate: 16 kHz mono
preemphasis: 0.97
STFT: n_fft=512, win_length=400, hop_length=160, centered with reflect padding
mel: Slaney/librosa 80-bin filterbank from sidecar
log: natural log with tiny floor
no per-bin normalization (normalize=NA)
zero-pad/truncate features to 2005 frames, and pass true processed_signal_length

Files

fastconformer_fa32m_ctc_fixed2005_len_fp16_fc_thr250000.tflite — LiteRT/TFLite model
tokens.json — tokenizer pieces + blank id
preprocessor.json — feature settings
mel_filters_slaney_80x257.json — browser/runtime-compatible mel filters
validation/parity_full269_litert_fp_fp16.json — full transcript parity for FP and FP16
validation/fa32m_litert_export_manifest.json — calibration/export manifest
scripts/ — export, conversion, quantization, and parity scripts

Provenance / conversion notes

Source model: Reza2kn/visualears-fastconformer-fa32m-streaming-bpe1024 / fa32m_streaming_bpe1024_final.nemo
Source SHA256: 034fb2afa19da13db8a120970a7f8d3e696987014cc62684ce50a1382d332448
Conversion: NeMo CTC encoder/auxiliary decoder → TorchScript → litert_torch → LiteRT/TFLite.
LiteRT workaround: relative positional encoding was fixed to the known 2005-frame contract to avoid dynamic scalar lowering in litert_torch; processed_signal_length remains a runtime input and drives padding/attention masking plus encoded_lengths.

FP16 vs FP exact transcript parity: 269 / 269 = 100.00%.

Downloads last month: 9

Model tree for Reza2kn/visualears-fastconformer-fa32m-streaming-bpe1024-litert-fp16

Base model

Reza2kn/visualears-fastconformer-fa32m-streaming-bpe1024

Quantized

(4)

this model