VisualEars FA32M Streaming BPE1024 β€” LiteRT FP16 selective-weight

LiteRT/TFLite fixed-frame acoustic CTC-core export of Reza2kn/visualears-fastconformer-fa32m-streaming-bpe1024.

This is the FA32M length-aware core: it accepts precomputed NeMo-compatible log-mel features plus the real valid feature length, so short utterances do not get decoded as if all 2005 padded frames were valid.

Runtime contract

  • input 0 (serving_default_args_0): processed_signal float32 [1, 80, 2005]
  • input 1 (serving_default_args_1): processed_signal_length int64 [1] β€” valid log-mel frame count before zero padding
  • output 0 (serving_default_output_0_output): logits float32 [1, 252, 1025]
  • output 1 (serving_default_output_1_output): encoded_lengths int64 [1]
  • tokenizer blank id: 1024

Artifact

  • File: fastconformer_fa32m_ctc_fixed2005_len_fp16_fc_thr250000.tflite

  • Size: 67,222,016 bytes (60.79% of the FP LiteRT file)

  • SHA256: 9c125de4f89d9706d65b53f0f5e469ec336d0d7cf22e7a03ffa93f36be5b16bd

  • Quantization: ai_edge_quantizer float-casting, bits=16, dtype=FLOAT

  • Quantized op family: FULLY_CONNECTED

  • Minimum weight elements: 250000

  • Activations and I/O remain float32/int64; this is selective weight storage quantization with dequantization around the quantized weights.

269-clip transcription parity

Source: PyTorch NeMo preprocessor + encoder + auxiliary CTC fp32, decoded during calibration export.
Candidate: this LiteRT/TFLite model through ai_edge_litert XNNPACK CPU.

Validation set: all 269 clips from Reza2kn/visualears-benchmark-269-gold.

Metric Result
Exact transcript matches 269 / 269
Exact transcript parity 100.00%
Exact normalized transcript parity 100.00%
Mean character similarity 100.00%
Candidate non-empty rate 98.88%
Source non-empty rate 98.88%
Encoded length match rate 100.00%

Result: passes the >98% transcription parity gate.

Feature contract

Use the sidecars preprocessor.json and mel_filters_slaney_80x257.json:

  • sample rate: 16 kHz mono
  • preemphasis: 0.97
  • STFT: n_fft=512, win_length=400, hop_length=160, centered with reflect padding
  • mel: Slaney/librosa 80-bin filterbank from sidecar
  • log: natural log with tiny floor
  • no per-bin normalization (normalize=NA)
  • zero-pad/truncate features to 2005 frames, and pass true processed_signal_length

Files

  • fastconformer_fa32m_ctc_fixed2005_len_fp16_fc_thr250000.tflite β€” LiteRT/TFLite model
  • tokens.json β€” tokenizer pieces + blank id
  • preprocessor.json β€” feature settings
  • mel_filters_slaney_80x257.json β€” browser/runtime-compatible mel filters
  • validation/parity_full269_litert_fp_fp16.json β€” full transcript parity for FP and FP16
  • validation/fa32m_litert_export_manifest.json β€” calibration/export manifest
  • scripts/ β€” export, conversion, quantization, and parity scripts

Provenance / conversion notes

  • Source model: Reza2kn/visualears-fastconformer-fa32m-streaming-bpe1024 / fa32m_streaming_bpe1024_final.nemo
  • Source SHA256: 034fb2afa19da13db8a120970a7f8d3e696987014cc62684ce50a1382d332448
  • Conversion: NeMo CTC encoder/auxiliary decoder β†’ TorchScript β†’ litert_torch β†’ LiteRT/TFLite.
  • LiteRT workaround: relative positional encoding was fixed to the known 2005-frame contract to avoid dynamic scalar lowering in litert_torch; processed_signal_length remains a runtime input and drives padding/attention masking plus encoded_lengths.

FP16 vs FP exact transcript parity: 269 / 269 = 100.00%.

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Reza2kn/visualears-fastconformer-fa32m-streaming-bpe1024-litert-fp16