Instructions to use Reza2kn/visualears-fastconformer-fa32m-streaming-bpe1024-litert-fp16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use Reza2kn/visualears-fastconformer-fa32m-streaming-bpe1024-litert-fp16 with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("Reza2kn/visualears-fastconformer-fa32m-streaming-bpe1024-litert-fp16") transcriptions = asr_model.transcribe(["file.wav"]) - LiteRT
How to use Reza2kn/visualears-fastconformer-fa32m-streaming-bpe1024-litert-fp16 with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
VisualEars FA32M Streaming BPE1024 β LiteRT FP16 selective-weight
LiteRT/TFLite fixed-frame acoustic CTC-core export of Reza2kn/visualears-fastconformer-fa32m-streaming-bpe1024.
This is the FA32M length-aware core: it accepts precomputed NeMo-compatible log-mel features plus the real valid feature length, so short utterances do not get decoded as if all 2005 padded frames were valid.
Runtime contract
- input 0 (
serving_default_args_0):processed_signalfloat32[1, 80, 2005] - input 1 (
serving_default_args_1):processed_signal_lengthint64[1]β valid log-mel frame count before zero padding - output 0 (
serving_default_output_0_output):logitsfloat32[1, 252, 1025] - output 1 (
serving_default_output_1_output):encoded_lengthsint64[1] - tokenizer blank id: 1024
Artifact
File:
fastconformer_fa32m_ctc_fixed2005_len_fp16_fc_thr250000.tfliteSize:
67,222,016bytes (60.79% of the FP LiteRT file)SHA256:
9c125de4f89d9706d65b53f0f5e469ec336d0d7cf22e7a03ffa93f36be5b16bdQuantization:
ai_edge_quantizerfloat-casting,bits=16,dtype=FLOATQuantized op family:
FULLY_CONNECTEDMinimum weight elements:
250000Activations and I/O remain float32/int64; this is selective weight storage quantization with dequantization around the quantized weights.
269-clip transcription parity
Source: PyTorch NeMo preprocessor + encoder + auxiliary CTC fp32, decoded during calibration export.
Candidate: this LiteRT/TFLite model through ai_edge_litert XNNPACK CPU.
Validation set: all 269 clips from Reza2kn/visualears-benchmark-269-gold.
| Metric | Result |
|---|---|
| Exact transcript matches | 269 / 269 |
| Exact transcript parity | 100.00% |
| Exact normalized transcript parity | 100.00% |
| Mean character similarity | 100.00% |
| Candidate non-empty rate | 98.88% |
| Source non-empty rate | 98.88% |
| Encoded length match rate | 100.00% |
Result: passes the >98% transcription parity gate.
Feature contract
Use the sidecars preprocessor.json and mel_filters_slaney_80x257.json:
- sample rate: 16 kHz mono
- preemphasis:
0.97 - STFT:
n_fft=512,win_length=400,hop_length=160, centered with reflect padding - mel: Slaney/librosa 80-bin filterbank from sidecar
- log: natural log with tiny floor
- no per-bin normalization (
normalize=NA) - zero-pad/truncate features to 2005 frames, and pass true
processed_signal_length
Files
fastconformer_fa32m_ctc_fixed2005_len_fp16_fc_thr250000.tfliteβ LiteRT/TFLite modeltokens.jsonβ tokenizer pieces + blank idpreprocessor.jsonβ feature settingsmel_filters_slaney_80x257.jsonβ browser/runtime-compatible mel filtersvalidation/parity_full269_litert_fp_fp16.jsonβ full transcript parity for FP and FP16validation/fa32m_litert_export_manifest.jsonβ calibration/export manifestscripts/β export, conversion, quantization, and parity scripts
Provenance / conversion notes
- Source model:
Reza2kn/visualears-fastconformer-fa32m-streaming-bpe1024/fa32m_streaming_bpe1024_final.nemo - Source SHA256:
034fb2afa19da13db8a120970a7f8d3e696987014cc62684ce50a1382d332448 - Conversion: NeMo CTC encoder/auxiliary decoder β TorchScript β
litert_torchβ LiteRT/TFLite. - LiteRT workaround: relative positional encoding was fixed to the known 2005-frame contract to avoid dynamic scalar lowering in
litert_torch;processed_signal_lengthremains a runtime input and drives padding/attention masking plusencoded_lengths.
FP16 vs FP exact transcript parity: 269 / 269 = 100.00%.
- Downloads last month
- 9