vhdm_whisper-large-fa-v1-NVFP4

NVFP4 (NVFP4, W4A4) post-training quantization of vhdm/whisper-large-fa-v1 β€” architecture: whisper.

  • Format: nvfp4-pack-quantized (compressed-tensors). 4-bit FP4 weights, per-block FP8 (E4M3) scales, per-tensor FP32 global scales; activations dynamically quantized to FP4.
  • Calibration: 32 Persian clips from Reza2kn/persian-asr-eval-v0 (held out from the WER eval set).
  • Hardware target: NVIDIA Blackwell tensor cores (sm_100+). Quantized on RTX 5080 Laptop (sm_120).
  • Quantized layers: all Linear modules in the encoder/decoder (CTC lm_head / proj_out left full precision).

Eval β€” Reza2kn/persian-asr-eval-v0 (FLEURS-fa)

Variant WER ↓ CER ↓ clips per-clip latency peak VRAM
NVFP4 (this repo) 15.25% 5.99% 200 653 ms 2731 MiB

Persian text normalization for WER/CER: NFKC, ZWNJ β†’ space, ΩŠβ†’ΫŒ / Ωƒβ†’Ϊ©, digit folding, punctuation stripping, whitespace collapse.

Usage

import torch
import soundfile as sf
from transformers import AutoProcessor, AutoModel

repo = "Reza2kn/vhdm_whisper-large-fa-v1-NVFP4"
processor = AutoProcessor.from_pretrained(repo)
# Load in bfloat16 β€” NVFP4 weights decompress to bf16 inside CompressedLinear.
model = AutoModel.from_pretrained(repo, dtype=torch.bfloat16).to("cuda").eval()

(See the original vhdm/whisper-large-fa-v1 model card for arch-specific decoding boilerplate.)

How it was made

llmcompressor QuantizationModifier(targets=["Linear"], scheme="NVFP4", ignore=...) β†’ compressed-tensors nvfp4-pack-quantized checkpoint.

License

Inherits the base model's license.

Downloads last month
30
Safetensors
Model size
0.5B params
Tensor type
F32
Β·
F8_E4M3
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Reza2kn/vhdm_whisper-large-fa-v1-NVFP4

Quantized
(1)
this model