vhdm_whisper-large-fa-v1-NVFP4

NVFP4 (NVFP4, W4A4) post-training quantization of vhdm/whisper-large-fa-v1 — architecture: whisper.

Format: nvfp4-pack-quantized (compressed-tensors). 4-bit FP4 weights, per-block FP8 (E4M3) scales, per-tensor FP32 global scales; activations dynamically quantized to FP4.
Calibration: 32 Persian clips from Reza2kn/persian-asr-eval-v0 (held out from the WER eval set).
Hardware target: NVIDIA Blackwell tensor cores (sm_100+). Quantized on RTX 5080 Laptop (sm_120).
Quantized layers: all Linear modules in the encoder/decoder (CTC lm_head / proj_out left full precision).

Eval — `Reza2kn/persian-asr-eval-v0` (FLEURS-fa)

Variant	WER ↓	CER ↓	clips	per-clip latency	peak VRAM
NVFP4 (this repo)	15.25%	5.99%	200	653 ms	2731 MiB

Persian text normalization for WER/CER: NFKC, ZWNJ → space, ي→ی / ك→ک, digit folding, punctuation stripping, whitespace collapse.

Usage

import torch
import soundfile as sf
from transformers import AutoProcessor, AutoModel

repo = "Reza2kn/vhdm_whisper-large-fa-v1-NVFP4"
processor = AutoProcessor.from_pretrained(repo)
# Load in bfloat16 — NVFP4 weights decompress to bf16 inside CompressedLinear.
model = AutoModel.from_pretrained(repo, dtype=torch.bfloat16).to("cuda").eval()

(See the original vhdm/whisper-large-fa-v1 model card for arch-specific decoding boilerplate.)

How it was made

llmcompressor QuantizationModifier(targets=["Linear"], scheme="NVFP4", ignore=...) → compressed-tensors nvfp4-pack-quantized checkpoint.

License

Inherits the base model's license.

Downloads last month: 30

Safetensors

Model size

0.5B params

Tensor type

F32

F8_E4M3

Model tree for Reza2kn/vhdm_whisper-large-fa-v1-NVFP4

Base model

openai/whisper-large-v3

Finetuned

openai/whisper-large-v3-turbo

Finetuned

vhdm/whisper-large-fa-v1

Quantized

(1)