SoulX-Singer NVFP4 (Quantized)

This is the NVIDIA FP4 (NVFP4) weight-only quantized version of the original SoulX-Singer zero-shot singing voice synthesis model.

It is produced by a one-time torchao NVFP4 quantization pass over the official fp32 base model, and is verified to be true 4-bit quantization (not a pseudo-quantization / fake-quant) by direct on-GPU forward comparison against the original fp32 weights.

Source Model

  • Base model: Soul-AILab/SoulX-Singer (fp32, 2687.54 MB)
  • Quantization format: NVIDIA FP4 weight-only (E2M1 mantissa + float8_e4m3fn per-block microscaling scales, 32-element blocks)
  • Tooling: torchao>=0.17.0 (NVFP4WeightOnlyConfig)
  • Hardware requirement: NVIDIA Blackwell GPU (sm100+), e.g. RTX 5060 / 5090 / B200
  • Verified on: NVIDIA GeForce RTX 5060 Laptop GPU (sm120), CUDA 13.2, torch 2.12.0

Why NVFP4

NVFP4 is the native 4-bit floating-point format introduced with the Blackwell architecture. Compared with INT4 / W4A16 pseudo-quantization, NVFP4 has dedicated hardware units on sm100+ and runs the actual 4-bit matmul on silicon β€” no on-the-fly dequantization to fp16/bf16 is required inside the GEMM. This gives both real memory savings and real compute throughput on Blackwell.

Quantization Scope

Only nn.Linear layers whose both dimensions are divisible by 16 are quantized to NVFP4. The following tensors stay in their original precision:

  • nn.Embedding (phoneme / pitch / note-type / F0 encoders)
  • nn.Conv1d (depth-wise convolutions in ConvNeXtV2 blocks)
  • nn.LayerNorm / GRN parameters
  • All biases

Layer coverage

Statistic Value
Total nn.Linear layers in the model 277
Layers quantized to NVFP4 276
Layers left in higher precision 1 (single non-divisible-by-16 Linear)
Total state-dict entries 587
NVFP4Tensor entries 276
Regular torch.Tensor entries 311

Quantized module groups

Group # layers Examples
preflow (ConvNeXtV2 blocks) 8 preflow.{0..3}.pwconv1/2
cfm_decoder.model.cond_emb 1 conditioning embedding projection
cfm_decoder.model.diff_estimator (DiffLlama, 22 layers) 207 attention Q/K/V/O, MLP up/down, layernorm gates, time-step MLP
vocoder.model.backbone (ConvNeXtV2, 30 blocks) 60 convnext.{0..29}.pwconv1/2

File Size

Variant Size Compression
SoulX-Singer/model.pt (fp32) 2687.54 MB 1.00Γ—
SoulX-Singer-nvfp4/model.pt (NVFP4) 396.62 MB 6.78Γ—

The achieved 6.78Γ— compression closely matches the theoretical 4-bit weight-only ratio (~8Γ— for weights alone), with the remainder coming from un-quantized embeddings, convolutions, layer-norms, and biases.

Precision Verification

The verification is performed by loading both the original fp32 model and the NVFP4 model on a Blackwell GPU, then running the same random fp32 input through every NVFP4-quantized nn.Linear using native NVFP4 matmul kernels (no dequantization, no fake-quant emulation). Outputs are compared layer-by-layer against the fp32 reference.

Per-group results (fp32 reference vs. native NVFP4 forward)

Group # layers Mean MSE Mean RMSE Mean MaxAbs Mean RelRMSE Mean Cosine
cond_emb 1 1.514e-03 0.0389 0.1714 9.449% 0.99553
diff_estimator 207 7.512e-03 0.0828 0.5202 9.502% 0.99539
preflow 8 4.911e-03 0.0700 0.3197 9.590% 0.99538
vocoder 60 1.339e-01 0.3455 2.1242 9.446% 0.99550
OVERALL 276 3.488e-02 0.1394 6.5048 9.492% 0.99541

Metrics:

  • MSE / RMSE β€” element-wise mean squared / root-mean-squared error of the layer output.
  • MaxAbs β€” maximum absolute element-wise error in the output.
  • RelRMSE β€” RMSE divided by the RMS magnitude of the fp32 reference output.
  • Cosine β€” mean cosine similarity over output feature dimension.

Worst layers by max abs error

Layer Shape MaxAbs Cosine
vocoder.model.backbone.convnext.29.pwconv2 1024 Γ— 4096 6.5048 0.99549
vocoder.model.backbone.convnext.24.pwconv2 1024 Γ— 4096 3.4804 0.99567
vocoder.model.backbone.convnext.25.pwconv1 4096 Γ— 1024 3.4157 0.99564
vocoder.model.backbone.convnext.27.pwconv1 4096 Γ— 1024 3.3751 0.99563
vocoder.model.backbone.convnext.14.pwconv2 1024 Γ— 4096 3.2466 0.99553

All worst-case layers are in the vocoder backbone, which is expected β€” the BigVGAN-style ConvNeXt blocks operate on large-magnitude activations and have the largest absolute weight values, so they show the largest absolute errors even though the direction (cosine) is still well-preserved.

Worst layers by cosine similarity

All 276 quantized layers stay above 0.9942 cosine similarity, with the lowest being a few layernorm gate projections inside the DiffLlama transformer. The mean over all 276 layers is 0.9954.

Verdict

Indicator Value Grade
Mean cosine similarity (all 276 layers) 0.9954 β€”
Mean relative RMSE 9.49% β€”
Max abs error (single layer) 6.505 β€”
Quantization grade GOOD βœ“ Real NVFP4

Conclusion: The checkpoint contains genuine NVFP4-typed weight tensors (NVFP4Tensor instances), runs through native NVFP4 GEMM kernels on Blackwell, achieves a 6.78Γ— model-size reduction, and preserves per-layer output direction to >0.995 cosine similarity on average. This is real 4-bit weight-only quantization, not a pseudo-quantization wrapper.

How to Use

Requirements

pip install torch>=2.12 torchao>=0.17.0 omegaconf
# CUDA 13.x with Blackwell (sm100+) support

Loading the NVFP4 model directly (no dequantization)

import torch
from omegaconf import OmegaConf
from torchao.quantization import quantize_
from torchao.prototype.mx_formats import NVFP4WeightOnlyConfig

from soulxsinger.models.soulxsinger import SoulXSinger


def nvfp4_filter(mod, fqn):
    import torch.nn as nn
    if not isinstance(mod, nn.Linear):
        return False
    N, K = mod.weight.shape
    return K % 16 == 0 and N % 16 == 0


config = OmegaConf.load("soulxsinger/config/soulxsinger.yaml")
model = SoulXSinger(config)

# Convert eligible Linear weights to NVFP4Tensor wrappers.
quantize_(model, NVFP4WeightOnlyConfig(), nvfp4_filter)

# Load saved NVFP4 weights. NVFP4Tensor does NOT support copy_ via
# load_state_dict, so we assign each quantized Linear's weight directly.
ckpt = torch.load("pretrained_models/SoulX-Singer-nvfp4/model.pt",
                  map_location="cpu", weights_only=False)
assert ckpt["nvfp4_quantized"] is True
nvfp4_sd = ckpt["state_dict"]

import torch.nn as nn
for name, param in model.named_parameters():
    if name not in nvfp4_sd:
        continue
    saved = nvfp4_sd[name]
    if type(saved).__name__ != "NVFP4Tensor":
        param.data.copy_(saved)  # embedding / conv / layernorm / bias
        continue
    # Direct assignment keeps the weight in NVFP4Tensor format.
    parent_name = name.rsplit(".", 1)[0]
    target_mod = model.get_submodule(parent_name)
    target_mod.weight = nn.Parameter(saved.to(device="cuda"))

model = model.to("cuda").eval()
# Forward passes now use native NVFP4 matmul on Blackwell.

Inference example

bash example/infer.sh
# (Point the script at pretrained_models/SoulX-Singer-nvfp4/model.pt)

File Layout

SoulX-Singer-nvfp4/
β”œβ”€β”€ model.pt        # NVFP4-quantized state_dict (396.62 MB)
└── README.md       # this file

The checkpoint model.pt is a regular torch.save dict with the following keys:

Key Type Meaning
state_dict dict[str, Tensor | NVFP4Tensor] 587 entries (276 NVFP4Tensor + 311 Tensor)
nvfp4_quantized bool Always True β€” marks this as a real NVFP4 checkpoint
torchao_required bool Always True β€” torchao is required to load/forward
orig_model_path str Path to the fp32 source model that was quantized

Limitations

  1. Hardware gated. NVFP4 native kernels require NVIDIA Blackwell (sm100+). On older GPUs (Ada / Hopper / Ampere), this checkpoint will either fail to load or silently fall back to a slow emulation path β€” use the fp32 or an INT8 variant there instead.
  2. Weight-only. Activations and gradients are not quantized; only nn.Linear weights are NVFP4. Embeddings / convs / norms stay fp32.
  3. No calibration. NVFP4 weight-only uses the weights' own per-block statistics for scaling β€” no representative dataset is needed. The ~9.5% mean relative RMSE is the intrinsic precision floor of 4-bit weights without K-V / activation quantization.
  4. LayerNorm gate projections in the DiffLlama show slightly higher directional error than other layers (~0.9942 cosine). If you observe quality degradation on specific singing samples, consider keeping to_weight layers in fp16 (they are small).

License

Apache 2.0, inherited from the base Soul-AILab/SoulX-Singer model. See LICENSE for details.

Citation

@misc{soulxsinger,
      title={SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis},
      author={Jiale Qian and Hao Meng and Tian Zheng and Pengcheng Zhu and Haopeng Lin and Yuhang Dai and Hanke Xie and Wenxiao Cao and Ruixuan Shang and Jun Wu and Hongmei Liu and Hanlin Wen and Jian Zhao and Zhonglin Jiang and Yong Chen and Shunshun Yin and Ming Tao and Jianguo Wei and Lei Xie and Xinsheng Wang},
      year={2026},
      eprint={2602.07803},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2602.07803},
}

Contact

For questions about the NVFP4 quantization pipeline in this repository, please open an issue. For questions about the original SoulX-Singer model, contact Soul-AILab qianjiale@soulapp.cn / menghao@soulapp.cn / wangxinsheng@soulapp.cn.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Henley04/SoulX-Singer-nvfp4

Finetuned
(6)
this model

Paper for Henley04/SoulX-Singer-nvfp4