You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Sarvam-30B W8A8-Dynamic (AutoRound FP8)

Model Description

This is an FP8 (W8A8) quantized version of Sarvam-30B, a Mixture-of-Experts (MoE) model with 128 experts (6 active per token) plus 1 shared expert. The model was quantized using AutoRound with dynamic activation quantization.

Property Value
Base Model Sarvam-30B
Architecture SarvamMoEForCausalLM
Parameters (total) ~30B
Layers 19
Hidden Size 4096
Attention Heads 64 (4 KV heads, GQA)
Experts 128 routed + 1 shared
Active Experts/Token 6
Max Context Length 131,072 tokens
Quantized Size ~37 GB

Compression Technique

Method: AutoRound (FP8 Dynamic Quantization)

AutoRound is a weight-rounding optimization technique that minimizes quantization error via learned rounding decisions. It iteratively optimizes rounding choices to preserve model accuracy.

Quantization Configuration

Component Precision Strategy Details
Weights FP8 (8-bit float) Per-channel, symmetric Static (memoryless minmax observer)
Input Activations FP8 (8-bit float) Per-token, symmetric Dynamic quantization
Output Activations Not quantized

AutoRound Hyperparameters

Parameter Value
Iterations 200
Batch Size 4
Scheme FP8_DYNAMIC
Torch Compile Enabled

Layers/Modules Kept at Full Precision

The following modules are not quantized (kept in original precision) to preserve model quality:

  • lm_head (output projection)
  • All self-attention layers (query_key_value, dense)
  • All shared expert layers (shared_experts.gate_proj, up_proj, down_proj)

This selective quantization strategy preserves the most sensitive components (attention and shared experts) while compressing the routed expert MLP weights and activations to FP8.

Inference

vLLM (Recommended)

vllm serve --config vllm_config.yaml

A vllm_config.yaml is included in the model root with the following settings:

model: .
trust_remote_code: true
tensor_parallel_size: 1
gpu_memory_utilization: 0.85
max_model_len: 65536
dtype: auto

License

Apache 2.0 — same as the original Sarvam-30B model.

Downloads last month
48
Safetensors
Model size
32B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Pranay2412/sarvam-30b-W8A8-FP8-Dynamic

Quantized
(21)
this model