You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Sarvam-30b GPTQ Quantized (W4A16)

This is a 4-bit quantized version of sarvamai/sarvam-30b, created using the LLM Compressor library.

Compression Details (Readme)

This model was quantized using Post-Training Quantization (PTQ) specifically using the GPTQ algorithm. No pruning or distillation was applied.

1. Algorithm Configuration

The following parameters were used for the GPTQ quantization process:

  • Quantization Method: GPTQ
  • Weight Precision: 4-bit (W4A16 - 4-bit weights, 16-bit activations)
  • Scheme: W4A16
  • Block Size / Group Size: 128
  • Sequential Update: True
  • Symmetry: True
  • Target Modules: All Linear layers
  • Ignored Modules: lm_head and re:.*gate.* (MoE router gates were kept in FP16 to preserve routing accuracy)

2. Calibration Dataset

  • Dataset: LinguaLift/IndicMMLU-Pro (Hindi subset)
  • Split: Validation
  • Number of Samples: 512
  • Maximum Sequence Length: 2048
  • Preprocessing: Samples were formatted using the model's chat template (User/Assistant format combining the question, options, and cot_content fields) before tokenization.

3. Software & Hardware

YAML Config File Details

The YAML block at the top of this README serves as the configuration file, containing the model weights metadata and the relevant quantization parameters required to load and interpret the model correctly. Specifically:

  • quant_method: gptq: Tells the inference engine (vLLM/Transformers) which kernel to use.
  • bits: 4 / scheme: W4A16: Defines the precision of weights and activations.
  • group_size: 128 / block_size: 128: Defines the granularity of quantization.
  • desc_act: true / sequential_update: true: Indicates the order of weight processing during quantization (improves accuracy).
  • ignore_modules: Specifies which layers were excluded from quantization (critical for MoE routing layers).

Usage

This model can be deployed efficiently using vLLM or Hugging Face Transformers.

vLLM Example (Recommended for L4/T4 GPUs):

python -m vllm.entrypoints.openai.api_server \
    --model amir22010/sarvam-30b-gptq-4bit \
    --max-model-len 2048 \
    --trust-remote-code

Transformers Example:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "amir22010/sarvam-30b-gptq-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True
)

How to launch

vllm serve --config vllm_config.yaml

Key decisions explained

Parameter Why
quantization: gptq_marlin Marlin is the optimized kernel for GPTQ in vLLM. It's 2-4× faster than the plain gptq backend. vLLM will automatically convert the GPTQ checkpoints to Marlin format on first load.
dtype: float16 compression was W4A16 — weights are stored as int4 but activations compute in fp16. Don't use auto here since it may pick bfloat16 and cause dtype mismatches with the GPTQ checkpoints.
tensor_parallel_size: 2 Sarvam-30b is a MoE model. At W4A16, the weight footprint is ~15-20GB, but MoE models have large KV cache demands per expert. Adjust based on your GPU VRAM.
max_model_len: 8192 calibrated at 2048, but that only affects quantization quality — serving length is independent. If you hit OOM, drop to 4096 or 2048.
kv_cache_dtype: auto On Ada (A100/H100) GPUs, switch to fp8_e5m2 to nearly double the effective KV cache capacity, which is the bottleneck for MoE models.

Troubleshooting

If Marlin fails to load

quantization: gptq    # Fallback — slower but more compatible

If you get OOM

max_model_len: 4096              # Reduce context window
gpu_memory_utilization: 0.95    # Push GPU memory usage
kv_cache_dtype: fp8_e5m2        # Only on Ada/Hopper GPUs
max_num_seqs: 32                 # Fewer concurrent requests

If CUDA graph capture crashes

enforce_eager: true   # Disables CUDA graphs (slightly slower but stable)

Verify the model loads correctly

curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sarvam-30b-gptq-4bit",
    "prompt": "भारत की राजधानी",
    "max_tokens": 50,
    "temperature": 0.7
  }'
Downloads last month
49
Safetensors
Model size
32B params
Tensor type
I64
·
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amir22010/sarvam-30b-gptq-4bit

Quantized
(21)
this model