Instructions to use amir22010/sarvam-30b-gptq-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use amir22010/sarvam-30b-gptq-4bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="amir22010/sarvam-30b-gptq-4bit", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("amir22010/sarvam-30b-gptq-4bit", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use amir22010/sarvam-30b-gptq-4bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "amir22010/sarvam-30b-gptq-4bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amir22010/sarvam-30b-gptq-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/amir22010/sarvam-30b-gptq-4bit
- SGLang
How to use amir22010/sarvam-30b-gptq-4bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "amir22010/sarvam-30b-gptq-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amir22010/sarvam-30b-gptq-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "amir22010/sarvam-30b-gptq-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amir22010/sarvam-30b-gptq-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use amir22010/sarvam-30b-gptq-4bit with Docker Model Runner:
docker model run hf.co/amir22010/sarvam-30b-gptq-4bit
Sarvam-30b GPTQ Quantized (W4A16)
This is a 4-bit quantized version of sarvamai/sarvam-30b, created using the LLM Compressor library.
Compression Details (Readme)
This model was quantized using Post-Training Quantization (PTQ) specifically using the GPTQ algorithm. No pruning or distillation was applied.
1. Algorithm Configuration
The following parameters were used for the GPTQ quantization process:
- Quantization Method: GPTQ
- Weight Precision: 4-bit (W4A16 - 4-bit weights, 16-bit activations)
- Scheme: W4A16
- Block Size / Group Size: 128
- Sequential Update: True
- Symmetry: True
- Target Modules: All
Linearlayers - Ignored Modules:
lm_headandre:.*gate.*(MoE router gates were kept in FP16 to preserve routing accuracy)
2. Calibration Dataset
- Dataset: LinguaLift/IndicMMLU-Pro (Hindi subset)
- Split: Validation
- Number of Samples: 512
- Maximum Sequence Length: 2048
- Preprocessing: Samples were formatted using the model's chat template (User/Assistant format combining the
question,options, andcot_contentfields) before tokenization.
3. Software & Hardware
- Framework: LLM Compressor (by Neural Magic)
- Modifier:
GPTQModifier - Hardware: NVIDIA H200 GPU
YAML Config File Details
The YAML block at the top of this README serves as the configuration file, containing the model weights metadata and the relevant quantization parameters required to load and interpret the model correctly. Specifically:
quant_method: gptq: Tells the inference engine (vLLM/Transformers) which kernel to use.bits: 4/scheme: W4A16: Defines the precision of weights and activations.group_size: 128/block_size: 128: Defines the granularity of quantization.desc_act: true/sequential_update: true: Indicates the order of weight processing during quantization (improves accuracy).ignore_modules: Specifies which layers were excluded from quantization (critical for MoE routing layers).
Usage
This model can be deployed efficiently using vLLM or Hugging Face Transformers.
vLLM Example (Recommended for L4/T4 GPUs):
python -m vllm.entrypoints.openai.api_server \
--model amir22010/sarvam-30b-gptq-4bit \
--max-model-len 2048 \
--trust-remote-code
Transformers Example:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "amir22010/sarvam-30b-gptq-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=True
)
How to launch
vllm serve --config vllm_config.yaml
Key decisions explained
| Parameter | Why |
|---|---|
quantization: gptq_marlin |
Marlin is the optimized kernel for GPTQ in vLLM. It's 2-4× faster than the plain gptq backend. vLLM will automatically convert the GPTQ checkpoints to Marlin format on first load. |
dtype: float16 |
compression was W4A16 — weights are stored as int4 but activations compute in fp16. Don't use auto here since it may pick bfloat16 and cause dtype mismatches with the GPTQ checkpoints. |
tensor_parallel_size: 2 |
Sarvam-30b is a MoE model. At W4A16, the weight footprint is ~15-20GB, but MoE models have large KV cache demands per expert. Adjust based on your GPU VRAM. |
max_model_len: 8192 |
calibrated at 2048, but that only affects quantization quality — serving length is independent. If you hit OOM, drop to 4096 or 2048. |
kv_cache_dtype: auto |
On Ada (A100/H100) GPUs, switch to fp8_e5m2 to nearly double the effective KV cache capacity, which is the bottleneck for MoE models. |
Troubleshooting
If Marlin fails to load
quantization: gptq # Fallback — slower but more compatible
If you get OOM
max_model_len: 4096 # Reduce context window
gpu_memory_utilization: 0.95 # Push GPU memory usage
kv_cache_dtype: fp8_e5m2 # Only on Ada/Hopper GPUs
max_num_seqs: 32 # Fewer concurrent requests
If CUDA graph capture crashes
enforce_eager: true # Disables CUDA graphs (slightly slower but stable)
Verify the model loads correctly
curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "sarvam-30b-gptq-4bit",
"prompt": "भारत की राजधानी",
"max_tokens": 50,
"temperature": 0.7
}'
- Downloads last month
- 49
Model tree for amir22010/sarvam-30b-gptq-4bit
Base model
sarvamai/sarvam-30b