Instructions to use meghanamakkapati/sarvam30b_INT4_quantisation with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use meghanamakkapati/sarvam30b_INT4_quantisation with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="meghanamakkapati/sarvam30b_INT4_quantisation", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meghanamakkapati/sarvam30b_INT4_quantisation", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use meghanamakkapati/sarvam30b_INT4_quantisation with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "meghanamakkapati/sarvam30b_INT4_quantisation"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meghanamakkapati/sarvam30b_INT4_quantisation",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/meghanamakkapati/sarvam30b_INT4_quantisation

SGLang

How to use meghanamakkapati/sarvam30b_INT4_quantisation with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "meghanamakkapati/sarvam30b_INT4_quantisation" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meghanamakkapati/sarvam30b_INT4_quantisation",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "meghanamakkapati/sarvam30b_INT4_quantisation" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meghanamakkapati/sarvam30b_INT4_quantisation",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use meghanamakkapati/sarvam30b_INT4_quantisation with Docker Model Runner:
```
docker model run hf.co/meghanamakkapati/sarvam30b_INT4_quantisation
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Sarvam-30B INT4 W4A16 Quantized Model

Base Model

Base model: sarvamai/sarvam-30b

This is an INT4 / W4A16 quantized version of Sarvam-30B.

Quantization Method

Method: GPTQ using llmcompressor
Scheme: W4A16
Source model dtype during quantization: BF16
Calibration samples: 128
Calibration sequence length: 2048
Saved format: Hugging Face save_pretrained format with compressed safetensors

Precision Policy

Preserved / ignored during quantization:

Embeddings
LM head
Attention modules and projections
Router / gating modules
MoE router-related modules

Main quantized target:

Linear layers outside the ignore list
Expert / FFN-heavy parts of the model

Serving

This submission is intended for vLLM.

Run with:

vllm serve --config vllm_config.yaml

Equivalent explicit command:

vllm serve . \
  --served-model-name sarvam-int4 \
  --trust-remote-code \
  --dtype bfloat16 \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.88 \
  --max-num-seqs 1

Validation

The model was validated through vLLM on seven prompts covering:

English reasoning
BoolQ-style reasoning
Hindi / Indian-language response
Math / science
Medical-style educational synthesis
Multiple choice
Open-ended generation

Observed result:

6 PASS
1 PASS_WITH_FORMAT_WARNING
0 FAIL

Known Caveats

Requires trust_remote_code=True.
Tested with vLLM.
The provided serving config is vllm_config.yaml.

Downloads last month: 189

Safetensors

Model size

7B params

Tensor type

BF16

I64

I32

Model tree for meghanamakkapati/sarvam30b_INT4_quantisation

Base model

sarvamai/sarvam-30b

Quantized

(21)

this model