Text Generation
Transformers
Safetensors
sarvam_moe
sarvam
sarvam-30b
int4
w4a16
gptq
llmcompressor
compressed-tensors
vllm
conversational
custom_code
Instructions to use meghanamakkapati/sarvam30b_INT4_quantisation with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use meghanamakkapati/sarvam30b_INT4_quantisation with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="meghanamakkapati/sarvam30b_INT4_quantisation", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("meghanamakkapati/sarvam30b_INT4_quantisation", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use meghanamakkapati/sarvam30b_INT4_quantisation with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "meghanamakkapati/sarvam30b_INT4_quantisation" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meghanamakkapati/sarvam30b_INT4_quantisation", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/meghanamakkapati/sarvam30b_INT4_quantisation
- SGLang
How to use meghanamakkapati/sarvam30b_INT4_quantisation with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "meghanamakkapati/sarvam30b_INT4_quantisation" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meghanamakkapati/sarvam30b_INT4_quantisation", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "meghanamakkapati/sarvam30b_INT4_quantisation" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meghanamakkapati/sarvam30b_INT4_quantisation", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use meghanamakkapati/sarvam30b_INT4_quantisation with Docker Model Runner:
docker model run hf.co/meghanamakkapati/sarvam30b_INT4_quantisation
Sarvam-30B INT4 W4A16 Quantized Model
Base Model
Base model: sarvamai/sarvam-30b
This is an INT4 / W4A16 quantized version of Sarvam-30B.
Quantization Method
- Method: GPTQ using
llmcompressor - Scheme: W4A16
- Source model dtype during quantization: BF16
- Calibration samples: 128
- Calibration sequence length: 2048
- Saved format: Hugging Face
save_pretrainedformat with compressed safetensors
Precision Policy
Preserved / ignored during quantization:
- Embeddings
- LM head
- Attention modules and projections
- Router / gating modules
- MoE router-related modules
Main quantized target:
- Linear layers outside the ignore list
- Expert / FFN-heavy parts of the model
Serving
This submission is intended for vLLM.
Run with:
vllm serve --config vllm_config.yaml
Equivalent explicit command:
vllm serve . \
--served-model-name sarvam-int4 \
--trust-remote-code \
--dtype bfloat16 \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096 \
--gpu-memory-utilization 0.88 \
--max-num-seqs 1
Validation
The model was validated through vLLM on seven prompts covering:
- English reasoning
- BoolQ-style reasoning
- Hindi / Indian-language response
- Math / science
- Medical-style educational synthesis
- Multiple choice
- Open-ended generation
Observed result:
- 6 PASS
- 1 PASS_WITH_FORMAT_WARNING
- 0 FAIL
Known Caveats
- Requires
trust_remote_code=True. - Tested with vLLM.
- The provided serving config is
vllm_config.yaml.
- Downloads last month
- 189
Model tree for meghanamakkapati/sarvam30b_INT4_quantisation
Base model
sarvamai/sarvam-30b