Instructions to use Orange/Sarvam-30b-GPTQ-w8a16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Orange/Sarvam-30b-GPTQ-w8a16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Orange/Sarvam-30b-GPTQ-w8a16", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Orange/Sarvam-30b-GPTQ-w8a16", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Orange/Sarvam-30b-GPTQ-w8a16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Orange/Sarvam-30b-GPTQ-w8a16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Orange/Sarvam-30b-GPTQ-w8a16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Orange/Sarvam-30b-GPTQ-w8a16
- SGLang
How to use Orange/Sarvam-30b-GPTQ-w8a16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Orange/Sarvam-30b-GPTQ-w8a16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Orange/Sarvam-30b-GPTQ-w8a16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Orange/Sarvam-30b-GPTQ-w8a16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Orange/Sarvam-30b-GPTQ-w8a16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Orange/Sarvam-30b-GPTQ-w8a16 with Docker Model Runner:
docker model run hf.co/Orange/Sarvam-30b-GPTQ-w8a16
Orange/Sarvam-30b-GPTQ-w8a16
8-bit GPTQ-quantized version of sarvamai/sarvam-30b,
produced with llmcompressor using a
W8A16 quality-aware quantization scheme and exported as compressed-tensors
(pack-quantized) for native loading by vLLM.
This model is submitted to the Resilient AI Challenge as a compressed variant of
Sarvam-30b suitable for single-GPU deployment via vllm serve.
1. Submission package
This repository is a self-contained model card. Files included:
| File | Role |
|---|---|
model-0000X-of-00009.safetensors + model.safetensors.index.json |
Quantized weights (compressed-tensors) |
config.json |
Model + quantization_config (auto-detected by vLLM) |
vllm_config.yaml |
Serving configuration consumed by vllm serve --config |
recipe.yaml |
llmcompressor recipe used during quantization (traceability) |
tokenizer.json, tokenizer_config.json, special_tokens_map.json, chat_template.jinja |
Tokenizer + chat template |
generation_config.json |
Generation defaults |
modeling_sarvam_moe.py, configuration_sarvam_moe.py |
Custom modeling code (requires trust_remote_code=True) |
2. Serving with vLLM
The model is intended to be served with vLLM >= 0.17.1 using the configuration file shipped at the root of the repository:
vllm serve --config vllm_config.yaml
vllm_config.yaml sets only the parameters required by the challenge spec:
trust-remote-code: true
max-model-len: 65536
gpu-memory-utilization: 0.85
trust-remote-codeis required because the model ships custom modeling code referenced throughauto_mapinconfig.json.max-model-len: 65536matches the maximum context length specified in the Sarvam-30b evaluation footnote.gpu-memory-utilization: 0.85follows the challenge default.- The quantization format (
compressed-tensors / pack-quantized, W8A16) is auto-detected fromquantization_configinconfig.json. - Tensor parallelism and dtype are intentionally left at vLLM defaults so the configuration adapts to the evaluation hardware.
vllm_config content
# vLLM serve configuration for Orange/Sarvam-30b-GPTQ-w8a16
# Used by: vllm serve --config vllm_config.yaml
# Reference: https://docs.vllm.ai/en/latest/configuration/serve_args/
trust-remote-code: true
max-model-len: 65536
gpu-memory-utilization: 0.85
Sampling parameters
Sampling parameters (temperature, top_p, top_k, max_new_tokens) are passed by
the evaluation harness on a per-request basis and therefore are not set in
vllm_config.yaml. The parameters expected by the challenge are reproduced
below for reference:
| Benchmark family | Parameters |
|---|---|
| General | max_context_length = 65536 |
| Reasoning & Math | temperature = 1.0, top_p = 1.0, max_new_tokens = 65536 |
| Code & Knowledge | temperature = 1.0, top_p = 1.0, max_new_tokens = 65536 |
| Writing-Bench (generation) | temperature = 0.7, top_p = 0.8, top_k = 20, max_length = 16000 |
| Writing-Bench (scoring) | temperature = 1.0, top_p = 0.95, max_length = 2048 |
3. Compression methodology
3.1 Quantization method
- Algorithm: GPTQ (second-order weight optimization) via
llmcompressor. - Scheme: W8A16 -- 8-bit weights, 16-bit activations.
- Strategy: Quality-aware -- quantization is selectively applied so that the most quality-sensitive components remain at full precision.
- Export format:
compressed-tensors(pack-quantized), natively supported by vLLM.
3.2 Quality-aware exclusions
The following layers are kept at full precision (listed in quantization_config.ignore
of config.json). The rationale for each exclusion:
- Attention projections (
model.layers.*.attention.query_key_value,model.layers.*.attention.dense): attention is sensitive to weight precision, particularly for multilingual inputs and long contexts. Keeping it unquantized preserves positional coherence and cross-token relationships. - First decoder MLP layer (
model.layers.0.mlp.*): early layers build foundational token representations reused throughout the network. Errors here compound across all subsequent layers. - Shared experts (
model.layers.*.mlp.shared_experts.*): unlike routed experts which only see a fraction of the tokens, shared experts process every token. Any precision loss here affects all inputs uniformly. lm_head: kept at full precision to preserve the output token distribution quality.
Routed MoE experts (the bulk of the parameters) are quantized to 8-bit, which is where most of the memory savings come from.
3.3 Calibration data
A calibration set of 2048 samples was drawn from the sarvamai/indivibe dataset,
split equally across 4 task subsets: chat, code, math, and stem
(512 samples each, shuffled with seed 42).
- Languages: the calibration covers the languages present in
indivibe(primarily English and Indian languages). - Calibration sequence length: 2048 tokens.
- Random seed: 42.
3.4 Recipe summary
The exact llmcompressor recipe is shipped as recipe.yaml. Key parameters:
quantization:
method: GPTQ
scheme: W8A16
variant: quality-aware
weights:
bit_width: 8
strategy: channel
symmetric: true
actorder: static
dampening_frac: 0.05
ignore:
- lm_head
- "re:model\\.layers\\..*\\.attention\\..*"
- "re:model\\.layers\\.0\\.mlp\\..*"
- "re:model\\.layers\\..*\\.mlp\\.shared_experts\\..*"
calibration:
num_samples: 2048
max_sequence_length: 2048
seed: 42
3.5 Hardware used for quantization
Quantization was performed on a single NVIDIA A100 (80 GB) GPU.
4. Trade-offs
This W8A16 quality-aware variant uses a larger memory footprint than lower-bit quantization schemes, which is offset by higher quality on reasoning, multilingual, and structured generation tasks. It is designed to fit on a single 80 GB GPU while maximizing output quality.
5. Bias, risks and limitations
The model inherits all limitations of the base model sarvamai/sarvam-30b,
including possible biases, hallucinations, factual errors, and uneven
performance across prompts and languages.
Quantization may additionally introduce:
- task-dependent quality degradation,
- sensitivity to the inference backend and runtime configuration,
- behavioral differences compared with the full-precision base model.
The quality-aware exclusions above are intended to mitigate these risks on the components that are most sensitive to weight precision.
6. License
This model is released under the Apache 2.0 license, identical to the
license of the base model sarvamai/sarvam-30b.
Model Card Contact
Thanks to Binetou C茅cile Niang (binetoucecile.niang@orange.com, ng.binetou@outlook.com), Abdoulaye Mbaye (abdoulaye.mbaye@orange.com, mbaye.laye14018@gmail.com), Floriane Behanzin, Lionel Delphin-Poulat, Nour Rammal and Rose Djagbre for adding this model.
- Downloads last month
- 197
Model tree for Orange/Sarvam-30b-GPTQ-w8a16
Base model
sarvamai/sarvam-30b