Instructions to use Orange/Sarvam-30b-GPTQ-w8a16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Orange/Sarvam-30b-GPTQ-w8a16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Orange/Sarvam-30b-GPTQ-w8a16", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Orange/Sarvam-30b-GPTQ-w8a16", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Orange/Sarvam-30b-GPTQ-w8a16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Orange/Sarvam-30b-GPTQ-w8a16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Orange/Sarvam-30b-GPTQ-w8a16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Orange/Sarvam-30b-GPTQ-w8a16

SGLang

How to use Orange/Sarvam-30b-GPTQ-w8a16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Orange/Sarvam-30b-GPTQ-w8a16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Orange/Sarvam-30b-GPTQ-w8a16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Orange/Sarvam-30b-GPTQ-w8a16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Orange/Sarvam-30b-GPTQ-w8a16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Orange/Sarvam-30b-GPTQ-w8a16 with Docker Model Runner:
```
docker model run hf.co/Orange/Sarvam-30b-GPTQ-w8a16
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Orange/Sarvam-30b-GPTQ-w8a16

8-bit GPTQ-quantized version of sarvamai/sarvam-30b, produced with llmcompressor using a W8A16 quality-aware quantization scheme and exported as compressed-tensors (pack-quantized) for native loading by vLLM.

This model is submitted to the Resilient AI Challenge as a compressed variant of Sarvam-30b suitable for single-GPU deployment via vllm serve.

1. Submission package

This repository is a self-contained model card. Files included:

File	Role
`model-0000X-of-00009.safetensors` + `model.safetensors.index.json`	Quantized weights (compressed-tensors)
`config.json`	Model + `quantization_config` (auto-detected by vLLM)
`vllm_config.yaml`	Serving configuration consumed by `vllm serve --config`
`recipe.yaml`	`llmcompressor` recipe used during quantization (traceability)
`tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, `chat_template.jinja`	Tokenizer + chat template
`generation_config.json`	Generation defaults
`modeling_sarvam_moe.py`, `configuration_sarvam_moe.py`	Custom modeling code (requires `trust_remote_code=True`)

2. Serving with vLLM

The model is intended to be served with vLLM >= 0.17.1 using the configuration file shipped at the root of the repository:

vllm serve --config vllm_config.yaml

vllm_config.yaml sets only the parameters required by the challenge spec:

trust-remote-code: true
max-model-len: 65536
gpu-memory-utilization: 0.85

trust-remote-code is required because the model ships custom modeling code referenced through auto_map in config.json.
max-model-len: 65536 matches the maximum context length specified in the Sarvam-30b evaluation footnote.
gpu-memory-utilization: 0.85 follows the challenge default.
The quantization format (compressed-tensors / pack-quantized, W8A16) is auto-detected from quantization_config in config.json.
Tensor parallelism and dtype are intentionally left at vLLM defaults so the configuration adapts to the evaluation hardware.

vllm_config content

# vLLM serve configuration for Orange/Sarvam-30b-GPTQ-w8a16
# Used by: vllm serve --config vllm_config.yaml
# Reference: https://docs.vllm.ai/en/latest/configuration/serve_args/

trust-remote-code: true

max-model-len: 65536

gpu-memory-utilization: 0.85

Sampling parameters

Sampling parameters (temperature, top_p, top_k, max_new_tokens) are passed by the evaluation harness on a per-request basis and therefore are not set in vllm_config.yaml. The parameters expected by the challenge are reproduced below for reference:

Benchmark family	Parameters
General	`max_context_length = 65536`
Reasoning & Math	`temperature = 1.0, top_p = 1.0, max_new_tokens = 65536`
Code & Knowledge	`temperature = 1.0, top_p = 1.0, max_new_tokens = 65536`
Writing-Bench (generation)	`temperature = 0.7, top_p = 0.8, top_k = 20, max_length = 16000`
Writing-Bench (scoring)	`temperature = 1.0, top_p = 0.95, max_length = 2048`

3. Compression methodology

3.1 Quantization method

Algorithm: GPTQ (second-order weight optimization) via llmcompressor.
Scheme: W8A16 -- 8-bit weights, 16-bit activations.
Strategy: Quality-aware -- quantization is selectively applied so that the most quality-sensitive components remain at full precision.
Export format: compressed-tensors (pack-quantized), natively supported by vLLM.

3.2 Quality-aware exclusions

The following layers are kept at full precision (listed in quantization_config.ignore of config.json). The rationale for each exclusion:

Attention projections (model.layers.*.attention.query_key_value, model.layers.*.attention.dense): attention is sensitive to weight precision, particularly for multilingual inputs and long contexts. Keeping it unquantized preserves positional coherence and cross-token relationships.
First decoder MLP layer (model.layers.0.mlp.*): early layers build foundational token representations reused throughout the network. Errors here compound across all subsequent layers.
Shared experts (model.layers.*.mlp.shared_experts.*): unlike routed experts which only see a fraction of the tokens, shared experts process every token. Any precision loss here affects all inputs uniformly.
lm_head: kept at full precision to preserve the output token distribution quality.

Routed MoE experts (the bulk of the parameters) are quantized to 8-bit, which is where most of the memory savings come from.

3.3 Calibration data

A calibration set of 2048 samples was drawn from the sarvamai/indivibe dataset, split equally across 4 task subsets: chat, code, math, and stem (512 samples each, shuffled with seed 42).

Languages: the calibration covers the languages present in indivibe (primarily English and Indian languages).
Calibration sequence length: 2048 tokens.
Random seed: 42.

3.4 Recipe summary

The exact llmcompressor recipe is shipped as recipe.yaml. Key parameters:

quantization:
  method: GPTQ
  scheme: W8A16
  variant: quality-aware
  weights:
    bit_width: 8
    strategy: channel
    symmetric: true
    actorder: static
    dampening_frac: 0.05
  ignore:
    - lm_head
    - "re:model\\.layers\\..*\\.attention\\..*"
    - "re:model\\.layers\\.0\\.mlp\\..*"
    - "re:model\\.layers\\..*\\.mlp\\.shared_experts\\..*"
  calibration:
    num_samples: 2048
    max_sequence_length: 2048
    seed: 42

3.5 Hardware used for quantization

Quantization was performed on a single NVIDIA A100 (80 GB) GPU.

4. Trade-offs

This W8A16 quality-aware variant uses a larger memory footprint than lower-bit quantization schemes, which is offset by higher quality on reasoning, multilingual, and structured generation tasks. It is designed to fit on a single 80 GB GPU while maximizing output quality.

5. Bias, risks and limitations

The model inherits all limitations of the base model sarvamai/sarvam-30b, including possible biases, hallucinations, factual errors, and uneven performance across prompts and languages.

Quantization may additionally introduce:

task-dependent quality degradation,
sensitivity to the inference backend and runtime configuration,
behavioral differences compared with the full-precision base model.

The quality-aware exclusions above are intended to mitigate these risks on the components that are most sensitive to weight precision.

6. License

This model is released under the Apache 2.0 license, identical to the license of the base model sarvamai/sarvam-30b.

Model Card Contact

Thanks to Binetou Cécile Niang (binetoucecile.niang@orange.com, ng.binetou@outlook.com), Abdoulaye Mbaye (abdoulaye.mbaye@orange.com, mbaye.laye14018@gmail.com), Floriane Behanzin, Lionel Delphin-Poulat, Nour Rammal and Rose Djagbre for adding this model.

Downloads last month: 197

Safetensors

Model size

10B params

Tensor type

F32

I64

I32

Model tree for Orange/Sarvam-30b-GPTQ-w8a16

Base model

sarvamai/sarvam-30b

Quantized

(20)

this model