You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Orange/Sarvam-30b-GPTQ-w8a16

8-bit GPTQ-quantized version of sarvamai/sarvam-30b, produced with llmcompressor using a W8A16 quality-aware quantization scheme and exported as compressed-tensors (pack-quantized) for native loading by vLLM.

This model is submitted to the Resilient AI Challenge as a compressed variant of Sarvam-30b suitable for single-GPU deployment via vllm serve.


1. Submission package

This repository is a self-contained model card. Files included:

File Role
model-0000X-of-00009.safetensors + model.safetensors.index.json Quantized weights (compressed-tensors)
config.json Model + quantization_config (auto-detected by vLLM)
vllm_config.yaml Serving configuration consumed by vllm serve --config
recipe.yaml llmcompressor recipe used during quantization (traceability)
tokenizer.json, tokenizer_config.json, special_tokens_map.json, chat_template.jinja Tokenizer + chat template
generation_config.json Generation defaults
modeling_sarvam_moe.py, configuration_sarvam_moe.py Custom modeling code (requires trust_remote_code=True)

2. Serving with vLLM

The model is intended to be served with vLLM >= 0.17.1 using the configuration file shipped at the root of the repository:

vllm serve --config vllm_config.yaml

vllm_config.yaml sets only the parameters required by the challenge spec:

trust-remote-code: true
max-model-len: 65536
gpu-memory-utilization: 0.85
  • trust-remote-code is required because the model ships custom modeling code referenced through auto_map in config.json.
  • max-model-len: 65536 matches the maximum context length specified in the Sarvam-30b evaluation footnote.
  • gpu-memory-utilization: 0.85 follows the challenge default.
  • The quantization format (compressed-tensors / pack-quantized, W8A16) is auto-detected from quantization_config in config.json.
  • Tensor parallelism and dtype are intentionally left at vLLM defaults so the configuration adapts to the evaluation hardware.

vllm_config content

# vLLM serve configuration for Orange/Sarvam-30b-GPTQ-w8a16
# Used by: vllm serve --config vllm_config.yaml
# Reference: https://docs.vllm.ai/en/latest/configuration/serve_args/

trust-remote-code: true

max-model-len: 65536

gpu-memory-utilization: 0.85
  

Sampling parameters

Sampling parameters (temperature, top_p, top_k, max_new_tokens) are passed by the evaluation harness on a per-request basis and therefore are not set in vllm_config.yaml. The parameters expected by the challenge are reproduced below for reference:

Benchmark family Parameters
General max_context_length = 65536
Reasoning & Math temperature = 1.0, top_p = 1.0, max_new_tokens = 65536
Code & Knowledge temperature = 1.0, top_p = 1.0, max_new_tokens = 65536
Writing-Bench (generation) temperature = 0.7, top_p = 0.8, top_k = 20, max_length = 16000
Writing-Bench (scoring) temperature = 1.0, top_p = 0.95, max_length = 2048

3. Compression methodology

3.1 Quantization method

  • Algorithm: GPTQ (second-order weight optimization) via llmcompressor.
  • Scheme: W8A16 -- 8-bit weights, 16-bit activations.
  • Strategy: Quality-aware -- quantization is selectively applied so that the most quality-sensitive components remain at full precision.
  • Export format: compressed-tensors (pack-quantized), natively supported by vLLM.

3.2 Quality-aware exclusions

The following layers are kept at full precision (listed in quantization_config.ignore of config.json). The rationale for each exclusion:

  • Attention projections (model.layers.*.attention.query_key_value, model.layers.*.attention.dense): attention is sensitive to weight precision, particularly for multilingual inputs and long contexts. Keeping it unquantized preserves positional coherence and cross-token relationships.
  • First decoder MLP layer (model.layers.0.mlp.*): early layers build foundational token representations reused throughout the network. Errors here compound across all subsequent layers.
  • Shared experts (model.layers.*.mlp.shared_experts.*): unlike routed experts which only see a fraction of the tokens, shared experts process every token. Any precision loss here affects all inputs uniformly.
  • lm_head: kept at full precision to preserve the output token distribution quality.

Routed MoE experts (the bulk of the parameters) are quantized to 8-bit, which is where most of the memory savings come from.

3.3 Calibration data

A calibration set of 2048 samples was drawn from the sarvamai/indivibe dataset, split equally across 4 task subsets: chat, code, math, and stem (512 samples each, shuffled with seed 42).

  • Languages: the calibration covers the languages present in indivibe (primarily English and Indian languages).
  • Calibration sequence length: 2048 tokens.
  • Random seed: 42.

3.4 Recipe summary

The exact llmcompressor recipe is shipped as recipe.yaml. Key parameters:

quantization:
  method: GPTQ
  scheme: W8A16
  variant: quality-aware
  weights:
    bit_width: 8
    strategy: channel
    symmetric: true
    actorder: static
    dampening_frac: 0.05
  ignore:
    - lm_head
    - "re:model\\.layers\\..*\\.attention\\..*"
    - "re:model\\.layers\\.0\\.mlp\\..*"
    - "re:model\\.layers\\..*\\.mlp\\.shared_experts\\..*"
  calibration:
    num_samples: 2048
    max_sequence_length: 2048
    seed: 42

3.5 Hardware used for quantization

Quantization was performed on a single NVIDIA A100 (80 GB) GPU.


4. Trade-offs

This W8A16 quality-aware variant uses a larger memory footprint than lower-bit quantization schemes, which is offset by higher quality on reasoning, multilingual, and structured generation tasks. It is designed to fit on a single 80 GB GPU while maximizing output quality.


5. Bias, risks and limitations

The model inherits all limitations of the base model sarvamai/sarvam-30b, including possible biases, hallucinations, factual errors, and uneven performance across prompts and languages.

Quantization may additionally introduce:

  • task-dependent quality degradation,
  • sensitivity to the inference backend and runtime configuration,
  • behavioral differences compared with the full-precision base model.

The quality-aware exclusions above are intended to mitigate these risks on the components that are most sensitive to weight precision.


6. License

This model is released under the Apache 2.0 license, identical to the license of the base model sarvamai/sarvam-30b.

Model Card Contact

Thanks to Binetou C茅cile Niang (binetoucecile.niang@orange.com, ng.binetou@outlook.com), Abdoulaye Mbaye (abdoulaye.mbaye@orange.com, mbaye.laye14018@gmail.com), Floriane Behanzin, Lionel Delphin-Poulat, Nour Rammal and Rose Djagbre for adding this model.

Downloads last month
197
Safetensors
Model size
10B params
Tensor type
F32
I64
I32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for Orange/Sarvam-30b-GPTQ-w8a16

Quantized
(20)
this model