Instructions to use Girikannan/sarvam-30b-compressed-model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Girikannan/sarvam-30b-compressed-model with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Girikannan/sarvam-30b-compressed-model", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Girikannan/sarvam-30b-compressed-model", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Girikannan/sarvam-30b-compressed-model with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Girikannan/sarvam-30b-compressed-model"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Girikannan/sarvam-30b-compressed-model",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Girikannan/sarvam-30b-compressed-model

SGLang

How to use Girikannan/sarvam-30b-compressed-model with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Girikannan/sarvam-30b-compressed-model" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Girikannan/sarvam-30b-compressed-model",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Girikannan/sarvam-30b-compressed-model" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Girikannan/sarvam-30b-compressed-model",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Girikannan/sarvam-30b-compressed-model with Docker Model Runner:
```
docker model run hf.co/Girikannan/sarvam-30b-compressed-model
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Methodology

This model represents a fine-tuned version of the Sarvam-30B baseline, developed for the Resilient AI Challenge., developed for the Resilient AI Challenge.

The compression strategy utilized "Post-Training Quantization (PTQ) formatted via compressed-tensors to achieve a W4A16 precision balance.

The primary objective was to maximize energy efficiency while ensuring the model maintains at least 80% of the baseline Sarvam-30b performance.

Model Details

Base Model: sarvam-30b
Compression Precision: W4A16
License: Apache 2.0

Inference Configuration

The model is optimized to run using the vLLM inference engine.

`vllm_config.yaml`

model: ./models/sarvam-30b-compressed-w4a16
quantization: compressed-tensors
kv-cache-dtype: auto
max-model-len: 8192
trust-remote-code: true

Evaluation Metrics

The model has been evaluated against the challenge benchmarks:

Technical Reasoning: Advanced Science and Mathematics problem-solving.
Domain-Specific Expertise: Medical knowledge synthesis.
Linguistic Creativity: Narrative generation in English and Indian languages.
Analytical Logic: Complex logical reasoning and deductive tasks.
Energy Monitoring: Power consumption was tracked using the NVIDIA Management Library (NVML) for GPU draw and TDP-relative estimation for CPU load.

Usage

To serve this model for evaluation, use the following command:

vllm serve --config vllm_config.yaml

Downloads last month: 57

Safetensors

Model size

6B params

Tensor type

I64

I32

F16

Model tree for Girikannan/sarvam-30b-compressed-model

Base model

sarvamai/sarvam-30b

Finetuned

(10)

this model