Instructions to use mygitphase/guhan-105b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mygitphase/guhan-105b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="mygitphase/guhan-105b", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("mygitphase/guhan-105b", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use mygitphase/guhan-105b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "mygitphase/guhan-105b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mygitphase/guhan-105b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/mygitphase/guhan-105b

SGLang

How to use mygitphase/guhan-105b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mygitphase/guhan-105b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mygitphase/guhan-105b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mygitphase/guhan-105b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mygitphase/guhan-105b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use mygitphase/guhan-105b with Docker Model Runner:
```
docker model run hf.co/mygitphase/guhan-105b
```

mygitphase

rahular commited on 5 days ago

Commit

4c108e3

0 Parent(s):

Duplicate from sarvamai/sarvam-105b

Browse files

Co-authored-by: Rahul <rahular@users.noreply.huggingface.co>

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +36 -0
README.md +298 -0
chat_template.jinja +97 -0
config.json +56 -0
configuration_sarvam_moe.py +140 -0
generation_config.json +6 -0
hotpatch_vllm.py +114 -0
model-00001-of-00085.safetensors +3 -0
model-00002-of-00085.safetensors +3 -0
model-00003-of-00085.safetensors +3 -0
model-00004-of-00085.safetensors +3 -0
model-00005-of-00085.safetensors +3 -0
model-00006-of-00085.safetensors +3 -0
model-00007-of-00085.safetensors +3 -0
model-00008-of-00085.safetensors +3 -0
model-00009-of-00085.safetensors +3 -0
model-00010-of-00085.safetensors +3 -0
model-00011-of-00085.safetensors +3 -0
model-00012-of-00085.safetensors +3 -0
model-00013-of-00085.safetensors +3 -0
model-00014-of-00085.safetensors +3 -0
model-00015-of-00085.safetensors +3 -0
model-00016-of-00085.safetensors +3 -0
model-00017-of-00085.safetensors +3 -0
model-00018-of-00085.safetensors +3 -0
model-00019-of-00085.safetensors +3 -0
model-00020-of-00085.safetensors +3 -0
model-00021-of-00085.safetensors +3 -0
model-00022-of-00085.safetensors +3 -0
model-00023-of-00085.safetensors +3 -0
model-00024-of-00085.safetensors +3 -0
model-00025-of-00085.safetensors +3 -0
model-00026-of-00085.safetensors +3 -0
model-00027-of-00085.safetensors +3 -0
model-00028-of-00085.safetensors +3 -0
model-00029-of-00085.safetensors +3 -0
model-00030-of-00085.safetensors +3 -0
model-00031-of-00085.safetensors +3 -0
model-00032-of-00085.safetensors +3 -0
model-00033-of-00085.safetensors +3 -0
model-00034-of-00085.safetensors +3 -0
model-00035-of-00085.safetensors +3 -0
model-00036-of-00085.safetensors +3 -0
model-00037-of-00085.safetensors +3 -0
model-00038-of-00085.safetensors +3 -0
model-00039-of-00085.safetensors +3 -0
model-00040-of-00085.safetensors +3 -0
model-00041-of-00085.safetensors +3 -0
model-00042-of-00085.safetensors +3 -0
model-00043-of-00085.safetensors +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,36 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,298 @@

+---
+language:
+  - en
+  - hi
+  - bn
+  - ta
+  - te
+  - mr
+  - gu
+  - kn
+  - ml
+  - pa
+  - or
+  - as
+  - ur
+  - sa
+  - ne
+  - sd
+  - kok
+  - mai
+  - doi
+  - mni
+  - sat
+  - ks
+  - bo
+library_name: transformers
+license: apache-2.0
+pipeline_tag: text-generation
+---
+![image](https://cdn-uploads.huggingface.co/production/uploads/60270a7c32856987162c641a/j1SbA6V6HixP1-6H0N41q.png)
+Want a smaller model? Download [Sarvam-30B](https://huggingface.co/sarvamai/sarvam-30b/)!
+## Index
+1. [Introduction](#introduction)
+2. [Architecture](#architecture)
+3. [Benchmarks](#benchmarks)
+   - Knowledge & Coding
+   - Reasoning & Math
+   - Agentic
+4. [Inference](#inference)
+   - Hugging Face
+   - [vLLM](https://github.com/vllm-project/vllm)
+   - [SGLang](https://github.com/sgl-project/sglang)
+5. [Footnote](#footnote)
+6. [Citation](#citation)
+## Introduction
+**Sarvam-105B** is an advanced Mixture-of-Experts (MoE) model with 10.3B active parameters, designed for superior performance across a wide range of complex tasks. It is highly optimized for complex reasoning, with particular strength in agentic tasks, mathematics, and coding.
+Sarvam-105B is a top-tier performer, consistently matching or surpassing several major closed-source models and staying within a narrow margin of frontier models across diverse reasoning and agentic benchmarks. It demonstrates exceptional agentic and reasoning capabilities in real-world applications such as web search and technical troubleshooting.
+A major focus during training was the Indian context and languages, resulting in **state-of-the-art performance across 22 Indian languages** for its model size.
+Sarvam-105B is open-sourced under the **Apache License**. For more details, see our [blog](https://www.sarvam.ai/blogs/sarvam-30b-105b).
+## Architecture
+The 105B model adopts an MLA-style attention stack with decoupled QK head dimensions (`q_head_dim=192` split into RoPE and noPE components, `v_head_dim=128`) and a large head_dim of 576, enabling higher representational bandwidth per head while keeping the hidden size at 4096. This approach improves attention expressivity and long-context extrapolation (via YaRN scaling with a factor of 40 and 128K context). It has an `intermediate_size` (16384) and `moe_intermediate_size` (2048), combined with top-8 routing over 128 experts, which increases per-token active capacity while keeping activation cost manageable. The model has one shared expert, a routed scaling factor of 2.5, and auxiliary-loss-free router balancing.
+## Benchmarks
+<details>
+  <summary>Knowledge & Coding</summary>
+| Benchmark | Sarvam-105B | GLM-4.5-Air | GPT-OSS-120B | Qwen3-Next-80B-A3B-Thinking |
+|---|---|---|---|---|
+| Math500 | 98.6 | 97.2 | 97.0 | 98.2 |
+| Live Code Bench v6 | 71.7 | 59.5 | 72.3 | 68.7 |
+| MMLU | 90.6 | 87.3 | 90.0 | 90.0 |
+| MMLU Pro | 81.7 | 81.4 | 80.8 | 82.7 |
+| Writing Bench | 80.5 | 83.8 | 86.5 | 84.6 |
+| Arena Hard v2 | 71.0 | 68.1 | 88.5 | 68.2 |
+| IF Eval | 84.8 | 83.5 | 85.4 | 88.9 |
+</details>
+<details>
+  <summary>Reasoning & Math</summary>
+| Benchmark | Sarvam-105B | GLM-4.5-Air | GPT-OSS-120B | Qwen3-Next-80B-A3B-Thinking |
+|---|---|---|---|---|
+| GPQA Diamond | 78.7 | 75.0 | 80.1 | 77.2 |
+| AIME 25 (w/ Tools) | 88.3 (96.7) | 83.3 | 90.0 | 87.8 |
+| Beyond AIME | 69.1 | 61.5 | 51.0 | 68.0 |
+| HMMT (Feb 25) | 85.8 | 69.2 | 90.0 | 73.9 |
+| HMMT (Nov 25) | 85.8 | 75.0 | 90.0 | 80.0 |
+</details>
+<details>
+  <summary>Agentic</summary>
+| Benchmark | Sarvam-105B | GLM-4.5-Air | GPT-OSS-120B | Qwen3-Next-80B-A3B-Thinking |
+|---|---|---|---|---|
+| BrowseComp | 49.5 | 21.3 | - | 38.0 |
+| SWE Bench Verified (SWE-Agent Harness) | 45.0 | 57.6 | 50.6 | 60.9 |
+| τ² Bench (avg.) | 68.3 | 53.2 | 65.8 | 55.0 |
+> See footnote for evaluation details.
+</details>
+## Inference
+<details>
+  <summary>Huggingface</summary>
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
+model_name = "sarvamai/sarvam-105b"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="auto")
+def generate_text(
+    prompt: str,
+    max_new_tokens: int = 2048,
+    temperature: float = 0.8,
+    top_p: float = 0.95,
+    repetition_penalty: float = 1.0,
+) -> None:
+    inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
+    generation_config = GenerationConfig(
+        max_new_tokens=max_new_tokens,
+        repetition_penalty=repetition_penalty,
+        temperature=temperature,
+        top_p=top_p,
+        do_sample=True,
+    )
+    with torch.no_grad():
+        output_ids = model.generate(
+            input_ids=inputs["input_ids"],
+            attention_mask=inputs["attention_mask"],
+            generation_config=generation_config,
+        )
+    return tokenizer.decode(output_ids[0], skip_special_tokens=True)
+prompts = [
+    "Which country won the FIFA World Cup in 2012?",
+]
+for prompt in prompts:
+    templated_prompt = tokenizer.apply_chat_template(
+      [{"role": "user", "content": prompt}],
+      tokenize=False,
+      add_generation_prompt=True,
+      enable_thinking=True
+    )
+    output = generate_text(templated_prompt, max_new_tokens=512)
+    print("Prompt: ", prompt)
+    print("Generated text: ", output)
+    print("=" * 100)
+```
+</details>
+<details>
+  <summary>SGLang</summary>
+**Install latest SGLang from source**
+```bash
+git clone https://github.com/sgl-project/sglang.git
+cd sglang
+pip install -e "python[all]"
+```
+**Instantiate model and Run**
+```python
+import sglang as sgl
+from transformers import AutoTokenizer
+model_path = "sarvamai/sarvam-105b"
+engine = sgl.Engine(
+    model_path=model_path,
+    tp_size=4,
+    mem_fraction_static=0.70,
+    trust_remote_code=True,
+    dtype="bfloat16",
+    moe_runner_backend="flashinfer_cutedsl",
+    prefill_attention_backend="fa3",
+    decode_attention_backend="flashmla",
+    disable_radix_cache=False,
+)
+sampling_params = {
+    "temperature": 0.8,
+    "max_new_tokens": 2048,
+    "repetition_penalty": 1.0,
+}
+prompts = [
+    "Which band released the album Dark Side of the Moon in 1973?",
+]
+outputs = engine.generate([
+    tokenizer.apply_chat_template([
+        {"role": "user", "content": prompt}],
+        tokenize=False,
+        add_generation_prompt=True,
+        enable_thinking=True)
+        for prompt in prompts],
+    sampling_params)
+for p, o in zip(prompts, outputs):
+    print("Prompt: ", p)
+    print("Generated text: ", o['text'])
+    print("=" * 100)
+```
+</details>
+<details>
+  <summary>vLLM</summary>
+Note: currently a PR is open for native support for the Sarvam models in vLLM ([link](https://github.com/vllm-project/vllm/pull/33942)). Therefore, we have 2 options here.
+#### Option 1: install from source (hard)
+* Use the custom fork here: [link](https://github.com/rahul-sarvam/vllm)
+* Follow the instructions here to install from source: [link](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source)
+#### Option 2: hot-patch (easy)
+* Run [hotpatch_vllm.py](./hotpatch_vllm.py)
+* This will do the following:
+  * install vllm=0.15.0
+  * add 2 model entries to `registry.py`
+  * download the model executors for `sarvam-105b` and `sarvam-30b`
+Once this is done, you can run vLLM as usual
+```python
+from vllm import LLM, SamplingParams
+from transformers import AutoTokenizer
+model_path = "sarvamai/sarvam-105b"
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+llm = LLM(model=model_path,
+            trust_remote_code=True,
+            max_model_len=2048,
+            tensor_parallel_size=8,
+            max_num_seqs=16,
+        )
+sampling_params = SamplingParams(
+                    temperature=0.8,
+                    max_tokens=2048,
+                    repetition_penalty=1.0,
+                    spaces_between_special_tokens=True
+                )
+prompts = [
+    "Which artist painted The Persistence of Memory (the melting clocks)?",
+]
+outputs = llm.generate([
+    tokenizer.apply_chat_template([
+        {"role": "user", "content": prompt}],
+        tokenize=False,
+        add_generation_prompt=True,
+        enable_thinking=True)
+        for prompt in prompts],
+    sampling_params)
+for p, o in zip(prompts, outputs):
+    print("Prompt: ", p)
+    print("Generated text: ", o.outputs[0].text)
+    print("=" * 100)
+```
+</details>
+## Footnote
+* **General settings**: All benchmarks are evaluated with a maximum context length of 65,536 tokens.
+* **Reasoning & Math benchmarks** (Math500, MMLU, MMLU Pro, GPQA Diamond, AIME 25, Beyond AIME, HMMT): Evaluated with `temperature=1.0, top_p=1.0, max_new_tokens=65536`.
+* **Coding & Knowledge benchmarks** (Live Code Bench v6, Arena Hard v2, IF Eval):
+  Evaluated with `temperature=1.0, top_p=1.0, max_new_tokens=65536`.
+* **Writing Bench**:
+  Responses generated using official Writing-Bench parameters:
+  `temperature=0.7, top_p=0.8, top_k=20, max_length=16000`.
+  Scoring performed using the official Writing-Bench critic model with:
+  `temperature=1.0, top_p=0.95, max_length=2048`.
+* **Agentic benchmarks** (BrowseComp, SWE Bench Verified, τ² Bench): Evaluated with `temperature=0.5, top_p=1.0, max_new_tokens=32768`.
+## Citation
+```
+@misc{sarvam_sovereign_models,
+  title        = {Introducing Sarvam's Sovereign Models},
+  author       = {{Sarvam Foundation Models Team}},
+  year         = {2026},
+  howpublished = {\url{https://www.sarvam.ai/blogs/sarvam-30b-105b}},
+  note         = {Accessed: 2026-03-03}
+}
+```

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,97 @@

+{{- '[@BOS@]\n' }}
+{%- if tools -%}
+<|start_of_turn|><|tool_declare|>
+<tools>
+{% for tool in tools %}
+{{ tool | tojson(ensure_ascii=False) }}
+{% endfor %}
+</tools>
+{{- '<|end_of_turn|>\n' }}{%- endif -%}
+{%- macro visible_text(content) -%}
+    {%- if content is string -%}
+        {{- content }}
+    {%- elif content is iterable and content is not mapping -%}
+        {%- for item in content -%}
+            {%- if item is mapping and item.type == 'text' -%}
+                {{- item.text }}
+            {%- elif item is string -%}
+                {{- item }}
+            {%- endif -%}
+        {%- endfor -%}
+    {%- elif content is none -%}
+        {{- '' }}
+    {%- else -%}
+        {{- content }}
+    {%- endif -%}
+{%- endmacro -%}
+{%- set ns = namespace(last_user_index=-1) %}
+{%- for m in messages %}
+    {%- if m.role == 'user' %}
+        {% set ns.last_user_index = loop.index0 -%}
+    {%- endif %}
+{%- endfor %}
+{% for m in messages %}
+{%- if m.role == 'user' -%}<|start_of_turn|><|user|>
+{{ visible_text(m.content) }}
+{{- '<|nothink|>' if (enable_thinking is defined and not enable_thinking and not visible_text(m.content).endswith("<|nothink|>")) else '' -}}
+{{- '<|end_of_turn|>\n' }}
+{%- elif m.role == 'assistant' -%}
+{{- '<|start_of_turn|><|assistant|>\n' }}
+{%- set reasoning_content = '' %}
+{%- set content = visible_text(m.content) %}
+{%- if m.reasoning_content is string %}
+    {%- set reasoning_content = m.reasoning_content %}
+{%- else %}
+    {%- if '</think>' in content %}
+        {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
+        {%- set content = content.split('</think>')[-1].lstrip('\n') %}
+    {%- endif %}
+{%- endif %}
+{%- if loop.index0 > ns.last_user_index and reasoning_content -%}
+{{ '<think>' + reasoning_content.strip() +  '</think>'}}
+{%- else -%}
+{{ '<think></think>' }}
+{%- endif -%}
+{%- if content.strip() -%}
+{{ '\n' + content.strip() }}
+{%- endif -%}
+{% if m.tool_calls %}
+{% for tc in m.tool_calls %}
+{%- if tc.function %}
+    {%- set tc = tc.function %}
+{%- endif %}
+{{ '\n<tool_call>' + tc.name }}
+{% set _args = tc.arguments %}
+{% for k, v in _args.items() %}
+<arg_key>{{ k }}</arg_key>
+<arg_value>{{ v | tojson(ensure_ascii=False) if v is not string else v }}</arg_value>
+{% endfor %}
+</tool_call>{% endfor %}
+{% endif %}
+{{- '<|end_of_turn|>\n' }}
+{%- elif m.role == 'tool' -%}
+{%- if m.content is string -%}
+{%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
+    {{- '<|start_of_turn|><|observation|>' }}
+{%- endif %}
+{{- '\n<tool_response>\n' }}
+{{- m.content }}
+{{- '\n</tool_response>' }}
+{%- else -%}
+<|start_of_turn|><|observation|>{% for tr in m.content %}
+<tool_response>
+{{ tr.output if tr.output is defined else tr }}
+</tool_response>{% endfor -%}
+{% endif -%}
+{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+{{- '<|end_of_turn|>\n' }}{%- endif -%}
+{%- elif m.role == 'system' -%}
+<|start_of_turn|><|system|>
+{{ visible_text(m.content) }}
+{{- '<|end_of_turn|>\n' }}
+{%- endif -%}
+{%- endfor -%}
+{%- if add_generation_prompt -%}
+    {{- '<|start_of_turn|><|assistant|>\n' }}
+{%- endif -%}

config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "architectures": [
+    "SarvamMLAForCausalLM"
+  ],
+  "attention_dropout": 0.0,
+  "attn_implementation": null,
+  "auto_map": {
+    "AutoConfig": "configuration_sarvam_moe.SarvamMLAConfig",
+    "AutoModel": "modeling_sarvam_moe.SarvamMLAModel",
+    "AutoModelForCausalLM": "modeling_sarvam_moe.SarvamMLAForCausalLM"
+  },
+  "default_theta": 10000.0,
+  "dtype": "float32",
+  "embedding_dropout": 0.0,
+  "eos_token_id": 1,
+  "first_k_dense_replace": 1,
+  "head_dim": 576,
+  "hidden_act": "silu",
+  "hidden_size": 4096,
+  "initializer_range": 0.006,
+  "intermediate_size": 16384,
+  "kv_lora_rank": 512,
+  "max_position_embeddings": 131072,
+  "model_type": "sarvam_mla",
+  "moe_intermediate_size": 2048,
+  "moe_router_enable_expert_bias": true,
+  "num_attention_heads": 64,
+  "num_experts": 128,
+  "num_experts_per_tok": 8,
+  "num_hidden_layers": 32,
+  "num_shared_experts": 1,
+  "output_dropout": 0.0,
+  "output_router_logits": false,
+  "pad_token_id": 0,
+  "q_head_dim": 192,
+  "qk_nope_head_dim": 128,
+  "qk_rope_head_dim": 64,
+  "rms_norm_eps": 1e-06,
+  "rope_scaling": {
+    "beta_fast": 32,
+    "beta_slow": 1,
+    "factor": 40,
+    "mscale": 1.0,
+    "mscale_all_dim": 1.0,
+    "original_max_position_embeddings": 4096,
+    "type": "deepseek_yarn"
+  },
+  "rope_theta": 10000.0,
+  "routed_scaling_factor": 2.5,
+  "tie_word_embeddings": false,
+  "transformers_version": "4.57.2",
+  "use_cache": true,
+  "use_qk_norm": true,
+  "v_head_dim": 128,
+  "vocab_size": 262144
+}

configuration_sarvam_moe.py ADDED Viewed

	@@ -0,0 +1,140 @@

+from transformers.configuration_utils import PretrainedConfig
+class SarvamMLAConfig(PretrainedConfig):
+    model_type = "sarvam_mla"
+    base_model_pp_plan = {
+        "embed_tokens": (["input_ids"], ["inputs_embeds"]),
+        "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
+        "norm": (["hidden_states"], ["hidden_states"]),
+    }
+    base_model_tp_plan = {
+        "layers.*.self_attn.q_proj": "colwise",
+        "layers.*.self_attn.kv_b_proj": "colwise",
+        "layers.*.self_attn.o_proj": "rowwise",
+    }
+    def __init__(
+        self,
+        vocab_size: int = 262144,
+        hidden_size: int = 4096,
+        num_hidden_layers: int = 32,
+        intermediate_size: int = 16384,
+        moe_intermediate_size: int = 2048,
+        num_experts: int = 128,
+        num_experts_per_tok: int = 8,
+        num_shared_experts: int = 1,
+        first_k_dense_replace: int = 1,
+        num_attention_heads: int = 64,
+        qk_rope_head_dim: int = 64,
+        qk_nope_head_dim: int = 128,
+        kv_lora_rank: int = 512,
+        v_head_dim: int = 128,
+        max_position_embeddings: int = 4096,
+        rope_theta: float = 10000.0,
+        rope_scaling: dict = None,
+        attention_dropout: float = 0.0,
+        output_dropout: float = 0.0,
+        rms_norm_eps: float = 1e-6,
+        hidden_act: str = "silu",
+        use_cache: bool = True,
+        use_qk_norm: bool = True,
+        moe_router_enable_expert_bias: bool = True,
+        routed_scaling_factor: float = 2.5,
+        output_router_logits: bool = False,
+        tie_word_embeddings: bool = False,
+        pad_token_id: int = 0,
+        eos_token_id: int = 1,
+        embedding_dropout: float = 0.0,
+        initializer_range: float = 0.006,
+        attn_implementation: str = "eager",
+        **kwargs,
+    ):
+        # core geometry
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.intermediate_size = intermediate_size
+        self.num_attention_heads = num_attention_heads
+        self.max_position_embeddings = max_position_embeddings
+        # MLA geometry
+        self.qk_rope_head_dim = qk_rope_head_dim
+        self.qk_nope_head_dim = qk_nope_head_dim
+        self.kv_lora_rank = kv_lora_rank
+        self.v_head_dim = v_head_dim
+        # convenient derived dim
+        self.q_head_dim = qk_rope_head_dim + qk_nope_head_dim
+        # vLLM MLA expects "head size" = Lkv + R, not hidden_size/num_heads.
+        self.head_dim = int(self.kv_lora_rank + self.qk_rope_head_dim)
+        # MoE
+        self.moe_intermediate_size = moe_intermediate_size
+        self.num_experts = num_experts
+        self.num_experts_per_tok = num_experts_per_tok
+        self.num_shared_experts = num_shared_experts
+        self.first_k_dense_replace = first_k_dense_replace
+        # Router
+        self.moe_router_enable_expert_bias = moe_router_enable_expert_bias
+        self.routed_scaling_factor = routed_scaling_factor
+        self.output_router_logits = output_router_logits
+        # dropouts / norms / init
+        self.attention_dropout = attention_dropout
+        self.output_dropout = output_dropout
+        self.embedding_dropout = embedding_dropout
+        self.rms_norm_eps = rms_norm_eps
+        self.initializer_range = initializer_range
+        self.hidden_act = hidden_act
+        # rope / cache
+        self.rope_theta = rope_theta
+        self.use_cache = use_cache
+        self.use_qk_norm = use_qk_norm
+        self.rope_scaling = rope_scaling
+        self.default_theta = 10000.0
+        if self.rope_scaling is None:
+            self.rope_scaling = {
+                'beta_fast': 32,
+                'beta_slow': 1,
+                'factor': 40,
+                'mscale': 1.0,
+                'mscale_all_dim': 1.0,
+                'original_max_position_embeddings': 4096,
+                'rope_type': 'deepseek_yarn',
+            }
+        self.attn_implementation = attn_implementation
+        self._attn_implementation = attn_implementation
+        if "_attn_implementation" in kwargs:
+            self._attn_implementation = kwargs.pop("_attn_implementation")
+            if hasattr(self, "attn_implementation"):
+                self.attn_implementation = self._attn_implementation
+        super().__init__(
+            pad_token_id=pad_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
+    def convert_rope_params_to_dict(self, ignore_keys_at_rope_validation: set | None = None, **kwargs):
+        rope_scaling = kwargs.pop("rope_scaling", None)
+        self.rope_parameters = rope_scaling or self.rope_parameters
+        self.rope_parameters = self.rope_parameters if self.rope_parameters is not None else {}
+        # Standardize and validate the correctness of rotary position embeddings parameters
+        self.rope_parameters.setdefault("rope_theta", kwargs.pop("rope_theta", self.default_theta))
+        self.standardize_rope_params()
+        self.validate_rope(ignore_keys=ignore_keys_at_rope_validation)
+        # Convert to float because RoPE fn expect a float. Models on the hub were saved as int
+        for key in ["beta_fast", "beta_slow", "factor"]:
+            if key in self.rope_parameters:
+                self.rope_parameters[key] = float(self.rope_parameters[key])
+        return kwargs

generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_from_model_config": true,
+  "eos_token_id": 26,
+  "pad_token_id": 0,
+  "transformers_version": "4.57.2"
+}

hotpatch_vllm.py ADDED Viewed

	@@ -0,0 +1,114 @@

+#!/usr/bin/env python3
+from __future__ import annotations
+import sys
+import subprocess
+from pathlib import Path
+from urllib.request import urlopen, Request
+HF_BLOB_URL = "https://huggingface.co/sarvamai/sarvam-105b/blob/main/sarvam.py"
+NEW_LINES = [
+    '    "SarvamMoEForCausalLM": ("sarvam", "SarvamMoEForCausalLM"),\n',
+    '    "SarvamMLAForCausalLM": ("sarvam", "SarvamMLAForCausalLM"),\n',
+]
+def run(cmd: list[str]) -> None:
+    print(f"+ {' '.join(cmd)}")
+    subprocess.check_call(cmd)
+def pip_install_vllm() -> None:
+    run([sys.executable, "-m", "pip", "install", "vllm==0.15.0"])
+def find_vllm_dir() -> Path:
+    import vllm  # type: ignore
+    vllm_dir = Path(vllm.__file__).resolve().parent
+    print(f"Detected vLLM package dir: {vllm_dir}")
+    return vllm_dir
+def patch_text_generation_models(registry_path: Path) -> None:
+    if not registry_path.exists():
+        raise FileNotFoundError(f"registry.py not found at: {registry_path}")
+    text = registry_path.read_text(encoding="utf-8")
+    lines = text.splitlines(keepends=True)
+    # Idempotency: if both keys already present, do nothing
+    if (
+        any('"SarvamMoEForCausalLM"' in l for l in lines)
+        and any('"SarvamMLAForCausalLM"' in l for l in lines)
+    ):
+        print("registry.py already contains Sarvam entries. Skipping patch.")
+        return
+    # Find the start of the _TEXT_GENERATION_MODELS dict
+    start_idx = None
+    for i, line in enumerate(lines):
+        if line.strip() == "_TEXT_GENERATION_MODELS = {":
+            start_idx = i
+            break
+    if start_idx is None:
+        raise RuntimeError(
+            "Could not find '_TEXT_GENERATION_MODELS = {' in registry.py. "
+            "vLLM version/layout may differ."
+        )
+    # Find the matching closing brace for that dict using brace depth
+    depth = 0
+    end_idx = None
+    for j in range(start_idx, len(lines)):
+        depth += lines[j].count("{")
+        depth -= lines[j].count("}")
+        if j > start_idx and depth == 0:
+            end_idx = j
+            break
+    if end_idx is None:
+        raise RuntimeError("Failed to find end of _TEXT_GENERATION_MODELS dict (brace matching).")
+    # Insert new entries just before the closing brace line
+    insert_at = end_idx
+    lines[insert_at:insert_at] = NEW_LINES
+    registry_path.write_text("".join(lines), encoding="utf-8")
+    print(f"Patched _TEXT_GENERATION_MODELS in: {registry_path}")
+def download_sarvam_py(dst: Path) -> None:
+    # Use /raw/ to download file contents, not HTML
+    raw_url = HF_BLOB_URL.replace("/blob/", "/raw/")
+    print(f"Downloading sarvam.py from: {raw_url}")
+    req = Request(raw_url, headers={"User-Agent": "vllm-hotpatch-script"})
+    with urlopen(req) as resp:
+        data = resp.read()
+    dst.parent.mkdir(parents=True, exist_ok=True)
+    dst.write_bytes(data)
+    print(f"Wrote: {dst}")
+def main() -> None:
+    pip_install_vllm()
+    vllm_dir = find_vllm_dir()
+    registry_path = vllm_dir / "model_executor" / "models" / "registry.py"
+    sarvam_path = vllm_dir / "model_executor" / "models" / "sarvam.py"
+    patch_text_generation_models(registry_path)
+    download_sarvam_py(sarvam_path)
+    print("\nDone.")
+    print(f"- Registry patched: {registry_path}")
+    print(f"- Sarvam module installed: {sarvam_path}")
+if __name__ == "__main__":
+    main()

model-00001-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:020162a3f1413743aa1ed567a2f19061fe63d782529059649b76a050365e547f
+size 4941941584

model-00002-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c13f5b15068b2dc76a517f1912e87448493a1e0e7955855e737bfe6931c4e7c7
+size 4975543872

model-00003-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:796c018f7d7af6684b5e1e4fc437abc5db6d8208af06c44a01309edc2370880e
+size 4999628720

model-00004-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:edae50ee1dfe5c03943d11ecd8285533da729d691d488b7ac47e37cd9ba85e12
+size 4977643584

model-00005-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aeb778a4730409646e42dc066698b85dc64a13157b40cce4ba0de2c1ba2c1670
+size 4999628712

model-00006-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fe67565f0b4dd30f46b50ec58b24e16f235e3ba79b3ef2a7a29b1dc75818196e
+size 4999628736

model-00007-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:33ee20d2b2a99158fd90b4e156502b79d523bfcf98a2e06c5902b999472a5e90
+size 4977643560

model-00008-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0346d9515c3327e35735b685a32177bb0607ff294c3bb080c5943580c663ae11
+size 4999628720

model-00009-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3790af0b4aea97c4f443d378bff9ce8b7d79c58bbd96771f8fc98d709a28a21c
+size 4999628784

model-00010-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6d3e9e1f284dc332bfda0d709fb8fdd93447043ab3b57e3b798bade2775cfb2a
+size 4977643504

model-00011-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f03a7da973536f075dfcbcc5afc51d49d393cad1538ade0c2422942b008efae1
+size 4999628720

model-00012-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2b2d6c8e512c5b81d7fd4d6ad18ea4aa2656bd2aae4ea5b4bf16293b8b432122
+size 4977643592

model-00013-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:90e89d4af307285773680b4d663fca8b53f6d066ef27c99694331e6d7024c8bb
+size 4999628704

model-00014-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c838e1af34b6e2c609521fe15e2cefd20ac9b8c161865e968bdc103f770caca8
+size 4999628728

model-00015-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6d19fa7b06f3b221983a671aa09241ac4abddac1bd0c3f9d236fcfa7a496de4b
+size 4977643560

model-00016-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1c826916a5dc292db7d59fe56a700c02f1bf61e8d7bb473853c6ca69a4fec1a8
+size 4999628720

model-00017-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:54a69b6eec29735a09cf9c790cae51ba4370d4aa1870ef2feca363dbe4eb0181
+size 4999628776

model-00018-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f5a2f5481e864e89772c9cb5349ad6c4768f9ea3ba94f7d136f53ebf63c27804
+size 4977643512

model-00019-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d2c93059fcded634f04218a908a17bad1afb424289fb1f97e5e6dfa93ef3f120
+size 4999628720

model-00020-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:603e97a59bb27ec736b5d896a127e45671c8b61c8b86a39e1f54db7e4d888ea9
+size 4977643600

model-00021-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f4b207ef1ebcdccae3bbf28e9867f309edb2ce0c3800b280e885e8e266fcd3d4
+size 4999628696

model-00022-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7d0c3f20a3c72c0d908677c7ff737393c690f3d93b1f543fff21bbcc08bbab01
+size 4999628728

model-00023-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d5e851f9a4e74ec7ce50d4a9752c84d66501c5adf6ad87aeb972f1d64f0d2f51
+size 4977643568

model-00024-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dd69c64cf1ab98085376201fd63b3c66c45326b16ce45b2f66914f156622c4a0
+size 4999628720

model-00025-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:10cafb2c42cfc260913bde7aaed3375274dd5bdb8a3cf68962468905d45915a6
+size 4999628776

model-00026-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5487fcda6b3f8468d0f3c5e8898fad88b6bd67cb364afc6e683b757adec9616b
+size 4977643640

model-00027-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cf70c35946bad6e6eb0d2a393dba31a7dcea835eb0873b4c905f486810105af7
+size 4999628864

model-00028-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:05160d3e48222b667982d4033b29a0b5e0b1c6957276a644a7a82bcc52febcf2
+size 4977643744

model-00029-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2dc78fb681b662ee382cfc2390cc1f483e28ef69e07b623adccfb056165ac93b
+size 4999628840

model-00030-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e2cad103eec94b218d5b6590373dc1a59217c77458d92c6152b76241981f008a
+size 4999628872

model-00031-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:30eea0422e12488ceae5fad99580984723939f2ee85ccbc25b6ac89f270dc51d
+size 4977643720

model-00032-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:84850fa473bb9a91f6611838433bb24535ae001c1b77a3d291804eacc4acb639
+size 4999628864

model-00033-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9a630b2cd2fabdff9ceee3a6cfce65c5f93d8c99de4b12eccc968c37b6649791
+size 4999628920

model-00034-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bff2b3792df9fa7999d9c53bd6d4f3591e0dc3ea0a81d22a8bac39eaf6b667bf
+size 4977643672

model-00035-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f329c1cb6b407c4b764b5d5913f89918b55e700f9e790bc01681b052802e75f6
+size 4999628864

model-00036-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f395cf50d44e672c07b0059a5775636ac2caf849132dbb16befc0b48bfeaf747
+size 4977643752

model-00037-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2ec1e4fcfacbc683eaf6db0398212496e23644bcea25f4a436d2d61ff1e3afdc
+size 4999628832

model-00038-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f7a3d772244e2d55ec9847767f495645d5116d4df24820c530a47d1fc8259ebd
+size 4999628864

model-00039-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:55f257b34fb41b225672561bae1a41c9c7a3c00cff1619748b8fda8c4d41a27c
+size 4977643720

model-00040-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f2722c9e30f9f87b364629700ec6ff31bf50284ce3abe58f02eb298612ec5907
+size 4999628864

model-00041-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5d343984921f3e7e3839b48ae2f109733707f7accf6851c5e0ba6bd72884a91a
+size 4999628912

model-00042-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7ba68f16a0ce2a70d59f9ceb0220aed17a8674dc01c7ffb5a274f741b08d7294
+size 4977643680

model-00043-of-00085.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2d2955a473ab2c6c3876bde5eabf3aa92192588ad7cca7eedc5adaad7715050d
+size 4999628864