sarvamai
/

sarvam-m

 print("thinking content:", thinking_content)
 print("content:", content)
+```
+## VLLM Deployment
+For deployment, you can use `vllm>=0.8.5` to create an OpenAI-compatible API endpoint:
+```shell
+vllm serve sarvamai/sarvam-M
+```
+For inference and switching between thinking and non-thinking mode, refer to the below python code:
+```python
+from openai import OpenAI
+# Modify OpenAI's API key and API base to use vLLM's API server.
+openai_api_key = "EMPTY"
+openai_api_base = "http://localhost:8000/v1"
+client = OpenAI(
+    api_key=openai_api_key,
+    base_url=openai_api_base,
+)
+models = client.models.list()
+model = models.data[0].id
+messages = [{"role": "user", "content": "How many letter r in word strawberry?"}]
+# By default, the model is in thinking mode.
+# If you want to disable thinking, add:
+# extra_body={"chat_template_kwargs": {"enable_thinking": False}}
+response = client.chat.completions.create(model=model, messages=messages)
+output_text = response.choices[0].message.content
+if "</think>" in output_text:
+    thinking_content = output_text.split("</think>")[0].rstrip("\n")
+    content = output_text.split("</think>")[-1].lstrip("\n").rstrip("</s>")
+else:
+    thinking_content = ""
+    content = output_text.rstrip("</s>")
+print("reasoning_content:", thinking_content)
+print("content:", content)
+# For the next round, add the assistant's response and reasoning to the messages.
+messages.append(
+    {"role": "assistant", "content": content, "reasoning_content": reasoning_content}
+)
+```