nvidia
/

Llama-3.1-Nemotron-Nano-4B-v1.1

@@ -14,6 +14,7 @@ tags:
   - pytorch
 ---
 # Llama-3.1-Nemotron-Nano-4B-v1.1
@@ -52,6 +53,8 @@ Developers designing AI Agent systems, chatbots, RAG systems, and other AI-power
 ## References
 - [\[2505.00949\] Llama-Nemotron: Efficient Reasoning Models](https://arxiv.org/abs/2505.00949)
@@ -177,6 +180,63 @@ pipeline = transformers.pipeline(
 print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}, {"role":"assistant", "content":"<think>\n</think>"}]))
 ```
 ## Inference:
 **Engine:** Transformers
 **Test Hardware:**

   - pytorch
 ---
 # Llama-3.1-Nemotron-Nano-4B-v1.1
 ## References
+- [\[2408.11796\] LLM Pruning and Distillation in Practice: The Minitron Approach](https://arxiv.org/abs/2408.11796)
+- [\[2502.00203\] Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment](https://arxiv.org/abs/2502.00203)
 - [\[2505.00949\] Llama-Nemotron: Efficient Reasoning Models](https://arxiv.org/abs/2505.00949)
 print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}, {"role":"assistant", "content":"<think>\n</think>"}]))
 ```
+## Running a vLLM Server with Tool-call Support
+Llama-3.1-Nemotron-Nano-4B-v1.1 supports tool calling. This HF repo hosts a tool-callilng parser as well as a chat template in Jinja, which can be used to launch a vLLM server.
+Here is an example command to launch a vLLM server with tool-call support.
+```console
+$ git clone https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1
+$ conda create -n vllm python=3.12 -y
+$ conda activate vllm
+$ python -m vllm.entrypoints.openai.api_server \
+  --model Llama-3.1-Nemotron-Nano-4B-v1.1 \
+  --trust-remote-code \
+  --seed 1 \
+  --host "0.0.0.0" \
+  --port 5000 \
+  --served-model-name "Llama-Nemotron-Nano-4B-v1.1" \
+  --tensor-parallel-size 1 \
+  --max-model-len 131072 \
+  --gpu-memory-utilization 0.95 \
+  --enforce-eager \
+  --enable-auto-tool-choice \
+  --tool-parser-plugin "Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_toolcall_parser.py" \
+  --tool-call-parser "llama_nemotron_json" \
+  --chat-template "Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_generic_tool_calling.jinja"
+```
+You can call the launched server with tool-call support using a Python script like below.
+```python
+>>> from openai import OpenAI
+>>> client = OpenAI(
+        base_url="http://0.0.0.0:5000/v1",
+        api_key="dummy",
+    )
+>>> completion = client.chat.completions.create(
+      model="Llama-Nemotron-Nano-v1.1",
+      messages=[
+        {"role": "system", "content": "detailed thinking on"},
+        {"role": "user", "content": "My bill is $100. What will be the amount for 18% tip?"},
+      ],
+      tools=[
+        {"type": "function", "function": {"name": "calculate_tip", "parameters": {"type": "object", "properties": {"bill_total": {"type": "integer", "description": "The total amount of the bill"}, "tip_percentage": {"type": "integer", "description": "The percentage of tip to be applied"}}, "required": ["bill_total", "tip_percentage"]}}},
+        {"type": "function", "function": {"name": "convert_currency", "parameters": {"type": "object", "properties": {"amount": {"type": "integer", "description": "The amount to be converted"}, "from_currency": {"type": "string", "description": "The currency code to convert from"}, "to_currency": {"type": "string", "description": "The currency code to convert to"}}, "required": ["from_currency", "amount", "to_currency"]}}},
+      ],
+    )
+>>> completion.choices[0].message.content
+'<think>\nOkay, let\'s see. The user has a bill of $100 and wants to know the amount of a 18% tip. So, I need to calculate the tip amount. The available tools include calculate_tip, which requires bill_total and tip_percentage. The parameters are both integers. The bill_total is 100, and the tip percentage is 18. So, the function should multiply 100 by 18% and return 18.0. But wait, maybe the user wants the total including the tip? The question says "the amount for 18% tip," which could be interpreted as the tip amount itself. Since the function is called calculate_tip, it\'s likely that it\'s designed to compute the tip, not the total. So, using calculate_tip with bill_total=100 and tip_percentage=18 should give the correct result. The other function, convert_currency, isn\'t relevant here. So, I should call calculate_tip with those values.\n</think>\n\n'
+>>> completion.choices[0].message.tool_calls
+[ChatCompletionMessageToolCall(id='chatcmpl-tool-2972d86817344edc9c1e0f9cd398e999', function=Function(arguments='{"bill_total": 100, "tip_percentage": 18}', name='calculate_tip'), type='function')]
+```
 ## Inference:
 **Engine:** Transformers
 **Test Hardware:**