Text Generation
Transformers
Safetensors
PyTorch
English
llama
nvidia
llama-3
conversational
text-generation-inference
suhara commited on
Commit
8ec6619
·
verified ·
1 Parent(s): a3ff62e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -0
README.md CHANGED
@@ -14,6 +14,7 @@ tags:
14
  - pytorch
15
  ---
16
 
 
17
  # Llama-3.1-Nemotron-Nano-4B-v1.1
18
 
19
 
@@ -52,6 +53,8 @@ Developers designing AI Agent systems, chatbots, RAG systems, and other AI-power
52
 
53
  ## References
54
 
 
 
55
  - [\[2505.00949\] Llama-Nemotron: Efficient Reasoning Models](https://arxiv.org/abs/2505.00949)
56
 
57
 
@@ -177,6 +180,63 @@ pipeline = transformers.pipeline(
177
  print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}, {"role":"assistant", "content":"<think>\n</think>"}]))
178
  ```
179
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
180
  ## Inference:
181
  **Engine:** Transformers
182
  **Test Hardware:**
 
14
  - pytorch
15
  ---
16
 
17
+
18
  # Llama-3.1-Nemotron-Nano-4B-v1.1
19
 
20
 
 
53
 
54
  ## References
55
 
56
+ - [\[2408.11796\] LLM Pruning and Distillation in Practice: The Minitron Approach](https://arxiv.org/abs/2408.11796)
57
+ - [\[2502.00203\] Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment](https://arxiv.org/abs/2502.00203)
58
  - [\[2505.00949\] Llama-Nemotron: Efficient Reasoning Models](https://arxiv.org/abs/2505.00949)
59
 
60
 
 
180
  print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}, {"role":"assistant", "content":"<think>\n</think>"}]))
181
  ```
182
 
183
+ ## Running a vLLM Server with Tool-call Support
184
+
185
+ Llama-3.1-Nemotron-Nano-4B-v1.1 supports tool calling. This HF repo hosts a tool-callilng parser as well as a chat template in Jinja, which can be used to launch a vLLM server.
186
+ Here is an example command to launch a vLLM server with tool-call support.
187
+
188
+ ```console
189
+ $ git clone https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1
190
+
191
+ $ conda create -n vllm python=3.12 -y
192
+ $ conda activate vllm
193
+
194
+ $ python -m vllm.entrypoints.openai.api_server \
195
+ --model Llama-3.1-Nemotron-Nano-4B-v1.1 \
196
+ --trust-remote-code \
197
+ --seed 1 \
198
+ --host "0.0.0.0" \
199
+ --port 5000 \
200
+ --served-model-name "Llama-Nemotron-Nano-4B-v1.1" \
201
+ --tensor-parallel-size 1 \
202
+ --max-model-len 131072 \
203
+ --gpu-memory-utilization 0.95 \
204
+ --enforce-eager \
205
+ --enable-auto-tool-choice \
206
+ --tool-parser-plugin "Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_toolcall_parser.py" \
207
+ --tool-call-parser "llama_nemotron_json" \
208
+ --chat-template "Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_generic_tool_calling.jinja"
209
+ ```
210
+
211
+ You can call the launched server with tool-call support using a Python script like below.
212
+
213
+ ```python
214
+ >>> from openai import OpenAI
215
+ >>> client = OpenAI(
216
+ base_url="http://0.0.0.0:5000/v1",
217
+ api_key="dummy",
218
+ )
219
+
220
+ >>> completion = client.chat.completions.create(
221
+ model="Llama-Nemotron-Nano-v1.1",
222
+ messages=[
223
+ {"role": "system", "content": "detailed thinking on"},
224
+ {"role": "user", "content": "My bill is $100. What will be the amount for 18% tip?"},
225
+ ],
226
+ tools=[
227
+ {"type": "function", "function": {"name": "calculate_tip", "parameters": {"type": "object", "properties": {"bill_total": {"type": "integer", "description": "The total amount of the bill"}, "tip_percentage": {"type": "integer", "description": "The percentage of tip to be applied"}}, "required": ["bill_total", "tip_percentage"]}}},
228
+ {"type": "function", "function": {"name": "convert_currency", "parameters": {"type": "object", "properties": {"amount": {"type": "integer", "description": "The amount to be converted"}, "from_currency": {"type": "string", "description": "The currency code to convert from"}, "to_currency": {"type": "string", "description": "The currency code to convert to"}}, "required": ["from_currency", "amount", "to_currency"]}}},
229
+ ],
230
+ )
231
+
232
+ >>> completion.choices[0].message.content
233
+ '<think>\nOkay, let\'s see. The user has a bill of $100 and wants to know the amount of a 18% tip. So, I need to calculate the tip amount. The available tools include calculate_tip, which requires bill_total and tip_percentage. The parameters are both integers. The bill_total is 100, and the tip percentage is 18. So, the function should multiply 100 by 18% and return 18.0. But wait, maybe the user wants the total including the tip? The question says "the amount for 18% tip," which could be interpreted as the tip amount itself. Since the function is called calculate_tip, it\'s likely that it\'s designed to compute the tip, not the total. So, using calculate_tip with bill_total=100 and tip_percentage=18 should give the correct result. The other function, convert_currency, isn\'t relevant here. So, I should call calculate_tip with those values.\n</think>\n\n'
234
+
235
+ >>> completion.choices[0].message.tool_calls
236
+ [ChatCompletionMessageToolCall(id='chatcmpl-tool-2972d86817344edc9c1e0f9cd398e999', function=Function(arguments='{"bill_total": 100, "tip_percentage": 18}', name='calculate_tip'), type='function')]
237
+ ```
238
+
239
+
240
  ## Inference:
241
  **Engine:** Transformers
242
  **Test Hardware:**