Instructions to use whichcy/llama3.2-1B-tool with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use whichcy/llama3.2-1B-tool with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="whichcy/llama3.2-1B-tool") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("whichcy/llama3.2-1B-tool") model = AutoModelForMultimodalLM.from_pretrained("whichcy/llama3.2-1B-tool") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use whichcy/llama3.2-1B-tool with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "whichcy/llama3.2-1B-tool" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "whichcy/llama3.2-1B-tool", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/whichcy/llama3.2-1B-tool
- SGLang
How to use whichcy/llama3.2-1B-tool with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "whichcy/llama3.2-1B-tool" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "whichcy/llama3.2-1B-tool", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "whichcy/llama3.2-1B-tool" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "whichcy/llama3.2-1B-tool", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use whichcy/llama3.2-1B-tool with Docker Model Runner:
docker model run hf.co/whichcy/llama3.2-1B-tool
Llama 3.2 1B Tool Calling
This is a tool-calling SFT model based on meta-llama/Llama-3.2-1B-Instruct. It is intended for research and experimentation with single-turn function calling, tool selection, and structured argument generation.
The model was fine-tuned on a processed version of Team-ACE/ToolACE. The training data keeps single-turn tool-use examples and follows the Llama chat format.
Model Details
- Base model:
meta-llama/Llama-3.2-1B-Instruct - Architecture:
LlamaForCausalLM - Task: tool calling / function calling
- Scope: single-turn tool calling
- Format: Llama chat template
- Checkpoint dtype: float32
- Training sequence length: 1024 tokens
- Training samples: 8,194
- Evaluation samples: 432
Data
The model uses Team-ACE/ToolACE after lightweight preprocessing:
- Keep the first user-assistant turn from each conversation.
- Extract tool definitions from the system message.
- Preserve tool-call and refusal examples.
- Tokenize with the Llama chat template.
- Remove examples whose prompt length is greater than 1024 tokens.
- Apply parameter-name perturbation to part of the tool-calling data to improve robustness to schema wording and parameter naming changes.
The processed dataset contains 8,626 examples before the final train/evaluation split.
RoPE / Config Compatibility
This repository uses a Llama 3.2 model, so the RoPE configuration is important for correct generation. Different Transformers and inference-server versions may expect different config field names.
For newer Transformers exports, the equivalent configuration is usually represented as:
{
"rope_parameters": {
"factor": 32.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_theta": 500000.0,
"rope_type": "llama3"
},
"dtype": "float32"
}
For older Transformers, SGLang, vLLM, or BFCL-style evaluation stacks, the same values may need to be represented as:
{
"rope_scaling": {
"factor": 32.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
"rope_theta": 500000.0,
"torch_dtype": "float32"
}
The uploaded model uses the second form for broader compatibility with BFCL/SGLang-style inference. This changes config metadata only; it does not change the model weights.
BFCL v4 Evaluation
The model was evaluated on BFCL v4 and compared with the base Llama-3.2-1B-Instruct-FC model.
| Task | Llama-3.2-1B-Instruct-FC | LLaMa-3.2-1B-Tool | Improvement |
|---|---|---|---|
parallel_multiple |
15.00 | 72.00 | +57.00 |
simple_python |
73.75 | 88.25 | +14.50 |
parallel |
42.50 | 79.00 | +36.50 |
simple_java |
16.00 | 62.00 | +46.00 |
multiple |
51.00 | 87.50 | +36.50 |
simple_javascript |
40.00 | 66.00 | +26.00 |
irrelevance |
36.25 | 78.33 | +42.08 |
live_irrelevance |
67.53 | 66.97 | -0.56 |
live_parallel_multiple |
0.00 | 45.83 | +45.83 |
live_multiple |
7.12 | 53.47 | +46.35 |
live_parallel |
0.00 | 56.25 | +56.25 |
live_simple |
30.62 | 60.08 | +29.46 |
live_relevance |
43.75 | 100.00 | +56.25 |
| Average | 32.66 | 70.90 | +38.24 |
Overall, the fine-tuned model improves the BFCL v4 average score from 32.66 to 70.90, with the largest gains on parallel, multi-tool, and live relevance tasks. The only category without improvement is live_irrelevance, where performance is nearly unchanged.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "whichcy/llama3.2-1B-tool"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
functions_json = """[
{
"name": "get_weather",
"description": "Get the weather for a location.",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "The city and region."},
"date": {"type": "string", "description": "The date for the forecast."}
},
"required": ["location", "date"]
}
}
]"""
system_prompt = f"""You are an expert in composing functions. You are given a question and a set of possible functions. Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
If none of the functions can be used, point it out. If the given question lacks the parameters required by the function, also point it out.
You should only return the function calls in your response.
If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]
You SHOULD NOT include any other text in the response.
At each turn, you should try your best to complete the tasks requested by the user within the current turn. Continue to output functions to call until you have fulfilled the user's request to the best of your ability. Once you have no more functions to call, the system will consider the current turn complete and proceed to the next turn or task.
Here is a list of functions in JSON format that you can invoke.
{functions_json}
"""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "Find the weather in Shanghai tomorrow."},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=256,
do_sample=False,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
When using this model for tool calling, include the tool schemas in either the system prompt or the user prompt, depending on your inference framework. The BFCL-style prompt places the available tools in the system prompt.
Limitations
- The model is optimized for tool-calling behavior, not general chat quality.
- The model targets single-turn tool calling. Multi-turn tool-use or agentic workflows are not guaranteed.
- Very long tool schemas or long multi-turn conversations may be unreliable because training examples were filtered at 1024 prompt tokens.
- The model may still choose incorrect tools, omit required arguments, or generate malformed calls.
- The model inherits the license and usage restrictions of the Llama 3.2 base model.
Citation
Please cite the Llama 3.2 base model and the ToolACE dataset if you use this model in research.
- Downloads last month
- 22
Model tree for whichcy/llama3.2-1B-tool
Base model
meta-llama/Llama-3.2-1B-Instruct