Instructions to use whichcy/llama3.2-1B-tool with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use whichcy/llama3.2-1B-tool with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="whichcy/llama3.2-1B-tool")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("whichcy/llama3.2-1B-tool")
model = AutoModelForMultimodalLM.from_pretrained("whichcy/llama3.2-1B-tool")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use whichcy/llama3.2-1B-tool with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "whichcy/llama3.2-1B-tool"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "whichcy/llama3.2-1B-tool",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/whichcy/llama3.2-1B-tool

SGLang

How to use whichcy/llama3.2-1B-tool with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "whichcy/llama3.2-1B-tool" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "whichcy/llama3.2-1B-tool",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "whichcy/llama3.2-1B-tool" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "whichcy/llama3.2-1B-tool",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use whichcy/llama3.2-1B-tool with Docker Model Runner:
```
docker model run hf.co/whichcy/llama3.2-1B-tool
```

Llama 3.2 1B Tool Calling

This is a tool-calling SFT model based on meta-llama/Llama-3.2-1B-Instruct. It is intended for research and experimentation with single-turn function calling, tool selection, and structured argument generation.

The model was fine-tuned on a processed version of Team-ACE/ToolACE. The training data keeps single-turn tool-use examples and follows the Llama chat format.

Model Details

Base model: meta-llama/Llama-3.2-1B-Instruct
Architecture: LlamaForCausalLM
Task: tool calling / function calling
Scope: single-turn tool calling
Format: Llama chat template
Checkpoint dtype: float32
Training sequence length: 1024 tokens
Training samples: 8,194
Evaluation samples: 432

Data

The model uses Team-ACE/ToolACE after lightweight preprocessing:

Keep the first user-assistant turn from each conversation.
Extract tool definitions from the system message.
Preserve tool-call and refusal examples.
Tokenize with the Llama chat template.
Remove examples whose prompt length is greater than 1024 tokens.
Apply parameter-name perturbation to part of the tool-calling data to improve robustness to schema wording and parameter naming changes.

The processed dataset contains 8,626 examples before the final train/evaluation split.

RoPE / Config Compatibility

This repository uses a Llama 3.2 model, so the RoPE configuration is important for correct generation. Different Transformers and inference-server versions may expect different config field names.

For newer Transformers exports, the equivalent configuration is usually represented as:

{
  "rope_parameters": {
    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_theta": 500000.0,
    "rope_type": "llama3"
  },
  "dtype": "float32"
}

For older Transformers, SGLang, vLLM, or BFCL-style evaluation stacks, the same values may need to be represented as:

{
  "rope_scaling": {
    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "torch_dtype": "float32"
}

The uploaded model uses the second form for broader compatibility with BFCL/SGLang-style inference. This changes config metadata only; it does not change the model weights.

BFCL v4 Evaluation

The model was evaluated on BFCL v4 and compared with the base Llama-3.2-1B-Instruct-FC model.

Task	Llama-3.2-1B-Instruct-FC	LLaMa-3.2-1B-Tool	Improvement
`parallel_multiple`	15.00	72.00	+57.00
`simple_python`	73.75	88.25	+14.50
`parallel`	42.50	79.00	+36.50
`simple_java`	16.00	62.00	+46.00
`multiple`	51.00	87.50	+36.50
`simple_javascript`	40.00	66.00	+26.00
`irrelevance`	36.25	78.33	+42.08
`live_irrelevance`	67.53	66.97	-0.56
`live_parallel_multiple`	0.00	45.83	+45.83
`live_multiple`	7.12	53.47	+46.35
`live_parallel`	0.00	56.25	+56.25
`live_simple`	30.62	60.08	+29.46
`live_relevance`	43.75	100.00	+56.25
Average	32.66	70.90	+38.24

Overall, the fine-tuned model improves the BFCL v4 average score from 32.66 to 70.90, with the largest gains on parallel, multi-tool, and live relevance tasks. The only category without improvement is live_irrelevance, where performance is nearly unchanged.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "whichcy/llama3.2-1B-tool"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

functions_json = """[
  {
    "name": "get_weather",
    "description": "Get the weather for a location.",
    "parameters": {
      "type": "object",
      "properties": {
        "location": {"type": "string", "description": "The city and region."},
        "date": {"type": "string", "description": "The date for the forecast."}
      },
      "required": ["location", "date"]
    }
  }
]"""

system_prompt = f"""You are an expert in composing functions. You are given a question and a set of possible functions. Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
If none of the functions can be used, point it out. If the given question lacks the parameters required by the function, also point it out.
You should only return the function calls in your response.

If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]
You SHOULD NOT include any other text in the response.

At each turn, you should try your best to complete the tasks requested by the user within the current turn. Continue to output functions to call until you have fulfilled the user's request to the best of your ability. Once you have no more functions to call, the system will consider the current turn complete and proceed to the next turn or task.

Here is a list of functions in JSON format that you can invoke.
{functions_json}
"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Find the weather in Shanghai tomorrow."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=256,
    do_sample=False,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=False))

When using this model for tool calling, include the tool schemas in either the system prompt or the user prompt, depending on your inference framework. The BFCL-style prompt places the available tools in the system prompt.

Limitations

The model is optimized for tool-calling behavior, not general chat quality.
The model targets single-turn tool calling. Multi-turn tool-use or agentic workflows are not guaranteed.
Very long tool schemas or long multi-turn conversations may be unreliable because training examples were filtered at 1024 prompt tokens.
The model may still choose incorrect tools, omit required arguments, or generate malformed calls.
The model inherits the license and usage restrictions of the Llama 3.2 base model.

Citation

Please cite the Llama 3.2 base model and the ToolACE dataset if you use this model in research.

Downloads last month: 22

Safetensors

Model size

1B params

Tensor type

F32

Model tree for whichcy/llama3.2-1B-tool

Base model

meta-llama/Llama-3.2-1B-Instruct

Finetuned

(1757)

this model

whichcy
/

llama3.2-1B-tool