TinyLlama Function Calling (CPU Optimized)
This is a CPU-optimized version of TinyLlama that has been fine-tuned for function calling capabilities.
Model Details
- Base Model: TinyLlama-1.1B-Chat-v1.0
- Parameters: 1.1 billion
- Fine-tuning Method: LoRA (Low-Rank Adaptation)
- Training Data: Function calling examples from Glaive Function Calling v2 dataset
- Optimization: Merged LoRA weights, converted to float32 for CPU deployment
Key Features
- Function Calling Capabilities: The model can identify when functions should be called and generate appropriate function call syntax
- CPU Optimized: Ready to run efficiently on low-end hardware without GPUs
- Lightweight: Only 1.1B parameters, making it suitable for older hardware
- Low Resource Requirements: Requires only 4-6 GB RAM for loading
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load the model
model = AutoModelForCausalLM.from_pretrained("tinyllama-function-calling-cpu-optimized")
tokenizer = AutoTokenizer.from_pretrained("tinyllama-function-calling-cpu-optimized")
# Example prompt for function calling
prompt = """### Instruction:
Given the available functions and the user query, determine which function(s) to call and with what arguments.
Available functions:
{
"name": "get_exchange_rate",
"description": "Get the exchange rate between two currencies",
"parameters": {
"type": "object",
"properties": {
"base_currency": {
"type": "string",
"description": "The currency to convert from"
},
"target_currency": {
"type": "string",
"description": "The currency to convert to"
}
},
"required": [
"base_currency",
"target_currency"
]
}
}
User query: What is the exchange rate from USD to EUR?
### Response:"""
# Tokenize and generate response
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=150,
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.95
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Performance on Low-End Hardware
The CPU-optimized model requires approximately:
- 4-6 GB RAM for loading
- 2-4 CPU cores for inference
- No GPU required
This makes it suitable for:
- Older laptops (2018 and newer)
- Low-end desktops
- Edge devices with ARM processors
Training Process
The model was fine-tuned using LoRA (Low-Rank Adaptation) on the Glaive Function Calling v2 dataset. Only a subset of 50 examples was used for demonstration purposes.
License
This model is licensed under the Apache 2.0 license.
- Downloads last month
- 26
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support