TinyLlama Function Calling (CPU Optimized)

This is a CPU-optimized version of TinyLlama that has been fine-tuned for function calling capabilities.

Model Details

  • Base Model: TinyLlama-1.1B-Chat-v1.0
  • Parameters: 1.1 billion
  • Fine-tuning Method: LoRA (Low-Rank Adaptation)
  • Training Data: Function calling examples from Glaive Function Calling v2 dataset
  • Optimization: Merged LoRA weights, converted to float32 for CPU deployment

Key Features

  1. Function Calling Capabilities: The model can identify when functions should be called and generate appropriate function call syntax
  2. CPU Optimized: Ready to run efficiently on low-end hardware without GPUs
  3. Lightweight: Only 1.1B parameters, making it suitable for older hardware
  4. Low Resource Requirements: Requires only 4-6 GB RAM for loading

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the model
model = AutoModelForCausalLM.from_pretrained("tinyllama-function-calling-cpu-optimized")
tokenizer = AutoTokenizer.from_pretrained("tinyllama-function-calling-cpu-optimized")

# Example prompt for function calling
prompt = """### Instruction:
Given the available functions and the user query, determine which function(s) to call and with what arguments.

Available functions:
{
    "name": "get_exchange_rate",
    "description": "Get the exchange rate between two currencies",
    "parameters": {
        "type": "object",
        "properties": {
            "base_currency": {
                "type": "string",
                "description": "The currency to convert from"
            },
            "target_currency": {
                "type": "string",
                "description": "The currency to convert to"
            }
        },
        "required": [
            "base_currency",
            "target_currency"
        ]
    }
}

User query: What is the exchange rate from USD to EUR?

### Response:"""

# Tokenize and generate response
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        do_sample=True,
        temperature=0.7,
        top_k=50,
        top_p=0.95
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Performance on Low-End Hardware

The CPU-optimized model requires approximately:

  • 4-6 GB RAM for loading
  • 2-4 CPU cores for inference
  • No GPU required

This makes it suitable for:

  • Older laptops (2018 and newer)
  • Low-end desktops
  • Edge devices with ARM processors

Training Process

The model was fine-tuned using LoRA (Low-Rank Adaptation) on the Glaive Function Calling v2 dataset. Only a subset of 50 examples was used for demonstration purposes.

License

This model is licensed under the Apache 2.0 license.

Downloads last month
26
Safetensors
Model size
1.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support