Instructions to use strongtowerapps/Llama-3.2-3B-ONNX-INT8-StrongTowerApps-Research with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use strongtowerapps/Llama-3.2-3B-ONNX-INT8-StrongTowerApps-Research with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="strongtowerapps/Llama-3.2-3B-ONNX-INT8-StrongTowerApps-Research")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("strongtowerapps/Llama-3.2-3B-ONNX-INT8-StrongTowerApps-Research")
model = AutoModelForCausalLM.from_pretrained("strongtowerapps/Llama-3.2-3B-ONNX-INT8-StrongTowerApps-Research")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use strongtowerapps/Llama-3.2-3B-ONNX-INT8-StrongTowerApps-Research with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "strongtowerapps/Llama-3.2-3B-ONNX-INT8-StrongTowerApps-Research"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "strongtowerapps/Llama-3.2-3B-ONNX-INT8-StrongTowerApps-Research",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/strongtowerapps/Llama-3.2-3B-ONNX-INT8-StrongTowerApps-Research

SGLang

How to use strongtowerapps/Llama-3.2-3B-ONNX-INT8-StrongTowerApps-Research with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "strongtowerapps/Llama-3.2-3B-ONNX-INT8-StrongTowerApps-Research" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "strongtowerapps/Llama-3.2-3B-ONNX-INT8-StrongTowerApps-Research",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "strongtowerapps/Llama-3.2-3B-ONNX-INT8-StrongTowerApps-Research" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "strongtowerapps/Llama-3.2-3B-ONNX-INT8-StrongTowerApps-Research",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use strongtowerapps/Llama-3.2-3B-ONNX-INT8-StrongTowerApps-Research with Docker Model Runner:
```
docker model run hf.co/strongtowerapps/Llama-3.2-3B-ONNX-INT8-StrongTowerApps-Research
```

Llama-3.2-3B-ONNX-INT8-StrongTowerApps-Research

🛡️ Project Description

This repository contains an optimized version of Meta's Llama-3.2-3B model. As part of independent research, the model has been transformed to the ONNX format and dynamically quantized to Int8, achieving a drastic size reduction from an intermediate state of ~27GB to just 3.4GB.

🧠 Transformation Process: From Dynamic Code to Static Graph

To achieve the necessary portability and efficiency in local environments, the model went through a critical "compilation" phase of its architecture:

Static Graph Conversion: The dynamic execution logic (PyTorch) was transformed into a static ONNX mathematical graph. This means that each of the operations and connections between the 28 layers of the model was explicitly defined, eliminating the reliance on the Python interpreter during inference.
KV Cache Integration: The text-generation-with-past task was incorporated, integrating the memory logic (past_key_values) directly into the graph. This allows the model to be significantly faster by maintaining the conversation context.
Intermediate Technical Expansion: During this process, the original ~12GB model expanded to 27GB. This growth was a necessary technical step due to loop unrolling to optimize CPU performance and comprehensive serialization of the Protobuf format. This "expanded" version was the essential foundation for the subsequent pruning and final quantization.

⚙️ Technical Details

Base: meta-llama/Llama-3.2-3B
Final Format: ONNX (External Data)
Optimization: Dynamic Quantization (QUInt8)
Usage: Specially designed for local execution on CPU using onnxruntime.

🚀 Usage Example

Below is a Python script (ask-model.py) demonstrating how to load the ONNX model and the tokenizer to generate text on a CPU.

Prerequisites

Make sure to install the required libraries:

pip install onnxruntime numpy transformers

Code Implementation

import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer
import time

# --- Configuration ---
model_path = "Llama-3.2-3B-ONNX-INT8-StrongTowerApps-Research/model_quantized.onnx"

tokenizer_name = "Llama-3.2-3B-ONNX-INT8-StrongTowerApps-Research" 
max_new_tokens = 500 # Limit of words to generate (increased for more complete answers)

try:
    print("1. Loading the tokenizer from the local folder...")
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    
    print("2. Loading the ONNX model on CPU...")
    session = ort.InferenceSession(model_path, providers=['CPUExecutionProvider'])
    
    # Extract metadata to map the KV Cache (past_key_values)
    input_names = [i.name for i in session.get_inputs()]
    output_names = [o.name for o in session.get_outputs()]
    past_kv_names = [name for name in input_names if "past_key_values" in name]
    
    # --- Prepare the Prompt ---
    prompt = "What are the benefits of artificial intelligence in Cybersecurity?"
    print(f"\n[User]: {prompt}\n")
    print("[Llama-3.2-3B-ONNX]: ", end="", flush=True)
    
    # Tokenize the input text
    inputs = tokenizer(prompt, return_tensors="np")
    input_ids = inputs["input_ids"].astype(np.int64)
    attention_mask = inputs["attention_mask"].astype(np.int64)
    seq_len = input_ids.shape[1]
    
    # position_ids: [0, 1, 2, ..., seq_len - 1]
    position_ids = np.arange(0, seq_len, dtype=np.int64).reshape(1, seq_len)
    
    # Initialize the input dictionary for ONNX
    ort_inputs = {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "position_ids": position_ids
    }
    
    # Initialize empty 'past_key_values' (sequence length = 0)
    for input_meta in session.get_inputs():
        if "past_key_values" in input_meta.name:
            # Reconstruct the shape: [batch_size, num_heads, 0, head_dim]
            shape = [dim if isinstance(dim, int) else (0 if i == 2 else 1) for i, dim in enumerate(input_meta.shape)]
            dtype = np.float32
            if 'int64' in input_meta.type: dtype = np.int64
            elif 'int32' in input_meta.type: dtype = np.int32
            elif 'float16' in input_meta.type: dtype = np.float16
            ort_inputs[input_meta.name] = np.zeros(shape, dtype=dtype)
            
    # --- Inference Cycle (Generation Loop) ---
    start_time = time.time()
    
    for step in range(max_new_tokens):
        # Execute the model
        outputs = session.run(None, ort_inputs)
        
        # outputs[0] are the logits (predictions). We extract the last token.
        logits = outputs[0]
        next_token_id = np.argmax(logits[:, -1, :], axis=-1)[0]
        
        # Print the generated word in real time
        word = tokenizer.decode([next_token_id])
        print(word, end="", flush=True)
        
        # If the model predicts the end of the response, we stop the cycle
        if next_token_id == tokenizer.eos_token_id:
            break
            
        # --- Update Inputs for the next step (Using KV Cache) ---
        ort_inputs["input_ids"] = np.array([[next_token_id]], dtype=np.int64)
        ort_inputs["attention_mask"] = np.concatenate([ort_inputs["attention_mask"], np.ones((1, 1), dtype=np.int64)], axis=1)
        ort_inputs["position_ids"] = np.array([[seq_len + step]], dtype=np.int64)
        for past_name, present_value in zip(past_kv_names, outputs[1:]):
            ort_inputs[past_name] = present_value
            
    print(f"\n\n[INFO] Generation time: {time.time() - start_time:.2f} seconds.")
    
except Exception as e:
    print(f"\nAn error occurred: {e}")

⚖️ Licenses

The quantization processes, the optimization pipeline, including static graph transformation and technical compilation, was engineered by Strong Tower Apps™. This distribution leverages ONNX Runtime for high-performance inference, while the underlying model intelligence remains the property of Meta under the Llama 3.2 Community License.

Researcher: Strong Tower Apps™

Downloads last month: 3

Model tree for strongtowerapps/Llama-3.2-3B-ONNX-INT8-StrongTowerApps-Research

Base model

meta-llama/Llama-3.2-3B

Quantized

(135)

this model