Instructions to use axonlabsai/axon-250m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use axonlabsai/axon-250m with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="axonlabsai/axon-250m")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("axonlabsai/axon-250m")
model = AutoModelForMultimodalLM.from_pretrained("axonlabsai/axon-250m")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use axonlabsai/axon-250m with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="axonlabsai/axon-250m",
	filename="axon-250m-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use axonlabsai/axon-250m with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf axonlabsai/axon-250m:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf axonlabsai/axon-250m:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf axonlabsai/axon-250m:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf axonlabsai/axon-250m:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf axonlabsai/axon-250m:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf axonlabsai/axon-250m:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf axonlabsai/axon-250m:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf axonlabsai/axon-250m:Q4_K_M

Use Docker

docker model run hf.co/axonlabsai/axon-250m:Q4_K_M

LM Studio
Jan

vLLM

How to use axonlabsai/axon-250m with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "axonlabsai/axon-250m"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "axonlabsai/axon-250m",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/axonlabsai/axon-250m:Q4_K_M

SGLang

How to use axonlabsai/axon-250m with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "axonlabsai/axon-250m" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "axonlabsai/axon-250m",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "axonlabsai/axon-250m" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "axonlabsai/axon-250m",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use axonlabsai/axon-250m with Ollama:
```
ollama run hf.co/axonlabsai/axon-250m:Q4_K_M
```

Unsloth Studio

How to use axonlabsai/axon-250m with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for axonlabsai/axon-250m to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for axonlabsai/axon-250m to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for axonlabsai/axon-250m to start chatting

Atomic Chat new
Docker Model Runner
How to use axonlabsai/axon-250m with Docker Model Runner:
```
docker model run hf.co/axonlabsai/axon-250m:Q4_K_M
```

Lemonade

How to use axonlabsai/axon-250m with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull axonlabsai/axon-250m:Q4_K_M

Run and chat with the model

lemonade run user.axon-250m-Q4_K_M

List all available models

lemonade list

Axon 250M

A 250M parameter custom chat model by Axon Labs. Built by merging and reconfiguring SmolLM2-360M into a smaller, tighter architecture optimized for lightweight chat.

Note: This model is NOT fine-tuned. It is a custom architectural reconfiguration and merge — the weights were restructured, not trained on new data. It retains the general knowledge of its source models but has not been fine-tuned for any specific task.

Model Details

Parameters: ~362M (F32) — marketed as 250M class
Architecture: LlamaForCausalLM (custom reconfiguration)
Hidden size: 960
Layers: 32
Attention heads: 15
KV heads: 5 (GQA)
Intermediate size: 2560
Max context: 8192 tokens
Vocab size: 49,152
Activation: SiLU
Tokenizer: SmolLM2 tokenizer with ChatML formatting (<|im_start|> / <|im_end|>)
License: MIT

Key Differences from Source

Unlike the base SmolLM2-360M, Axon 250M was created through architectural merging and reconfiguration:

Restructured layer count and attention configuration
GQA with 5 KV heads for efficient inference
Custom head dimension of 64
RoPE with theta=100000

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("axonlabsai/axon-250m", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("axonlabsai/axon-250m")

messages = [{"role": "user", "content": "Hey, what's up?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(inputs, max_new_tokens=128)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Limitations

NOT fine-tuned — no task-specific training was performed
Very small model with limited reasoning and factual knowledge
Prone to hallucination and incoherent outputs on complex prompts
Best suited for simple chat and experimentation, not production use
The "250M" branding reflects its model class, actual parameter count is ~362M

About Axon Labs

Axon Labs builds AI models and tools. This is our tiny model — small enough to run anywhere, dumb enough to be funny.

Downloads last month: 57

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for axonlabsai/axon-250m

Base model

HuggingFaceTB/SmolLM2-360M

Quantized

(34)

this model

Collection including axonlabsai/axon-250m

Axon OPENSOURCE

Collection

Opensource axon models • 2 items • Updated 3 days ago