Instructions to use gary23w/gary-5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use gary23w/gary-5 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="gary23w/gary-5", filename="gary-5.Q8_0.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use gary23w/gary-5 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf gary23w/gary-5:Q8_0 # Run inference directly in the terminal: llama-cli -hf gary23w/gary-5:Q8_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf gary23w/gary-5:Q8_0 # Run inference directly in the terminal: llama-cli -hf gary23w/gary-5:Q8_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf gary23w/gary-5:Q8_0 # Run inference directly in the terminal: ./llama-cli -hf gary23w/gary-5:Q8_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf gary23w/gary-5:Q8_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf gary23w/gary-5:Q8_0
Use Docker
docker model run hf.co/gary23w/gary-5:Q8_0
- LM Studio
- Jan
- vLLM
How to use gary23w/gary-5 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "gary23w/gary-5" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "gary23w/gary-5", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/gary23w/gary-5:Q8_0
- Ollama
How to use gary23w/gary-5 with Ollama:
ollama run hf.co/gary23w/gary-5:Q8_0
- Unsloth Studio
How to use gary23w/gary-5 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for gary23w/gary-5 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for gary23w/gary-5 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for gary23w/gary-5 to start chatting
- Atomic Chat new
- Docker Model Runner
How to use gary23w/gary-5 with Docker Model Runner:
docker model run hf.co/gary23w/gary-5:Q8_0
- Lemonade
How to use gary23w/gary-5 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull gary23w/gary-5:Q8_0
Run and chat with the model
lemonade run user.gary-5-Q8_0
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)gary-5 π±
The pocket-sized chat model that can actually chat. Successor to gary-4 (69 KB of beautiful nonsense). gary-5 trades a few megabytes for the ability to, you know, answer questions.
Built on SmolLM2-135M-Instruct (a distilled instruct model by Hugging Face TB), LoRA fine-tuned for the gary persona, merged, and quantized to GGUF. Runs fully offline on basically anything β phone, Raspberry Pi, that laptop from 2014.
Stats
| Parameters | 135M |
| GGUF Q8_0 | 138 MB |
| Safetensors (bf16) | 260 MB |
| Base | SmolLM2-135M-Instruct (Apache-2.0) |
| Fine-tune | LoRA r=16 on all attn+MLP projections, 101 steps, merged |
| Runs on | CPU, no GPU needed. ~30 tok/s on a sandbox CPU. |
Sample (greedy, unedited)
Q: Who are you?
A: I'm gary-5, a pocket-sized AI built by Garrett. Small enough to live on
your phone, smart enough to be worth talking to.
Q: What is the capital of France?
A: Paris.
Q: Explain what a firewall does in one sentence.
A: A firewall acts like a powerful wall that filters and blocks unauthorized
traffic before it reaches your device.
Run it
llama.cpp / ollama (recommended, uses the 138 MB GGUF):
llama-cli -m gary-5.Q8_0.gguf -cnv \
-sys "You are gary-5, a pocket-sized AI assistant created by Garrett."
transformers:
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("gary23w/gary-5")
model = AutoModelForCausalLM.from_pretrained("gary23w/gary-5")
msgs = [{"role":"system","content":"You are gary-5, a pocket-sized AI assistant created by Garrett."},
{"role":"user","content":"hi"}]
enc = tok.apply_chat_template(msgs, add_generation_prompt=True, return_dict=True, return_tensors="pt")
print(tok.decode(model.generate(**enc, max_new_tokens=80)[0], skip_special_tokens=True))
Honest section
It's a 135M model: great at chat, identity, short factual answers, summaries, and one-sentence explanations; it will confidently improvise on hard reasoning and obscure facts. For GPT-4-like performance in your pocket the recipe is this exact pipeline with a 1β3B base β gary-6, presumably.
The gary lineage: gary-4 (67K params, 69 KB, gibberish, beloved) β gary-5 (135M params, 138 MB, coherent) β gary-6 (TBD, pending Garrett's ambitions).
- Downloads last month
- 25
Model tree for gary23w/gary-5
Base model
HuggingFaceTB/SmolLM2-135M
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="gary23w/gary-5", filename="gary-5.Q8_0.gguf", )