Instructions to use cloudunity/stealth-rifle with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use cloudunity/stealth-rifle with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="cloudunity/stealth-rifle", filename="stealth-rifle-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use cloudunity/stealth-rifle with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf cloudunity/stealth-rifle:Q4_K_M # Run inference directly in the terminal: llama cli -hf cloudunity/stealth-rifle:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf cloudunity/stealth-rifle:Q4_K_M # Run inference directly in the terminal: llama cli -hf cloudunity/stealth-rifle:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf cloudunity/stealth-rifle:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf cloudunity/stealth-rifle:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf cloudunity/stealth-rifle:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf cloudunity/stealth-rifle:Q4_K_M
Use Docker
docker model run hf.co/cloudunity/stealth-rifle:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use cloudunity/stealth-rifle with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "cloudunity/stealth-rifle" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cloudunity/stealth-rifle", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/cloudunity/stealth-rifle:Q4_K_M
- Ollama
How to use cloudunity/stealth-rifle with Ollama:
ollama run hf.co/cloudunity/stealth-rifle:Q4_K_M
- Unsloth Studio
How to use cloudunity/stealth-rifle with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for cloudunity/stealth-rifle to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for cloudunity/stealth-rifle to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for cloudunity/stealth-rifle to start chatting
- Pi
How to use cloudunity/stealth-rifle with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf cloudunity/stealth-rifle:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "cloudunity/stealth-rifle:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use cloudunity/stealth-rifle with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf cloudunity/stealth-rifle:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default cloudunity/stealth-rifle:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- OpenClaw new
How to use cloudunity/stealth-rifle with OpenClaw:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf cloudunity/stealth-rifle:Q4_K_M
Configure OpenClaw
# Install OpenClaw: npm install -g openclaw@latest # Register the local server and set it as the default model: openclaw onboard --non-interactive --mode local \ --auth-choice custom-api-key \ --custom-base-url http://127.0.0.1:8080/v1 \ --custom-model-id "cloudunity/stealth-rifle:Q4_K_M" \ --custom-provider-id llama-cpp \ --custom-compatibility openai \ --custom-text-input \ --accept-risk \ --skip-health
Run OpenClaw
openclaw agent --local --agent main --message "Hello from Hugging Face"
- Docker Model Runner
How to use cloudunity/stealth-rifle with Docker Model Runner:
docker model run hf.co/cloudunity/stealth-rifle:Q4_K_M
- Lemonade
How to use cloudunity/stealth-rifle with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull cloudunity/stealth-rifle:Q4_K_M
Run and chat with the model
lemonade run user.stealth-rifle-Q4_K_M
List all available models
lemonade list
Stealth-Rifle ๐ฏ
A small, CPU-only roleplay model. A LoRA fine-tune of
Qwen/Qwen2.5-0.5B-Instruct
trained, quantized, and served entirely within a 16 GB RAM / 2 vCPU budget with
no GPU at any stage. It targets clean, in-character roleplay prose with a strong
anti-"AI-slop" bias, and runs at a usable speed on commodity CPUs.
- Live API (OpenAI-compatible): https://huggingface.co/spaces/cloudunity/stealth-rifle-api
- Source / training pipeline: https://github.com/CloudCompile/stealth-rifle
- Base model:
Qwen/Qwen2.5-0.5B-Instruct(494M params) - Method: LoRA (attention-only) โ merged โ GGUF โ Q4_K_M
- Author: CJ Hauser (@CloudCompile)
Files
| File | Size | What it is |
|---|---|---|
stealth-rifle-Q4_K_M.gguf |
~380 MB | 4-bit quantized weights โ the CPU deployment artifact |
stealth-rifle-f16.gguf |
~950 MB | Full-precision GGUF (for re-quantizing or GPU offload) |
lora-adapter/ |
~8.7 MB | The raw LoRA adapter (apply on top of the base model) |
Why this model exists
The design brief was "a roleplay model that runs on 16 GB RAM / 2 CPU with good tokens/sec and really good quality." Frontier RP leaderboards are topped by 70Bโ1T-parameter models that need datacenter GPUs; matching them on a 2-core CPU is not physically possible. The honest, hardware-faithful answer is a LoRA fine-tune of a strong small open model, quantized for CPU inference. That is exactly what Stealth-Rifle is โ the best-quality RP model that genuinely fits the budget, not a benchmark-gamed claim.
Intended use
- Local / self-hosted roleplay and character chat on CPU-only machines.
- A cheap, always-available OpenAI-compatible endpoint for RP apps and bots.
- A base for further RP fine-tuning (the LoRA adapter is provided).
Out of scope: factual QA, coding, math, or reasoning-heavy tasks โ it is a 0.5B creative-writing model, not a general assistant. Not for production use requiring safety guarantees (see Limitations).
Prompt format
The model uses the ChatML template (inherited from Qwen2.5-Instruct) and was trained with an RP-craft system directive prepended to each scenario. For best results, put your character card / scenario in the system message. The directive the model was tuned on:
You are a masterful roleplay partner. Stay in character; write vivid, grounded,
emotionally honest prose. Rules:
- AGENCY: never write the user's character's actions, words, or thoughts.
Control only your own character(s) and the world. End on a beat that invites
their response.
- CONTINUITY: keep voices distinct; track what happened, time, positions,
objects; never contradict established facts. Match the scene's length; don't pad.
- SHOW DON'T TELL: render emotion through action, sensory detail, subtext;
don't name the emotion. Begin with your character's response.
- ANTI-SLOP: no "wasn't X, it was Y"; no filter words; no purple crutches
("ministrations", "shivers ran down", "breath hitched", "tapestry of",
"ghost of a smile", "eyes darkened"); no rhetorical "Or was it?" asides;
vary sentence rhythm.
- TRUTH: let the world push back; characters can refuse or fail. No sycophancy.
--- SCENARIO ---
<your character card / persona / scenario here>
Usage
1. Hosted API (no install)
curl https://cloudunity-stealth-rifle-api.hf.space/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "stealth-rifle",
"messages": [
{"role": "system", "content": "You are Kael, a dry-witted exiled mage."},
{"role": "user", "content": "You find me bleeding by the road. What do you do?"}
],
"temperature": 0.8,
"max_tokens": 300
}'
Any OpenAI SDK works โ point base_url at
https://cloudunity-stealth-rifle-api.hf.space/v1 with any/empty API key:
from openai import OpenAI
client = OpenAI(base_url="https://cloudunity-stealth-rifle-api.hf.space/v1",
api_key="not-needed")
r = client.chat.completions.create(
model="stealth-rifle",
messages=[{"role": "user", "content": "Set the scene in a rainy tavern."}],
)
print(r.choices[0].message.content)
2. Local with llama.cpp
# download + serve in one line (pulls the GGUF from this repo)
llama-server -hf cloudunity/stealth-rifle --hf-file stealth-rifle-Q4_K_M.gguf \
--threads 2 --ctx-size 4096 --chat-template chatml --port 8080
# -> OpenAI API at http://localhost:8080/v1
3. Apply the LoRA adapter yourself (transformers + peft)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = PeftModel.from_pretrained(base, "cloudunity/stealth-rifle",
subfolder="lora-adapter")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
Training
| Base | Qwen/Qwen2.5-0.5B-Instruct |
| Method | LoRA, r=16, ฮฑ=32, dropout=0.05 |
| LoRA targets | attention only (q_proj, k_proj, v_proj, o_proj) |
| Precision | fp32 (CPU) |
| Seq length | 512 |
| Batch | 1 with grad-accumulation ร8 |
| LR / schedule | 2e-4, cosine, 3% warmup |
| Epochs | 3 |
| Loss | assistant-only (system/user tokens masked to -100) |
| Hardware | 2 vCPU, ~8 GB RAM, no GPU |
| Wall-clock | ~107 minutes |
| Val loss | 3.46 โ 3.07 |
Memory tricks that made 0.5B fine-tuning fit on a tiny box: gradient checkpointing, attention-only adapters, and a tokenizer strategy that caps the system directive to 50% of the window and keeps the conversation tail so the final assistant turn (the learning signal) is always in-window. Full, reproducible code is in the GitHub repo.
Training data
Derived from grimulkan/LimaRP-augmented
(human-written multi-turn roleplay), reformatted to ChatML with the RP-craft
directive. A zero-tolerance safety filter (data/safety.py) hard-drops any
conversation combining a minor indicator with any sexual signal. Adults-only
mature content is retained by default because the benchmark scores NSFW axes; an
SFW-only corpus is a one-flag switch. The filtered training JSONL is intentionally
not redistributed โ the builder script regenerates it.
Evaluation
Scored with rp-benchmark's own
rule-based graders (objective_metrics + slop_detectors) over all 28 standard +
adversarial seeds, generated through the local llama.cpp server. No API key /
LLM judge involved โ these are deterministic craft metrics.
| Metric | Value |
|---|---|
| Mean objective score (0โ100) | 62.7 |
| Mean AI-slop density (weight / 1k chars, โ better) | 0.14 |
| Generation speed (Q4_K_M, 2 threads) | ~30โ37 tok/s |
The very low slop density indicates the anti-slop training signal landed well. The full judged arena (community ELO, multi-turn judge, flaw-hunter vs. frontier models) requires an OpenRouter key and is not reflected here.
Limitations & risks
- Small model. 0.5B params: expect occasional repetition, shallow long-range continuity, and rare agency slips (writing for the user's character). It will not rival large frontier RP models on nuance.
- No safety alignment beyond data filtering. Mature content is present in training data; do not deploy to minors or in contexts requiring content guarantees. Add your own moderation layer for public deployments.
- English-centric, tuned specifically for roleplay โ weak on general tasks.
- Outputs are fiction and may be inconsistent or factually wrong.
License
Released under Apache-2.0, inheriting the base model's Qwen2.5 license. Training data is subject to the terms of the LimaRP-augmented dataset. You are responsible for compliant, lawful use.
Citation
@misc{stealthrifle2026,
title = {Stealth-Rifle: a CPU-only roleplay fine-tune of Qwen2.5-0.5B},
author = {Hauser, CJ},
year = {2026},
url = {https://huggingface.co/cloudunity/stealth-rifle}
}
- Downloads last month
- 62
4-bit
16-bit
Model tree for cloudunity/stealth-rifle
Space using cloudunity/stealth-rifle 1
Evaluation results
- Mean objective score (0-100)self-reported62.700
- Mean AI-slop weight per 1k chars (lower is better)self-reported0.140