Instructions to use cloudunity/stealth-rifle with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use cloudunity/stealth-rifle with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="cloudunity/stealth-rifle",
	filename="stealth-rifle-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use cloudunity/stealth-rifle with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf cloudunity/stealth-rifle:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf cloudunity/stealth-rifle:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf cloudunity/stealth-rifle:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf cloudunity/stealth-rifle:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf cloudunity/stealth-rifle:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf cloudunity/stealth-rifle:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf cloudunity/stealth-rifle:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf cloudunity/stealth-rifle:Q4_K_M

Use Docker

docker model run hf.co/cloudunity/stealth-rifle:Q4_K_M

LM Studio
Jan

vLLM

How to use cloudunity/stealth-rifle with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "cloudunity/stealth-rifle"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cloudunity/stealth-rifle",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/cloudunity/stealth-rifle:Q4_K_M

Ollama
How to use cloudunity/stealth-rifle with Ollama:
```
ollama run hf.co/cloudunity/stealth-rifle:Q4_K_M
```

Unsloth Studio

How to use cloudunity/stealth-rifle with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for cloudunity/stealth-rifle to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for cloudunity/stealth-rifle to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for cloudunity/stealth-rifle to start chatting

How to use cloudunity/stealth-rifle with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf cloudunity/stealth-rifle:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "cloudunity/stealth-rifle:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use cloudunity/stealth-rifle with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf cloudunity/stealth-rifle:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default cloudunity/stealth-rifle:Q4_K_M

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use cloudunity/stealth-rifle with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf cloudunity/stealth-rifle:Q4_K_M

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "cloudunity/stealth-rifle:Q4_K_M" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use cloudunity/stealth-rifle with Docker Model Runner:
```
docker model run hf.co/cloudunity/stealth-rifle:Q4_K_M
```

Lemonade

How to use cloudunity/stealth-rifle with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull cloudunity/stealth-rifle:Q4_K_M

Run and chat with the model

lemonade run user.stealth-rifle-Q4_K_M

List all available models

lemonade list

Stealth-Rifle 🎯

A small, CPU-only roleplay model. A LoRA fine-tune of Qwen/Qwen2.5-0.5B-Instruct trained, quantized, and served entirely within a 16 GB RAM / 2 vCPU budget with no GPU at any stage. It targets clean, in-character roleplay prose with a strong anti-"AI-slop" bias, and runs at a usable speed on commodity CPUs.

Live API (OpenAI-compatible): https://huggingface.co/spaces/cloudunity/stealth-rifle-api
Source / training pipeline: https://github.com/CloudCompile/stealth-rifle
Base model: Qwen/Qwen2.5-0.5B-Instruct (494M params)
Method: LoRA (attention-only) → merged → GGUF → Q4_K_M
Author: CJ Hauser (@CloudCompile)

Files

File	Size	What it is
`stealth-rifle-Q4_K_M.gguf`	~380 MB	4-bit quantized weights — the CPU deployment artifact
`stealth-rifle-f16.gguf`	~950 MB	Full-precision GGUF (for re-quantizing or GPU offload)
`lora-adapter/`	~8.7 MB	The raw LoRA adapter (apply on top of the base model)

Why this model exists

The design brief was "a roleplay model that runs on 16 GB RAM / 2 CPU with good tokens/sec and really good quality." Frontier RP leaderboards are topped by 70B–1T-parameter models that need datacenter GPUs; matching them on a 2-core CPU is not physically possible. The honest, hardware-faithful answer is a LoRA fine-tune of a strong small open model, quantized for CPU inference. That is exactly what Stealth-Rifle is — the best-quality RP model that genuinely fits the budget, not a benchmark-gamed claim.

Intended use

Local / self-hosted roleplay and character chat on CPU-only machines.
A cheap, always-available OpenAI-compatible endpoint for RP apps and bots.
A base for further RP fine-tuning (the LoRA adapter is provided).

Out of scope: factual QA, coding, math, or reasoning-heavy tasks — it is a 0.5B creative-writing model, not a general assistant. Not for production use requiring safety guarantees (see Limitations).

Prompt format

The model uses the ChatML template (inherited from Qwen2.5-Instruct) and was trained with an RP-craft system directive prepended to each scenario. For best results, put your character card / scenario in the system message. The directive the model was tuned on:

You are a masterful roleplay partner. Stay in character; write vivid, grounded,
emotionally honest prose. Rules:
- AGENCY: never write the user's character's actions, words, or thoughts.
  Control only your own character(s) and the world. End on a beat that invites
  their response.
- CONTINUITY: keep voices distinct; track what happened, time, positions,
  objects; never contradict established facts. Match the scene's length; don't pad.
- SHOW DON'T TELL: render emotion through action, sensory detail, subtext;
  don't name the emotion. Begin with your character's response.
- ANTI-SLOP: no "wasn't X, it was Y"; no filter words; no purple crutches
  ("ministrations", "shivers ran down", "breath hitched", "tapestry of",
  "ghost of a smile", "eyes darkened"); no rhetorical "Or was it?" asides;
  vary sentence rhythm.
- TRUTH: let the world push back; characters can refuse or fail. No sycophancy.

--- SCENARIO ---
<your character card / persona / scenario here>

Usage

1. Hosted API (no install)

curl https://cloudunity-stealth-rifle-api.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "stealth-rifle",
    "messages": [
      {"role": "system", "content": "You are Kael, a dry-witted exiled mage."},
      {"role": "user", "content": "You find me bleeding by the road. What do you do?"}
    ],
    "temperature": 0.8,
    "max_tokens": 300
  }'

Any OpenAI SDK works — point base_url at https://cloudunity-stealth-rifle-api.hf.space/v1 with any/empty API key:

from openai import OpenAI
client = OpenAI(base_url="https://cloudunity-stealth-rifle-api.hf.space/v1",
                api_key="not-needed")
r = client.chat.completions.create(
    model="stealth-rifle",
    messages=[{"role": "user", "content": "Set the scene in a rainy tavern."}],
)
print(r.choices[0].message.content)

2. Local with llama.cpp

# download + serve in one line (pulls the GGUF from this repo)
llama-server -hf cloudunity/stealth-rifle --hf-file stealth-rifle-Q4_K_M.gguf \
  --threads 2 --ctx-size 4096 --chat-template chatml --port 8080
# -> OpenAI API at http://localhost:8080/v1

3. Apply the LoRA adapter yourself (transformers + peft)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = PeftModel.from_pretrained(base, "cloudunity/stealth-rifle",
                                  subfolder="lora-adapter")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

Training


Base	`Qwen/Qwen2.5-0.5B-Instruct`
Method	LoRA, r=16, α=32, dropout=0.05
LoRA targets	attention only (`q_proj, k_proj, v_proj, o_proj`)
Precision	fp32 (CPU)
Seq length	512
Batch	1 with grad-accumulation ×8
LR / schedule	2e-4, cosine, 3% warmup
Epochs	3
Loss	assistant-only (system/user tokens masked to -100)
Hardware	2 vCPU, ~8 GB RAM, no GPU
Wall-clock	~107 minutes
Val loss	3.46 → 3.07

Memory tricks that made 0.5B fine-tuning fit on a tiny box: gradient checkpointing, attention-only adapters, and a tokenizer strategy that caps the system directive to 50% of the window and keeps the conversation tail so the final assistant turn (the learning signal) is always in-window. Full, reproducible code is in the GitHub repo.

Training data

Derived from grimulkan/LimaRP-augmented (human-written multi-turn roleplay), reformatted to ChatML with the RP-craft directive. A zero-tolerance safety filter (data/safety.py) hard-drops any conversation combining a minor indicator with any sexual signal. Adults-only mature content is retained by default because the benchmark scores NSFW axes; an SFW-only corpus is a one-flag switch. The filtered training JSONL is intentionally not redistributed — the builder script regenerates it.

Evaluation

Scored with rp-benchmark's own rule-based graders (objective_metrics + slop_detectors) over all 28 standard + adversarial seeds, generated through the local llama.cpp server. No API key / LLM judge involved — these are deterministic craft metrics.

Metric	Value
Mean objective score (0–100)	62.7
Mean AI-slop density (weight / 1k chars, ↓ better)	0.14
Generation speed (Q4_K_M, 2 threads)	~30–37 tok/s

The very low slop density indicates the anti-slop training signal landed well. The full judged arena (community ELO, multi-turn judge, flaw-hunter vs. frontier models) requires an OpenRouter key and is not reflected here.

Limitations & risks

Small model. 0.5B params: expect occasional repetition, shallow long-range continuity, and rare agency slips (writing for the user's character). It will not rival large frontier RP models on nuance.
No safety alignment beyond data filtering. Mature content is present in training data; do not deploy to minors or in contexts requiring content guarantees. Add your own moderation layer for public deployments.
English-centric, tuned specifically for roleplay — weak on general tasks.
Outputs are fiction and may be inconsistent or factually wrong.

License

Released under Apache-2.0, inheriting the base model's Qwen2.5 license. Training data is subject to the terms of the LimaRP-augmented dataset. You are responsible for compliant, lawful use.

Citation

@misc{stealthrifle2026,
  title  = {Stealth-Rifle: a CPU-only roleplay fine-tune of Qwen2.5-0.5B},
  author = {Hauser, CJ},
  year   = {2026},
  url    = {https://huggingface.co/cloudunity/stealth-rifle}
}

Downloads last month: 62

GGUF

Model size

0.5B params

Architecture

qwen2

Hardware compatibility

4-bit

16-bit

Model tree for cloudunity/stealth-rifle

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Adapter

(657)

this model

Space using cloudunity/stealth-rifle 1

Evaluation results

Mean objective score (0-100)
self-reported

62.700
Mean AI-slop weight per 1k chars (lower is better)
self-reported

0.140