Stealth-Rifle ๐ŸŽฏ

A small, CPU-only roleplay model. A LoRA fine-tune of Qwen/Qwen2.5-0.5B-Instruct trained, quantized, and served entirely within a 16 GB RAM / 2 vCPU budget with no GPU at any stage. It targets clean, in-character roleplay prose with a strong anti-"AI-slop" bias, and runs at a usable speed on commodity CPUs.


Files

File Size What it is
stealth-rifle-Q4_K_M.gguf ~380 MB 4-bit quantized weights โ€” the CPU deployment artifact
stealth-rifle-f16.gguf ~950 MB Full-precision GGUF (for re-quantizing or GPU offload)
lora-adapter/ ~8.7 MB The raw LoRA adapter (apply on top of the base model)

Why this model exists

The design brief was "a roleplay model that runs on 16 GB RAM / 2 CPU with good tokens/sec and really good quality." Frontier RP leaderboards are topped by 70Bโ€“1T-parameter models that need datacenter GPUs; matching them on a 2-core CPU is not physically possible. The honest, hardware-faithful answer is a LoRA fine-tune of a strong small open model, quantized for CPU inference. That is exactly what Stealth-Rifle is โ€” the best-quality RP model that genuinely fits the budget, not a benchmark-gamed claim.


Intended use

  • Local / self-hosted roleplay and character chat on CPU-only machines.
  • A cheap, always-available OpenAI-compatible endpoint for RP apps and bots.
  • A base for further RP fine-tuning (the LoRA adapter is provided).

Out of scope: factual QA, coding, math, or reasoning-heavy tasks โ€” it is a 0.5B creative-writing model, not a general assistant. Not for production use requiring safety guarantees (see Limitations).


Prompt format

The model uses the ChatML template (inherited from Qwen2.5-Instruct) and was trained with an RP-craft system directive prepended to each scenario. For best results, put your character card / scenario in the system message. The directive the model was tuned on:

You are a masterful roleplay partner. Stay in character; write vivid, grounded,
emotionally honest prose. Rules:
- AGENCY: never write the user's character's actions, words, or thoughts.
  Control only your own character(s) and the world. End on a beat that invites
  their response.
- CONTINUITY: keep voices distinct; track what happened, time, positions,
  objects; never contradict established facts. Match the scene's length; don't pad.
- SHOW DON'T TELL: render emotion through action, sensory detail, subtext;
  don't name the emotion. Begin with your character's response.
- ANTI-SLOP: no "wasn't X, it was Y"; no filter words; no purple crutches
  ("ministrations", "shivers ran down", "breath hitched", "tapestry of",
  "ghost of a smile", "eyes darkened"); no rhetorical "Or was it?" asides;
  vary sentence rhythm.
- TRUTH: let the world push back; characters can refuse or fail. No sycophancy.

--- SCENARIO ---
<your character card / persona / scenario here>

Usage

1. Hosted API (no install)

curl https://cloudunity-stealth-rifle-api.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "stealth-rifle",
    "messages": [
      {"role": "system", "content": "You are Kael, a dry-witted exiled mage."},
      {"role": "user", "content": "You find me bleeding by the road. What do you do?"}
    ],
    "temperature": 0.8,
    "max_tokens": 300
  }'

Any OpenAI SDK works โ€” point base_url at https://cloudunity-stealth-rifle-api.hf.space/v1 with any/empty API key:

from openai import OpenAI
client = OpenAI(base_url="https://cloudunity-stealth-rifle-api.hf.space/v1",
                api_key="not-needed")
r = client.chat.completions.create(
    model="stealth-rifle",
    messages=[{"role": "user", "content": "Set the scene in a rainy tavern."}],
)
print(r.choices[0].message.content)

2. Local with llama.cpp

# download + serve in one line (pulls the GGUF from this repo)
llama-server -hf cloudunity/stealth-rifle --hf-file stealth-rifle-Q4_K_M.gguf \
  --threads 2 --ctx-size 4096 --chat-template chatml --port 8080
# -> OpenAI API at http://localhost:8080/v1

3. Apply the LoRA adapter yourself (transformers + peft)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = PeftModel.from_pretrained(base, "cloudunity/stealth-rifle",
                                  subfolder="lora-adapter")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

Training

Base Qwen/Qwen2.5-0.5B-Instruct
Method LoRA, r=16, ฮฑ=32, dropout=0.05
LoRA targets attention only (q_proj, k_proj, v_proj, o_proj)
Precision fp32 (CPU)
Seq length 512
Batch 1 with grad-accumulation ร—8
LR / schedule 2e-4, cosine, 3% warmup
Epochs 3
Loss assistant-only (system/user tokens masked to -100)
Hardware 2 vCPU, ~8 GB RAM, no GPU
Wall-clock ~107 minutes
Val loss 3.46 โ†’ 3.07

Memory tricks that made 0.5B fine-tuning fit on a tiny box: gradient checkpointing, attention-only adapters, and a tokenizer strategy that caps the system directive to 50% of the window and keeps the conversation tail so the final assistant turn (the learning signal) is always in-window. Full, reproducible code is in the GitHub repo.

Training data

Derived from grimulkan/LimaRP-augmented (human-written multi-turn roleplay), reformatted to ChatML with the RP-craft directive. A zero-tolerance safety filter (data/safety.py) hard-drops any conversation combining a minor indicator with any sexual signal. Adults-only mature content is retained by default because the benchmark scores NSFW axes; an SFW-only corpus is a one-flag switch. The filtered training JSONL is intentionally not redistributed โ€” the builder script regenerates it.


Evaluation

Scored with rp-benchmark's own rule-based graders (objective_metrics + slop_detectors) over all 28 standard + adversarial seeds, generated through the local llama.cpp server. No API key / LLM judge involved โ€” these are deterministic craft metrics.

Metric Value
Mean objective score (0โ€“100) 62.7
Mean AI-slop density (weight / 1k chars, โ†“ better) 0.14
Generation speed (Q4_K_M, 2 threads) ~30โ€“37 tok/s

The very low slop density indicates the anti-slop training signal landed well. The full judged arena (community ELO, multi-turn judge, flaw-hunter vs. frontier models) requires an OpenRouter key and is not reflected here.


Limitations & risks

  • Small model. 0.5B params: expect occasional repetition, shallow long-range continuity, and rare agency slips (writing for the user's character). It will not rival large frontier RP models on nuance.
  • No safety alignment beyond data filtering. Mature content is present in training data; do not deploy to minors or in contexts requiring content guarantees. Add your own moderation layer for public deployments.
  • English-centric, tuned specifically for roleplay โ€” weak on general tasks.
  • Outputs are fiction and may be inconsistent or factually wrong.

License

Released under Apache-2.0, inheriting the base model's Qwen2.5 license. Training data is subject to the terms of the LimaRP-augmented dataset. You are responsible for compliant, lawful use.

Citation

@misc{stealthrifle2026,
  title  = {Stealth-Rifle: a CPU-only roleplay fine-tune of Qwen2.5-0.5B},
  author = {Hauser, CJ},
  year   = {2026},
  url    = {https://huggingface.co/cloudunity/stealth-rifle}
}
Downloads last month
62
GGUF
Model size
0.5B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

4-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cloudunity/stealth-rifle

Adapter
(657)
this model

Space using cloudunity/stealth-rifle 1

Evaluation results