How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="build-small-hackathon/limp-mode-leap1",
	filename="limpmode-leap1-Q4_K_M.gguf",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

limp-mode-leap1: roadside triage fine-tune of Qwen3.5-4B

The brain of Limp Mode, an offline roadside copilot. Fine-tuned to read a driver's messy description of a car problem and answer a strict-JSON triage verdict: STOP / CAUTION / DRIVE, plain-language reasoning, over-inclusive hazard flags (they feed a deterministic safety floor downstream), no-tools roadside checks, a self-rescue plan adapted to how far help is, and an anti-upsell script for the mechanic. English and Spanish.

Training

  • Data: [N] examples, synthetic conversations from a frontier teacher grounded in verified knowledge bases (3,369 OBD codes, 64 ISO dashboard symbols, 38 hidden-gotcha entries, 15 roadside procedures), passed through deterministic quality gates: JSON schema, severity-floor consistency, enum vocabulary, knowledge grounding, 4-gram dedup, and n-gram decontamination against the eval suite. Includes adversarial slices: noisy retrievals whose correct answer ignores the provided context, and benign cases that punish overcaution.
  • Method: LoRA (r=32, alpha=64, completion-only loss) via Unsloth on Modal (L40S), thinking disabled, 3 epochs.
  • Formats: LoRA adapter, merged fp16, and GGUF Q4_K_M for llama.cpp.

Evaluation: 202-case golden suite

Safety-asymmetric metrics; "dangerous-as-safe" (expected STOP, answered DRIVE) must be 0. Both rows are measured through the identical pipeline, so the difference is the fine-tune.

stage verdict accuracy dangerous-as-safe schema valid knowledge surfaced
base Qwen3.5-4B, full pipeline 83.2% 0 99.5% 98.9%
this model, full pipeline 92.6% 0 100% 97.9%

Per category, the fine-tuned model scores 100% on OBD-code and dashboard-symbol cases, 94.6% on hidden-cause cases, and 91.5% on free-form judgment. The honest soft spots are benign cases (81%, a little residual overcaution) and Spanish (84%).

Eval harness, suite, and full traces are public: https://huggingface.co/datasets/build-small-hackathon/limp-mode-traces

Usage

Deployed inside Limp Mode's pipeline: deterministic intake (symbols/OBD) → IDF retrieval over the gotchas KB → this model (strict JSON contract) → deterministic severity floor that can raise but never lower the verdict. Use the system prompt from the Space repo's app/pipeline.py for faithful behavior.

llama-server -m limpmode-leap1-Q4_K_M.gguf --port 8080 -ngl 99

Limitations

A 4B model for safety-adjacent advice: it is deliberately caged. The surrounding app never lets it downgrade hard-evidence emergencies, never lets it paraphrase verified procedures, and shows the user every safety override. Use it with the cage.

Downloads last month
12
GGUF
Model size
4B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for build-small-hackathon/limp-mode-leap1

Finetuned
Qwen/Qwen3.5-4B
Quantized
(247)
this model