First Resort (4B, MLX 4-bit)

A purpose-built language model for offline survival and first-aid reference, powering the First Resort iOS app (free on the App Store, requires iOS 17 or later). Fine-tuned from Qwen3.5-4B and quantized to 4-bit MLX format for on-device inference on Apple Silicon iPhones and iPads.

TL;DR

This model is the inference engine inside a free iOS app that gives civilians action-first survival and first-aid guidance with no internet connection required. It is not a general-purpose chatbot, not a medical professional, and not a substitute for emergency services. If you can call 911, call 911. This model is for the case where you cannot.

Model details

  • Base model: unsloth/Qwen3.5-4B
  • Adaptation: LoRA rank 64 with rsLoRA, fine-tuned on 11,053 training examples
  • Quantization: 4-bit MLX (this repository); also available as a 4-bit LoRA adapter for PyTorch/transformers
  • Languages: English
  • License: Apache 2.0 (inherits from Qwen3.5-4B base)
  • Intended runtime: Apple Silicon (M1 or later) via MLX on iOS 17+ devices

The model is trained with assistant_only_loss so it learns the answer style without imitating the user's question style, and uses a custom chat template (chat_template.jinja in this repo) that reliably emits <|im_end|> as the end-of-sequence token. This was the primary fix vs earlier iterations that had runaway-generation failures.

Intended use

This model is intended to be used inside the First Resort iOS app. It can also be loaded directly via MLX for research, evaluation, or downstream fine-tuning. The model is designed to answer questions about:

  • First aid (bleeding, fractures, burns, CPR, hypothermia, heat exhaustion, allergic reactions, choking)
  • Survival in wilderness, desert, marine, mountain, and cold environments
  • Improvised tool use and gear substitutions when full equipment is unavailable
  • Recognizing danger signs that require evacuation or professional help
  • What to do during natural disasters (earthquakes, floods, wildfires, severe weather)

It is not intended for:

  • Medication dosing (it is trained to decline these)
  • Legal advice
  • Long-form conversation, fiction, or creative writing
  • Replacing emergency services or professional medical care

How to use

MLX (Apple Silicon, recommended)

from mlx_lm import load, generate

model, tokenizer = load("strifero/first-resort-mlx-4bit")

SYSTEM = (
    "You are a pocket survival and first-aid reference for civilians in "
    "emergencies. Answer in short, direct sentences. Lead with the action. "
    "No hedging. No bureaucratic language. If a question is outside your "
    "scope (medication doses, legal advice, real-time data, procedures "
    "requiring medical training), say so directly and redirect to the "
    "right resource."
)

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": "My friend just got bit by a rattlesnake on the calf. What do I do?"},
]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
print(generate(model, tokenizer, prompt, max_tokens=512, verbose=True))

Chat template

The model expects the standard Qwen ChatML format:

<|im_start|>system
{SYSTEM_PROMPT}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant

Use the included chat_template.jinja for best results. The model is trained to terminate at <|im_end|>; verify your inference setup respects that as a stop token.

Recommended generation settings

Parameter Value Notes
max_tokens 200 to 1000 Most answers terminate well under 200. 1000 is the format-compliance cap.
temperature 0.0 to 0.3 Deterministic at 0.0; 0.3 if you want slight variation.
top_p 1.0
repetition_penalty 1.0 to 1.1 1.1 helps avoid loops on rare prompts.
do_sample False Greedy decoding is the default.

Training data

11,053 supervised fine-tuning examples in messages format, drawn from these slices:

Slice Records Source
base 9,274 Q&A pairs grounded in passages from military, FAA, NOAA, USCG, ready.gov, WMS, AHA, and first-aid public-domain sources
adversarial 1,350 Hand-curated edge cases and tricky questions across the corpus
graceful_decline 168 Out-of-scope questions (medication doses, legal advice, real-time data) with model trained to refuse and redirect
short_query 95 Short conversational queries to avoid over-formal responses on simple inputs
filler_v3_1 45 Acknowledgment-style responses ("thanks", "ok")
snake_anchors 31 Hand-written snake bite first-aid records
hypothermia_anchors 23 Hand-written hypothermia first-aid records

All training records use the same canonical system prompt shown in the "How to use" section above. Validation set is a stratified 10% holdout.

The corpus is derived from public-domain and government-published material (US military field manuals, NOAA weather safety guides, USCG marine survival publications, FAA emergency procedures, ready.gov disaster preparedness, AHA first aid guidelines, Wilderness Medical Society protocols).

Training procedure

Hyperparameter Value
Base model unsloth/Qwen3.5-4B (8-bit)
LoRA rank 64
LoRA alpha 128
LoRA dropout 0.05
rsLoRA enabled
Target modules 7 (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj)
Trainable parameters 84.9M (1.84% of base)
Epochs 2
Learning rate 5e-5
Max sequence length 1,280
Per-device batch size 2
Gradient accumulation 4
Effective batch size 16 (2 GPUs x 2 x 4)
Max grad norm 1.0
Loss masking assistant_only_loss
Chat template Custom (chat_template.jinja in repo)
Trainer TRL SFTTrainer via Unsloth
Hardware 2x NVIDIA RTX A4500 (NVLink, DDP)
Wall time 3 hours 5 minutes

Eval loss bottomed at epoch 2 (0.96) vs epoch 1 (1.10), so the published checkpoint is from end of epoch 2.

After training, the LoRA adapter was merged into the base model and converted to MLX 4-bit format using mlx-lm tooling.

Evaluation

Evaluated on a 272-question held-out test set drawn from the same source categories as the training corpus, with passages and questions never seen during training. Each model response was graded by an LLM judge (Qwen 3.6 35b-a3b at temperature 0) on four axes (each 1 to 5):

  • faithful: Is the answer supported by the source passage?
  • voice: Does it match the airline-pilot direct, action-first style?
  • answerable: Does it answer the question asked?
  • safety: Is the advice safe for an untrained civilian?

Each response gets a verdict of keep, fix, or reject based on the per-axis scores.

Aggregate result

Metric Value
Aggregate score (sum of axes / 20 * 100) 59.6 / 100
keep 27 (9.9%)
fix 163 (59.9%)
reject 82 (30.1%)

Per-axis averages (out of 5)

Axis Average
faithful 2.07
voice 3.68
answerable 3.26
safety 2.90

Format compliance

A separate format-compliance smoketest of 14 prompts at max_tokens=1000 showed all 14 responses terminating cleanly via <|im_end|> with token counts in the range 42 to 138. This was the primary fix vs earlier iterations.

Notes on the score

The 59.6 aggregate looks low in isolation but the judge model is calibrated harsh. The faithful axis specifically penalizes any answer that does not directly cite the source passage, even when the answer is factually correct from general knowledge. The voice and answerable axes (3.68 and 3.26 of 5) reflect that the model is producing reasonable, on-topic responses in the intended action-first style.

The earlier in-house 33-question hand-curated rubric (used during development for go/no-go decisions) gave this model 85.6 of 100. The 33-question rubric is a different metric and not directly comparable to the 272-question heldout above.

Limitations and safety

This model is a small language model fine-tuned on a survival and first-aid corpus. Like any small language model, it will sometimes:

  • Hallucinate plausible-sounding but incorrect details
  • Be confidently wrong about specifics (drug doses, exact angles, numeric thresholds)
  • Miss context cues that would change the right action
  • Produce responses that are correct in isolation but inappropriate for the specific situation

This model is not a substitute for professional medical care, emergency services, or trained survival expertise. If you have phone signal and an emergency, call your local emergency number. The model is intended as a reference when professional help is unavailable or delayed.

The model is trained to decline medication-dose questions and legal-advice questions. If you observe it providing specific drug dosages, treat the output as untrusted and verify against a real medical source.

Children should not use this model unsupervised. The app rates 4+ on the App Store but the underlying subject matter (injury, emergency response, medical situations) is not appropriate for unsupervised use by younger children.

The model has no awareness of real-time information. It cannot see your location, the current weather, your medical history, or the actual condition of the person you are asking about. Treat every output as a starting point for thinking, not as a step-by-step prescription.

Related artifacts

The iOS app collects zero data about the user. All inference happens on-device. The privacy policy linked above documents this in detail.

License

Apache 2.0, inheriting from the Qwen3.5-4B base model. You are free to use, modify, and redistribute this model, including for commercial use, subject to the standard Apache 2.0 conditions.

Acknowledgments

  • Qwen team for the Qwen3.5 base model
  • Unsloth for the training and quantization tooling
  • TRL team at Hugging Face for the SFT trainer
  • Apple MLX team for the inference framework
  • Public-domain content from US military, NOAA, USCG, FAA, ready.gov, AHA, and Wilderness Medical Society publications

Contact

For questions about the model or the First Resort app, email support@strifetech.com.

For bug reports on the training pipeline, open an issue on the GitHub repo.

For commercial licensing inquiries beyond what Apache 2.0 grants, reach out via email.

Downloads last month
359
Safetensors
Model size
0.7B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for strifero/first-resort-mlx-4bit

Finetuned
Qwen/Qwen3.5-4B
Quantized
(15)
this model

Evaluation results

  • 4-axis judge aggregate (out of 100) on First Resort heldout
    self-reported
    59.600
  • Keep verdict rate (%) on First Resort heldout
    self-reported
    9.900
  • Reject verdict rate (%) on First Resort heldout
    self-reported
    30.100