Medical Wayfinder — Gemma 4 E2B

Navigation got you to the parking lot. Medical Wayfinder gets you to the doctor.

A fine-tuned Gemma 4 E2B for on-device healthcare facility wayfinding in English and Spanish. Patients describe a destination ("cardiology", "MRI", "where's parking for the children's ER?"); the model returns step-by-step directions with landmarks, accessibility info, and check-in instructions — all running locally on a phone via llama.cpp + Metal GPU. No PHI leaves the device.

Submission for the Gemma 4 Good Hackathon. Code repository: github.com/jmdevita/medical-wayfinder.

Quick start

This repo ships three artifacts. Pick the one that matches your use case:

Run inference with llama.cpp / Ollama / LM Studio (GGUF)

# llama.cpp
llama-cli -hf jmdevita/medical-wayfinder-gemma-4-e2b --jinja \
  --model-file gemma-4-e2b-it.Q4_K_M.gguf

# Ollama
ollama create medical-wayfinder -f Modelfile
ollama run medical-wayfinder

Load the merged model with transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "jmdevita/medical-wayfinder-gemma-4-e2b",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("jmdevita/medical-wayfinder-gemma-4-e2b")

Apply the LoRA adapter to your own base copy

from peft import PeftModel
from transformers import AutoModelForCausalLM

base = AutoModelForCausalLM.from_pretrained("google/gemma-4-e2b-it")
model = PeftModel.from_pretrained(base, "jmdevita/medical-wayfinder-gemma-4-e2b")

What it does

The model is a wayfinding assistant, not a medical advisor. Given:

  1. A system prompt that defines a strict JSON response contract (destinations, steps, accessibility badges, disambiguation prompts, arrival markers)
  2. A CONTEXT block describing a specific facility — its departments, entrances, parking lots, and topology graph
  3. A user query in English or Spanish

…it emits a structured JSON response that the host app parses into a multi-step walking guide. Five facilities ship in the open-source app: Atrius Boston Kenmore, Kaiser Panorama City, Massachusetts General, Southern JP, and Tufts Medical Center.

A deterministic Dart-side orchestrator handles alias lookup and Dijkstra path-finding over a hand-authored topology graph — the model handles intent classification, hedging, multilingual phrasing, and accessibility-aware step formatting.

Training details

Base model google/gemma-4-e2b-it (5.1B params, 2.3B effective). Trained from Unsloth's 4-bit quantized variant unsloth/gemma-4-e2b-it-unsloth-bnb-4bit for memory efficiency on a consumer GPU.
Adapter LoRA, rank 8
Training steps 78
Dataset size 310 examples
Dataset source 100% synthetic, generated by a larger teacher LLM (qwen3.5-122B) against a published generation prompt at training/data/prompts/generation.txt. Curated 1000 directional phrases from public call-center datasets anchor the synthetic data (no real patient queries).
Training framework Unsloth
Quantization GGUF Q4_K_M (3.4 GB) for on-device inference
Verification Merged-then-quantized GGUF SHA differs from base (14638e2b… vs e781b34b…), confirming the adapter is in the weights

Evaluation

Held-out 100-case eval suite. Same production system prompt runs against base and fine-tune; only weights change. Judge: gpt-oss-120b (cross-family, JSON-schema-constrained, default reasoning effort). Suite is published verbatim at training/data/eval/eval_suite.jsonl.

Headline

Metric Base Fine-tune Δ
Mean rubric score (1-5) 3.62 3.98 +0.36
Strict pass (corr ≥ 4 AND mean ≥ 3.5) 28% 38% +10 pp
Soft pass (corr ≥ 3 AND mean ≥ 3.5) 47% 56% +9 pp
English mean 3.57 3.94 +0.37
Spanish mean 3.92 4.17 +0.25

Per-criterion

Criterion Base Fine-tune Δ
Scope Handling 3.37 4.20 +0.83
Correctness 3.07 3.46 +0.39
Accessibility 3.59 3.88 +0.29
Landmarks 3.17 3.38 +0.21
Format 4.90 4.96 +0.06

Every criterion lifts. Scope Handling moved most — the targeted round-3 distillation pass added 20 batches of scope_enforcement examples and explicitly forbid "I'm not able to give medical advice" hedging.

Spanish now outscores English under the production configuration (4.17 vs 3.94, gap of -0.23). Training set is ~30% Spanish examples after the bilingual category pass.

One trade-off worth flagging

Verbatim route-copy rate (the model's ability to reproduce landmark prose character-for-character) regresses with the May-15 prompt revision (67% → 50% on the same fine-tune). The longer, more directive new prompt nudges the model to paraphrase. Other metrics improve, so the net is positive on mean, strict pass, soft pass, and Spanish — but the verbatim cost is the largest single regression in the eval matrix.

Full eval methodology — including a 2×2 controlled comparison (model × prompt), per-criterion failure-mode breakdown, and rubric design rationale — is reproducible from the committed eval suite. See the four canonical JSONs:

  • training/output/eval_results/eval_summary_gemma4_e2b_2026-05-15T22-29-19.json (base + new prompt)
  • training/output/eval_results/eval_summary_gemma4-e2b-wf-cp78_2026-05-15T22-55-36.json (cp78 + new prompt)
  • Plus the two old-prompt runs for the 2×2 controls

Run env/bin/python training/scripts/eval_runner.py with the corresponding model and the system prompt at health_wayfinder/assets/system_prompt.txt to reproduce.

Intended use

  • In-scope: Hospital/clinic wayfinding queries in English or Spanish, against a CONTEXT block derived from a structured facility JSON file. The model expects the system prompt at health_wayfinder/assets/system_prompt.txt and emits a JSON response per the schema in that prompt.
  • Out of scope: Medical advice, diagnosis, triage, appointment scheduling, EHR integration, billing inquiries, or any clinical decision. The system prompt explicitly classifies these as out-of-scope and the model is trained to deflect them.
  • Deployment target: On-device on iOS via llama.cpp + Metal GPU. Q4_K_M quantization fits a ~3.4 GB binary in the app bundle; first-launch copy to Documents directory.

Limitations and known issues

  • Eval is directional, not statistically significant — 100 cases at a single seed.
  • The eval suite was authored alongside the data contract, which biases the results in the way training-adjacent evals always do. The suite is published verbatim for reproducibility.
  • Training data is 100% synthetic — anchored with curated real-world directional phrases but no real patient queries. Anchoring with 50–100 real queries is the next dataset improvement.
  • Hedging on edge cases — "I can't walk far" or "I'm on the orange line" still get over-applied medical-question template responses ~25% of the time. Further prompt sharpening has diminishing returns; the fix is more diverse retraining data.
  • Per-prompt verbatim trade-off documented above.
  • Multimodal (photo re-orientation) has the camera path live and the model path stubbed; that's a V2 item. The BF16-mmproj.gguf in this repo is published for future multimodal work but unused by the current app.

License

This model is a derivative of Google's Gemma 4 E2B and is therefore subject to the Gemma Terms of Use in addition to anything stated here. By downloading you agree to those terms.

The training data, eval suite, and accompanying code in the GitHub repository are licensed CC-BY 4.0.

Citation

@misc{medical-wayfinder-2026,
  author = {De Vita, Julian},
  title  = {Medical Wayfinder: On-device fine-tuned Gemma 4 E2B for multilingual hospital navigation},
  year   = {2026},
  url    = {https://huggingface.co/jmdevita/medical-wayfinder-gemma-4-e2b},
  note   = {Gemma 4 Good Hackathon submission},
}

Acknowledgements

  • Google DeepMind for the Gemma 4 family
  • The Unsloth team for the fine-tuning framework (~2× faster training on a consumer GPU)
  • OpenStreetMap contributors — facility data is derived in part from OSM under ODbL §4.3
  • Jamshidi et al. (HERD 2025), Sela et al. (AMIA 2018), González Cueto et al. (JGIM 2024) for the peer-reviewed evidence underpinning the problem framing — see SOURCES.md in the GitHub repo

Trained 2× faster with Unsloth.

Downloads last month
178
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support