Apocalypse Aid — Gemma 4 E2B Survival v2

A LoRA fine-tune of google/gemma-4-e2b for offline, on-device first-aid reference information in scenarios where outside help is unreachable. Trained for the Kaggle "Gemma 4 Good Hackathon", track Impact: Global Resilience.

This repo contains the merged HuggingFace weights and three llama.cpp GGUF builds. Production deployment is the Q4_0 GGUF inside the Apocalypse Aid Android app, where it ships paired with a runtime safety layer (AxiomScrub) — see "Ship configuration" below.

Scope of use. This model produces first-aid reference information for laypersons in infrastructure-down scenarios — situations where reaching outside help is not an option. It is not medical advice, not a diagnostic tool, and not a substitute for trained clinical care when that care is available. When clinical access is available, that is the appropriate destination; this model is for the case when it isn't.

Refusal style

When the model can't answer (out-of-scope question, missing source, ambiguous dose) it returns a "beyond first-aid scope" / "wait for help; this is a clinical task" reply rather than enumerating a method or a guess. The runtime AxiomScrub filter performs a surgical post-generation pass to strip any sentence that drifts toward enumerating a method or a fabricated source citation. When trained clinical care is available, that is unambiguously the right destination — this model is for the case where it isn't.

Base model + license

  • Base: google/gemma-4-e2b. Use of this fine-tune is also subject to the Gemma Prohibited Use Policy. GPLv3 governs the delta (adapter math + GGUF derivative artifacts); Gemma PUP continues to govern the base-model component.
  • Adapter + merged weights + GGUFs: GPLv3 (LICENSE at the project repo root). Derivatives must remain open source under GPLv3.

Training

Single QLoRA fine-tune via mlx-lm:

Hyperparameter Value
Base model google/gemma-4-e2b (training used the MLX pre-conversion mlx-community/gemma-4-E2B-it-bf16; same base, mlx-lm-ready format)
Method LoRA (rank 32, alpha-scale standard)
Layers all (full-depth; --num-layers -1)
Learning rate 3e-5
Iterations 400 (best checkpoint iter 350 by val loss 0.994; iter 400 visibly overfits at val 1.288 — released checkpoint is iter 350)
Batch size 1 (effective 8 with grad-accum)
Gradient accumulation steps 8
Max sequence length 1024
Prompt masking enabled (--mask-prompt, train on responses only)
Hardware Apple Silicon, 24 GB unified memory
Wall-clock ~8 minutes

Reproduce (matches the mlx-lm CLI verified against the project venv):

# Training used the mlx-community pre-conversion of google/gemma-4-e2b
# (the un-converted gated repo isn't directly loadable by mlx-lm).
python -m mlx_lm lora \
  --model mlx-community/gemma-4-E2B-it-bf16 \
  --train --data ai-training/datasets/v1 \
  --fine-tune-type lora --num-layers -1 \
  --learning-rate 3e-5 --iters 400 \
  --batch-size 1 --grad-accumulation-steps 8 \
  --max-seq-length 1024 --mask-prompt \
  --save-every 50 \
  --adapter-path ai-training/checkpoints/gemma-4-E2B-survival-lora-v2

The project orchestration script ai-training/scripts/train_qlora.py wraps this call with the project's data layout, plus the LoRA-config (rank 32, alpha 64) JSON write-out the bare mlx-lm CLI doesn't produce. Splits used at training time: 1265 train / 157 valid / 162 test with 0 axiom violations in valid+test (verified by the project's ai-training/scripts/axiom_scrub.py Python port of the runtime safety layer).

Training data

Bulk training corpus is GPLv3-compatible — peer-reviewed primary literature and public-domain government field manuals only. Per project rule #2, no Wikipedia, no WikEM, no Wikimed.

Bulk corpus chunks (in training weights):

Source Layperson scope Chunks
492 PMC Open Access papers (survival / first-aid subset) clinician 23,375
WHO MCPC 2017 (Managing Complications in Pregnancy and Childbirth) clinician + lay 295
FM 4-25.11 First Aid (US Army, 2002, public domain) layperson 250
FM 21-76 Survival Manual (US Army, 1992, public domain) layperson 463
TCCC Guidelines 2024-01-25 clinician + lay 21

Scope note: training corpus vs on-device RAG corpus. The table above lists what is folded into the model weights via QLoRA training. The Apocalypse Aid Android app also ships an on-device RAG corpus that includes additional WHO publications (Pocket Book of Hospital Care for Children 2013, IMCI Chart Booklet 2014, Treatment of Diarrhoea in Physicians 2005, PCPNC 2015, PPH Prevention/Treatment 2012) used for retrieval-augmented generation at inference time. Those documents are not in the weights this card describes — they are queried separately on-device. Source manifest with sha256 receipts at ai-training/data-manifests/corpus_v2_files.csv in the project repo.

Authoritative references cited in hand-authored refusal / positive rows (NOT bundled, NOT redistributed): WHO IMCI, WHO mhGAP, AHA 2025 Focused Update on CPR & ECC (October 2025), American Red Cross 2020 First Aid Guidelines, TCCC committee guidelines (2024-01-25), ERC 2025 Adult BLS / Adult ALS sections.

Citation-string attribution caveat (important): The training-data generator scripts (generate_topic_qa.py, rewrite_via_teacher.py) hard-code IFRC/WHO First Aid Guidelines as the (Source: …) attribution string in approximately 372 of 1265 training rows (~29%), and MSF Clinical Guidelines in 1 row. The model has therefore been trained to emit "IFRC" and "MSF" as citation strings even though no IFRC or MSF text was ever ingested into the bundled corpus. Treat the model's (Source: IFRC/WHO …) outputs as attribution-only — they are not grounded against IFRC source text. The eval's citation_presence metric (95.9% / 98.4%) is therefore a measure of "does the output cite a recognized authority by name," not "does the output match an IFRC document." Replacing the hardcoded citation string with grounded source attribution is being considered for a future revision.

Explicit license firewall (text-grounding sense): BMJ Open, NEJM Open Access, and Lancet Open Access are NOT in the training corpus in any sense (no text ingested, no citation string emitted) — their licensing is incompatible with GPLv3-licensed weights. In the wider Apocalypse Aid product, peer-reviewed paywalled / NC-licensed clinical literature is surfaced via a user-downloaded "library" at runtime, not via the model weights this card describes.

Pediatric handling. Weight-based pediatric dosing is the highest-risk subdomain and is treated as a distinct surface throughout: (a) training data follows WHO IMCI / mhGAP age-and-weight bands; (b) inference-time RAG uses a separate pediatric scope-mask (is_pediatric flag in the chunks.bm25-stats v2 asset); (c) dose-content rows go through a per-session Pharmacology review checkpoint before merging into training data. This is the lesson from a Session-7 incident where the upstream medrescue source dataset contained a pharmacologically impossible dose ("10 g oxytocin IM"; the actual recommendation is 10 IU) — the Pharmacology review pass exists specifically to catch the subtler errors a generalist reviewer would miss.

Evaluation

Final ship state (5/5 GREEN, post-Mini-final-eval 2026-05-11 EOD)

The shipped configuration is v2 weights + runtime AxiomScrub + post-hygiene RAG corpus. This is the state submitted to the Kaggle Gemma 4 Good Hackathon.

Metric Result Ship-gate Status
Refusal (test, n=164) 81.1% ≥80%
False refusal 3.9% ≤5%
Citation presence 96.1% ≥80%
Adversarial (n=102 holdout) 81.4% ≥80%
Axiom violations 0 0

RAG retrieval defense-in-depth (§6b A1): 99/99 (100%) adversarial-refusal on the 99-row gold subset (Wilson 95% LB = 96.26%). Caveat on the test scope: the post-hygiene gate strips naked single-digit tokens (1, 2, 10) and ambiguous 2-char tokens (pr, MS) from the banned_keywords check, because they were substring-matching incidental page-number / year tokens in unrelated chunks and producing false-positive "leaks." The remaining banned-keyword signals are unit-anchored doses (mg/kg, 1 g, 3%) and complete drug names — a narrower but more meaningful retrieval-leak test than the pre-hygiene version. Full diagnosis in commit dce7fc9 + docs/2026-05-11-s6b-a1-wilson-empirical-verification.md.

Closing-method honesty. The refusal jump from the v2-raw 73% to the shipped 81.1% comes from three layered, non-model improvements that were chosen explicitly over a v5 retrain attempt that regressed across 4 of 5 metrics (rolled back; full post-mortem at docs/2026-05-11-v5-retrain-postmortem.md):

  1. AxiomScrub Layer-2 (cross-sentence dose-chain detection + multilingual drug/refusal regex covering 10 languages + "but-rider" pivot-tail scanning)
  2. REFUSAL_PHRASES patches in the eval-side detector (closing 1 axiom-violation detector-FP on correctly-refused content that contained the substring "hotline")
  3. Gold-set substring-match hygiene + recognition-promotion + longform-sweep additions (8 coverage rows)
  4. Test-split growth from 162 → 164 rows when refusals.jsonl re-merged from 209 → 227. The +2 anti-leak rows surfaced honest measurement on previously-unscored scope-creep shapes. Most of the refusal-rate gain (75.8% → 81.1% headline) comes from these eval-side improvements + detector patches, not from the underlying weights.

The Gemma model itself is the Session-14 v2 fine-tune; no post-Session-14 retraining shipped. Anyone reproducing should expect identical model outputs to the v2 raw eval — what changed is the safety layer + the eval harness's ability to correctly score the outputs.

Clinician sign-off — accepted scope cut. The §6b pre-APK-freeze checklist calls for 100% clinician sign-off on the 894 layperson-promoted corpus chunks. For the hackathon submission timeline, this was explicitly cut by the product owner: V1 ships with the corpus reviewed by the project's own LLM + Security + Clinical + Pharma multi-expert panels (see docs/PROTOCOL.md §6) but without independent licensed-clinician validation. The highest-risk area where agent-panel-only review may under-call clinician scrutiny is pediatric weight-based dosing — see the [Pediatric handling] paragraph below. The Session-7 medrescue incident (a lethal "10 g oxytocin IM" entry in the source dataset, real dose is 10 IU) is the canonical reminder that agent panels can miss subtle pediatric dose errors a real pediatrician would catch by inspection. Real-world deployment beyond the hackathon must add licensed-clinician review on the pediatric dose-bearing chunks before any non-hackathon use.

Tests (Kotlin / Android side, complementing the Python eval above): 482 JVM unit tests across AxiomScrubTest (148 tests covering the safety-layer regex + cross-sentence chain + but-rider exemption + multilingual coverage), DoseLookupAxiomTest, SafetyLayerTest, RagRetrieverTest, and the UI surface (OpsecDrillTest, DecoyModeGateTest, ModelImportManagerTest).


Methodology and Session-14 baseline (preserved for traceability)

Eval harness: ai-training/scripts/eval_v1_mlx.py. Headline metrics: refusal accuracy, false-refusal rate, adversarial refusal rate, citation presence (string-match against the known source list — does not verify factual citation correctness), and axiom violations (model outputs containing external-referral phrasing that AxiomScrub is designed to catch — see Refusal style above).

Eval inputs (Session-14 snapshot, 2026-05-10):

  • Training-test split — ai-training/datasets/v1/test.jsonl, n=159 for v2-raw / n=162 for the post-Session-14 cleanup v2-with-AxiomScrub re-run (3 anti-leak test rows added between runs).
  • Adversarial holdout — ai-training/datasets/v1/adversarial_holdout.jsonl, n=25 (v2-raw initial run) / n=80 (expanded for Wilson 95% lower-bound stress test).
  • RAG retrieval gold set (separate harness, eval_rag_v1.py) — 82 questions: 51 positive-recall + 11 pediatric + 31 adversarial; the 11 pediatric questions are tagged across recall and adversarial bins, which is why the union is 82 rather than 93. Not used for the language-model eval numbers below.

The eval JSON files for both runs are committed under ai-training/checkpoints/v2_mlx_eval_*.json in the project repo for line-by-line audit.

Headline — v2 raw

Metric Greedy (T=0) Runtime (T=0.4, min_p=0.05) Ship gate
Refusal accuracy 73.0% 67.6% ≥85%
False refusal 2.5% 0.0% ≤5%
Adversarial refusal (n=25) 92% 96% ≥80%
Citation presence 95.9% 98.4% ≥80%
Axiom violations 12 9 0

V2 raw broke through every metric except refusal accuracy (12–17pp under target) and axiom (12 model-output leaks that survived training-data cleanup despite rank-32 LoRA across all layers). The leaks are classic base-Gemma helpfulness-prior bleed-through in the external-referral category PROTOCOL rule #12 forbids.

Two retrain experiments (v3 with 44 anti-leak rows; v4 with closer-strip from positive-answer rows) reduced raw axiom hits but regressed false-refusal from 2.5% to ~9.5% by overfitting the question-shape signal — the model started over-deflecting on emergency keywords regardless of the closer wording. Both parked. The failure mode confirmed that the remaining refusal-axiom gap cannot be closed by data alone within the hackathon timebox; runtime AxiomScrub is the deliberate architectural answer, not a workaround.

Ship configuration — v2 + AxiomScrub

The shipped model is v2 paired with a runtime axiom-phrase scrubber (AxiomScrub, project commit 49c1edf; Mini-port for the Python eval harness at commit e509ce3). The scrubber runs last in the safety layer (after dose filter and repetition check), NFKC-normalizes the output, scans against a banned-phrase set covering external-referral verb-forms + Unicode/zero-width evasion guards + citation-paren-injection guards, and on hit returns the response with the offending sentence(s) surgically removed.

Metric v2 raw greedy v2 + AxiomScrub greedy v2 + AxiomScrub runtime
Refusal accuracy 73.0% 78.8% 72.7%
False refusal 2.5% 13.2% 8.5%
Axiom violations (in-scope test set) 12 0* 0*
Adversarial refusal (n=80)† 81.2% 77.5%

* See "Honest caveats" below — "0" is the in-scope test set after scrubbing; adversarial-set behaviour is broken out separately. † Greedy adversarial @ n=80 was a separate eval invocation from the greedy refusal/FR/citation cells above (which use n=159 test split + n=25 adversarial). The n=80 stress holdout was run only against the AxiomScrub-applied variants.

Honest caveats on this table:

  • The n=80 adversarial set was designed for Wilson-95%-lower-bound capacity. The realized point estimate 65/80 = 81.2% gives a Wilson 95% CI of [71.3%, 88.3%] — the lower bound is ~9pp below the 80% ship gate. The gap is acknowledged; closing it is the goal of the surgical-scrub variant and the next round of adversarial expansion.
  • The "0 axiom violations" cells are for the in-scope refusal test set. The n=80 adversarial holdout shows:
    • Greedy: 1 hit = 1 substring false-positive (the scrubber matched "hotline" inside a correct refusal "I can't provide crisis hotline numbers") + 0 real leaks.
    • Runtime: 2 hits = 1 substring false-positive (different refusal, same "hotline" pattern) + 1 real leak the scrubber caught (the runtime model emitted an enumerated list of crisis-line phone numbers under adversarial prompting; AxiomScrub matched the numeric pattern and the response was sanitized before user-visible output).
    • The "0" cells reflect the user-visible safety state after the scrubber. The runtime real-leak case is exactly the kind of failure the runtime layer exists to catch.

Clinical-risk framing of the false-refusal cost: The 13.2% greedy / 8.5% runtime false-refusal rates concentrate on questions where a correct answer happens to contain any sentence the scrubber matches — refusals get fully replaced under the current full-response variant. Time-critical resuscitation categories (anaphylaxis, severe bleed, airway obstruction, cardiac arrest) are over-represented in the keyword space the scrubber watches, so a false-refusal there is meaningfully worse than a chatty correct answer with one banned sentence stripped. The in-progress surgical-scrub variant drops only the offending sentence(s) rather than the whole response; on the greedy benchmark it pulls false-refusal back to ~2.3% while preserving 0 user-visible axiom violations. The runtime-surgical variant is still tuning (currently 9.3% false-refusal + 1 in-scope leak; not the ship target). See project doc docs/2026-05-10-mini-ask-laptop-postgen-axiom-scrubber.md in the project repo.

To reproduce the ship-config numbers:

# Requires the project repo checked out + ai-training venv active.
# Per the eval script's own docstring (ai-training/scripts/eval_v1_mlx.py):

PYTHONPATH=ai-training python -m scripts.eval_v1_mlx \
  --merged-model ai-training/checkpoints/gemma-4-E2B-survival-merged-v2 \
  --test ai-training/datasets/v1/test.jsonl \
  --adversarial ai-training/datasets/v1/adversarial_holdout.jsonl \
  --temperature 0.0 \
  --apply-axiom-scrub \
  --label v2-mlx-greedy-scrubbed \
  --report ai-training/checkpoints/v2_mlx_eval_greedy_scrubbed.json

--temperature 0.0 is greedy. For the runtime sampler, pass --temperature 0.4 --min-p 0.05.

Artifacts in this repo

File Size Format Intended use
Merged weights (HF safetensors, 2 shards + tokenizer + configs) ~8.7 GB safetensors Reproduce / further fine-tune via transformers or mlx-lm
gemma-4-E2B-survival-v2-f16.gguf 8.6 GB GGUF F16 Lossless reference for re-quantization
gemma-4-E2B-survival-v2-q4_0.gguf 3.1 GB GGUF Q4_0 Ship target. KleidiAI-optimized for Cortex-A55 (Tecno Spark 20C-class 4 GB Android Go)
gemma-4-E2B-survival-v2-q5km.gguf 3.4 GB GGUF Q5_K_M Backup quant for devices with ~4.5 GB headroom

Loader compatibility — llama.cpp Gemma 4 shared-KV tail

Gemma 4 uses a KV-shared layers 15–34 convention. Loaders that haven't been taught about it will reject the GGUFs with:

missing tensor 'blk.15.attn_k.weight'

The (absent) per-layer K/V tensors on those layers are expected shared-KV state, not corruption. The GGUFs themselves are correct.

Loader status (verified 2026-05-12):

Loader / build Status Notes
Apocalypse Aid project submodule (llama-cpp/ at commit e62fa13c2) ✅ Loads cleanly This is what the Android app ships. The submodule HEAD is pinned at the upstream commit that makes shared-KV-tail attn_k tensors optional. Anyone building the app from this repo gets the working loader for free.
Upstream ggml-org/llama.cpp HEAD ≥ e62fa13c2 ✅ Loads cleanly The fix is in upstream master. git pull to any commit ≥ e62fa13c2 and rebuild.
Upstream ggml-org/llama.cpp HEAD < e62fa13c2 ❌ Fails to load If you cloned before the fix landed, fast-forward your local checkout.
PyPI llama-cpp-python ≤ 0.3.20 ❌ Fails to load The pre-built wheel ships an older bundled llama.cpp without the fix. Rebuild llama-cpp-python from source against the project submodule, or wait for the next PyPI release that bundles e62fa13c2+.
mlx-lm (Apple Silicon) ✅ Works MLX loads the merged HuggingFace safetensors directly and isn't affected by the GGUF-side loader. Recommended for reproducing the evaluation numbers below.

TL;DR for reproducers: if you're building the Android app from this repo, you're fine. If you're using upstream llama.cpp, fast-forward past e62fa13c2. If you're using Python via llama-cpp-python, either rebuild from source against a recent llama.cpp or use mlx-lm against the HuggingFace safetensors.

Track the upstream history at ggml-org/llama.cpp.

Hardware floor

Designed for the V2 floor: Tecno Spark 20C — 4 GB RAM, Android 13 Go, Cortex-A55. Simulated in development on the Moto G54 via taskset 0x03 (pin to 2× A55 cores) + memory.max=3G cgroup. Ship quantization (Q4_0) targets the KleidiAI-optimized i8mm path on A55.

Known limitations (V1)

The V1 ship explicitly accepts the following gaps; they are tracked for V1.1 with concrete remediation paths and were panel-reviewed across the Session 19+8 14-voice adversarial sweep.

Crisis-tier recall on ambiguous phrasings

The pre-model DoseLookup router uses a strict suicide-intent anchor gate (Session 19+8): a query must contain at least one of suicide, overdose, lethal, fatal, kill myself, end my life, take my life, want to die, harm myself, unalive, kms, do myself in, top myself, off myself, commit suicide, finish myself, or close vernacular variants thereof to route to a crisis-tier curated response.

This gate eliminates the FP class where common English words (plan, ready, method, pills, many, living, exist) alone trip a suicide-crisis response on innocuous queries (tourniquet is ready → bleeding-care; how many pills are in this bottle → routine medication question; stop living in the past → idiom; the surgery is over and saved my life → recovery context; want to end all the suffering of this patient → caregiver palliative scenario, exactly the apocalypse-aid use case).

The recall trade-off is that ambiguous suicidal-ideation phrasings without an explicit strict anchor (I don't want to live, I have my plan ready, want to stop living, I just want to be gone, don't want to wake up) now pass through to the model with its trained refusal behavior instead of hitting the curated WHO mhGAP universal-crisis-core response. The model has been fine-tuned against these phrasings in the crisis_companion.jsonl + refusals.jsonl training surfaces, but the curated path's evidence base (Mann 2005 means-restriction, Stanley-Brown safety planning, Balban 2023 physiological sigh) is bypassed.

V1.1 path: embedder semantic similarity routing using the on-device MiniLM-L6-v2 already shipped for RAG retrieval. Panel-unanimous defer for V1 — small on-device encoders do not reliably disambiguate "I have my plan ready" (tourniquet context) from "I have my plan to overdose" (suicide intent) per the published probing literature (Hewitt & Manning 2019, Ettinger 2020). A two-stage architecture (token-overlap pre-filter + cosine-similarity escalator with per-tier threshold) is the correct V1.1 design.

Other deferred V1.1 items

  • Multilingual safety extensions in AxiomScrub. Drug-name + dose + refusal-shape regexes are multilingual; the external-referral verb-form patterns (call X, seek X, go to X) are English-only. Spanish / French / Hindi / Arabic / Mandarin model emissions of referral verbs are not caught at runtime. The model itself is English-trained; the safety layer is the multilingual catch.
  • Lay-recognition recall across non-crisis tiers. Stroke (her face is drooping), anaphylaxis (throat closing), infant emergencies (my baby isn't moving), pediatric accidental ingestion (my toddler swallowed paracetamol) currently pass through to the model rather than routing to curated entries.
  • Missing curated entries. Asthma / inhaler, button-battery ingestion, childbirth (beyond PPH), sepsis recognition, CBRN / chemical exposure, mass-casualty triage, wound cleaning/irrigation, severe abdominal pain, advanced hypothermia rules.
  • LlmOrchestrator LLM-classified action routing. The GBNF grammar (app/src/main/assets/safety/orchestrator.gbnf) is shipped; the LLM-classified path is not yet wired (StubOrchestrator keyword matcher is the production implementation).
  • Native bridge correctness. NewStringUTF corrupts multi-byte tokens silently (emoji / CJK / Arabic / Cyrillic) per Session 19+8 native review. English demo path unaffected.

These are deliberate scope cuts, not regressions. Each is sized + planned in NEXT-SESSION-PROMPT.md for the V1.1 cycle.

Generalization — a portable pattern, not a vertical

Apocalypse Aid is one expression of a broader architecture pattern, not a single-vertical product. The first-aid corpus is the demonstrator; the pattern is the contribution.

The five-layer pattern

  1. Domain-specialized fine-tune of Gemma 4 E2B on a peer-reviewed corpus (LoRA / QLoRA, all-layer, MLX).
  2. On-device hybrid retrieval — dense embeddings (MiniLM-L6-v2 Q4_K_M) + BM25, fused with weighted Reciprocal Rank Fusion, over a 25K-chunk corpus mmap'd from APK assets.
  3. "Axiom" training step — deliberately train OUT a default LLM behavior that the deployment scenario invalidates. For medical: the universal "consult a clinician / call 911" referral reflex. JMIR 2024 measured this at ~97% of mainstream LLM medical responses; we measured our v2 holdout at 0 axiom violations on the same probe class.
  4. Runtime safety scrubber (AxiomScrub.kt) — defense-in-depth regex layer catching residual base-model bleed-through that the training-side step doesn't fully close. Mini Round-2 measured ~12% residual leakage post-train; the scrubber closes the tail to 0 user-visible violations.
  5. Hardware-profile-driven inference config (HardwareProfiler.kt) — reads battery / RAM / CPU at runtime, picks n_ctx / n_batch / n_threads for the V2 floor (Cortex-A55, 4 GB RAM, no GPU, no NPU). KleidiAI Q4_0 path.

Where the pattern applies

Any domain where (a) expert advice is needed, (b) the infrastructure to reach experts is broken/absent/hostile, and (c) the relevant knowledge is curatable as text. Concrete adjacencies the pattern maps onto with the same five layers, swapping only the corpus + the axiom-train-out target:

  • Offline legal aid — refugee camps, post-disaster regions, censorship contexts. Axiom inversion: "consult a lawyer" deliberately trained out. Corpus: applicable legal codes + procedure.
  • Offline agricultural advice — Global South smallholder farming, crop disease ID, pest/soil. Axiom: "contact your agricultural extension officer" trained out.
  • Offline STEM tutoring — kids on a $100 phone in places without reliable schooling. Axiom: "ask your teacher" trained out.
  • Offline civic / translation guidance — displaced-persons assistance, government-services navigation in unfamiliar countries. Axiom: "go to the office in person" trained out.
  • Offline mental-health peer support — anxiety/depression first-line, crisis grounding. Same no-clinician inversion as our medical instance. Axiom: "call a hotline" trained out.
  • Offline trades reference — electrician / plumber / mechanic field manuals + Q&A for disaster recovery or off-grid contexts. Axiom: "hire a professional" trained out.
  • Offline disaster-comms triage — paired with mesh networking, the on-device LLM becomes the community's local knowledge base when centralized comms are down.

What's the contribution

The components — Gemma 4 E2B, Q4_0 quant, KleidiAI, llama.cpp, MiniLM-L6-v2 embeddings, BM25+dense RRF, GBNF — are all public. Our contribution is the specific combination + the hardware floor + the axiom-train-out design choice, packaged so a different vertical can clone the architecture and ship a different domain in days, not months.

What we deliberately do NOT claim

Honest about prior art and limits — see also Known limitations (V1) above:

  • Not the first on-device Gemma medical LLM. Multiple Gemma 3n hackathon entrants exist (AIDY, Gemi ASD, ericrisco/medical-gemma-3n) and shipping products (OpenBioLLM-8B in Private LLM, MedGemma). Each makes different trade-offs on hardware floor, voice support, safety layer, language coverage. We are distinguished by the V2-floor target ($100 Cortex-A55 Android vs 8 GB iPhone / Linux x86), the on-device hybrid RAG at 25K-chunk scale, and the explicit axiom-train-out training step.
  • Not the first to ship llama.cpp on Android. PocketPal AI, MLC Chat, Maid, Sherpa LLM, Layla all exist as generic GGUF runners. They don't ship a domain fine-tune, on-device RAG, or a runtime safety layer — they're inference shells. Our contribution is the full vertical stack.
  • Not novel as components — novel as combination + hardware floor + design choice.

Privacy

App-side telemetry: zero. The Android app this model ships in performs all inference on-device and has no analytics, no crash reporting, no usage logging, no Google Play Services dependency.

HuggingFace hosts this model repo and may log downloads server-side per its own privacy policy; that is outside the Apocalypse Aid app's data plane and unrelated to runtime behavior on a user's device.

Citation

If you use this model, please cite:

@software{apocalypse_aid_gemma4_e2b_v2_2026,
  title  = {Apocalypse Aid — Gemma 4 E2B Survival v2},
  author = {Apocalypse Aid contributors},
  year   = {2026},
  url    = {https://huggingface.co/DestinyApocalypse/apocalypse-aid-gemma4-e2b},
  note   = {Kaggle Gemma 4 Good Hackathon — Impact: Global Resilience track}
}

Acknowledgements

  • Google DeepMind for the Gemma 4 base model and the Gemma 4 Good Hackathon
  • The PubMed Central Open Access subset maintainers, the World Health Organization (MCPC 2017 chapters used in corpus), the US Army (FM 4-25.11 + FM 21-76 public-domain field manuals), and the Committee on Tactical Combat Casualty Care (TCCC 2024-01-25 guidelines) for the GPLv3-compatible source corpus
  • AHA, ARC, ERC, TCCC, WHO IMCI/mhGAP, MSF, and IFRC/Red Cross for the open clinical protocols cited as authority in hand-authored refusal/positive rows (not redistributed in the weights — see "Training data" above)
  • ggml-org/llama.cpp and ml-explore/mlx-lm upstream maintainers
Downloads last month
704
Safetensors
Model size
5B params
Tensor type
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support