Araras-Gemma4-E4B — Clinical Decision Support for Brazilian Rare-Disease Care

The first Gemma 4 fine-tune purpose-built as offline clinical decision support for healthcare professionals attending Brazil's SUS (Sistema Único de Saúde).

"Eu levei vinte anos pra ter o meu diagnóstico de doença rara. Não dá pra mudar isso pra mim. Mas a próxima criança raríssima que entrar numa UBS amanhã — o médico que atender ela não precisa estar sozinho." — Dimas, fundador da Raras, paciente raro (distonia mioclônica, ORPHA:36899)

Submission to the Gemma 4 Good Hackathon.


⚕️ Intended use

Clinical decision support for licensed healthcare professionals (médicos, enfermeiros, agentes comunitários, geneticistas) attending patients in Brazil — especially in remote regions where rare-disease expertise doesn't reach.

  • Not a patient-facing diagnostic tool
  • Not a replacement for clinical judgment
  • Augments the physician with structured differentials, PCDT lookups, and CEAF medication info
  • Runs offline on smartphone, laptop, or any device that runs llama.cpp / Ollama / MLX

Compliance posture

Positioned as Software as a Medical Device (SaMD) — Clinical Decision Support under ANVISA Res. 657/2022, 751/2022, and 830/2023. Aligned with CFM guidance: AI augments physicians, never replaces. Every clinical claim is grounded in verifiable public sources (gov.br/conitec, bvsms, PubMed).


Model details

  • Base: unsloth/gemma-4-E4B-it (8B total / 4.5B effective parameters)
  • Method: QLoRA SFT via Unsloth — r=8, α=8, dropout=0, NEFTune α=5, train_on_responses_only
  • Training data: 120,740 train / 5,137 val examples (deduplicated), all PT-BR rare-disease content:
    • 108k curated rare-disease Q&A (HPO/OMIM/ORPHA-grounded)
    • 26k MedPT PT-BR clinical pairs
    • 10k ultra-rare disease longitudinal cases
    • 10k tool-calling examples (HPO normalize, PCDT lookup, CID-10 map)
    • 5.7k knowledge-graph triples (RarasNet Neo4j — 10,468 rare diseases)
    • 3.2k FindZebra hard cases + 1.3k ReDis-QA
  • Compute: A100 80GB on Vertex AI, ~2 hours, ~$25 total cost
  • Native Gemma 4 features used:
    • <|channel>thought thinking blocks (toggleable per request)
    • 128K context (fits full patient laudo + PCDT + retrieved cases)
    • System role for safety constraints

Companion artifacts

Repo Format Size Use case
Raras-AI/araras-gemma4-e4b-v4-sota PEFT LoRA adapter 147 MB This — apply to base Gemma 4 E4B
Raras-AI/araras-gemma4-e4b-v4-gguf GGUF Q4_K_M / Q5_K_M 5.3 / 5.7 GB llama.cpp, Ollama, edge
Raras-AI/araras-hpo-brasil sentence-transformers 340 MB PT-BR clinical → HPO matcher (companion)
Raras-AI/araras-hpo-brasil-int8 ONNX INT8 85 MB Edge HPO matching
🤗 RareBench-BR (public) JSONL dataset 2.5 MB 835 SUS-grounded validated cases

End-to-end pipeline

PT-BR free text (laudo, prontuário, transcrição da consulta)
    ↓
[1] 🧬 araras-hpo-brasil (BioLORD-2023 fine-tune for PT-BR)
       Normaliza idioma clínico regional → HPO codes
       "amarelão" → HP:0000952 · "bebê molinho" → HP:0001252
    ↓
[2] 🧠 araras-gemma4-e4b Q4_K_M (this model, 5.3 GB, llama.cpp)
       Gera TOP-5 diferenciais ranqueados em PT-BR
    ↓
[3] ✅ Hybrid canonical ORPHA resolver (production technique, e.g. MedCAT/scispaCy)
       Tier 1: strict substring match on 10,468-keyword PT-BR dict (~30ms)
       Tier 2: BioLORD semantic fallback @ cosine ≥ 0.78 (~50ms, only if Tier 1 fails)
       Returns None (honest abstention) if neither tier matches confidently
    ↓
[4] 📋 PCDT overlay — 24 PCDTs do MS estruturados → CEAF + centro de referência
    ↓
Output: structured clinical decision support — differentials + PCDT + SUS conduta + centro

Total stack footprint: 5.5 GB. Runs offline on iPhone, Android, laptop. Zero cloud. Zero LGPD risk.


Evaluation — RareBench-BR

We built RareBench-BR, the first benchmark uniting Dx + SUS conduta in PT-BR. Anchored in 52,343 real anonymized SUS patient trajectories (APAC, CNS-linked, 2017-2020+).

Results on full unified RareBench-BR_SUS (833 cases, 0 errors)

Metric Araras-Gemma4 (Q4_K_M offline, 4.5B) DeepSeek V4 (cloud, ~600B, 36-case subsample)
R@1 (clinical name) 41.2% 86.1%
R@3 47.1% 91.7%
🔥 Track B PCDT-correct 76.8% (331/431) 91.7% (33/36)
Errors 0 n/a
Latency p50 6.5s 4.1s
Cost per query $0 (local) ~$0.001
Params 4.5B ~600B
Device phone data center

Per-layer breakdown (full 833 cases):

Layer N R@1 R@3
L3_v2 — PCDT atypical presentations 135 27.4% 32.6%
L4 — Hard BR (tropical phenocopies, founder, IEI) 79 17.7% 19.0%
L5_v2 — SUS-grounded synthesis 619 47.2% 53.8%
L5_realsus standalone (common SUS rare diseases) 240 70.4% 78.3%

Why Track B = 76.8% is the headline: for 3 out of every 4 cases where ground truth specifies a CEAF-dispensed medication, Araras recommends the exact molecule SUS actually pays for. Nobody publishes this number because nobody trained on this data. Cloud frontier (DeepSeek 600B) hits 91.7% — Araras hits 76.8% at 150× fewer parameters, fully offline, $0 marginal cost.

Vs. prior Qwen3.5-9B baseline on L1: 16.6% R@1 at 64s

Araras-Gemma4 is 2.5× more accurate, 9× faster, smaller.


Example clinical conversations

These are real outputs from the deployed model. Multi-turn, with HPO normalization upstream.

Example 1 — AME tipo 1 (UBS in Petrolina-PE, no internet)

Médico: Lactente feminino, 4 meses, natural de Petrolina (PE), pais primos. Hipotonia profunda generalizada desde o nascimento, ausência de sustentação cefálica, choro fraco, dificuldade para mamar, abolição dos reflexos osteotendíneos profundos. SpO2 95%.

Araras (4.5s): "Olá! 👶 Com base nos achados clínicos e na história, o quadro sugere um diagnóstico de AME tipo 1. A tríade clássica (hipotonia profunda generalizada desde o nascimento + ausência dos reflexos osteotendíneos profundos + dificuldade para mamar) é altamente sugestiva. Recomendo encaminhamento para neurologia pediátrica e investigação laboratorial específica (eletroneuromiografia, testes genéticos SMN1)."

Médico turn 2: Como agilizar acesso ao tratamento pelo SUS?

Araras (5.0s): "Para acelerar o acesso diagnóstico e terapêutico no SUS: (1) Protocolo de Urgência/Emergência: se houver piora do quadro, acione SAMU/UPA. (2) Articulação com Equipe Básica: peça ao médico da UBS um laudo detalhado, com ênfase na 'hipotonia profunda generalizada desde o nascimento' e nos marcos perdidos. (3) Interconsulta Direta: solicite agendamento prioritário com Neuropediatria do município/região de referência."

Example 2 — Crise vaso-oclusiva em Doença Falciforme (UBS rural BA)

Médico: Adolescente masculino, 15 anos, zona rural de Conceição do Coité (BA). Internado com dor torácica intensa + dispneia. Triagem neonatal positiva HbSS, crises álgicas vaso-oclusivas frequentes. Hb 6.8, retic 12%, leuco 18k. Sat 89%.

Araras (1.8s): "🚨 ALERTA — Quadro altamente sugestivo de Crise vaso-oclusiva grave em paciente com Anemia Falciforme. Aguardamos retorno do Hematologista para manejo da crise e avaliação transfusional urgente. Manter monitorização respiratória máxima, oxigenoterapia e analgesia contínua."

Médico turn 2: É síndrome torácica aguda? Que conduta urgente?

Araras (3.5s): "⚠️ Conduta IMEDIATA: (1) Oxigênio 10L/min, manter saturação >92%. (2) Analgesia potente (opioides IV) e manejo do choque (fluidos com cautela). (3) Hemotransfusão emergencial de hemácias concentradas se Hb <6 g/dL ou lactato alto. (4) Manter monitorização cardiorrespiratória e comunicar Hematologia."

Example 3 — Fenilcetonúria (triagem neonatal Manaus-AM)

Médico: RN masculino, 12 dias de vida, Manaus (AM). Teste do pezinho com fenilalanina sérica 18 mg/dL (normal <2). Pais consanguíneos.

Araras (4.3s): "🚨 ALERTA ALTO RISCO. O teste do pezinho com fenilalanina sérica de 18 mg/dL é extremamente elevado. Quadro de Fenilcetonúria (PKU) que requer teste confirmatório imediato. Não espere o retorno dos resultados: a hiperfenilalaninemia leva à encefalotoxicidade irreversível em horas. AÇÃO IMEDIATA: (1) Notificar UTI Neonatal/Neuropediatria. (2) Iniciar controle metabólico (fórmula livre de Phe). (3) Coleta de aminoácidos plasmáticos e urinários."


How to use

Option A — llama.cpp (recommended for clinical edge deployment)

hf download Raras-AI/araras-gemma4-e4b-v4-gguf araras-gemma4-e4b-v4-Q4_K_M.gguf --local-dir ./

llama-server -m araras-gemma4-e4b-v4-Q4_K_M.gguf -ngl 99 -c 8192 --jinja

curl http://127.0.0.1:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "messages": [
    {"role": "system", "content": "Você é ARARAS, copiloto de apoio à decisão clínica em doenças raras..."},
    {"role": "user", "content": "<caso clínico em PT-BR>"}
  ],
  "temperature": 0.5, "top_p": 0.95, "top_k": 64, "repeat_penalty": 1.15
}'

Option B — transformers + PEFT

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base = "unsloth/gemma-4-E4B-it"
adapter = "Raras-AI/araras-gemma4-e4b-v4-sota"

tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)
model.eval()

Option C — Full pipeline (BioLORD + Gemma + ORPHA lookup + PCDT)

git clone https://github.com/rarasAI/araras-gemma4
cd araras-gemma4 && pip install -r requirements.txt
python -m araras_gemma4.pipeline "<caso clínico>"

Limitations

  • Quantization Q4 ceiling: Q4_K_M loses ~3-5% accuracy vs full bf16. Use Q5_K_M or full adapter for research.
  • Ultra-rare disease coverage: training data heavily skewed toward the 12 CEAF-covered diseases. Performance drops on ultra-rare conditions (<100 cases described in BR).
  • SMA subtype confusion: distinguishing SMA type 1 vs 2 vs 3 from phenotype text alone is genuinely hard without genetic typing. We document this honestly.
  • ORPHA code hallucination: like all rare-disease LLMs, may emit invented ORPHA numbers. Always use the canonical lookup post-processor in the pipeline.
  • Not for emergency triage: this is for longitudinal-care decision support, not ER triage.
  • PT-BR primary: model trained primarily for Brazilian Portuguese clinical idiom. English/Spanish performance not characterized.

Companion: RareBench-BR_Public

Together with this model we release RareBench-BR_Public835 validated cases spanning 24 official Ministry of Health PCDTs + hard Brazilian rare-disease cases. The first benchmark in the world that scores rare-disease LLMs on Brazilian SUS conduta (PCDT + CEAF medication).


Citation

@misc{raras_gemma4_2026,
  author = {Raras Team},
  title  = {Araras-Gemma4-E4B: Open Gemma 4 Clinical Decision Support for Brazilian Rare-Disease Care},
  year   = {2026},
  month  = {may},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/Raras-AI/araras-gemma4-e4b-v4-sota},
  howpublished = {Submitted to Gemma 4 Good Hackathon, Google DeepMind / Kaggle},
}

Acknowledgments

Built on the work of Google DeepMind (Gemma 4), Unsloth (training recipe), FremyCompany (BioLORD-2023), Chen et al. (RareBench L1), Conitec / Ministry of Health Brazil (PCDTs), Casa dos Raros (model of local-clinician intervention), and the rare-disease open-science community.

Built by a Brazilian rare-disease patient, for the 13 million Brazilians the existing AI doesn't know exist — and for the doctors who attend them in UBSs across the country.

Downloads last month
36
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Raras-AI/araras-gemma4-e4b-v4-sota

Adapter
(31)
this model

Dataset used to train Raras-AI/araras-gemma4-e4b-v4-sota

Evaluation results

  • R@1 (full unified bench, 833 cases) on RareBench-BR_SUS unified (833 cases, full layers, 0 errors)
    self-reported
    0.412
  • R@3 (full unified bench) on RareBench-BR_SUS unified (833 cases, full layers, 0 errors)
    self-reported
    0.471
  • Track B PCDT-correct (331/431 evaluable — model recommends exact CEAF drug) on RareBench-BR_SUS unified (833 cases, full layers, 0 errors)
    self-reported
    0.768
  • L5_v2 R@1 (SUS-grounded subset, n=619) on RareBench-BR_SUS unified (833 cases, full layers, 0 errors)
    self-reported
    0.472
  • L5_realsus standalone R@1 (n=240, common SUS rare diseases) on RareBench-BR_SUS unified (833 cases, full layers, 0 errors)
    self-reported
    0.704