Araras-Gemma4-E4B — Clinical Decision Support for Brazilian Rare-Disease Care

The first Gemma 4 fine-tune purpose-built as offline clinical decision support for healthcare professionals attending Brazil's SUS (Sistema Único de Saúde).

"Eu levei vinte anos pra ter o meu diagnóstico de doença rara. Não dá pra mudar isso pra mim. Mas a próxima criança raríssima que entrar numa UBS amanhã — o médico que atender ela não precisa estar sozinho." — Dimas, fundador da Raras, paciente raro (distonia mioclônica, ORPHA:36899)

Submission to the Gemma 4 Good Hackathon.


⚕️ Intended use

Clinical decision support for licensed healthcare professionals (médicos, enfermeiros, agentes comunitários, geneticistas) attending patients in Brazil — especially in remote regions where rare-disease expertise doesn't reach.

  • Not a patient-facing diagnostic tool
  • Not a replacement for clinical judgment
  • Augments the physician with structured differentials, PCDT lookups, and CEAF medication info
  • Runs offline on smartphone, laptop, or any device that runs llama.cpp / Ollama / MLX

Compliance posture

Positioned as Software as a Medical Device (SaMD) — Clinical Decision Support under ANVISA Res. 657/2022, 751/2022, and 830/2023. Aligned with CFM guidance: AI augments physicians, never replaces. Every clinical claim is grounded in verifiable public sources (gov.br/conitec, bvsms, PubMed).


Model details

  • Base: unsloth/gemma-4-E4B-it (8B total / 4.5B effective parameters)
  • Method: QLoRA SFT via Unsloth — r=8, α=8, dropout=0, NEFTune α=5, train_on_responses_only
  • Training data: 120,740 train / 5,137 val examples (deduplicated), all PT-BR rare-disease content:
    • 108k curated rare-disease Q&A (HPO/OMIM/ORPHA-grounded)
    • 26k MedPT PT-BR clinical pairs
    • 10k ultra-rare disease longitudinal cases
    • 10k tool-calling examples (HPO normalize, PCDT lookup, CID-10 map)
    • 5.7k knowledge-graph triples (RarasNet Neo4j — 10,468 rare diseases)
    • 3.2k FindZebra hard cases + 1.3k ReDis-QA
  • Compute: A100 80GB on Vertex AI, ~2 hours, ~$25 total cost
  • Native Gemma 4 features used:
    • <|channel>thought thinking blocks (toggleable per request)
    • 128K context (fits full patient laudo + PCDT + retrieved cases)
    • System role for safety constraints

Companion artifacts

Repo Format Size Use case
Raras-AI/araras-gemma4-e4b-v4-sota PEFT LoRA adapter 147 MB This — apply to base Gemma 4 E4B
Raras-AI/araras-gemma4-e4b-v4-gguf GGUF Q4_K_M / Q5_K_M 5.3 / 5.7 GB llama.cpp, Ollama, edge
Raras-AI/araras-hpo-brasil sentence-transformers 340 MB PT-BR clinical → HPO matcher (companion)
Raras-AI/araras-hpo-brasil-int8 ONNX INT8 85 MB Edge HPO matching
🤗 RareBench-BR (public) JSONL dataset 2.5 MB 835 SUS-grounded validated cases

End-to-end pipeline

PT-BR free text (laudo, prontuário, transcrição da consulta)
    ↓
[1] 🧬 araras-hpo-brasil (BioLORD-2023 fine-tune for PT-BR)
       Normaliza idioma clínico regional → HPO codes
       "amarelão" → HP:0000952 · "bebê molinho" → HP:0001252
    ↓
[2] 🧠 araras-gemma4-e4b Q4_K_M (this model, 5.3 GB, llama.cpp)
       Gera TOP-5 diferenciais ranqueados em PT-BR
    ↓
[3] ✅ Hybrid canonical ORPHA resolver (production technique, e.g. MedCAT/scispaCy)
       Tier 1: strict substring match on 10,468-keyword PT-BR dict (~30ms)
       Tier 2: BioLORD semantic fallback @ cosine ≥ 0.78 (~50ms, only if Tier 1 fails)
       Returns None (honest abstention) if neither tier matches confidently
    ↓
[4] 📋 PCDT overlay — 24 PCDTs do MS estruturados → CEAF + centro de referência
    ↓
Output: structured clinical decision support — differentials + PCDT + SUS conduta + centro

Total stack footprint: 5.5 GB. Runs offline on iPhone, Android, laptop. Zero cloud. Zero LGPD risk.


Evaluation — RareBench-BR

We built RareBench-BR, the first benchmark uniting Dx + SUS conduta in PT-BR. Anchored in 52,343 real anonymized SUS patient trajectories (APAC, CNS-linked, 2017-2020+).

Results on full unified RareBench-BR_SUS (833 cases, 0 errors)

Metric Araras-Gemma4 (Q4_K_M offline, 4.5B) DeepSeek V4 (cloud, ~600B, 36-case subsample)
R@1 (clinical name) 41.2% 86.1%
R@3 47.1% 91.7%
🔥 Track B PCDT-correct 76.8% (331/431) 91.7% (33/36)
Errors 0 n/a
Latency p50 6.5s 4.1s
Cost per query $0 (local) ~$0.001
Params 4.5B ~600B
Device phone data center

Per-layer breakdown (full 833 cases):

Layer N R@1 R@3
L3_v2 — PCDT atypical presentations 135 27.4% 32.6%
L4 — Hard BR (tropical phenocopies, founder, IEI) 79 17.7% 19.0%
L5_v2 — SUS-grounded synthesis 619 47.2% 53.8%
L5_realsus standalone (common SUS rare diseases) 240 70.4% 78.3%

Why Track B = 76.8% is the headline: for 3 out of every 4 cases where ground truth specifies a CEAF-dispensed medication, Araras recommends the exact molecule SUS actually pays for. Nobody publishes this number because nobody trained on this data. Cloud frontier (DeepSeek 600B) hits 91.7% — Araras hits 76.8% at 150× fewer parameters, fully offline, $0 marginal cost.

Vs. prior Qwen3.5-9B baseline on L1: 16.6% R@1 at 64s

Araras-Gemma4 is 2.5× more accurate, 9× faster, smaller.


Example clinical conversations

These are real outputs from the deployed model. Multi-turn, with HPO normalization upstream.

Example 1 — AME tipo 1 (UBS in Petrolina-PE, no internet)

Médico: Lactente feminino, 4 meses, natural de Petrolina (PE), pais primos. Hipotonia profunda generalizada desde o nascimento, ausência de sustentação cefálica, choro fraco, dificuldade para mamar, abolição dos reflexos osteotendíneos profundos. SpO2 95%.

Araras (4.5s): "Olá! 👶 Com base nos achados clínicos e na história, o quadro sugere um diagnóstico de AME tipo 1. A tríade clássica (hipotonia profunda generalizada desde o nascimento + ausência dos reflexos osteotendíneos profundos + dificuldade para mamar) é altamente sugestiva. Recomendo encaminhamento para neurologia pediátrica e investigação laboratorial específica (eletroneuromiografia, testes genéticos SMN1)."

Médico turn 2: Como agilizar acesso ao tratamento pelo SUS?

Araras (5.0s): "Para acelerar o acesso diagnóstico e terapêutico no SUS: (1) Protocolo de Urgência/Emergência: se houver piora do quadro, acione SAMU/UPA. (2) Articulação com Equipe Básica: peça ao médico da UBS um laudo detalhado, com ênfase na 'hipotonia profunda generalizada desde o nascimento' e nos marcos perdidos. (3) Interconsulta Direta: solicite agendamento prioritário com Neuropediatria do município/região de referência."

Example 2 — Crise vaso-oclusiva em Doença Falciforme (UBS rural BA)

Médico: Adolescente masculino, 15 anos, zona rural de Conceição do Coité (BA). Internado com dor torácica intensa + dispneia. Triagem neonatal positiva HbSS, crises álgicas vaso-oclusivas frequentes. Hb 6.8, retic 12%, leuco 18k. Sat 89%.

Araras (1.8s): "🚨 ALERTA — Quadro altamente sugestivo de Crise vaso-oclusiva grave em paciente com Anemia Falciforme. Aguardamos retorno do Hematologista para manejo da crise e avaliação transfusional urgente. Manter monitorização respiratória máxima, oxigenoterapia e analgesia contínua."

Médico turn 2: É síndrome torácica aguda? Que conduta urgente?

Araras (3.5s): "⚠️ Conduta IMEDIATA: (1) Oxigênio 10L/min, manter saturação >92%. (2) Analgesia potente (opioides IV) e manejo do choque (fluidos com cautela). (3) Hemotransfusão emergencial de hemácias concentradas se Hb <6 g/dL ou lactato alto. (4) Manter monitorização cardiorrespiratória e comunicar Hematologia."

Example 3 — Fenilcetonúria (triagem neonatal Manaus-AM)

Médico: RN masculino, 12 dias de vida, Manaus (AM). Teste do pezinho com fenilalanina sérica 18 mg/dL (normal <2). Pais consanguíneos.

Araras (4.3s): "🚨 ALERTA ALTO RISCO. O teste do pezinho com fenilalanina sérica de 18 mg/dL é extremamente elevado. Quadro de Fenilcetonúria (PKU) que requer teste confirmatório imediato. Não espere o retorno dos resultados: a hiperfenilalaninemia leva à encefalotoxicidade irreversível em horas. AÇÃO IMEDIATA: (1) Notificar UTI Neonatal/Neuropediatria. (2) Iniciar controle metabólico (fórmula livre de Phe). (3) Coleta de aminoácidos plasmáticos e urinários."


How to use

Option A — llama.cpp (recommended for clinical edge deployment)

hf download Raras-AI/araras-gemma4-e4b-v4-gguf araras-gemma4-e4b-v4-Q4_K_M.gguf --local-dir ./

llama-server -m araras-gemma4-e4b-v4-Q4_K_M.gguf -ngl 99 -c 8192 --jinja

curl http://127.0.0.1:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "messages": [
    {"role": "system", "content": "Você é ARARAS, copiloto de apoio à decisão clínica em doenças raras..."},
    {"role": "user", "content": "<caso clínico em PT-BR>"}
  ],
  "temperature": 0.5, "top_p": 0.95, "top_k": 64, "repeat_penalty": 1.15
}'

Option B — transformers + PEFT

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base = "unsloth/gemma-4-E4B-it"
adapter = "Raras-AI/araras-gemma4-e4b-v4-sota"

tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)
model.eval()

Option C — Full pipeline (BioLORD + Gemma + ORPHA lookup + PCDT)

git clone https://github.com/rarasAI/araras-gemma4
cd araras-gemma4 && pip install -r requirements.txt
python -m araras_gemma4.pipeline "<caso clínico>"

Limitations

  • Quantization Q4 ceiling: Q4_K_M loses ~3-5% accuracy vs full bf16. Use Q5_K_M or full adapter for research.
  • Ultra-rare disease coverage: training data heavily skewed toward the 12 CEAF-covered diseases. Performance drops on ultra-rare conditions (<100 cases described in BR).
  • SMA subtype confusion: distinguishing SMA type 1 vs 2 vs 3 from phenotype text alone is genuinely hard without genetic typing. We document this honestly.
  • ORPHA code hallucination: like all rare-disease LLMs, may emit invented ORPHA numbers. Always use the canonical lookup post-processor in the pipeline.
  • Not for emergency triage: this is for longitudinal-care decision support, not ER triage.
  • PT-BR primary: model trained primarily for Brazilian Portuguese clinical idiom. English/Spanish performance not characterized.

Companion: RareBench-BR_Public

Together with this model we release RareBench-BR_Public835 validated cases spanning 24 official Ministry of Health PCDTs + hard Brazilian rare-disease cases. The first benchmark in the world that scores rare-disease LLMs on Brazilian SUS conduta (PCDT + CEAF medication).


Citation

@misc{raras_gemma4_2026,
  author = {Raras Team},
  title  = {Araras-Gemma4-E4B: Open Gemma 4 Clinical Decision Support for Brazilian Rare-Disease Care},
  year   = {2026},
  month  = {may},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/Raras-AI/araras-gemma4-e4b-v4-sota},
  howpublished = {Submitted to Gemma 4 Good Hackathon, Google DeepMind / Kaggle},
}

Acknowledgments

Built on the work of Google DeepMind (Gemma 4), Unsloth (training recipe), FremyCompany (BioLORD-2023), Chen et al. (RareBench L1), Conitec / Ministry of Health Brazil (PCDTs), Casa dos Raros (model of local-clinician intervention), and the rare-disease open-science community.

Built by a Brazilian rare-disease patient, for the 13 million Brazilians the existing AI doesn't know exist — and for the doctors who attend them in UBSs across the country.

Downloads last month
225
GGUF
Model size
7B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Raras-AI/araras-gemma4-e4b-v4-gguf

Adapter
(31)
this model

Dataset used to train Raras-AI/araras-gemma4-e4b-v4-gguf

Space using Raras-AI/araras-gemma4-e4b-v4-gguf 1

Evaluation results

  • R@1 (full unified bench, 833 cases) on RareBench-BR_SUS unified (833 cases, full layers, 0 errors)
    self-reported
    0.412
  • R@3 (full unified bench) on RareBench-BR_SUS unified (833 cases, full layers, 0 errors)
    self-reported
    0.471
  • Track B PCDT-correct (331/431 evaluable — model recommends exact CEAF drug) on RareBench-BR_SUS unified (833 cases, full layers, 0 errors)
    self-reported
    0.768
  • L5_v2 R@1 (SUS-grounded subset, n=619) on RareBench-BR_SUS unified (833 cases, full layers, 0 errors)
    self-reported
    0.472
  • L5_realsus standalone R@1 (n=240, common SUS rare diseases) on RareBench-BR_SUS unified (833 cases, full layers, 0 errors)
    self-reported
    0.704