MedGemma Health Chat — MLX 8-bit

8-bit MLX quantized version of the MedGemma Health Chat model, optimized for Apple Silicon inference with near-lossless quality.

Model Details

Base model: google/medgemma-1.5-4b-it (fine-tuned via LoRA, merged)
Quantization: 8-bit, group size 64, affine mode
Format: MLX (Apple's ML framework for Silicon)
Architecture: Gemma 3 4B (34 layers, 2560 dim, 8 heads, 4 KV heads)
Parameters: 4.4B (quantized to ~3.8GB)
Context length: 131,072 (use 2048 for health chat)

How to Use

With mlx-lm

# Install mlx-lm
pip install mlx-lm

# Generate directly from HF Hub
mlx_lm.generate \
  --model bisonnetworking/medgemma-health-chat-mlx-8bit \
  --prompt "I've had a sore throat and low-grade fever for 3 days. What should I do?" \
  --system-prompt "You are a board-certified Primary Care Physician. Provide direct, clinical guidance. No tables. No AI disclaimers."

# Interactive chat
mlx_lm.chat --model bisonnetworking/medgemma-health-chat-mlx-8bit

With Python

from mlx_lm import load, generate

model, tokenizer = load("bisonnetworking/medgemma-health-chat-mlx-8bit")

messages = [
    {
        "role": "system",
        "content": "You are a board-certified Primary Care Physician. Provide direct, clinical guidance. No tables. No AI disclaimers.",
    },
    {
        "role": "user",
        "content": "I've had a sore throat and low-grade fever (100.4) for 3 days. No cough, no swollen lymph nodes. What should I do?",
    },
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=prompt, max_tokens=256, temp=0.7)
print(response)

As an OpenAI-Compatible Server

mlx_lm.server --model bisonnetworking/medgemma-health-chat-mlx-8bit --port 8080

# Then use with any OpenAI client
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "medgemma-health-chat",
    "messages": [
      {"role": "system", "content": "You are a Primary Care Physician..."},
      {"role": "user", "content": "I have a sore throat. What should I do?"}
    ]
  }'

Quantization Details

Tool: mlx-lm 0.31.3
Bits: 8
Group size: 64
Mode: affine
Converted from: bisonnetworking/medgemma-health-chat-merged (F16)

8-bit MLX quantization provides near-lossless quality — the difference from the original F16 model is imperceptible in practice. Weights are divided into groups of 64 and each group is quantized to 8 bits with an affine scale and bias. This is the best choice when you have the memory and want maximum quality.

4-bit vs 8-bit

Property	4-bit	8-bit
Size	~2.5GB	~3.8GB
Quality	High	Near-lossless
Memory (16GB Mac)	Comfortable	Comfortable
Memory (8GB Mac)	Tight but works	Not recommended
Use case	General health chat	Maximum quality responses

Personas

The model supports 6 medical personas via system prompts:

Persona	Specialty
Primary Care	General practice, common conditions
Internal Medicine	Complex adult medicine, multi-system disorders
Clinical Nutritionist	Dietary interventions, nutritional therapy
Exercise Specialist	Therapeutic exercise, sports performance
Best Doctor	Cross-specialty integration, comprehensive care
Chronic Health	Chronic illness management, diagnostic mysteries

Related Models

MLX 4-bit — smaller, slightly lower quality
GGUF Q4_K_M / Q8_0 — for llama.cpp/Ollama
Merged 16-bit — full precision
LoRA adapter — for PEFT
Training dataset

Limitations

Only runs on Apple Silicon (M1/M2/M3/M4 or later)
8-bit uses more memory than 4-bit — ensure your Mac has sufficient RAM
Not a substitute for professional medical advice
MedGemma base model is gated — users need approved access to the original model

Model Card Contact

bisonnetworking

Downloads last month: 56

Safetensors

Model size

1B params

Tensor type

F16

U32

MLX

Hardware compatibility

8-bit

Model tree for bisonnetworking/medgemma-health-chat-mlx-8bit

Base model

google/medgemma-1.5-4b-it

Quantized

(39)

this model