MedGemma Health Chat โ€” MLX 8-bit

8-bit MLX quantized version of the MedGemma Health Chat model, optimized for Apple Silicon inference with near-lossless quality.

Model Details

  • Base model: google/medgemma-1.5-4b-it (fine-tuned via LoRA, merged)
  • Quantization: 8-bit, group size 64, affine mode
  • Format: MLX (Apple's ML framework for Silicon)
  • Architecture: Gemma 3 4B (34 layers, 2560 dim, 8 heads, 4 KV heads)
  • Parameters: 4.4B (quantized to ~3.8GB)
  • Context length: 131,072 (use 2048 for health chat)

How to Use

With mlx-lm

# Install mlx-lm
pip install mlx-lm

# Generate directly from HF Hub
mlx_lm.generate \
  --model bisonnetworking/medgemma-health-chat-mlx-8bit \
  --prompt "I've had a sore throat and low-grade fever for 3 days. What should I do?" \
  --system-prompt "You are a board-certified Primary Care Physician. Provide direct, clinical guidance. No tables. No AI disclaimers."

# Interactive chat
mlx_lm.chat --model bisonnetworking/medgemma-health-chat-mlx-8bit

With Python

from mlx_lm import load, generate

model, tokenizer = load("bisonnetworking/medgemma-health-chat-mlx-8bit")

messages = [
    {
        "role": "system",
        "content": "You are a board-certified Primary Care Physician. Provide direct, clinical guidance. No tables. No AI disclaimers.",
    },
    {
        "role": "user",
        "content": "I've had a sore throat and low-grade fever (100.4) for 3 days. No cough, no swollen lymph nodes. What should I do?",
    },
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=prompt, max_tokens=256, temp=0.7)
print(response)

As an OpenAI-Compatible Server

mlx_lm.server --model bisonnetworking/medgemma-health-chat-mlx-8bit --port 8080

# Then use with any OpenAI client
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "medgemma-health-chat",
    "messages": [
      {"role": "system", "content": "You are a Primary Care Physician..."},
      {"role": "user", "content": "I have a sore throat. What should I do?"}
    ]
  }'

Quantization Details

8-bit MLX quantization provides near-lossless quality โ€” the difference from the original F16 model is imperceptible in practice. Weights are divided into groups of 64 and each group is quantized to 8 bits with an affine scale and bias. This is the best choice when you have the memory and want maximum quality.

4-bit vs 8-bit

Property 4-bit 8-bit
Size ~2.5GB ~3.8GB
Quality High Near-lossless
Memory (16GB Mac) Comfortable Comfortable
Memory (8GB Mac) Tight but works Not recommended
Use case General health chat Maximum quality responses

Personas

The model supports 6 medical personas via system prompts:

Persona Specialty
Primary Care General practice, common conditions
Internal Medicine Complex adult medicine, multi-system disorders
Clinical Nutritionist Dietary interventions, nutritional therapy
Exercise Specialist Therapeutic exercise, sports performance
Best Doctor Cross-specialty integration, comprehensive care
Chronic Health Chronic illness management, diagnostic mysteries

Related Models

Limitations

  • Only runs on Apple Silicon (M1/M2/M3/M4 or later)
  • 8-bit uses more memory than 4-bit โ€” ensure your Mac has sufficient RAM
  • Not a substitute for professional medical advice
  • MedGemma base model is gated โ€” users need approved access to the original model

Model Card Contact

bisonnetworking

Downloads last month
56
Safetensors
Model size
1B params
Tensor type
F16
ยท
U32
ยท
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for bisonnetworking/medgemma-health-chat-mlx-8bit

Quantized
(39)
this model