Text Generation
MLX
Safetensors
English
gemma3
conversational
medical
health-chat
8-bit precision
apple-silicon
Instructions to use bisonnetworking/medgemma-health-chat-mlx-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use bisonnetworking/medgemma-health-chat-mlx-8bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("bisonnetworking/medgemma-health-chat-mlx-8bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- MLX LM
How to use bisonnetworking/medgemma-health-chat-mlx-8bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "bisonnetworking/medgemma-health-chat-mlx-8bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "bisonnetworking/medgemma-health-chat-mlx-8bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bisonnetworking/medgemma-health-chat-mlx-8bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
MedGemma Health Chat โ MLX 8-bit
8-bit MLX quantized version of the MedGemma Health Chat model, optimized for Apple Silicon inference with near-lossless quality.
Model Details
- Base model: google/medgemma-1.5-4b-it (fine-tuned via LoRA, merged)
- Quantization: 8-bit, group size 64, affine mode
- Format: MLX (Apple's ML framework for Silicon)
- Architecture: Gemma 3 4B (34 layers, 2560 dim, 8 heads, 4 KV heads)
- Parameters: 4.4B (quantized to ~3.8GB)
- Context length: 131,072 (use 2048 for health chat)
How to Use
With mlx-lm
# Install mlx-lm
pip install mlx-lm
# Generate directly from HF Hub
mlx_lm.generate \
--model bisonnetworking/medgemma-health-chat-mlx-8bit \
--prompt "I've had a sore throat and low-grade fever for 3 days. What should I do?" \
--system-prompt "You are a board-certified Primary Care Physician. Provide direct, clinical guidance. No tables. No AI disclaimers."
# Interactive chat
mlx_lm.chat --model bisonnetworking/medgemma-health-chat-mlx-8bit
With Python
from mlx_lm import load, generate
model, tokenizer = load("bisonnetworking/medgemma-health-chat-mlx-8bit")
messages = [
{
"role": "system",
"content": "You are a board-certified Primary Care Physician. Provide direct, clinical guidance. No tables. No AI disclaimers.",
},
{
"role": "user",
"content": "I've had a sore throat and low-grade fever (100.4) for 3 days. No cough, no swollen lymph nodes. What should I do?",
},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=prompt, max_tokens=256, temp=0.7)
print(response)
As an OpenAI-Compatible Server
mlx_lm.server --model bisonnetworking/medgemma-health-chat-mlx-8bit --port 8080
# Then use with any OpenAI client
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "medgemma-health-chat",
"messages": [
{"role": "system", "content": "You are a Primary Care Physician..."},
{"role": "user", "content": "I have a sore throat. What should I do?"}
]
}'
Quantization Details
- Tool: mlx-lm 0.31.3
- Bits: 8
- Group size: 64
- Mode: affine
- Converted from: bisonnetworking/medgemma-health-chat-merged (F16)
8-bit MLX quantization provides near-lossless quality โ the difference from the original F16 model is imperceptible in practice. Weights are divided into groups of 64 and each group is quantized to 8 bits with an affine scale and bias. This is the best choice when you have the memory and want maximum quality.
4-bit vs 8-bit
| Property | 4-bit | 8-bit |
|---|---|---|
| Size | ~2.5GB | ~3.8GB |
| Quality | High | Near-lossless |
| Memory (16GB Mac) | Comfortable | Comfortable |
| Memory (8GB Mac) | Tight but works | Not recommended |
| Use case | General health chat | Maximum quality responses |
Personas
The model supports 6 medical personas via system prompts:
| Persona | Specialty |
|---|---|
| Primary Care | General practice, common conditions |
| Internal Medicine | Complex adult medicine, multi-system disorders |
| Clinical Nutritionist | Dietary interventions, nutritional therapy |
| Exercise Specialist | Therapeutic exercise, sports performance |
| Best Doctor | Cross-specialty integration, comprehensive care |
| Chronic Health | Chronic illness management, diagnostic mysteries |
Related Models
- MLX 4-bit โ smaller, slightly lower quality
- GGUF Q4_K_M / Q8_0 โ for llama.cpp/Ollama
- Merged 16-bit โ full precision
- LoRA adapter โ for PEFT
- Training dataset
Limitations
- Only runs on Apple Silicon (M1/M2/M3/M4 or later)
- 8-bit uses more memory than 4-bit โ ensure your Mac has sufficient RAM
- Not a substitute for professional medical advice
- MedGemma base model is gated โ users need approved access to the original model
Model Card Contact
- Downloads last month
- 56
Model size
1B params
Tensor type
F16
ยท
U32 ยท
Hardware compatibility
Log In to add your hardware
8-bit
Model tree for bisonnetworking/medgemma-health-chat-mlx-8bit
Base model
google/medgemma-1.5-4b-it