Instructions to use Ferox-AI/Phi-4-multimodal-instruct-mlx-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use Ferox-AI/Phi-4-multimodal-instruct-mlx-8bit with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("Ferox-AI/Phi-4-multimodal-instruct-mlx-8bit") config = load_config("Ferox-AI/Phi-4-multimodal-instruct-mlx-8bit") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use Ferox-AI/Phi-4-multimodal-instruct-mlx-8bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "Ferox-AI/Phi-4-multimodal-instruct-mlx-8bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Ferox-AI/Phi-4-multimodal-instruct-mlx-8bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Ferox-AI/Phi-4-multimodal-instruct-mlx-8bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "Ferox-AI/Phi-4-multimodal-instruct-mlx-8bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Ferox-AI/Phi-4-multimodal-instruct-mlx-8bit
Run Hermes
hermes
Phi-4-Multimodal-Instruct — MLX 8-bit
An 8-bit quantized Apple MLX conversion of microsoft/Phi-4-multimodal-instruct for native inference on Apple Silicon.
Converted by Ferox AI · Vision-language inference on MacBook / Mac Studio / Mac Pro without cloud dependencies.
| Parameters | 5.6B (pre-LoRA-fusion) |
| Quantization | 8-bit, group_size=64 (backbone only; SigLIP encoder remains FP16) |
| Disk size | ~5.5 GB |
| Base model | microsoft/Phi-4-multimodal-instruct |
| License | MIT |
| Modality | Vision + Text (Phase 1; audio deferred) |
This variant offers a balance between the 4-bit model's memory efficiency and the bf16 model's full precision. Recommended for systems with 16 GB+ unified memory where accuracy is prioritized over memory footprint.
Other variants: 4-bit (smallest) · bf16 (full precision)
Quickstart
from mlx_vlm import load, generate
model, processor = load("ferox-ai/Phi-4-multimodal-instruct-mlx-8bit")
output = generate(
model,
processor,
"Describe this image in detail.",
["path/to/image.jpg"],
max_tokens=512,
verbose=False,
)
print(output)
Requires mlx-vlm >= 0.1.0 with Phi-4-MM architecture support. Install dependencies:
pip install mlx-vlm>=0.1.0 mlx>=0.22.0
Benchmark Results
Evaluated with our internal evaluation harness on a single Apple Silicon device. Scores are computed on a 100-sample subset of each benchmark. Microsoft's reference scores are reported on the full dataset using PyTorch FP16 — direct comparison should account for both the precision difference and sample-size variance.
| Benchmark | This Model (8-bit) | 4-bit | bf16 | Microsoft FP16 (full) | Metric |
|---|---|---|---|---|---|
| ChartQA | 85.0 | 86.0 | 85.0 | 81.4 | Relaxed Accuracy |
| DocVQA | 86.1 | 82.8 | 86.2 | 93.2 | ANLS |
| MMMU | 29.0 | 24.0 | 31.0 | 55.1 | Accuracy |
| OCRBench | 850 | 840 | 840 | 844 | Score / 1000 |
| TextVQA | 81.0 | 80.0 | 82.0 | 75.6 | Accuracy |
| AI2D | —‡ | 83.0 | 90.0 | 82.3 | Accuracy |
| MathVista | —‡ | 58.0 | 58.0 | 62.4 | Accuracy |
| ScienceQA | —‡ | 95.8†| 100.0†| 97.5 | Accuracy |
†ScienceQA: scored on the 48 image-bearing questions of the 100-sample subset (text-only questions excluded). ‡ Not yet evaluated for the 8-bit variant. These three benchmarks were measured for the 4-bit and bf16 variants but have not been re-run at 8-bit; they will be added in a future update.
Quantization fidelity
On the five benchmarks measured at 8-bit, scores track the lossless bf16 variant within ~1 point (e.g. DocVQA 86.1 vs 86.2, ChartQA 85.0 vs 85.0, TextVQA 81.0 vs 82.0), indicating that 8-bit group quantization is effectively lossless for this model on these tasks.
Note on MMMU
The 100-sample MMMU scores (29.0% 8-bit, 24.0% 4-bit, 31.0% bf16) fall well below Microsoft's reported 55.1%. To isolate the cause, we ran a full 900-sample MMMU validation on the lossless bf16 variant and obtained 27.9% — consistent with the subset, which confirms the gap is not caused by quantization or weight conversion. We were unable to reproduce Microsoft's 55.1% and attribute the difference to evaluation-harness and answer-extraction handling for MMMU's multiple-choice format (prompt formatting and option parsing), rather than to the model's underlying capability — which is better reflected by the document-, chart-, and OCR-focused benchmarks above.
Architecture
| Component | Details |
|---|---|
| Backbone | Phi-4-Mini (3.8B) — 32 transformer layers, hidden_size=3072, 24 query heads / 8 KV heads (GQA), head_dim=128, LongRoPE (131K context) |
| Vision encoder | SigLIP-SO400M NaViT — 27 layers, 16 heads, head_dim=72, hidden_size=1152 |
| Vision projection | 2-layer MLP: Linear(4608→3072) → GELU → Linear(3072→3072) |
| Vision LoRA | rank=256, alpha=512 (~370M params) — pre-fused into backbone weights before quantization |
| Quantization | 8-bit with group_size=64. Applied to backbone linear layers only; SigLIP remains FP16 |
Variant comparison
| Variant | Disk Size | Memory (approx) | Best For |
|---|---|---|---|
| 4-bit | ~3.9 GB | ~5 GB | 8 GB devices, memory-constrained workflows |
| 8-bit (this) | ~5.5 GB | ~7 GB | 16 GB devices, balanced accuracy/memory |
| bf16 | ~8.5 GB | ~10 GB | 24+ GB devices, maximum accuracy |
Intended Use
This model is designed for local, on-device vision-language inference on Apple Silicon hardware. Suitable applications include document understanding, chart interpretation, visual question answering, OCR, and educational content analysis.
Limitations
- 100-sample evaluations. Benchmark scores are computed on subsets, not full datasets.
- Partial benchmark coverage at 8-bit. AI2D, MathVista, and ScienceQA have not yet been evaluated for this variant (see the benchmark table).
- Vision-only. Audio support from the original architecture is not included (Phase 1).
- Apple Silicon required. MLX targets Apple's unified memory architecture (M1/M2/M3/M4).
Citation
@misc{feroxai2026phi4mlx,
title={Phi-4-Multimodal-Instruct MLX Conversion},
author={Ferox AI},
year={2026},
url={https://huggingface.co/ferox-ai/Phi-4-multimodal-instruct-mlx-8bit},
note={8-bit quantized MLX port of microsoft/Phi-4-multimodal-instruct}
}
Acknowledgments
- Microsoft Research for Phi-4-multimodal-instruct
- Apple MLX team for the MLX framework
- Prince Canuma for mlx-vlm
- Downloads last month
- 53
Quantized
Model tree for Ferox-AI/Phi-4-multimodal-instruct-mlx-8bit
Base model
microsoft/Phi-4-multimodal-instructDatasets used to train Ferox-AI/Phi-4-multimodal-instruct-mlx-8bit
lmms-lab/textvqa
lmms-lab/DocVQA
Collection including Ferox-AI/Phi-4-multimodal-instruct-mlx-8bit
Evaluation results
- Relaxed Accuracy (n=100) on ChartQAtest set self-reported85.000
- ANLS (n=100) on DocVQAvalidation set self-reported86.100
- Accuracy (n=100) on MMMUvalidation set self-reported29.000
- Score/1000 (n=100) on OCRBenchtest set self-reported850.000
- Accuracy (n=100) on TextVQAvalidation set self-reported81.000