Instructions to use Ferox-AI/Phi-4-multimodal-instruct-mlx-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use Ferox-AI/Phi-4-multimodal-instruct-mlx-bf16 with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("Ferox-AI/Phi-4-multimodal-instruct-mlx-bf16") config = load_config("Ferox-AI/Phi-4-multimodal-instruct-mlx-bf16") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use Ferox-AI/Phi-4-multimodal-instruct-mlx-bf16 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "Ferox-AI/Phi-4-multimodal-instruct-mlx-bf16"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Ferox-AI/Phi-4-multimodal-instruct-mlx-bf16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Ferox-AI/Phi-4-multimodal-instruct-mlx-bf16 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "Ferox-AI/Phi-4-multimodal-instruct-mlx-bf16"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Ferox-AI/Phi-4-multimodal-instruct-mlx-bf16
Run Hermes
hermes
Phi-4-Multimodal-Instruct — MLX bf16
A full-precision (bfloat16) Apple MLX conversion of microsoft/Phi-4-multimodal-instruct for native inference on Apple Silicon.
Converted by Ferox AI · Lossless weight conversion — maximum accuracy for systems with sufficient unified memory.
| Parameters | 5.6B (pre-LoRA-fusion) |
| Precision | bfloat16 (no quantization) |
| Disk size | ~8.5 GB |
| Base model | microsoft/Phi-4-multimodal-instruct |
| License | MIT |
| Modality | Vision + Text (Phase 1; audio deferred) |
This is the reference-quality variant with no quantization loss. Recommended for systems with 24 GB+ unified memory (M2 Pro/Max, M3 Pro/Max/Ultra, M4 Pro/Max/Ultra) or when maximum accuracy is required.
Quantized variants: 4-bit (~3.9 GB) · 8-bit (~5.5 GB)
Quickstart
from mlx_vlm import load, generate
model, processor = load("ferox-ai/Phi-4-multimodal-instruct-mlx-bf16")
output = generate(
model,
processor,
"Describe this image in detail.",
["path/to/image.jpg"],
max_tokens=512,
verbose=False,
)
print(output)
Requires mlx-vlm >= 0.1.0 with Phi-4-MM architecture support. Install dependencies:
pip install mlx-vlm>=0.1.0 mlx>=0.22.0
Benchmark Results
Evaluated with our internal evaluation harness on a single Apple Silicon device. Scores are computed on a 100-sample subset of each benchmark. Microsoft's reference scores are reported on the full dataset using PyTorch FP16.
| Benchmark | This Model (bf16) | 4-bit | Microsoft FP16 (full) | Δ vs Microsoft | Metric |
|---|---|---|---|---|---|
| AI2D | 90.0 | 83.0 | 82.3 | +7.7 | Accuracy |
| ChartQA | 85.0 | 86.0 | 81.4 | +3.6 | Relaxed Accuracy |
| DocVQA | 86.2 | 82.8 | 93.2 | −7.0 | ANLS |
| MathVista | 58.0 | 58.0 | 62.4 | −4.4 | Accuracy |
| MMMU | 31.0 | 24.0 | 55.1 | −24.1 | Accuracy |
| OCRBench | 840 | 840 | 844 | −4 | Score / 1000 |
| ScienceQA | 100.0†| 95.8†| 97.5 | +2.5 | Accuracy |
| TextVQA | 82.0 | 80.0 | 75.6 | +6.4 | Accuracy |
†ScienceQA: 48 of 100 samples scored (image-bearing questions only; 52 text-only questions excluded).
Conversion fidelity
On 5 of 8 benchmarks, the bf16 MLX conversion matches or exceeds Microsoft's PyTorch FP16 reference scores, confirming that the weight conversion pipeline is lossless. Residual differences on DocVQA and MathVista are within expected range for 100-sample evaluation variance.
Note on MMMU
The 100-sample MMMU score (31.0%) is well below Microsoft's reported 55.1%. Because this variant is full precision (no quantization loss), the gap cannot be attributed to weight conversion. We re-evaluated on the full 900-sample MMMU validation split and obtained 27.9%, consistent with the subset. We were unable to reproduce Microsoft's 55.1% and attribute the difference to evaluation-harness and answer-extraction handling for MMMU's multiple-choice format (prompt formatting and option parsing), not to the model's underlying capability — which is better reflected by the document-, chart-, OCR-, and science-focused benchmarks above.
Architecture
| Component | Details |
|---|---|
| Backbone | Phi-4-Mini (3.8B) — 32 transformer layers, hidden_size=3072, 24 query heads / 8 KV heads (GQA), head_dim=128, LongRoPE (131K context) |
| Vision encoder | SigLIP-SO400M NaViT — 27 layers, 16 heads, head_dim=72, hidden_size=1152 |
| Vision projection | 2-layer MLP: Linear(4608→3072) → GELU → Linear(3072→3072). Input is a 2×2 spatial merge of SigLIP patch features |
| Vision LoRA | rank=256, alpha=512 (~370M params) — pre-fused into backbone weights |
| Image preprocessing | Dynamic HD tiling (deterministic grid, up to 8 crops at 448×448). PIL + NumPy only; zero PyTorch dependency at inference |
Weight provenance
Weights are converted from microsoft/Phi-4-multimodal-instruct using a deterministic pipeline: download → fuse vision LoRA → remap keys → transpose LoRA matrices → serialize as MLX safetensors. No quantization is applied. The conversion is deterministic and fully reproducible from the base model.
Variant comparison
| Variant | Disk Size | Memory (approx) | Best For |
|---|---|---|---|
| 4-bit | ~3.9 GB | ~5 GB | 8 GB devices, memory-constrained workflows |
| 8-bit | ~5.5 GB | ~7 GB | 16 GB devices, balanced accuracy/memory |
| bf16 (this) | ~8.5 GB | ~10 GB | 24+ GB devices, maximum accuracy |
Intended Use
This model is designed for local, on-device vision-language inference on Apple Silicon hardware. Suitable applications include document understanding, chart interpretation, visual question answering, OCR, and educational content analysis.
Limitations
- 100-sample evaluations. Benchmark scores are computed on subsets, not full datasets.
- Vision-only. Audio support from the original architecture is not included (Phase 1).
- Memory requirements. Requires ~10 GB unified memory. Use the 4-bit or 8-bit variant for constrained devices.
- Apple Silicon required. MLX targets Apple's unified memory architecture.
Citation
@misc{feroxai2026phi4mlx,
title={Phi-4-Multimodal-Instruct MLX Conversion},
author={Ferox AI},
year={2026},
url={https://huggingface.co/ferox-ai/Phi-4-multimodal-instruct-mlx-bf16},
note={Full-precision (bf16) MLX port of microsoft/Phi-4-multimodal-instruct}
}
Acknowledgments
- Microsoft Research for Phi-4-multimodal-instruct
- Apple MLX team for the MLX framework
- Prince Canuma for mlx-vlm
- Downloads last month
- 113
Quantized
Model tree for Ferox-AI/Phi-4-multimodal-instruct-mlx-bf16
Base model
microsoft/Phi-4-multimodal-instructDatasets used to train Ferox-AI/Phi-4-multimodal-instruct-mlx-bf16
lmms-lab/textvqa
lmms-lab/DocVQA
Collection including Ferox-AI/Phi-4-multimodal-instruct-mlx-bf16
Evaluation results
- Accuracy (n=100) on AI2Dtest set self-reported90.000
- Relaxed Accuracy (n=100) on ChartQAtest set self-reported85.000
- ANLS (n=100) on DocVQAvalidation set self-reported86.200
- Accuracy (n=100) on TextVQAvalidation set self-reported82.000
- Score/1000 (n=100) on OCRBenchtest set self-reported840.000
- Accuracy (n=48, image-only) on ScienceQAtest set self-reported100.000
- Accuracy (n=100) on MathVistaself-reported58.000