Instructions to use Ferox-AI/Phi-4-multimodal-instruct-mlx-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use Ferox-AI/Phi-4-multimodal-instruct-mlx-4bit with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("Ferox-AI/Phi-4-multimodal-instruct-mlx-4bit") config = load_config("Ferox-AI/Phi-4-multimodal-instruct-mlx-4bit") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use Ferox-AI/Phi-4-multimodal-instruct-mlx-4bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "Ferox-AI/Phi-4-multimodal-instruct-mlx-4bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Ferox-AI/Phi-4-multimodal-instruct-mlx-4bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Ferox-AI/Phi-4-multimodal-instruct-mlx-4bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "Ferox-AI/Phi-4-multimodal-instruct-mlx-4bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Ferox-AI/Phi-4-multimodal-instruct-mlx-4bit
Run Hermes
hermes
Phi-4-Multimodal-Instruct — MLX 4-bit
A 4-bit quantized Apple MLX conversion of microsoft/Phi-4-multimodal-instruct for native inference on Apple Silicon.
Converted by Ferox AI · Vision-language inference on MacBook / Mac Studio / Mac Pro without cloud dependencies.
| Parameters | 5.6B (pre-LoRA-fusion) |
| Quantization | 4-bit, group_size=64 (backbone only; SigLIP encoder remains FP16) |
| Disk size | ~3.9 GB |
| Base model | microsoft/Phi-4-multimodal-instruct |
| License | MIT |
| Modality | Vision + Text (Phase 1; audio deferred) |
Other variants: bf16 (full precision) · 8-bit
Quickstart
from mlx_vlm import load, generate
model, processor = load("ferox-ai/Phi-4-multimodal-instruct-mlx-4bit")
output = generate(
model,
processor,
"Describe this image in detail.",
["path/to/image.jpg"],
max_tokens=512,
verbose=False,
)
print(output)
Requires mlx-vlm >= 0.1.0 with Phi-4-MM architecture support. Install dependencies:
pip install mlx-vlm>=0.1.0 mlx>=0.22.0
Benchmark Results
Evaluated with our internal evaluation harness on a single Apple Silicon device. Scores are computed on a 100-sample subset of each benchmark. Microsoft's reference scores are reported on the full dataset using PyTorch FP16 — direct comparison should account for both the precision difference and sample-size variance.
| Benchmark | This Model (4-bit) | bf16 | Microsoft FP16 (full dataset) | Metric |
|---|---|---|---|---|
| AI2D | 83.0 | 90.0 | 82.3 | Accuracy |
| ChartQA | 86.0 | 85.0 | 81.4 | Relaxed Accuracy |
| DocVQA | 82.8 | 86.2 | 93.2 | ANLS |
| MathVista | 58.0 | 58.0 | 62.4 | Accuracy |
| MMMU | 24.0 | 31.0 | 55.1 | Accuracy |
| OCRBench | 840 | 840 | 844 | Score / 1000 |
| ScienceQA | 95.8†| 100.0†| 97.5 | Accuracy |
| TextVQA | 80.0 | 82.0 | 75.6 | Accuracy |
†ScienceQA: 48 of 100 samples scored (image-bearing questions only; 52 text-only questions excluded).
Quantization impact
Across all benchmarks, 4-bit quantization produces a mean accuracy delta of −2.2 percentage points relative to bf16 — within the expected range for 4-bit group quantization on a model of this scale.
Note on MMMU
The 100-sample MMMU scores (24.0% 4-bit, 31.0% bf16) fall well below Microsoft's reported 55.1%. To isolate the cause, we ran a full 900-sample MMMU validation on the lossless bf16 variant and obtained 27.9% — consistent with the subset, which confirms the gap is not caused by quantization or weight conversion. We were unable to reproduce Microsoft's 55.1% and attribute the difference to evaluation-harness and answer-extraction handling for MMMU's multiple-choice format (prompt formatting and option parsing), rather than to the model's underlying capability — which is better reflected by the document-, chart-, OCR-, and science-focused benchmarks above.
Architecture
| Component | Details |
|---|---|
| Backbone | Phi-4-Mini (3.8B) — 32 transformer layers, hidden_size=3072, 24 query heads / 8 KV heads (GQA), head_dim=128, LongRoPE positional encoding (131K context) |
| Vision encoder | SigLIP-SO400M NaViT — 27 layers, 16 heads, head_dim=72, hidden_size=1152 |
| Vision projection | 2-layer MLP: Linear(4608→3072) → GELU → Linear(3072→3072). Input is a 2×2 spatial merge of SigLIP patch features |
| Vision LoRA | rank=256, alpha=512 (~370M parameters) — pre-fused into backbone weights before quantization |
| Image preprocessing | Dynamic HD tiling (deterministic grid, up to 8 crops at 448×448). PIL + NumPy only; zero PyTorch dependency at inference |
| Quantization | 4-bit with group_size=64. Applied to backbone linear layers only; SigLIP encoder weights remain in FP16 |
Weight provenance
Weights are converted from microsoft/Phi-4-multimodal-instruct using a deterministic pipeline:
- Download source checkpoint (PyTorch safetensors)
- Fuse vision LoRA adapters into backbone weights (eliminates runtime adapter overhead)
- Remap weight keys to MLX naming conventions
- Transpose LoRA matrices (PEFT → MLX format)
- Quantize backbone to 4-bit (SigLIP excluded)
- Serialize as MLX safetensors
The conversion and quantization pipeline is deterministic and fully reproducible from the base model.
Intended Use
This model is designed for local, on-device vision-language inference on Apple Silicon hardware. Suitable applications include:
- Document understanding and extraction (invoices, forms, reports)
- Chart and diagram interpretation
- Visual question answering
- OCR and text recognition in images
- Educational content analysis
Out of scope
- Audio processing (Phase 2, not included in this release)
- Production deployment without application-level safety filtering
- Use cases requiring guaranteed factual accuracy without human verification
Limitations
- 100-sample evaluations. Benchmark scores are computed on subsets, not full datasets. Expect variance relative to full-dataset evaluations.
- Vision-only. This is a Phase 1 release covering the vision modality. Audio support from the original Phi-4-multimodal architecture is not included.
- No runtime LoRA switching. Vision LoRA adapters are pre-fused; the model cannot dynamically swap adapters.
- Apple Silicon required. MLX is designed for Apple's unified memory architecture (M1/M2/M3/M4). This model will not run on CUDA or CPU-only systems.
Citation
If you use this model in your work, please cite:
@misc{feroxai2026phi4mlx,
title={Phi-4-Multimodal-Instruct MLX Conversion},
author={Ferox AI},
year={2026},
url={https://huggingface.co/ferox-ai/Phi-4-multimodal-instruct-mlx-4bit},
note={4-bit quantized MLX port of microsoft/Phi-4-multimodal-instruct}
}
Acknowledgments
- Microsoft Research for the Phi-4-multimodal-instruct model and technical report
- Apple MLX team for the MLX framework
- Prince Canuma for mlx-vlm
- Downloads last month
- 57
Quantized
Model tree for Ferox-AI/Phi-4-multimodal-instruct-mlx-4bit
Base model
microsoft/Phi-4-multimodal-instructDatasets used to train Ferox-AI/Phi-4-multimodal-instruct-mlx-4bit
lmms-lab/textvqa
lmms-lab/DocVQA
Collection including Ferox-AI/Phi-4-multimodal-instruct-mlx-4bit
Evaluation results
- Accuracy (n=100) on AI2Dtest set self-reported83.000
- Relaxed Accuracy (n=100) on ChartQAtest set self-reported86.000
- ANLS (n=100) on DocVQAvalidation set self-reported82.800
- Accuracy (n=100) on TextVQAvalidation set self-reported80.000
- Score/1000 (n=100) on OCRBenchtest set self-reported840.000
- Accuracy (n=48, image-only) on ScienceQAtest set self-reported95.800
- Accuracy (n=100) on MathVistaself-reported58.000