Phi-4-Multimodal-Instruct — MLX bf16

A full-precision (bfloat16) Apple MLX conversion of microsoft/Phi-4-multimodal-instruct for native inference on Apple Silicon.

Converted by Ferox AI · Lossless weight conversion — maximum accuracy for systems with sufficient unified memory.


Parameters	5.6B (pre-LoRA-fusion)
Precision	bfloat16 (no quantization)
Disk size	~8.5 GB
Base model	microsoft/Phi-4-multimodal-instruct
License	MIT
Modality	Vision + Text (Phase 1; audio deferred)

This is the reference-quality variant with no quantization loss. Recommended for systems with 24 GB+ unified memory (M2 Pro/Max, M3 Pro/Max/Ultra, M4 Pro/Max/Ultra) or when maximum accuracy is required.

Quantized variants: 4-bit (~3.9 GB) · 8-bit (~5.5 GB)

Quickstart

from mlx_vlm import load, generate

model, processor = load("ferox-ai/Phi-4-multimodal-instruct-mlx-bf16")

output = generate(
    model,
    processor,
    "Describe this image in detail.",
    ["path/to/image.jpg"],
    max_tokens=512,
    verbose=False,
)
print(output)

Requires mlx-vlm >= 0.1.0 with Phi-4-MM architecture support. Install dependencies:

pip install mlx-vlm>=0.1.0 mlx>=0.22.0

Benchmark Results

Evaluated with our internal evaluation harness on a single Apple Silicon device. Scores are computed on a 100-sample subset of each benchmark. Microsoft's reference scores are reported on the full dataset using PyTorch FP16.

Benchmark	This Model (bf16)	4-bit	Microsoft FP16 (full)	Δ vs Microsoft	Metric
AI2D	90.0	83.0	82.3	+7.7	Accuracy
ChartQA	85.0	86.0	81.4	+3.6	Relaxed Accuracy
DocVQA	86.2	82.8	93.2	−7.0	ANLS
MathVista	58.0	58.0	62.4	−4.4	Accuracy
MMMU	31.0	24.0	55.1	−24.1	Accuracy
OCRBench	840	840	844	−4	Score / 1000
ScienceQA	100.0†	95.8†	97.5	+2.5	Accuracy
TextVQA	82.0	80.0	75.6	+6.4	Accuracy

† ScienceQA: 48 of 100 samples scored (image-bearing questions only; 52 text-only questions excluded).

Conversion fidelity

On 5 of 8 benchmarks, the bf16 MLX conversion matches or exceeds Microsoft's PyTorch FP16 reference scores, confirming that the weight conversion pipeline is lossless. Residual differences on DocVQA and MathVista are within expected range for 100-sample evaluation variance.

Note on MMMU

The 100-sample MMMU score (31.0%) is well below Microsoft's reported 55.1%. Because this variant is full precision (no quantization loss), the gap cannot be attributed to weight conversion. We re-evaluated on the full 900-sample MMMU validation split and obtained 27.9%, consistent with the subset. We were unable to reproduce Microsoft's 55.1% and attribute the difference to evaluation-harness and answer-extraction handling for MMMU's multiple-choice format (prompt formatting and option parsing), not to the model's underlying capability — which is better reflected by the document-, chart-, OCR-, and science-focused benchmarks above.

Architecture

Component	Details
Backbone	Phi-4-Mini (3.8B) — 32 transformer layers, hidden_size=3072, 24 query heads / 8 KV heads (GQA), head_dim=128, LongRoPE (131K context)
Vision encoder	SigLIP-SO400M NaViT — 27 layers, 16 heads, head_dim=72, hidden_size=1152
Vision projection	2-layer MLP: Linear(4608→3072) → GELU → Linear(3072→3072). Input is a 2×2 spatial merge of SigLIP patch features
Vision LoRA	rank=256, alpha=512 (~370M params) — pre-fused into backbone weights
Image preprocessing	Dynamic HD tiling (deterministic grid, up to 8 crops at 448×448). PIL + NumPy only; zero PyTorch dependency at inference

Weight provenance

Weights are converted from microsoft/Phi-4-multimodal-instruct using a deterministic pipeline: download → fuse vision LoRA → remap keys → transpose LoRA matrices → serialize as MLX safetensors. No quantization is applied. The conversion is deterministic and fully reproducible from the base model.

Variant comparison

Variant	Disk Size	Memory (approx)	Best For
4-bit	~3.9 GB	~5 GB	8 GB devices, memory-constrained workflows
8-bit	~5.5 GB	~7 GB	16 GB devices, balanced accuracy/memory
bf16 (this)	~8.5 GB	~10 GB	24+ GB devices, maximum accuracy

Intended Use

This model is designed for local, on-device vision-language inference on Apple Silicon hardware. Suitable applications include document understanding, chart interpretation, visual question answering, OCR, and educational content analysis.

Limitations

100-sample evaluations. Benchmark scores are computed on subsets, not full datasets.
Vision-only. Audio support from the original architecture is not included (Phase 1).
Memory requirements. Requires ~10 GB unified memory. Use the 4-bit or 8-bit variant for constrained devices.
Apple Silicon required. MLX targets Apple's unified memory architecture.

Citation

@misc{feroxai2026phi4mlx,
  title={Phi-4-Multimodal-Instruct MLX Conversion},
  author={Ferox AI},
  year={2026},
  url={https://huggingface.co/ferox-ai/Phi-4-multimodal-instruct-mlx-bf16},
  note={Full-precision (bf16) MLX port of microsoft/Phi-4-multimodal-instruct}
}

Acknowledgments

Microsoft Research for Phi-4-multimodal-instruct
Apple MLX team for the MLX framework
Prince Canuma for mlx-vlm

Downloads last month: 113

Safetensors

Model size

4B params

Tensor type

BF16

F16

MLX

Hardware compatibility

Quantized

Model tree for Ferox-AI/Phi-4-multimodal-instruct-mlx-bf16

Base model

microsoft/Phi-4-multimodal-instruct

Finetuned

(54)

this model

Datasets used to train Ferox-AI/Phi-4-multimodal-instruct-mlx-bf16

Collection including Ferox-AI/Phi-4-multimodal-instruct-mlx-bf16

Phi-4 Multimodal — MLX for Apple Silicon

Collection

MLX conversions of microsoft/Phi-4-multimodal-instruct for native vision-language inference on Apple Silicon (M1-M4). 4-bit, 8-bit, and bf16 variants. • 3 items • Updated 11 days ago

Evaluation results

Accuracy (n=100) on AI2D
test set self-reported

90.000
Relaxed Accuracy (n=100) on ChartQA
test set self-reported

85.000
ANLS (n=100) on DocVQA
validation set self-reported

86.200
Accuracy (n=100) on TextVQA
validation set self-reported

82.000
Score/1000 (n=100) on OCRBench
test set self-reported

840.000
Accuracy (n=48, image-only) on ScienceQA
test set self-reported

100.000
Accuracy (n=100) on MathVista
self-reported

58.000