Phi-4-Multimodal-Instruct — MLX 4-bit

A 4-bit quantized Apple MLX conversion of microsoft/Phi-4-multimodal-instruct for native inference on Apple Silicon.

Converted by Ferox AI · Vision-language inference on MacBook / Mac Studio / Mac Pro without cloud dependencies.


Parameters	5.6B (pre-LoRA-fusion)
Quantization	4-bit, group_size=64 (backbone only; SigLIP encoder remains FP16)
Disk size	~3.9 GB
Base model	microsoft/Phi-4-multimodal-instruct
License	MIT
Modality	Vision + Text (Phase 1; audio deferred)

Other variants: bf16 (full precision) · 8-bit

Quickstart

from mlx_vlm import load, generate

model, processor = load("ferox-ai/Phi-4-multimodal-instruct-mlx-4bit")

output = generate(
    model,
    processor,
    "Describe this image in detail.",
    ["path/to/image.jpg"],
    max_tokens=512,
    verbose=False,
)
print(output)

Requires mlx-vlm >= 0.1.0 with Phi-4-MM architecture support. Install dependencies:

pip install mlx-vlm>=0.1.0 mlx>=0.22.0

Benchmark Results

Evaluated with our internal evaluation harness on a single Apple Silicon device. Scores are computed on a 100-sample subset of each benchmark. Microsoft's reference scores are reported on the full dataset using PyTorch FP16 — direct comparison should account for both the precision difference and sample-size variance.

Benchmark	This Model (4-bit)	bf16	Microsoft FP16 (full dataset)	Metric
AI2D	83.0	90.0	82.3	Accuracy
ChartQA	86.0	85.0	81.4	Relaxed Accuracy
DocVQA	82.8	86.2	93.2	ANLS
MathVista	58.0	58.0	62.4	Accuracy
MMMU	24.0	31.0	55.1	Accuracy
OCRBench	840	840	844	Score / 1000
ScienceQA	95.8†	100.0†	97.5	Accuracy
TextVQA	80.0	82.0	75.6	Accuracy

† ScienceQA: 48 of 100 samples scored (image-bearing questions only; 52 text-only questions excluded).

Quantization impact

Across all benchmarks, 4-bit quantization produces a mean accuracy delta of −2.2 percentage points relative to bf16 — within the expected range for 4-bit group quantization on a model of this scale.

Note on MMMU

The 100-sample MMMU scores (24.0% 4-bit, 31.0% bf16) fall well below Microsoft's reported 55.1%. To isolate the cause, we ran a full 900-sample MMMU validation on the lossless bf16 variant and obtained 27.9% — consistent with the subset, which confirms the gap is not caused by quantization or weight conversion. We were unable to reproduce Microsoft's 55.1% and attribute the difference to evaluation-harness and answer-extraction handling for MMMU's multiple-choice format (prompt formatting and option parsing), rather than to the model's underlying capability — which is better reflected by the document-, chart-, OCR-, and science-focused benchmarks above.

Architecture

Component	Details
Backbone	Phi-4-Mini (3.8B) — 32 transformer layers, hidden_size=3072, 24 query heads / 8 KV heads (GQA), head_dim=128, LongRoPE positional encoding (131K context)
Vision encoder	SigLIP-SO400M NaViT — 27 layers, 16 heads, head_dim=72, hidden_size=1152
Vision projection	2-layer MLP: Linear(4608→3072) → GELU → Linear(3072→3072). Input is a 2×2 spatial merge of SigLIP patch features
Vision LoRA	rank=256, alpha=512 (~370M parameters) — pre-fused into backbone weights before quantization
Image preprocessing	Dynamic HD tiling (deterministic grid, up to 8 crops at 448×448). PIL + NumPy only; zero PyTorch dependency at inference
Quantization	4-bit with group_size=64. Applied to backbone linear layers only; SigLIP encoder weights remain in FP16

Weight provenance

Weights are converted from microsoft/Phi-4-multimodal-instruct using a deterministic pipeline:

Download source checkpoint (PyTorch safetensors)
Fuse vision LoRA adapters into backbone weights (eliminates runtime adapter overhead)
Remap weight keys to MLX naming conventions
Transpose LoRA matrices (PEFT → MLX format)
Quantize backbone to 4-bit (SigLIP excluded)
Serialize as MLX safetensors

The conversion and quantization pipeline is deterministic and fully reproducible from the base model.

Intended Use

This model is designed for local, on-device vision-language inference on Apple Silicon hardware. Suitable applications include:

Document understanding and extraction (invoices, forms, reports)
Chart and diagram interpretation
Visual question answering
OCR and text recognition in images
Educational content analysis

Out of scope

Audio processing (Phase 2, not included in this release)
Production deployment without application-level safety filtering
Use cases requiring guaranteed factual accuracy without human verification

Limitations

100-sample evaluations. Benchmark scores are computed on subsets, not full datasets. Expect variance relative to full-dataset evaluations.
Vision-only. This is a Phase 1 release covering the vision modality. Audio support from the original Phi-4-multimodal architecture is not included.
No runtime LoRA switching. Vision LoRA adapters are pre-fused; the model cannot dynamically swap adapters.
Apple Silicon required. MLX is designed for Apple's unified memory architecture (M1/M2/M3/M4). This model will not run on CUDA or CPU-only systems.

Citation

If you use this model in your work, please cite:

@misc{feroxai2026phi4mlx,
  title={Phi-4-Multimodal-Instruct MLX Conversion},
  author={Ferox AI},
  year={2026},
  url={https://huggingface.co/ferox-ai/Phi-4-multimodal-instruct-mlx-4bit},
  note={4-bit quantized MLX port of microsoft/Phi-4-multimodal-instruct}
}

Acknowledgments

Microsoft Research for the Phi-4-multimodal-instruct model and technical report
Apple MLX team for the MLX framework
Prince Canuma for mlx-vlm

Downloads last month: 57

Safetensors

Model size

2B params

Tensor type

F16

BF16

U32

MLX

Hardware compatibility

Quantized

Model tree for Ferox-AI/Phi-4-multimodal-instruct-mlx-4bit

Base model

microsoft/Phi-4-multimodal-instruct

Finetuned

(54)

this model

Datasets used to train Ferox-AI/Phi-4-multimodal-instruct-mlx-4bit

Collection including Ferox-AI/Phi-4-multimodal-instruct-mlx-4bit

Phi-4 Multimodal — MLX for Apple Silicon

Collection

MLX conversions of microsoft/Phi-4-multimodal-instruct for native vision-language inference on Apple Silicon (M1-M4). 4-bit, 8-bit, and bf16 variants. • 3 items • Updated 11 days ago

Evaluation results

Accuracy (n=100) on AI2D
test set self-reported

83.000
Relaxed Accuracy (n=100) on ChartQA
test set self-reported

86.000
ANLS (n=100) on DocVQA
validation set self-reported

82.800
Accuracy (n=100) on TextVQA
validation set self-reported

80.000
Score/1000 (n=100) on OCRBench
test set self-reported

840.000
Accuracy (n=48, image-only) on ScienceQA
test set self-reported

95.800
Accuracy (n=100) on MathVista
self-reported

58.000