Phi-4-Multimodal-Instruct — MLX 8-bit

An 8-bit quantized Apple MLX conversion of microsoft/Phi-4-multimodal-instruct for native inference on Apple Silicon.

Converted by Ferox AI · Vision-language inference on MacBook / Mac Studio / Mac Pro without cloud dependencies.


Parameters	5.6B (pre-LoRA-fusion)
Quantization	8-bit, group_size=64 (backbone only; SigLIP encoder remains FP16)
Disk size	~5.5 GB
Base model	microsoft/Phi-4-multimodal-instruct
License	MIT
Modality	Vision + Text (Phase 1; audio deferred)

This variant offers a balance between the 4-bit model's memory efficiency and the bf16 model's full precision. Recommended for systems with 16 GB+ unified memory where accuracy is prioritized over memory footprint.

Other variants: 4-bit (smallest) · bf16 (full precision)

Quickstart

from mlx_vlm import load, generate

model, processor = load("ferox-ai/Phi-4-multimodal-instruct-mlx-8bit")

output = generate(
    model,
    processor,
    "Describe this image in detail.",
    ["path/to/image.jpg"],
    max_tokens=512,
    verbose=False,
)
print(output)

Requires mlx-vlm >= 0.1.0 with Phi-4-MM architecture support. Install dependencies:

pip install mlx-vlm>=0.1.0 mlx>=0.22.0

Benchmark Results

Evaluated with our internal evaluation harness on a single Apple Silicon device. Scores are computed on a 100-sample subset of each benchmark. Microsoft's reference scores are reported on the full dataset using PyTorch FP16 — direct comparison should account for both the precision difference and sample-size variance.

Benchmark	This Model (8-bit)	4-bit	bf16	Microsoft FP16 (full)	Metric
ChartQA	85.0	86.0	85.0	81.4	Relaxed Accuracy
DocVQA	86.1	82.8	86.2	93.2	ANLS
MMMU	29.0	24.0	31.0	55.1	Accuracy
OCRBench	850	840	840	844	Score / 1000
TextVQA	81.0	80.0	82.0	75.6	Accuracy
AI2D	—‡	83.0	90.0	82.3	Accuracy
MathVista	—‡	58.0	58.0	62.4	Accuracy
ScienceQA	—‡	95.8†	100.0†	97.5	Accuracy

† ScienceQA: scored on the 48 image-bearing questions of the 100-sample subset (text-only questions excluded). ‡ Not yet evaluated for the 8-bit variant. These three benchmarks were measured for the 4-bit and bf16 variants but have not been re-run at 8-bit; they will be added in a future update.

Quantization fidelity

On the five benchmarks measured at 8-bit, scores track the lossless bf16 variant within ~1 point (e.g. DocVQA 86.1 vs 86.2, ChartQA 85.0 vs 85.0, TextVQA 81.0 vs 82.0), indicating that 8-bit group quantization is effectively lossless for this model on these tasks.

Note on MMMU

The 100-sample MMMU scores (29.0% 8-bit, 24.0% 4-bit, 31.0% bf16) fall well below Microsoft's reported 55.1%. To isolate the cause, we ran a full 900-sample MMMU validation on the lossless bf16 variant and obtained 27.9% — consistent with the subset, which confirms the gap is not caused by quantization or weight conversion. We were unable to reproduce Microsoft's 55.1% and attribute the difference to evaluation-harness and answer-extraction handling for MMMU's multiple-choice format (prompt formatting and option parsing), rather than to the model's underlying capability — which is better reflected by the document-, chart-, and OCR-focused benchmarks above.

Architecture

Component	Details
Backbone	Phi-4-Mini (3.8B) — 32 transformer layers, hidden_size=3072, 24 query heads / 8 KV heads (GQA), head_dim=128, LongRoPE (131K context)
Vision encoder	SigLIP-SO400M NaViT — 27 layers, 16 heads, head_dim=72, hidden_size=1152
Vision projection	2-layer MLP: Linear(4608→3072) → GELU → Linear(3072→3072)
Vision LoRA	rank=256, alpha=512 (~370M params) — pre-fused into backbone weights before quantization
Quantization	8-bit with group_size=64. Applied to backbone linear layers only; SigLIP remains FP16

Variant comparison

Variant	Disk Size	Memory (approx)	Best For
4-bit	~3.9 GB	~5 GB	8 GB devices, memory-constrained workflows
8-bit (this)	~5.5 GB	~7 GB	16 GB devices, balanced accuracy/memory
bf16	~8.5 GB	~10 GB	24+ GB devices, maximum accuracy

Intended Use

This model is designed for local, on-device vision-language inference on Apple Silicon hardware. Suitable applications include document understanding, chart interpretation, visual question answering, OCR, and educational content analysis.

Limitations

100-sample evaluations. Benchmark scores are computed on subsets, not full datasets.
Partial benchmark coverage at 8-bit. AI2D, MathVista, and ScienceQA have not yet been evaluated for this variant (see the benchmark table).
Vision-only. Audio support from the original architecture is not included (Phase 1).
Apple Silicon required. MLX targets Apple's unified memory architecture (M1/M2/M3/M4).

Citation

@misc{feroxai2026phi4mlx,
  title={Phi-4-Multimodal-Instruct MLX Conversion},
  author={Ferox AI},
  year={2026},
  url={https://huggingface.co/ferox-ai/Phi-4-multimodal-instruct-mlx-8bit},
  note={8-bit quantized MLX port of microsoft/Phi-4-multimodal-instruct}
}

Acknowledgments

Microsoft Research for Phi-4-multimodal-instruct
Apple MLX team for the MLX framework
Prince Canuma for mlx-vlm

Downloads last month: 62

Safetensors

Model size

2B params

Tensor type

F16

BF16

U32

MLX

Hardware compatibility

Quantized

Model tree for Ferox-AI/Phi-4-multimodal-instruct-mlx-8bit

Base model

microsoft/Phi-4-multimodal-instruct

Finetuned

(54)

this model

Datasets used to train Ferox-AI/Phi-4-multimodal-instruct-mlx-8bit

Collection including Ferox-AI/Phi-4-multimodal-instruct-mlx-8bit

Phi-4 Multimodal — MLX for Apple Silicon

Collection

MLX conversions of microsoft/Phi-4-multimodal-instruct for native vision-language inference on Apple Silicon (M1-M4). 4-bit, 8-bit, and bf16 variants. • 3 items • Updated 12 days ago

Evaluation results

Relaxed Accuracy (n=100) on ChartQA
test set self-reported

85.000
ANLS (n=100) on DocVQA
validation set self-reported

86.100
Accuracy (n=100) on MMMU
validation set self-reported

29.000
Score/1000 (n=100) on OCRBench
test set self-reported

850.000
Accuracy (n=100) on TextVQA
validation set self-reported

81.000