Spaces:
Running
Has anyone got MLX Swift to work with mlx-community/gpt-oss-20b-MXFP4-Q4 or mlx-community/gpt-oss-120b-MXFP4-Q4?
If I do this:
% mlx_lm.generate --model mlx-community/gpt-oss-20b-MXFP4-Q4 --prompt "hello"
==========
<|channel|>analysis<|message|>We need to respond to user greeting. Provide friendly response.<|end|><|start|>assistant<|channel|>final<|message|>Hello! How can I help you today?
==========
Prompt: 68 tokens, 136.969 tokens-per-sec
Generation: 31 tokens, 103.760 tokens-per-sec
Peak memory: 11.297 GB
The model works perfectly. But if I try to load and run the same model from using Swift I get the following error:
Failed to parse config.json for model 'mix-community/gpt-oss-20b-MXFP4-Q4': Type mismatch at 'quantization.mode'
I'm using the Example code found here: https://github.com/ml-explore/mlx-swift-examples
Other models load, but not gpt-oss!
Here's my config:
% system_profiler SPSoftwareDataType SPHardwareDataType | grep -E "Model Name:|Model Identifier:|Processor Name:|Chip:|Processor Speed:|Number of Processors:|Total Number of Cores:|Memory:|System Version:"
System Version: macOS 26.0 (25A353)
Secure Virtual Memory: Enabled
Model Name: MacBook Pro
Model Identifier: Mac14,6
Chip: Apple M2 Max
Total Number of Cores: 12 (8 performance and 4 efficiency)
Memory: 64 GB
There's an issue logged about it here: https://github.com/ml-explore/mlx-swift-examples/issues/386
Alas, no response yet. ๐
Claude research suggests the following:
GPT-OSS-20B works with MLX Swift, but requires conversion
GPT-OSS-20B is compatible with MLX Swift as of August 2025, but cannot run in its native MXFP4 format. Pre-converted models are available from the MLX community that work with the mlx-swift-examples repository. However, the model requires format conversion that introduces performance trade-offs and some features like fine-tuning have unresolved issues.
Current support status for GPT-OSS-20B
GPT-OSS-20B gained official support in MLX Swift Examples in version 2.25.6 (August 2025) through pull request #371. The model runs successfully on Apple Silicon devices with converted weights, achieving 5-40 tokens/second depending on quantization and hardware.
Available pre-converted models
The MLX community provides several ready-to-use conversions:
- mlx-community/gpt-oss-20b-mlx-q8 - 8-bit quantized version (~12GB)
- InferenceIllusionist/gpt-oss-20b-MLX-4bit - 4-bit quantized version (~11GB)
- lmstudio-community/gpt-oss-20b-MLX-8bit - Alternative 8-bit implementation
These models work directly with MLX Swift using the standard MLXLLM library from the mlx-swift-examples repository. Implementation requires MLX-LM v0.26.3 or later and the model uses OpenAI's harmony response format, which must be properly configured.
Why native MXFP4 format doesn't work
The fundamental incompatibility stems from GPT-OSS-20B's use of MXFP4 (4.25-bit) quantization, a format specifically designed for NVIDIA hardware that MLX doesn't natively support.
Technical limitations blocking native support
MXFP4 is a custom floating-point format with 32-element block quantization using E8M0 scales and E2M1 data encoding. MLX's quantization system uses integer-based schemes with different block structures, making direct MXFP4 inference impossible. GitHub Issue #367 confirms MLX has no plans for native MXFP4 support.
The conversion process introduces overhead: Models must be upcast from MXFP4 to BF16 (increasing size from 12.8GB to ~42GB), then re-quantized to MLX-compatible formats. This results in 64% larger memory footprint than the original and 20-40% slower inference compared to native MXFP4 performance.
Unresolved fine-tuning bug: Users report consistent NaN values during LoRA fine-tuning (Issue #361), making the model unsuitable for on-device adaptation despite working inference.
Alternative approaches for running GPT-OSS-20B
Direct MLX implementation (recommended)
Use the pre-converted MLX models with your existing mlx-swift-examples code:
let modelConfig = ModelConfiguration(
id: "mlx-community/gpt-oss-20b-mlx-q8",
overrideTokenizer: "PreTrainedTokenizer"
)
This provides 30-40 tokens/second on M3 Max with 32GB+ RAM, though iPhone deployment remains challenging due to the 16GB minimum memory requirement.
Core ML conversion
Converting GPT-OSS-20B to Core ML format using coremltools
enables hardware-accelerated inference through Apple's official framework. However, the 20B parameter count pushes against Core ML's practical limits, and no pre-converted versions exist yet.
Hybrid cloud-edge deployment
Run a smaller MLX model locally for latency-sensitive tasks while keeping GPT-OSS-20B on a server. This balances performance with the model's full capabilities, especially useful for iOS apps where memory constraints make local deployment impractical.
MLX Swift's current model ecosystem
MLX Swift supports an extensive range of architectures that work without conversion issues, offering strong alternatives to GPT-OSS-20B.
Models with comparable capabilities (7B-20B range)
Qwen2.5-14B-Instruct provides similar reasoning capabilities with better MLX optimization. InternLM2-20B matches the parameter count with native MLX support and 128k context length. Mistral-7B-Instruct-v0.3 offers excellent performance-per-token efficiency for resource-constrained deployments.
The framework natively supports the Llama, Phi, Gemma, and Starcoder families, all available as pre-quantized 4-bit and 8-bit models from mlx-community. These models bypass MXFP4 conversion issues entirely while maintaining strong performance on Apple Silicon.
Specialized alternatives
For code generation, Codestral-22B outperforms GPT-OSS-20B while running efficiently in MLX. Mathematical reasoning tasks benefit from Mathstral-7B, which provides focused capabilities in a smaller footprint. Multi-modal applications can leverage Pixtral-12B with native image understanding.
Practical recommendations
For immediate deployment: Use mlx-community/gpt-oss-20b-mlx-q8
with your existing mlx-swift-examples setup. Ensure your Mac has 24GB+ RAM for smooth operation. Set MLX GPU cache to 20GB for optimal performance.
For iOS apps: GPT-OSS-20B remains impractical due to memory constraints. Deploy Qwen2.5-7B-4bit or Phi-3.5-mini-4bit locally, using API calls to GPT-OSS-20B when needed.
For production systems: Consider alternatives like Qwen2.5-14B or InternLM2-20B that offer similar capabilities without conversion overhead. These models provide 2-3x faster inference with native MLX optimization.
The community has created working solutions for GPT-OSS-20B on MLX Swift, but the conversion requirements and performance penalties make native MLX models more practical for most use cases. The model works, but choosing an alternative designed for MLX will provide better performance and fewer complications.
I got it to work! The key was moving to lmstudio-community/gpt-oss-20b-MLX-8bit
It's slow (14 tokens/sec) and takes more storage space and RAM, but that's good enough for now. Maybe MLX will support gpt-oss in it's native 4bit quant in the future.
Also a big thanks to @ibrahimcetin for https://github.com/ibrahimcetin/MLXSampleApp/tree/main which helped me a lot.