Instructions to use johnlockejrr/Qwen2.5-7B-Instruct-mxfp4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use johnlockejrr/Qwen2.5-7B-Instruct-mxfp4 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("johnlockejrr/Qwen2.5-7B-Instruct-mxfp4") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use johnlockejrr/Qwen2.5-7B-Instruct-mxfp4 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "johnlockejrr/Qwen2.5-7B-Instruct-mxfp4"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "johnlockejrr/Qwen2.5-7B-Instruct-mxfp4" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use johnlockejrr/Qwen2.5-7B-Instruct-mxfp4 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "johnlockejrr/Qwen2.5-7B-Instruct-mxfp4"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default johnlockejrr/Qwen2.5-7B-Instruct-mxfp4
Run Hermes
hermes
- MLX LM
How to use johnlockejrr/Qwen2.5-7B-Instruct-mxfp4 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "johnlockejrr/Qwen2.5-7B-Instruct-mxfp4"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "johnlockejrr/Qwen2.5-7B-Instruct-mxfp4" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "johnlockejrr/Qwen2.5-7B-Instruct-mxfp4", "messages": [ {"role": "user", "content": "Hello"} ] }'
Qwen2.5Code-14B - MLX mxfp4 Quantized
- Repository:
johnlockejrr/Qwen2.5-7B-Instruct-mxfp4 - Base model: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
- Quantization: MLX mxfp4 (4-bit)
- Quantized by:
johnlockejrr - Framework: MLX + mlx-lm
- Quantization tool: https://github.com/EricFillion/quantize
Model Summary
This repository contains an MLX-quantized version of Qwen2.5-7B-Instruct, optimized for Apple Silicon (M1/M2/M3/M4) devices.
The model was quantized to mxfp4 (4-bit) using the MLX-based quantization tool by Eric Fillion, reducing memory usage from approximately 14-15 GB (FP16) to approximately 5-6 GB while maintaining strong instruction-following performance.
This quantized model is ideal for:
- local assistants
- offline workflows
- VS Code integration
- fast inference on Apple GPUs
- running large models on 8 GB, 16 GB, or 24 GB Apple Silicon machines
Quantization Details
| Setting | Value |
|---|---|
| Quantization mode | mxfp4 |
| Bits per weight | 4 |
| Group size | 64 |
| Activation dtype | bfloat16 |
| Framework | MLX |
| Quantization tool | EricFillion/quantize |
Command used:
python3 quantize.py \
--model_name Qwen/Qwen2.5-7B-Instruct \
--save_model_path models/qwen2.5-7b-instruct-mxfp4 \
--q_mode mxfp4 \
--q_bits 4 \
--q_group_size 64
Resulting model size: approximately 5-6 GB
Running the Model (MLX)
CLI (mx-lm)
mlx_lm.generate \
--model johnlockejrr/qwen2.5-7b-instruct-mxfp4 \
--prompt "Write a poem about the Fibonacci numbers." \
--max-tokens 512
Python API
from mlx_lm import load, generate
model, tokenizer = load("johnlockejrr/qwen2.5-7b-instruct-mxfp4")
prompt = "Explain recursion in simple terms."
output = generate(model, tokenizer, prompt, max_tokens=200)
print(output)
Chat Mode
from mlx_lm import load, chat
model, tokenizer = load("johnlockejrr/qwen2.5-7b-instruct-mxfp4")
messages = [
{"role": "user", "content": "What is a binary search tree?"}
]
response = chat(model, tokenizer, messages)
print(response)
Performance (Mac Mini M4, 16 GB)
| Metric | Value |
|---|---|
| Generation speed | approximately 20-30 tokens/sec |
| Peak memory usage | approximately 5.3 G B |
| GPU | Apple M4 GPU |
| Framework | MLX |
Repository Contents
model-00001-of-00002.safetensors
model-00002-of-00002.safetensors
model.safetensors.index.json
config.json
tokenizer.json
tokenizer_config.json
chat_template.jinja
generation_config.json
README.md
License
This model inherits the license of the original model:
Qwen2.5-7B-Instruct License: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct#license
Please review the license before using this model in commercial applications.
Limitations and Bias
- The model may generate incorrect or insecure code.
- It may hallucinate APIs or functions.
- It may produce biased or harmful statements if prompted.
- It should not be used for production-critical code without human review.
Acknowledgements
- Qwen Team for the original Qwen2.5-7B-Instruct model
- Apple MLX Team for the MLX framework
- Eric Fillion for the MLX quantization tool
- Hugging Face for hosting the model
- Downloads last month
- 37
4-bit