Instructions to use yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6
Run Hermes
hermes
- MLX LM
How to use yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6", "messages": [ {"role": "user", "content": "Hello"} ] }'
Qwopus3.5-4B-Coder-MTP-oQ6
An oMLX oQ-quantized version of Qwopus3.5-4B-Coder-MTP optimized for efficient local inference on Apple Silicon devices.
About Qwopus3.5-4B-Coder
Qwopus3.5-4B-Coder is a compact coding and agent-oriented model built on the Qwen3.5 4B family.
The model is designed for:
- Coding assistance
- Agent workflows
- Tool use
- Debugging
- Structured reasoning
- Software engineering tasks
- Local development environments
The training recipe combines reasoning-oriented techniques, agent trajectories, and coding-focused instruction tuning to improve stability and practical coding performance.
About This Quantization
This repository contains an oMLX oQ6 mixed-precision quantization of the original model.
Unlike traditional uniform quantization methods, oQ allocates precision dynamically according to layer sensitivity. Critical model components retain higher precision while less sensitive components are compressed more aggressively.
Benefits include:
- Reduced memory consumption
- Reduced storage requirements
- Better quality retention than uniform low-bit quantization
- Faster local inference
- Improved efficiency on Apple Silicon hardware
Multi-Token Prediction (MTP)
This release preserves the model's Multi-Token Prediction (MTP) components.
MTP allows the model architecture to predict multiple future tokens internally, improving generation efficiency and helping maintain compatibility with runtimes and workflows that support MTP-enabled Qwen-family models.
Recommended Settings
For best results:
temp: 1.0
top_p: 0.95
top_k: 20
min_p: 0
rep_penalty: 1.05
presence_penalty: 1.5
enable_thinking: true
These settings provide a good balance between exploration, acceptance rate, and generation quality when paired with a Qwen3.5 target model. Consider using DFlash model for more accurate and faster response. https://huggingface.co/z-lab/Qwen3.5-4B-DFlash or https://huggingface.co/yugeshkarunamurthy/Qwen3.5-4b-Dflash-6bit-MLX
Intended Use
This model is suitable for:
- Code generation
- Code review
- Debugging assistance
- Agentic coding workflows
- Terminal assistants
- IDE integrations
- Research and experimentation
- Local AI development
Usage
MLX-LM
from mlx_lm import load, generate
model, tokenizer = load("path/to/model")
response = generate(
model,
tokenizer,
prompt="Write a Python function that implements binary search.",
max_tokens=512,
)
print(response)
Claude Code
This model works well as a local coding model for Claude Code workflows where fast iteration, code generation, debugging, and repository assistance are required.
Quantization Details
| Item | Value |
|---|---|
| Base Model | Qwopus3.5-4B-Coder-MTP |
| Quantization Method | oMLX oQ |
| Format | MLX |
| MTP Preserved | Yes |
| Architecture | Qwen3.5 Family |
Performance Notes
Performance depends on:
- Context length
- Runtime implementation
- Hardware configuration
- Quantization parameters
- Prompt style
Users are encouraged to benchmark the model on their own workloads.
Limitations
This model inherits the strengths and limitations of the original Qwopus3.5-4B-Coder model.
Quantization may introduce:
- Minor reductions in reasoning quality
- Small changes in generation behavior
- Reduced performance on certain edge-case tasks
Results will vary depending on hardware and inference settings.
Credits
Original Model
- Jackrong — Qwopus3.5-4B-Coder-MTP
Quantization
- oMLX
- MLX Ecosystem
Citation
If you use the original model in research, please cite the original Qwopus model authors and repository.
Disclaimer
This repository contains a community-generated quantized checkpoint and is not an official release from the original model authors.
Please evaluate the model carefully before deploying it in production environments.
- Downloads last month
- 151
6-bit