Instructions to use yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6 with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6 with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6 with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6

Run Hermes

hermes

MLX LM

How to use yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6 with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Qwopus3.5-4B-Coder-MTP-oQ6

An oMLX oQ-quantized version of Qwopus3.5-4B-Coder-MTP optimized for efficient local inference on Apple Silicon devices.

About Qwopus3.5-4B-Coder

Qwopus3.5-4B-Coder is a compact coding and agent-oriented model built on the Qwen3.5 4B family.

The model is designed for:

Coding assistance
Agent workflows
Tool use
Debugging
Structured reasoning
Software engineering tasks
Local development environments

The training recipe combines reasoning-oriented techniques, agent trajectories, and coding-focused instruction tuning to improve stability and practical coding performance.

About This Quantization

This repository contains an oMLX oQ6 mixed-precision quantization of the original model.

Unlike traditional uniform quantization methods, oQ allocates precision dynamically according to layer sensitivity. Critical model components retain higher precision while less sensitive components are compressed more aggressively.

Benefits include:

Reduced memory consumption
Reduced storage requirements
Better quality retention than uniform low-bit quantization
Faster local inference
Improved efficiency on Apple Silicon hardware

Multi-Token Prediction (MTP)

This release preserves the model's Multi-Token Prediction (MTP) components.

MTP allows the model architecture to predict multiple future tokens internally, improving generation efficiency and helping maintain compatibility with runtimes and workflows that support MTP-enabled Qwen-family models.

Recommended Settings

For best results:

temp: 1.0
top_p: 0.95
top_k: 20
min_p: 0
rep_penalty: 1.05
presence_penalty: 1.5
enable_thinking: true

These settings provide a good balance between exploration, acceptance rate, and generation quality when paired with a Qwen3.5 target model. Consider using DFlash model for more accurate and faster response. https://huggingface.co/z-lab/Qwen3.5-4B-DFlash or https://huggingface.co/yugeshkarunamurthy/Qwen3.5-4b-Dflash-6bit-MLX

Intended Use

This model is suitable for:

Code generation
Code review
Debugging assistance
Agentic coding workflows
Terminal assistants
IDE integrations
Research and experimentation
Local AI development

Usage

MLX-LM

from mlx_lm import load, generate

model, tokenizer = load("path/to/model")

response = generate(
    model,
    tokenizer,
    prompt="Write a Python function that implements binary search.",
    max_tokens=512,
)

print(response)

Claude Code

This model works well as a local coding model for Claude Code workflows where fast iteration, code generation, debugging, and repository assistance are required.

Quantization Details

Item	Value
Base Model	Qwopus3.5-4B-Coder-MTP
Quantization Method	oMLX oQ
Format	MLX
MTP Preserved	Yes
Architecture	Qwen3.5 Family

Performance Notes

Performance depends on:

Context length
Runtime implementation
Hardware configuration
Quantization parameters
Prompt style

Users are encouraged to benchmark the model on their own workloads.

Limitations

This model inherits the strengths and limitations of the original Qwopus3.5-4B-Coder model.

Quantization may introduce:

Minor reductions in reasoning quality
Small changes in generation behavior
Reduced performance on certain edge-case tasks

Results will vary depending on hardware and inference settings.

Credits

Original Model

Jackrong — Qwopus3.5-4B-Coder-MTP

Quantization

oMLX
MLX Ecosystem

Citation

If you use the original model in research, please cite the original Qwopus model authors and repository.

Disclaimer

This repository contains a community-generated quantized checkpoint and is not an official release from the original model authors.

Please evaluate the model carefully before deploying it in production environments.

Downloads last month: 151

Safetensors

Model size

1B params

Tensor type

BF16

U32

MLX

Hardware compatibility

6-bit

Model tree for yugeshkarunamurthy/Qwopus3.5-4B-Coder-oQ6

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Adapter

Jackrong/Qwopus3.5-4B-Coder

Quantized

(3)

this model