Qwen3.6-35B-A3B-4bit-MTPLX-Optimized-Speed

This is an MLX 4-bit build of Qwen/Qwen3.6-35B-A3B packaged for fast local serving with lightning-mlx.

The model includes an MTPLX sidecar (mtp.safetensors) and runtime metadata (mtplx_runtime.json) so lightning-mlx can use its Qwen3.6 MTPLX serving path on Apple Silicon. The included runtime metadata was verified on Darwin arm64 with mtplx_version: 0.1.0rc3, mtp_depth_max: 1, and the performance-cold recommended profile.

Refer to the original Qwen3.6-35B-A3B model card for base-model capabilities, license, and upstream details.

Install lightning-mlx

Install directly from GitHub:

python3 -m pip install git+https://github.com/samuelfaj/lightning-mlx.git

Or use the self-contained installer:

curl -fsSL https://raw.githubusercontent.com/samuelfaj/lightning-mlx/main/install.sh | bash

Verify the CLI:

lightning-mlx --help

Serve this model

Serve directly from Hugging Face:

lightning-mlx serve samuelfaj/Qwen3.6-35B-A3B-4bit-MTPLX-Optimized-Speed

Or serve from a local checkout:

lightning-mlx serve /path/to/Qwen3.6-35B-A3B-4bit-MTPLX-Optimized-Speed

For long-running local use, start it as a daemon:

lightning-mlx serve samuelfaj/Qwen3.6-35B-A3B-4bit-MTPLX-Optimized-Speed --daemon

Daemon mode starts a detached supervisor, writes logs under ~/.lightning-mlx/logs/, and can restart the server if the model process exits unexpectedly.

Useful daemon commands:

lightning-mlx status
lightning-mlx tui <PID-or-model-name>
lightning-mlx kill <PID-or-model-name>

Use status to list running daemons, tui to attach the live monitor, and kill to stop by supervisor PID, server PID, alias, or model name.

Use the OpenAI-compatible API

Once the server is running, send chat requests to the local OpenAI-compatible endpoint:

curl http://localhost:8010/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [
      {"role": "user", "content": "Write a tiny Python HTTP server."}
    ],
    "stream": true
  }'

The default served model name used by lightning-mlx is local, so OpenAI-compatible clients can point at the local base URL and keep "model": "local" unless you override the served model name.

Why use lightning-mlx

lightning-mlx is built for local agent workloads on Apple Silicon: short streamed turns, tool calls, growing context, and repeated low-latency interactions. With this model it can use the packaged MTPLX metadata and Qwen3.6 serving preset instead of treating the checkpoint as a generic MLX model.

The runtime focuses on:

  • OpenAI-compatible local serving
  • Fast streamed chat completions
  • Qwen3.6 reasoning and tool-use paths
  • MTPLX-style speculative decoding support
  • Daemon, status, TUI, and kill controls for local model servers

Convert similar local MTPLX models

If you have a local quantized Qwen3.6 model and the original full model for MTP tensors, lightning-mlx can package a similar MTPLX model:

lightning-mlx convert-mtplx \
  /path/to/Qwen3.6-35B-A3B-4bit \
  --mtp-source /path/to/Qwen3.6-35B-A3B

By default, the output is written next to the source model as:

/path/to/Qwen3.6-35B-A3B-4bit-MTPLX-Optimized-Speed

Then serve it normally:

lightning-mlx serve /path/to/Qwen3.6-35B-A3B-4bit-MTPLX-Optimized-Speed

Use with mlx-vlm

This checkpoint remains an MLX model. For direct generation through mlx-vlm:

pip install -U mlx-vlm
python -m mlx_vlm.generate \
  --model samuelfaj/Qwen3.6-35B-A3B-4bit-MTPLX-Optimized-Speed \
  --max-tokens 100 \
  --temperature 0.0 \
  --prompt "Describe this image." \
  --image <path_to_image>
Downloads last month
3,728
Safetensors
Model size
6B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for samuelfaj/Qwen3.6-35B-A3B-4bit-MTPLX-Optimized-Speed

Quantized
(373)
this model

Collection including samuelfaj/Qwen3.6-35B-A3B-4bit-MTPLX-Optimized-Speed