Qwen3-8B — Squished for Apple Silicon

This is Qwen3-8B (8B parameters) compressed with Squish — a local inference engine for Apple Silicon.

Weights are INT4-quantized using Squish's ARM NEON-accelerated pipeline and load in under a second on M-series hardware.

Quick start

brew tap konjoai/squish
brew install squish
squish pull qwen3:8b
squish run qwen3:8b

Model details

Property Value
Parameters 8B
Family Qwen3
Developer Alibaba Cloud
Raw size 16.4 GB
Squished size 11.0 GB
Context window 131,072 tokens
Minimum RAM 16 GB unified memory
Quantization INT4 (Squish pipeline)
Format MLX-compatible safetensors

Use case

High-quality reasoning and coding with 128k context. Best for M2/M3 16GB and above.

Requirements

  • macOS 13.0 or later
  • Apple Silicon (M1, M2, M3, M4, M5)
  • 16 GB unified memory minimum

Intel Macs, Linux, and Windows are not supported.

How to use with Squish

# Pull and run
squish pull qwen3:8b
squish run qwen3:8b

# OpenAI-compatible API on port 11435
curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3:8b","messages":[{"role":"user","content":"Hello"}]}'
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11435/v1", api_key="squish")
response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

Load with mlx_lm directly

from mlx_lm import load, generate

model, tokenizer = load("squishai/Qwen3-8B-bf16-squished")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
print(response)

Compression details

This model was compressed using Squish's three-tier pipeline:

  • INT4 quantization via squish_quant_rs Rust extension with ARM NEON acceleration
  • Compressed weight loader — weights decompress directly into Metal-mapped memory at load time
  • KV cache quantization — attention cache stored at reduced precision during generation

Source weights: mlx-community/Qwen3-8B-bf16

License

The original model weights are subject to the license of the source model (Alibaba Cloud). The compression and tooling are MIT licensed. See Squish license for details.


Pre-compressed by Konjo AI · squish.run

Downloads last month
26
Safetensors
Model size
1B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for squishai/Qwen3-8B-bf16-squished

Finetuned
Qwen/Qwen3-8B
Quantized
(2)
this model