Konjo AI

company

https://squish.run

konjoai

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

wscholl updated a Space 3 days ago

konjoai/README

wscholl published a Space 3 days ago

konjoai/README

View all activity

Organization Card

Community About org cards

🗜 Konjo AI

Local AI infrastructure for Apple Silicon. We make models that already exist run faster on the hardware you already own.

🌐 squish.run · 💻 github.com/konjoai

squish — Local LLM inference for Apple Silicon

squish is an MLX-based local inference server with a block-level paged KV cache and INT3 quantization support for the Qwen3 family. On a 16 GB M3 MacBook against Ollama:

5.4× faster end-to-end response at 4000-token prompts (12.78s vs 69.6s)
1.5× faster end-to-end on 75-token prompts (5.50s vs 8.09s)
33% less RAM during inference (3.36 GB vs ~5 GB)
INT3 support for Qwen3 with no measurable accuracy loss (Ollama doesn't ship INT3)

The honest tradeoff: Ollama still wins first-token latency on short prompts. squish wins when you care about total response time on real workloads.

Install

brew tap konjoai/squish && brew install squish
# or
pip install squish-ai

Use

squish pull konjoai/Qwen3-8B-squished
squish run Qwen3-8B-squished

Full benchmarks · Repo · Issues

Pre-Compressed Models

This org hosts models pre-compressed by squish. Pull once, load instantly every time after.

Model	Squish ID	Quantization	Disk size	Context
Available after first publish batch

The format is mlx_lm-compatible — you can also use these models directly:

from mlx_lm import load, generate

model, tokenizer = load("konjoai/Qwen2.5-7B-Instruct-squished")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
print(response)

How models are compressed

squish uses a three-tier pipeline:

INT4/INT3 quantization via a Rust extension (squish_quant_rs) with ARM NEON acceleration
Block-level paged KV cache — KV state is chunked into fixed-size blocks for prefix reuse across sessions
Quantization safeguards — squish hard-blocks INT3 on model families where it collapses (e.g. Gemma-3 loses ~15pp on common benchmarks); INT3 ships only for families that hold accuracy (Qwen3 specifically)

Other projects

We also build squash, a security and EU AI Act compliance scanner for HuggingFace models. Independent codebase, related mission.

License

squish is BUSL-1.1. Compressed models inherit their base model's license — Qwen3 is Apache-2.0, Llama is the Llama Community License, etc. Check each model's card for specifics.

Requirements

macOS 13.0 or later
Apple Silicon (M1 / M2 / M3 / M4 / M5)
Enough unified memory for the model (table above)

Intel Macs and Linux are not supported.

models 0

None public yet

datasets 0