AI & ML interests

None defined yet.

Recent Activity

wschollย  updated a Space 3 days ago
konjoai/README
wschollย  published a Space 3 days ago
konjoai/README
View all activity

Organization Card

๐Ÿ—œ Konjo AI

Local AI infrastructure for Apple Silicon. We make models that already exist run faster on the hardware you already own.


squish โ€” Local LLM inference for Apple Silicon

squish is an MLX-based local inference server with a block-level paged KV cache and INT3 quantization support for the Qwen3 family. On a 16 GB M3 MacBook against Ollama:

  • 5.4ร— faster end-to-end response at 4000-token prompts (12.78s vs 69.6s)
  • 1.5ร— faster end-to-end on 75-token prompts (5.50s vs 8.09s)
  • 33% less RAM during inference (3.36 GB vs ~5 GB)
  • INT3 support for Qwen3 with no measurable accuracy loss (Ollama doesn't ship INT3)

The honest tradeoff: Ollama still wins first-token latency on short prompts. squish wins when you care about total response time on real workloads.

Install

brew tap konjoai/squish && brew install squish
# or
pip install squish-ai

Use

squish pull konjoai/Qwen3-8B-squished
squish run Qwen3-8B-squished

Pre-Compressed Models

This org hosts models pre-compressed by squish. Pull once, load instantly every time after.

ModelSquish IDQuantizationDisk sizeContext
Available after first publish batch

The format is mlx_lm-compatible โ€” you can also use these models directly:

from mlx_lm import load, generate

model, tokenizer = load("konjoai/Qwen2.5-7B-Instruct-squished")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
print(response)

How models are compressed

squish uses a three-tier pipeline:

  • INT4/INT3 quantization via a Rust extension (squish_quant_rs) with ARM NEON acceleration
  • Block-level paged KV cache โ€” KV state is chunked into fixed-size blocks for prefix reuse across sessions
  • Quantization safeguards โ€” squish hard-blocks INT3 on model families where it collapses (e.g. Gemma-3 loses ~15pp on common benchmarks); INT3 ships only for families that hold accuracy (Qwen3 specifically)

Other projects

We also build squash, a security and EU AI Act compliance scanner for HuggingFace models. Independent codebase, related mission.


License

squish is BUSL-1.1. Compressed models inherit their base model's license โ€” Qwen3 is Apache-2.0, Llama is the Llama Community License, etc. Check each model's card for specifics.


Requirements

  • macOS 13.0 or later
  • Apple Silicon (M1 / M2 / M3 / M4 / M5)
  • Enough unified memory for the model (table above)

Intel Macs and Linux are not supported.

models 0

None public yet

datasets 0

None public yet