Instructions to use datasysdev/ann-sparseattention with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use datasysdev/ann-sparseattention with Transformers:

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("datasysdev/ann-sparseattention", dtype="auto")

llama-cpp-python

How to use datasysdev/ann-sparseattention with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="datasysdev/ann-sparseattention",
	filename="gguf/Qwen3-4B-Instruct-2507-F16-ann-6layer-k128-v2.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use datasysdev/ann-sparseattention with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf datasysdev/ann-sparseattention:F16
# Run inference directly in the terminal:
llama-cli -hf datasysdev/ann-sparseattention:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf datasysdev/ann-sparseattention:F16
# Run inference directly in the terminal:
llama-cli -hf datasysdev/ann-sparseattention:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf datasysdev/ann-sparseattention:F16
# Run inference directly in the terminal:
./llama-cli -hf datasysdev/ann-sparseattention:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf datasysdev/ann-sparseattention:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf datasysdev/ann-sparseattention:F16

Use Docker

docker model run hf.co/datasysdev/ann-sparseattention:F16

LM Studio
Jan
Ollama
How to use datasysdev/ann-sparseattention with Ollama:
```
ollama run hf.co/datasysdev/ann-sparseattention:F16
```

Unsloth Studio new

How to use datasysdev/ann-sparseattention with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for datasysdev/ann-sparseattention to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for datasysdev/ann-sparseattention to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for datasysdev/ann-sparseattention to start chatting

Pi new

How to use datasysdev/ann-sparseattention with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf datasysdev/ann-sparseattention:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ann-sparseattention"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Docker Model Runner
How to use datasysdev/ann-sparseattention with Docker Model Runner:
```
docker model run hf.co/datasysdev/ann-sparseattention:F16
```

Lemonade

How to use datasysdev/ann-sparseattention with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull datasysdev/ann-sparseattention:F16

Run and chat with the model

lemonade run user.ann-sparseattention-F16

List all available models

lemonade list

datasysdev commited on 1 day ago

Commit

e12b862

verified ·

1 Parent(s): 57c6b5b

Expand model card with clean results and artifact guide

Browse files

Files changed (1) hide show

README.md +188 -29

README.md CHANGED Viewed

@@ -3,34 +3,94 @@ license: mit
 base_model: Qwen/Qwen3-4B-Instruct-2507
 tags:
 - sparse-attention
 - ann
 - qwen3
 - retrieval
 - research-artifact
 ---
 # ANN Sparse Attention Checkpoints
-Research artifact for distillation-trained ANN-friendly search projections for sparse attention.
-Base model: `Qwen/Qwen3-4B-Instruct-2507`.
-## Current Clean Result
-The clean methodology is the packed block-causal d128 run in `checkpoints_block_d128/`.
-Packed examples are isolated with per-document `segment_ids`, reset `position_ids`, and a 4D block-causal attention mask. Retrieval, loss masking, mass@K, and recall@K use the same segment-causal eligibility mask.
-Clean block-causal d128 checkpoint:
-- `checkpoints_block_d128/search_step_1000.pt`
-- 6 trained layers: `[4, 8, 12, 16, 20, 24]`
-- `d_search=128`, 3.93M trainable parameters
-- K=128 exact learned search: PPL gap `+0.07%`, mass@K `0.787`, recall@K `0.744`
-- K=256 exact learned search: PPL gap `+0.01%`, mass@K `0.953`, recall@K `0.879`
-Interpretation: clean block-causal evaluation shows full-attention parity, not a clean denoising/improvement claim.
-## Clean Per-layer Retrieval, K=128
 From `checkpoints_block_d128/search_step_1000.compare_retrieval.json`:
@@ -44,11 +104,31 @@ From `checkpoints_block_d128/search_step_1000.compare_retrieval.json`:
 | 24 | 0.978 | 0.984 |
 | avg | 0.969 | 0.973 |
-With segment isolation, early trained layers are not uniquely diffuse or hard; all six trained layers have high oracle mass and learned projections match/slightly exceed raw-QK retrieval.
 ## Quest-style Page Baseline
-From `checkpoints_block_d128/search_step_1000.quest_page16.json`, using page size 16, native post-RoPE Q/K min/max pages, and the same block-causal eligibility mask:
 | Method | K | Recall@K | mass@K | PPL | PPL gap |
 |---|---:|---:|---:|---:|---:|
@@ -61,27 +141,106 @@ Both methods are effectively full-attention parity on PPL. Learned projections r
 ## Packed Leakage-confounded Ablations
-The packed d64/d128/d256 runs are included for capacity-scaling history and should not be used for clean quality claims because packed examples could attend across document boundaries.
 Packed d_search ablation at K=128:
-| d_search | learned mass@K=128 | raw-QK oracle | learned/oracle | final PPL gap |
 |---|---:|---:|---:|---:|
-| 64 | 0.492 | 0.488 | 1.01x | +2.39% |
-| 128 | 0.503 | 0.488 | 1.03x | -1.81% |
-| 256 | 0.509 | 0.488 | 1.04x | -1.85% |
-The large negative packed K-sweep gaps were leakage-confounded and should be treated as historical/debugging evidence only, not as the headline.
-## Folder Guide
-- `checkpoints_block_d128/`: clean block-causal d128 checkpoint and JSON eval artifacts. Use this for current claims.
-- `checkpoints_packed_d64/`, `checkpoints_packed_d128/`, `checkpoints_packed_d256/`: leakage-confounded packed ablation checkpoints.
-- `checkpoints_d64/`: earlier unpacked d64 checkpoints.
-- `checkpoints/`: original pilot checkpoint and compare JSON.
-## Code
-Source repo: https://github.com/unixsysdev/ann-sparseattention
-The repo README contains the current methodology notes and reproduction commands.

 base_model: Qwen/Qwen3-4B-Instruct-2507
 tags:
 - sparse-attention
+- approximate-nearest-neighbor
 - ann
 - qwen3
 - retrieval
+- attention
 - research-artifact
+library_name: pytorch
 ---
 # ANN Sparse Attention Checkpoints
+This repository contains checkpoint artifacts for a research prototype that trains tiny per-layer search projections on a frozen LLM, so dense attention can be approximated by retrieving a small causal key set in a learned low-dimensional space.
+The associated source repo is [unixsysdev/ann-sparseattention](https://github.com/unixsysdev/ann-sparseattention). The GitHub repo contains the training/eval code; this Hugging Face repo stores the checkpoints and JSON result artifacts.
+## Status
+Research artifact, not a deployable inference package.
+The clean result is narrow but real: on a 6-layer Qwen3-4B pilot with packed block-causal WikiText evaluation, the learned d128 search projections preserve full-attention perplexity under exact sparse substitution.
+What survives clean methodology:
+- Full-attention parity on the block-causal d128 pilot.
+- Strong teacher-attention mass recovery with learned projections.
+- Learned search projections recover more teacher attention mass than the Quest-style page heuristic at the same token budget on this slice.
+- The earlier negative PPL gaps from packed-with-leakage runs do **not** survive as a clean denoising headline.
+What is not established yet:
+- Wall-clock speedup. The current runtime is a correctness prototype.
+- Confidence intervals across seeds.
+- LongBench/RULER/needle downstream task quality.
+- Dynamic decode-mode index insertion.
+- Whole-model / all-layer substitution.
+- GPU-resident ANN or fused sparse-attention kernels.
+## Base Model
+- Base model: `Qwen/Qwen3-4B-Instruct-2507`
+- Layers trained in pilot: `[4, 8, 12, 16, 20, 24]`
+- Clean recommended checkpoint: `checkpoints_block_d128/search_step_1000.pt`
+- Search dimension: `d_search=128`
+- Trainable parameters: 3.93M total, about 0.1% of the base model
+- Base model weights are **not** included here. These checkpoints contain only the learned search projection module and training metadata.
+## Folder Guide
+Use `checkpoints_block_d128/` for current clean claims.
+| Folder | Meaning | Use for claims? |
+|---|---|---|
+| `checkpoints_block_d128/` | Clean packed block-causal d128 run and eval artifacts | Yes |
+| `checkpoints_packed_d64/` | Packed d64 leakage-confounded capacity run | Capacity history only |
+| `checkpoints_packed_d128/` | Packed d128 leakage-confounded capacity run | Capacity history only |
+| `checkpoints_packed_d256/` | Packed d256 leakage-confounded capacity run | Capacity history only |
+| `checkpoints_d64/` | Earlier unpacked d64 checkpoints | Debug/history |
+| `checkpoints/` | Original pilot checkpoint and compare JSON | Debug/history |
+The clean block-causal run fixed the core packing issue by assigning each packed document a `segment_id`, resetting `position_ids`, and supplying a 4D block-causal attention mask so tokens can only attend causally within their own document.
+## Clean Block-Causal Result
+Command used for the clean d128 checkpoint:
+```bash
+python train.py --config pilot_d128_block
+python k_sweep.py \
+  --ckpt /tmp/checkpoints_block_d128/search_step_1000.pt \
+  --K 128,256,512 \
+  --no-use-faiss
+```
+Evaluation slice: 16 packed block-causal WikiText batches at 4K context.
+`PPL_full = 30.44`
+| K | Recall@K | mass@K | PPL_ANN | PPL gap |
+|---|---:|---:|---:|---:|
+| 128 | 0.744 | 0.787 | 30.47 | +0.07% |
+| 256 | 0.879 | 0.953 | 30.45 | +0.01% |
+| 512 | n/a | n/a | 30.45 | +0.01% |
+K=512 has no meaningful mass/recall average on this WikiText slice because almost no same-segment queries have 512 valid causal keys. The PPL value is still shown, but K=512 should not be used as a retrieval-quality point for this dataset slice.
+Interpretation: the clean result supports **quality-preserving sparse substitution**, not a claim that sparse attention improves over full attention.
+## Clean Per-layer Retrieval at K=128
 From `checkpoints_block_d128/search_step_1000.compare_retrieval.json`:
 | 24 | 0.978 | 0.984 |
 | avg | 0.969 | 0.973 |
+This changes the interpretation from the earlier leakage-confounded pilot. With segment isolation, early trained layers are not diffuse or uniquely hard. All six trained layers have high raw-QK oracle mass, and learned projections match or slightly exceed raw-QK retrieval across the tested set.
+The next deployment hypothesis is therefore: substitute all tested layers, then validate on a broader all-layer run.
 ## Quest-style Page Baseline
+`quest_sweep.py` implements a Quest-style min/max page selector for comparison:
+- Page size: 16
+- Native post-RoPE Q/K min/max metadata
+- Same block-causal token eligibility mask
+- Same sparse-attention gather path
+This is a correctness baseline, not an optimized Quest runtime.
+Command:
+```bash
+python quest_sweep.py \
+  --ckpt /tmp/checkpoints_block_d128/search_step_1000.pt \
+  --K 128,256,512 \
+  --page-size 16
+```
+Same 16-batch clean block-causal eval slice:
 | Method | K | Recall@K | mass@K | PPL | PPL gap |
 |---|---:|---:|---:|---:|---:|
 ## Packed Leakage-confounded Ablations
+The packed d64/d128/d256 runs are included because they are useful for understanding capacity scaling, but they should not be used for clean quality claims. Those runs allowed cross-document attention inside packed examples.
 Packed d_search ablation at K=128:
+| d_search | Params | learned mass@K=128 | raw-QK oracle | learned/oracle | final PPL gap |
+|---|---:|---:|---:|---:|---:|
+| 64 | 1.97M | 0.492 | 0.488 | 1.01x | +2.39% |
+| 128 | 3.93M | 0.503 | 0.488 | 1.03x | -1.81% |
+| 256 | 7.86M | 0.509 | 0.488 | 1.04x | -1.85% |
+The packed leakage-confounded K-sweep showed large negative PPL gaps:
+| K | Recall@K | mass@K | PPL_ANN | PPL gap |
 |---|---:|---:|---:|---:|
+| 128 | 0.166 | 0.256 | 203.63 | -9.36% |
+| 256 | 0.233 | 0.318 | 207.06 | -7.83% |
+| 512 | 0.339 | 0.409 | 211.93 | -5.66% |
+A second leaked packed slice preserved the shape: K=128 `-8.78%`, K=256 `-7.59%`, K=512 `-6.21%`. These numbers are retained for transparency and debugging history. They should not be reported as the headline because the clean block-causal rerun shows parity, not denoising.
+## What the Checkpoints Contain
+Each `.pt` file is a PyTorch checkpoint with the learned search projection module and config metadata. The base LLM is loaded separately from Hugging Face.
+Example loading pattern from the source repo:
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from config import Config
+from model import SearchProjectionModule
+ckpt = torch.load("checkpoints_block_d128/search_step_1000.pt", map_location="cpu", weights_only=False)
+ckpt_cfg = ckpt["config"]
+cfg = Config()
+for key, value in ckpt_cfg.items():
+    if hasattr(cfg, key):
+        setattr(cfg, key, value)
+base = AutoModelForCausalLM.from_pretrained(
+    cfg.base_model_name,
+    dtype=torch.bfloat16,
+    device_map="auto",
+    attn_implementation="sdpa",
+)
+layers = [
+    i for i in cfg.full_attention_layer_indices
+    if i not in cfg.reserved_full_attention_indices
+]
+search = SearchProjectionModule(
+    d_model=base.config.hidden_size,
+    d_search=cfg.d_search,
+    layer_indices=layers,
+    use_mlp=cfg.use_mlp_proj,
+).to(base.device).to(torch.bfloat16)
+search.load_state_dict(ckpt["search_module"])
+search.eval()
+```
+See the GitHub repo for full eval scripts and monkey-patched sparse-attention wrappers.
+## Runtime Caveat
+The current `inference.py` path is a correctness prototype:
+- Exact top-K path materializes dense `[B, L, L]` similarity and is for analysis.
+- FAISS/HNSW path builds a CPU index per forward pass and transfers data across CPU/GPU.
+- Gathered sparse attention still uses dense-style tensor expansion internally.
+Therefore, any FLOP/scoring reductions are algorithmic estimates, not measured wall-clock speedups. A deployable runtime needs GPU-resident retrieval and a fused sparse/paged attention kernel.
+## Recommended Use
+Use this repo for:
+- Reproducing the clean d128 block-causal result.
+- Inspecting search projection checkpoints.
+- Comparing learned search retrieval against raw-QK and Quest-style page retrieval.
+- Building follow-up experiments such as dynamic-index insertion or all-layer substitution.
+Do not use this repo as:
+- A drop-in accelerated inference engine.
+- Evidence that sparse attention beats full attention on clean methodology.
+- A complete comparison against all sparse-attention baselines.
+## Next Experiments
+The most important follow-ups are:
+1. Dynamic-index demonstration during long generation.
+2. Multi-seed confidence intervals for block-causal d128.
+3. LongBench/RULER/needle task evaluation.
+4. All-layer substitution run.
+5. GPU-resident retrieval and decode-mode KV-cache integration.
+## Citation / Attribution
+This is an in-progress research artifact. If you use it, cite the GitHub repo and this Hugging Face checkpoint repository.
+Source: https://github.com/unixsysdev/ann-sparseattention