KRLabsOrg
/

squeez-2b

@@ -2,95 +2,91 @@
 license: apache-2.0
 language:
 - en
-base_model: Qwen/Qwen3.5-2B
 tags:
-- tool-output-pruning
-- context-engineering
-- context-pruning
-- code-agent
-- squeez
-- qwen3.5
-pipeline_tag: text-generation
-library_name: transformers
 datasets:
 - KRLabsOrg/tool-output-extraction-swebench
 ---
-<p align="center">
-  <img src="https://github.com/KRLabsOrg/squeez/blob/main/assets/squeez_mascot.png?raw=true" alt="Squeez" width="250"/>
-  <br><em>Squeeze out the juice, leave the pulp behind.</em>
-</p>
 # Squeez-2B
-LLM coding agents spend 80-95% of their context window on irrelevant tool output. Squeez filters it down to the lines that actually matter, compressing tool output by ~91% while keeping 86% of the relevant information.
-## What is Squeez?
-A tool output pruner for coding agents. When an agent runs a tool (pytest, grep, git log, npm build, kubectl, etc.), the output is often hundreds of lines but only a handful matter for the current task. Squeez sits between the tool and the agent's context window:
 ```
 Tool output (500 lines) → Squeez → Relevant lines (30 lines) → Agent context
 ```
-Existing context pruning tools ([SWE-Pruner](https://github.com/Ayanami1314/swe-pruner), [Zilliz Semantic Highlight](https://huggingface.co/zilliz/semantic-highlight-bilingual-v1), [Provence](https://github.com/hotchpotch/open_provence)) are built for source code or document paragraphs. They don't handle the mixed, unstructured format of tool output (stack traces interleaved with passing tests, grep matches with context lines, build logs with timestamps).
-This model is [Qwen 3.5 2B](https://huggingface.co/Qwen/Qwen3.5-2B) fine-tuned to extract verbatim relevant lines from tool output given a task-specific query. It's trained specifically on 14 types of tool output from real SWE-bench workflows.
-- 2B parameters, runs on a single GPU, serves via vLLM
-- Outperforms Qwen 3.5 35B A3B zero-shot by +13% Span F1
-- Returns verbatim lines only, no rewriting or summarization
-- Works as a CLI pipe, Python library, or vLLM server
-## Evaluation
-Evaluated on 617 held-out test samples from SWE-bench repositories, across 14 tool types:
-| Model | Precision | Recall | F1 | Compression |
-|-------|-----------|--------|------|-------------|
-| **Squeez-2B** | **0.8043** | **0.8624** | **0.7895** | 0.9150 |
-| Qwen 3.5 35B A3B (zero-shot) | 0.7402 | 0.7498 | 0.7000 | 0.9177 |
-| Kimi K2 (zero-shot) | 0.6128 | 0.5286 | 0.5344 | 0.9425 |
-| Qwen 3.5 2B (untrained) | 0.4154 | 0.5299 | 0.4075 | 0.8197 |
-| BM25 (10%) | 0.1277 | 0.2172 | 0.1314 | 0.9036 |
-| First-N (10%) | 0.0741 | 0.1445 | 0.0798 | 0.9055 |
-| Random (10%) | 0.0738 | 0.1009 | 0.0697 | 0.9067 |
-| Last-N (10%) | 0.0496 | 0.0503 | 0.0407 | 0.9130 |
-Span-level precision, recall, and F1 measure strict line-level set overlap between predicted and gold relevant lines. Compression is the fraction of input removed.
 ## Quick Start
-### With vLLM (recommended)
 ```bash
-# Start the server
-pip install vllm
-vllm serve KRLabsOrg/squeez-2b --dtype bfloat16 --max-model-len 16384
-# Use from squeez CLI
 pip install squeez
 export SQUEEZ_SERVER_URL=http://localhost:8000/v1
-cat output.txt | squeez "find the bug"
-# Or pipe directly
-python -m pytest tests/ -v 2>&1 | squeez "find the test failure related to authentication"
 ```
-vLLM gives you batched inference, continuous batching, and high throughput — ideal when multiple agents or tools are running concurrently.
-### With squeez (local, no server)
-```bash
-pip install squeez
-# Downloads and runs the model locally (no GPU server needed)
-squeez "Find the failing traceback block" --input-file output.txt
-```
-> **Note:** Local mode loads the model on every call. Fine for one-off use, but for repeated calls (e.g. an agent piping every tool through squeez), use vLLM — the model stays warm in memory.
-### With transformers
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -114,13 +110,12 @@ messages = [
         "Do not rewrite, summarize, or invent lines."
     )},
     {"role": "user", "content": (
-        "<query>\nFix the failing authentication test\n</query>\n"
         "<tool_output>\n"
         "PASSED tests/test_login.py::test_valid_credentials\n"
         "FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401\n"
         "PASSED tests/test_login.py::test_logout\n"
-        "PASSED tests/test_login.py::test_rate_limiting\n"
-        "\n</tool_output>"
     )},
 ]
@@ -132,144 +127,74 @@ with torch.no_grad():
 response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
 print(response)
 ```
-**Output:**
-```xml
-<relevant_lines>
-FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401
-</relevant_lines>
-```
-### Python API (with squeez)
-```python
-from squeez.inference.extractor import ToolOutputExtractor
-# Loads this model locally
-extractor = ToolOutputExtractor(model_path="KRLabsOrg/squeez-2b")
-# Or connect to a vLLM server
-extractor = ToolOutputExtractor(base_url="http://localhost:8000/v1")
-filtered = extractor.extract(
-    task="Find the referer validation block",
-    tool_output=raw_output,
-)
-print(filtered)
-```
-## Input / Output Format
-**Input** — chat format with system prompt:
-```
-System: You prune verbose tool output for a coding agent. Given a focused
-extraction query and one tool output, return only the smallest verbatim
-evidence block(s) the agent should read next. Return the kept text inside
-<relevant_lines> tags. Do not rewrite, summarize, or invent lines.
-User: <query>{task_description}</query>
-<tool_output>{raw_tool_output}</tool_output>
 ```
-**Output** — verbatim relevant lines wrapped in XML:
-```xml
 <relevant_lines>
 {only the lines that matter, copied verbatim}
 </relevant_lines>
 ```
-If no lines are relevant, the model returns empty tags: `<relevant_lines>\n</relevant_lines>`.
-## Supported Tool Types
-The model was trained on 14 tool types from SWE-bench repositories:
-| Tool type | Description | Example |
-|-----------|-------------|---------|
-| `test_output` | pytest / unittest output | Test failures, tracebacks, assertion errors |
-| `read_file` | File contents | Source code, config files |
-| `grep` | Search results | Pattern matches across files |
-| `git_diff` | Code changes | Diffs between commits or branches |
-| `git_log` | Commit history | Relevant commits |
-| `git_blame` | Line-level attribution | Who changed what |
-| `ls` | Directory listings | File structure |
-| `python` | Python REPL output | Script output, errors |
-| `curl` | HTTP responses | API responses, documentation |
-| `build_output` | Build logs | Compilation errors, warnings |
-| `lint_output` | Linter output | Style/type violations |
-| `pip_install` | Package manager output | Dependency errors |
-| `type_check` | Type checker output | mypy/pyright errors |
-| `coverage` | Coverage reports | Uncovered lines |
-## Training Details
-| Parameter | Value |
-|-----------|-------|
-| Base model | [Qwen/Qwen3.5-2B](https://huggingface.co/Qwen/Qwen3.5-2B) |
-| Fine-tuning method | LoRA (r=16, alpha=32) via [Unsloth](https://github.com/unslothai/unsloth) |
-| Training data | Squeez v3 — 10,508 samples from [SWE-bench](https://swe-bench.github.io/) |
-| Epochs | 3 (best checkpoint at epoch 1.5) |
-| Max sequence length | 16,384 tokens |
-| Learning rate | 2e-4 |
-| Batch size | 8 (effective 32 with 4x gradient accumulation) |
-| Warmup | 5% of steps |
-| Weight decay | 0.01 |
-| Checkpoint selection | Best validation Span F1 |
-### Data generation
-Training data was generated by running 14 types of tool calls on SWE-bench repositories and using a teacher model to label the relevant lines. Each sample contains:
-- A focused extraction query (what the agent needs to find)
-- Raw tool output (as the agent would see it)
-- Gold relevant lines (the minimal set the agent should read)
-Dataset: [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench)
-## Limitations
-- Trained primarily on Python/SWE-bench data — works best on software engineering tool output, though the prompt format generalizes to other domains
-- Not designed for general-purpose text summarization or question answering
-- Very short outputs (<5 lines) may be returned unchanged
-- Max input length is 16,384 tokens — longer outputs should be chunked
-## Use with coding agents
-Add to your agent's system instructions (e.g. `CLAUDE.md` for Claude Code):
-```
-Always pipe shell commands through squeez and tell exactly what you want to know.
-Examples:
-- `bun test 2>&1 | squeez "did the tests pass?"`
-- `git log --oneline -50 | squeez "find the commit that broke CSRF"`
-- `cat src/auth/middleware.py | squeez "find the referer validation logic"`
-Do NOT use squeez when:
-- You need exact, uncompressed output (e.g. writing a patch)
-- The command is interactive
-```
 ## Citation
 ```bibtex
-@software{kovacs2026squeez,
-    title={Squeez: Compressing Tool Output for LLM Coding Agents},
     author={Adam Kovacs},
     year={2026},
     url={https://github.com/KRLabsOrg/squeez}
 }
 ```
-## License
-Apache 2.0
-## Acknowledgments
-- [Qwen](https://huggingface.co/Qwen) for the Qwen 3.5 2B base model
-- [Unsloth](https://github.com/unslothai/unsloth) for efficient LoRA training
-- [SWE-bench](https://swe-bench.github.io/) for the evaluation framework and source repositories
-- [Provence](https://arxiv.org/abs/2501.16214) and [SWE-Pruner](https://github.com/ayanami-kitasan/SWE-Pruner) for inspiration on context pruning approaches

 license: apache-2.0
 language:
 - en
 tags:
+- code
+- tool-output
+- pruning
+- coding-agents
+- extraction
 datasets:
 - KRLabsOrg/tool-output-extraction-swebench
+base_model: Qwen/Qwen3.5-2B
+pipeline_tag: text-generation
 ---
 # Squeez-2B
+**Squeez-2B** is a 2B parameter model fine-tuned from Qwen 3.5 2B for task-conditioned tool-output pruning in coding agents. Given a focused query and one raw tool observation, it extracts the smallest verbatim evidence block the agent should inspect next — removing **92%** of input tokens while retaining **0.86 recall**.
 ```
 Tool output (500 lines) → Squeez → Relevant lines (30 lines) → Agent context
 ```
+- Outperforms zero-shot Qwen 3.5 35B A3B by **+11 recall points**
+- Returns verbatim lines only (no rewriting or summarization)
+- Works as CLI pipe, Python library, or vLLM server
+- Trained on **27 tool types** from real SWE-bench workflows and synthetic multi-ecosystem outputs
+**Resources:** [Paper (coming soon)]() | [Dataset](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) | [Code & CLI](https://github.com/KRLabsOrg/squeez) | [Blog post](https://huggingface.co/blog/KRLabsOrg/squeez)
+## Results
+Evaluated on 618 manually curated held-out examples spanning 27 tool types.
+| Model | Prec. | Recall | F1 | Compression |
+|-------|-------|--------|-----|-------------|
+| **Squeez-2B** | **0.80** | **0.86** | **0.80** | 0.92 |
+| Qwen 3.5 35B A3B (zero-shot) | 0.74 | 0.75 | 0.73 | 0.92 |
+| Kimi K2 (zero-shot) | 0.61 | 0.53 | 0.68 | 0.94 |
+| Qwen 3.5 2B (untrained) | 0.42 | 0.53 | 0.55 | 0.82 |
+The fine-tuned 2B model is also the most precise system in the comparison, indicating it has learned a tool-specific extraction policy rather than relying on generic instruction following.
+### Qualitative patterns
+| Pattern | Example | Squeez-2B | Baseline failure |
+|---------|---------|-----------|-----------------|
+| Precise selection | `git_log`, 21 lines — find one commit | Selects the single correct entry | Qwen 35B picks a plausible but wrong commit |
+| Failure-block extraction | Service log, 176 lines — two similar TLS errors | Returns the correct 5-line block | Qwen 35B picks the wrong TLS error (different timestamp) |
+| Correct empty prediction | `docker_logs`, 316 lines — no matching evidence | Returns empty output | Qwen 35B generates "No relevant lines found..." |
+| Adjacent over-selection | Build output, 110 lines — Dockerfile error | Finds the right error + nearby noise | Qwen 35B misses the Dockerfile error entirely |
+On the 59 negative examples in the test set, Squeez-2B correctly returns empty output 80% of the time. Qwen 35B returns empty only 7% of the time.
 ## Quick Start
+### CLI (recommended)
 ```bash
 pip install squeez
+# With vLLM server
+vllm serve KRLabsOrg/squeez-2b --dtype bfloat16 --max-model-len 16384
 export SQUEEZ_SERVER_URL=http://localhost:8000/v1
+pytest -q 2>&1 | squeez "find the failure block"
+git log --oneline -50 | squeez "find the commit that changed CSRF handling"
+cat src/auth/middleware.py | squeez "find the referer validation logic"
 ```
+### Python API
+```python
+from squeez.inference.extractor import ToolOutputExtractor
+# vLLM server
+extractor = ToolOutputExtractor(base_url="http://localhost:8000/v1")
+# Or local
+extractor = ToolOutputExtractor(model_path="KRLabsOrg/squeez-2b")
+filtered = extractor.extract(
+    task="Find the failing test block",
+    tool_output=raw_output,
+)
+```
+### With transformers directly
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
         "Do not rewrite, summarize, or invent lines."
     )},
     {"role": "user", "content": (
+        "<query>\nFind the failing authentication test\n</query>\n"
         "<tool_output>\n"
         "PASSED tests/test_login.py::test_valid_credentials\n"
         "FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401\n"
         "PASSED tests/test_login.py::test_logout\n"
+        "</tool_output>"
     )},
 ]
 response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
 print(response)
+# <relevant_lines>
+# FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401
+# </relevant_lines>
 ```
+## Input/Output Format
+**Input** — Chat messages with system prompt:
+- System: extraction instructions (see above)
+- User: `<query>{task}</query>\n<tool_output>{raw_output}</tool_output>`
+**Output** — Verbatim lines in XML tags:
 ```
 <relevant_lines>
 {only the lines that matter, copied verbatim}
 </relevant_lines>
 ```
+## Supported Tool Types (27)
+**SWE-bench derived (14):** `read_file` | `grep` | `git_log` | `git_blame` | `git_diff` | `test_output` | `python` | `curl` | `pip_install` | `ls` | `lint_output` | `build_output` | `type_check` | `coverage`
+**Synthetic multi-ecosystem (13):** `npm_build` | `tsc` | `npm_install` | `docker_logs` | `docker_build` | `make_cmake` | `kubectl` | `cargo_build` | `go_build` | `mvn_gradle` | `terraform` | `mypy_pyright` | `eslint`
+## Training Details
+| | |
+|---|---|
+| **Base model** | Qwen/Qwen3.5-2B |
+| **Method** | LoRA (r=16, alpha=32) via Unsloth |
+| **Training data** | 10,508 examples (SWE-bench + synthetic) |
+| **Epochs** | 3 |
+| **Max sequence length** | 20,000 tokens |
+| **Learning rate** | 2e-4 |
+| **Batch size** | 8 (32 effective with 4x gradient accumulation) |
+| **Hardware** | Single NVIDIA A100 80GB |
+| **Dataset** | [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) |
+## Usage with Coding Agents
+Add to your `CLAUDE.md` or agent system prompt:
+```
+When you invoke a shell command, pipe it through `squeez` and describe what you need.
+Examples:
+- bun test 2>&1 | squeez "did the tests pass?"
+- git log --oneline -50 | squeez "find the commit that broke CSRF"
+- cat src/auth/middleware.py | squeez "find the referer validation logic"
+```
+## Limitations
+- Best on software engineering tool output; not designed for general-purpose summarization
+- Synthetic data generated by `openai/gpt-oss-120b` — may not fully reflect real-world distributions for all ecosystems
+- Evaluates single tool observations, not full agent trajectories
+- Max input: 20,000 tokens (training length); can be extended at serving time
+## License
+Apache 2.0
 ## Citation
 ```bibtex
+@misc{kovacs2026squeez,
+    title={Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents},
     author={Adam Kovacs},
     year={2026},
     url={https://github.com/KRLabsOrg/squeez}
 }
 ```