adaamko commited on
Commit
37964d9
·
verified ·
1 Parent(s): 2a03adb

Update model card: 2-decimal metrics, 27 tool types, qualitative examples, match paper

Browse files
Files changed (1) hide show
  1. README.md +95 -170
README.md CHANGED
@@ -2,95 +2,91 @@
2
  license: apache-2.0
3
  language:
4
  - en
5
- base_model: Qwen/Qwen3.5-2B
6
  tags:
7
- - tool-output-pruning
8
- - context-engineering
9
- - context-pruning
10
- - code-agent
11
- - squeez
12
- - qwen3.5
13
- pipeline_tag: text-generation
14
- library_name: transformers
15
  datasets:
16
  - KRLabsOrg/tool-output-extraction-swebench
 
 
17
  ---
18
 
19
- <p align="center">
20
- <img src="https://github.com/KRLabsOrg/squeez/blob/main/assets/squeez_mascot.png?raw=true" alt="Squeez" width="250"/>
21
- <br><em>Squeeze out the juice, leave the pulp behind.</em>
22
- </p>
23
-
24
  # Squeez-2B
25
 
26
- LLM coding agents spend 80-95% of their context window on irrelevant tool output. Squeez filters it down to the lines that actually matter, compressing tool output by ~91% while keeping 86% of the relevant information.
27
-
28
- ## What is Squeez?
29
-
30
- A tool output pruner for coding agents. When an agent runs a tool (pytest, grep, git log, npm build, kubectl, etc.), the output is often hundreds of lines but only a handful matter for the current task. Squeez sits between the tool and the agent's context window:
31
 
32
  ```
33
  Tool output (500 lines) → Squeez → Relevant lines (30 lines) → Agent context
34
  ```
35
 
36
- Existing context pruning tools ([SWE-Pruner](https://github.com/Ayanami1314/swe-pruner), [Zilliz Semantic Highlight](https://huggingface.co/zilliz/semantic-highlight-bilingual-v1), [Provence](https://github.com/hotchpotch/open_provence)) are built for source code or document paragraphs. They don't handle the mixed, unstructured format of tool output (stack traces interleaved with passing tests, grep matches with context lines, build logs with timestamps).
 
 
 
 
 
37
 
38
- This model is [Qwen 3.5 2B](https://huggingface.co/Qwen/Qwen3.5-2B) fine-tuned to extract verbatim relevant lines from tool output given a task-specific query. It's trained specifically on 14 types of tool output from real SWE-bench workflows.
39
 
40
- - 2B parameters, runs on a single GPU, serves via vLLM
41
- - Outperforms Qwen 3.5 35B A3B zero-shot by +13% Span F1
42
- - Returns verbatim lines only, no rewriting or summarization
43
- - Works as a CLI pipe, Python library, or vLLM server
44
 
45
- ## Evaluation
 
 
 
 
 
46
 
47
- Evaluated on 617 held-out test samples from SWE-bench repositories, across 14 tool types:
48
 
49
- | Model | Precision | Recall | F1 | Compression |
50
- |-------|-----------|--------|------|-------------|
51
- | **Squeez-2B** | **0.8043** | **0.8624** | **0.7895** | 0.9150 |
52
- | Qwen 3.5 35B A3B (zero-shot) | 0.7402 | 0.7498 | 0.7000 | 0.9177 |
53
- | Kimi K2 (zero-shot) | 0.6128 | 0.5286 | 0.5344 | 0.9425 |
54
- | Qwen 3.5 2B (untrained) | 0.4154 | 0.5299 | 0.4075 | 0.8197 |
55
- | BM25 (10%) | 0.1277 | 0.2172 | 0.1314 | 0.9036 |
56
- | First-N (10%) | 0.0741 | 0.1445 | 0.0798 | 0.9055 |
57
- | Random (10%) | 0.0738 | 0.1009 | 0.0697 | 0.9067 |
58
- | Last-N (10%) | 0.0496 | 0.0503 | 0.0407 | 0.9130 |
59
 
60
- Span-level precision, recall, and F1 measure strict line-level set overlap between predicted and gold relevant lines. Compression is the fraction of input removed.
 
 
 
 
 
 
 
61
 
62
  ## Quick Start
63
 
64
- ### With vLLM (recommended)
65
 
66
  ```bash
67
- # Start the server
68
- pip install vllm
69
- vllm serve KRLabsOrg/squeez-2b --dtype bfloat16 --max-model-len 16384
70
-
71
- # Use from squeez CLI
72
  pip install squeez
 
 
 
73
  export SQUEEZ_SERVER_URL=http://localhost:8000/v1
74
- cat output.txt | squeez "find the bug"
75
 
76
- # Or pipe directly
77
- python -m pytest tests/ -v 2>&1 | squeez "find the test failure related to authentication"
 
78
  ```
79
 
80
- vLLM gives you batched inference, continuous batching, and high throughput — ideal when multiple agents or tools are running concurrently.
81
 
82
- ### With squeez (local, no server)
 
83
 
84
- ```bash
85
- pip install squeez
86
 
87
- # Downloads and runs the model locally (no GPU server needed)
88
- squeez "Find the failing traceback block" --input-file output.txt
89
- ```
90
 
91
- > **Note:** Local mode loads the model on every call. Fine for one-off use, but for repeated calls (e.g. an agent piping every tool through squeez), use vLLM — the model stays warm in memory.
 
 
 
 
92
 
93
- ### With transformers
94
 
95
  ```python
96
  from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -114,13 +110,12 @@ messages = [
114
  "Do not rewrite, summarize, or invent lines."
115
  )},
116
  {"role": "user", "content": (
117
- "<query>\nFix the failing authentication test\n</query>\n"
118
  "<tool_output>\n"
119
  "PASSED tests/test_login.py::test_valid_credentials\n"
120
  "FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401\n"
121
  "PASSED tests/test_login.py::test_logout\n"
122
- "PASSED tests/test_login.py::test_rate_limiting\n"
123
- "\n</tool_output>"
124
  )},
125
  ]
126
 
@@ -132,144 +127,74 @@ with torch.no_grad():
132
 
133
  response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
134
  print(response)
 
 
 
135
  ```
136
 
137
- **Output:**
138
- ```xml
139
- <relevant_lines>
140
- FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401
141
- </relevant_lines>
142
- ```
143
-
144
- ### Python API (with squeez)
145
-
146
- ```python
147
- from squeez.inference.extractor import ToolOutputExtractor
148
-
149
- # Loads this model locally
150
- extractor = ToolOutputExtractor(model_path="KRLabsOrg/squeez-2b")
151
-
152
- # Or connect to a vLLM server
153
- extractor = ToolOutputExtractor(base_url="http://localhost:8000/v1")
154
-
155
- filtered = extractor.extract(
156
- task="Find the referer validation block",
157
- tool_output=raw_output,
158
- )
159
- print(filtered)
160
- ```
161
-
162
- ## Input / Output Format
163
-
164
- **Input** — chat format with system prompt:
165
 
166
- ```
167
- System: You prune verbose tool output for a coding agent. Given a focused
168
- extraction query and one tool output, return only the smallest verbatim
169
- evidence block(s) the agent should read next. Return the kept text inside
170
- <relevant_lines> tags. Do not rewrite, summarize, or invent lines.
171
 
172
- User: <query>{task_description}</query>
173
- <tool_output>{raw_tool_output}</tool_output>
174
  ```
175
-
176
- **Output** — verbatim relevant lines wrapped in XML:
177
-
178
- ```xml
179
  <relevant_lines>
180
  {only the lines that matter, copied verbatim}
181
  </relevant_lines>
182
  ```
183
 
184
- If no lines are relevant, the model returns empty tags: `<relevant_lines>\n</relevant_lines>`.
185
-
186
- ## Supported Tool Types
187
-
188
- The model was trained on 14 tool types from SWE-bench repositories:
189
-
190
- | Tool type | Description | Example |
191
- |-----------|-------------|---------|
192
- | `test_output` | pytest / unittest output | Test failures, tracebacks, assertion errors |
193
- | `read_file` | File contents | Source code, config files |
194
- | `grep` | Search results | Pattern matches across files |
195
- | `git_diff` | Code changes | Diffs between commits or branches |
196
- | `git_log` | Commit history | Relevant commits |
197
- | `git_blame` | Line-level attribution | Who changed what |
198
- | `ls` | Directory listings | File structure |
199
- | `python` | Python REPL output | Script output, errors |
200
- | `curl` | HTTP responses | API responses, documentation |
201
- | `build_output` | Build logs | Compilation errors, warnings |
202
- | `lint_output` | Linter output | Style/type violations |
203
- | `pip_install` | Package manager output | Dependency errors |
204
- | `type_check` | Type checker output | mypy/pyright errors |
205
- | `coverage` | Coverage reports | Uncovered lines |
206
-
207
- ## Training Details
208
 
209
- | Parameter | Value |
210
- |-----------|-------|
211
- | Base model | [Qwen/Qwen3.5-2B](https://huggingface.co/Qwen/Qwen3.5-2B) |
212
- | Fine-tuning method | LoRA (r=16, alpha=32) via [Unsloth](https://github.com/unslothai/unsloth) |
213
- | Training data | Squeez v3 — 10,508 samples from [SWE-bench](https://swe-bench.github.io/) |
214
- | Epochs | 3 (best checkpoint at epoch 1.5) |
215
- | Max sequence length | 16,384 tokens |
216
- | Learning rate | 2e-4 |
217
- | Batch size | 8 (effective 32 with 4x gradient accumulation) |
218
- | Warmup | 5% of steps |
219
- | Weight decay | 0.01 |
220
- | Checkpoint selection | Best validation Span F1 |
221
 
222
- ### Data generation
223
 
224
- Training data was generated by running 14 types of tool calls on SWE-bench repositories and using a teacher model to label the relevant lines. Each sample contains:
225
- - A focused extraction query (what the agent needs to find)
226
- - Raw tool output (as the agent would see it)
227
- - Gold relevant lines (the minimal set the agent should read)
228
 
229
- Dataset: [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench)
 
 
 
 
 
 
 
 
 
 
230
 
231
- ## Limitations
232
 
233
- - Trained primarily on Python/SWE-bench data works best on software engineering tool output, though the prompt format generalizes to other domains
234
- - Not designed for general-purpose text summarization or question answering
235
- - Very short outputs (<5 lines) may be returned unchanged
236
- - Max input length is 16,384 tokens — longer outputs should be chunked
237
 
238
- ## Use with coding agents
 
 
 
 
 
 
239
 
240
- Add to your agent's system instructions (e.g. `CLAUDE.md` for Claude Code):
241
 
242
- ```
243
- Always pipe shell commands through squeez and tell exactly what you want to know.
 
 
244
 
245
- Examples:
246
- - `bun test 2>&1 | squeez "did the tests pass?"`
247
- - `git log --oneline -50 | squeez "find the commit that broke CSRF"`
248
- - `cat src/auth/middleware.py | squeez "find the referer validation logic"`
249
 
250
- Do NOT use squeez when:
251
- - You need exact, uncompressed output (e.g. writing a patch)
252
- - The command is interactive
253
- ```
254
 
255
  ## Citation
256
 
257
  ```bibtex
258
- @software{kovacs2026squeez,
259
- title={Squeez: Compressing Tool Output for LLM Coding Agents},
260
  author={Adam Kovacs},
261
  year={2026},
262
  url={https://github.com/KRLabsOrg/squeez}
263
  }
264
  ```
265
-
266
- ## License
267
-
268
- Apache 2.0
269
-
270
- ## Acknowledgments
271
-
272
- - [Qwen](https://huggingface.co/Qwen) for the Qwen 3.5 2B base model
273
- - [Unsloth](https://github.com/unslothai/unsloth) for efficient LoRA training
274
- - [SWE-bench](https://swe-bench.github.io/) for the evaluation framework and source repositories
275
- - [Provence](https://arxiv.org/abs/2501.16214) and [SWE-Pruner](https://github.com/ayanami-kitasan/SWE-Pruner) for inspiration on context pruning approaches
 
2
  license: apache-2.0
3
  language:
4
  - en
 
5
  tags:
6
+ - code
7
+ - tool-output
8
+ - pruning
9
+ - coding-agents
10
+ - extraction
 
 
 
11
  datasets:
12
  - KRLabsOrg/tool-output-extraction-swebench
13
+ base_model: Qwen/Qwen3.5-2B
14
+ pipeline_tag: text-generation
15
  ---
16
 
 
 
 
 
 
17
  # Squeez-2B
18
 
19
+ **Squeez-2B** is a 2B parameter model fine-tuned from Qwen 3.5 2B for task-conditioned tool-output pruning in coding agents. Given a focused query and one raw tool observation, it extracts the smallest verbatim evidence block the agent should inspect next removing **92%** of input tokens while retaining **0.86 recall**.
 
 
 
 
20
 
21
  ```
22
  Tool output (500 lines) → Squeez → Relevant lines (30 lines) → Agent context
23
  ```
24
 
25
+ - Outperforms zero-shot Qwen 3.5 35B A3B by **+11 recall points**
26
+ - Returns verbatim lines only (no rewriting or summarization)
27
+ - Works as CLI pipe, Python library, or vLLM server
28
+ - Trained on **27 tool types** from real SWE-bench workflows and synthetic multi-ecosystem outputs
29
+
30
+ **Resources:** [Paper (coming soon)]() | [Dataset](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) | [Code & CLI](https://github.com/KRLabsOrg/squeez) | [Blog post](https://huggingface.co/blog/KRLabsOrg/squeez)
31
 
32
+ ## Results
33
 
34
+ Evaluated on 618 manually curated held-out examples spanning 27 tool types.
 
 
 
35
 
36
+ | Model | Prec. | Recall | F1 | Compression |
37
+ |-------|-------|--------|-----|-------------|
38
+ | **Squeez-2B** | **0.80** | **0.86** | **0.80** | 0.92 |
39
+ | Qwen 3.5 35B A3B (zero-shot) | 0.74 | 0.75 | 0.73 | 0.92 |
40
+ | Kimi K2 (zero-shot) | 0.61 | 0.53 | 0.68 | 0.94 |
41
+ | Qwen 3.5 2B (untrained) | 0.42 | 0.53 | 0.55 | 0.82 |
42
 
43
+ The fine-tuned 2B model is also the most precise system in the comparison, indicating it has learned a tool-specific extraction policy rather than relying on generic instruction following.
44
 
45
+ ### Qualitative patterns
 
 
 
 
 
 
 
 
 
46
 
47
+ | Pattern | Example | Squeez-2B | Baseline failure |
48
+ |---------|---------|-----------|-----------------|
49
+ | Precise selection | `git_log`, 21 lines — find one commit | Selects the single correct entry | Qwen 35B picks a plausible but wrong commit |
50
+ | Failure-block extraction | Service log, 176 lines — two similar TLS errors | Returns the correct 5-line block | Qwen 35B picks the wrong TLS error (different timestamp) |
51
+ | Correct empty prediction | `docker_logs`, 316 lines — no matching evidence | Returns empty output | Qwen 35B generates "No relevant lines found..." |
52
+ | Adjacent over-selection | Build output, 110 lines — Dockerfile error | Finds the right error + nearby noise | Qwen 35B misses the Dockerfile error entirely |
53
+
54
+ On the 59 negative examples in the test set, Squeez-2B correctly returns empty output 80% of the time. Qwen 35B returns empty only 7% of the time.
55
 
56
  ## Quick Start
57
 
58
+ ### CLI (recommended)
59
 
60
  ```bash
 
 
 
 
 
61
  pip install squeez
62
+
63
+ # With vLLM server
64
+ vllm serve KRLabsOrg/squeez-2b --dtype bfloat16 --max-model-len 16384
65
  export SQUEEZ_SERVER_URL=http://localhost:8000/v1
 
66
 
67
+ pytest -q 2>&1 | squeez "find the failure block"
68
+ git log --oneline -50 | squeez "find the commit that changed CSRF handling"
69
+ cat src/auth/middleware.py | squeez "find the referer validation logic"
70
  ```
71
 
72
+ ### Python API
73
 
74
+ ```python
75
+ from squeez.inference.extractor import ToolOutputExtractor
76
 
77
+ # vLLM server
78
+ extractor = ToolOutputExtractor(base_url="http://localhost:8000/v1")
79
 
80
+ # Or local
81
+ extractor = ToolOutputExtractor(model_path="KRLabsOrg/squeez-2b")
 
82
 
83
+ filtered = extractor.extract(
84
+ task="Find the failing test block",
85
+ tool_output=raw_output,
86
+ )
87
+ ```
88
 
89
+ ### With transformers directly
90
 
91
  ```python
92
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
110
  "Do not rewrite, summarize, or invent lines."
111
  )},
112
  {"role": "user", "content": (
113
+ "<query>\nFind the failing authentication test\n</query>\n"
114
  "<tool_output>\n"
115
  "PASSED tests/test_login.py::test_valid_credentials\n"
116
  "FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401\n"
117
  "PASSED tests/test_login.py::test_logout\n"
118
+ "</tool_output>"
 
119
  )},
120
  ]
121
 
 
127
 
128
  response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
129
  print(response)
130
+ # <relevant_lines>
131
+ # FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401
132
+ # </relevant_lines>
133
  ```
134
 
135
+ ## Input/Output Format
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
 
137
+ **Input** — Chat messages with system prompt:
138
+ - System: extraction instructions (see above)
139
+ - User: `<query>{task}</query>\n<tool_output>{raw_output}</tool_output>`
 
 
140
 
141
+ **Output** — Verbatim lines in XML tags:
 
142
  ```
 
 
 
 
143
  <relevant_lines>
144
  {only the lines that matter, copied verbatim}
145
  </relevant_lines>
146
  ```
147
 
148
+ ## Supported Tool Types (27)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
 
150
+ **SWE-bench derived (14):** `read_file` | `grep` | `git_log` | `git_blame` | `git_diff` | `test_output` | `python` | `curl` | `pip_install` | `ls` | `lint_output` | `build_output` | `type_check` | `coverage`
 
 
 
 
 
 
 
 
 
 
 
151
 
152
+ **Synthetic multi-ecosystem (13):** `npm_build` | `tsc` | `npm_install` | `docker_logs` | `docker_build` | `make_cmake` | `kubectl` | `cargo_build` | `go_build` | `mvn_gradle` | `terraform` | `mypy_pyright` | `eslint`
153
 
154
+ ## Training Details
 
 
 
155
 
156
+ | | |
157
+ |---|---|
158
+ | **Base model** | Qwen/Qwen3.5-2B |
159
+ | **Method** | LoRA (r=16, alpha=32) via Unsloth |
160
+ | **Training data** | 10,508 examples (SWE-bench + synthetic) |
161
+ | **Epochs** | 3 |
162
+ | **Max sequence length** | 20,000 tokens |
163
+ | **Learning rate** | 2e-4 |
164
+ | **Batch size** | 8 (32 effective with 4x gradient accumulation) |
165
+ | **Hardware** | Single NVIDIA A100 80GB |
166
+ | **Dataset** | [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) |
167
 
168
+ ## Usage with Coding Agents
169
 
170
+ Add to your `CLAUDE.md` or agent system prompt:
 
 
 
171
 
172
+ ```
173
+ When you invoke a shell command, pipe it through `squeez` and describe what you need.
174
+ Examples:
175
+ - bun test 2>&1 | squeez "did the tests pass?"
176
+ - git log --oneline -50 | squeez "find the commit that broke CSRF"
177
+ - cat src/auth/middleware.py | squeez "find the referer validation logic"
178
+ ```
179
 
180
+ ## Limitations
181
 
182
+ - Best on software engineering tool output; not designed for general-purpose summarization
183
+ - Synthetic data generated by `openai/gpt-oss-120b` may not fully reflect real-world distributions for all ecosystems
184
+ - Evaluates single tool observations, not full agent trajectories
185
+ - Max input: 20,000 tokens (training length); can be extended at serving time
186
 
187
+ ## License
 
 
 
188
 
189
+ Apache 2.0
 
 
 
190
 
191
  ## Citation
192
 
193
  ```bibtex
194
+ @misc{kovacs2026squeez,
195
+ title={Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents},
196
  author={Adam Kovacs},
197
  year={2026},
198
  url={https://github.com/KRLabsOrg/squeez}
199
  }
200
  ```