Update model card: 2-decimal metrics, 27 tool types, qualitative examples, match paper
Browse files
README.md
CHANGED
|
@@ -2,95 +2,91 @@
|
|
| 2 |
license: apache-2.0
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
-
base_model: Qwen/Qwen3.5-2B
|
| 6 |
tags:
|
| 7 |
-
-
|
| 8 |
-
-
|
| 9 |
-
-
|
| 10 |
-
-
|
| 11 |
-
-
|
| 12 |
-
- qwen3.5
|
| 13 |
-
pipeline_tag: text-generation
|
| 14 |
-
library_name: transformers
|
| 15 |
datasets:
|
| 16 |
- KRLabsOrg/tool-output-extraction-swebench
|
|
|
|
|
|
|
| 17 |
---
|
| 18 |
|
| 19 |
-
<p align="center">
|
| 20 |
-
<img src="https://github.com/KRLabsOrg/squeez/blob/main/assets/squeez_mascot.png?raw=true" alt="Squeez" width="250"/>
|
| 21 |
-
<br><em>Squeeze out the juice, leave the pulp behind.</em>
|
| 22 |
-
</p>
|
| 23 |
-
|
| 24 |
# Squeez-2B
|
| 25 |
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
## What is Squeez?
|
| 29 |
-
|
| 30 |
-
A tool output pruner for coding agents. When an agent runs a tool (pytest, grep, git log, npm build, kubectl, etc.), the output is often hundreds of lines but only a handful matter for the current task. Squeez sits between the tool and the agent's context window:
|
| 31 |
|
| 32 |
```
|
| 33 |
Tool output (500 lines) → Squeez → Relevant lines (30 lines) → Agent context
|
| 34 |
```
|
| 35 |
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
-
|
| 39 |
|
| 40 |
-
|
| 41 |
-
- Outperforms Qwen 3.5 35B A3B zero-shot by +13% Span F1
|
| 42 |
-
- Returns verbatim lines only, no rewriting or summarization
|
| 43 |
-
- Works as a CLI pipe, Python library, or vLLM server
|
| 44 |
|
| 45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
-
|
| 48 |
|
| 49 |
-
|
| 50 |
-
|-------|-----------|--------|------|-------------|
|
| 51 |
-
| **Squeez-2B** | **0.8043** | **0.8624** | **0.7895** | 0.9150 |
|
| 52 |
-
| Qwen 3.5 35B A3B (zero-shot) | 0.7402 | 0.7498 | 0.7000 | 0.9177 |
|
| 53 |
-
| Kimi K2 (zero-shot) | 0.6128 | 0.5286 | 0.5344 | 0.9425 |
|
| 54 |
-
| Qwen 3.5 2B (untrained) | 0.4154 | 0.5299 | 0.4075 | 0.8197 |
|
| 55 |
-
| BM25 (10%) | 0.1277 | 0.2172 | 0.1314 | 0.9036 |
|
| 56 |
-
| First-N (10%) | 0.0741 | 0.1445 | 0.0798 | 0.9055 |
|
| 57 |
-
| Random (10%) | 0.0738 | 0.1009 | 0.0697 | 0.9067 |
|
| 58 |
-
| Last-N (10%) | 0.0496 | 0.0503 | 0.0407 | 0.9130 |
|
| 59 |
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
## Quick Start
|
| 63 |
|
| 64 |
-
###
|
| 65 |
|
| 66 |
```bash
|
| 67 |
-
# Start the server
|
| 68 |
-
pip install vllm
|
| 69 |
-
vllm serve KRLabsOrg/squeez-2b --dtype bfloat16 --max-model-len 16384
|
| 70 |
-
|
| 71 |
-
# Use from squeez CLI
|
| 72 |
pip install squeez
|
|
|
|
|
|
|
|
|
|
| 73 |
export SQUEEZ_SERVER_URL=http://localhost:8000/v1
|
| 74 |
-
cat output.txt | squeez "find the bug"
|
| 75 |
|
| 76 |
-
|
| 77 |
-
|
|
|
|
| 78 |
```
|
| 79 |
|
| 80 |
-
|
| 81 |
|
| 82 |
-
|
|
|
|
| 83 |
|
| 84 |
-
|
| 85 |
-
|
| 86 |
|
| 87 |
-
#
|
| 88 |
-
|
| 89 |
-
```
|
| 90 |
|
| 91 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
-
### With transformers
|
| 94 |
|
| 95 |
```python
|
| 96 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
@@ -114,13 +110,12 @@ messages = [
|
|
| 114 |
"Do not rewrite, summarize, or invent lines."
|
| 115 |
)},
|
| 116 |
{"role": "user", "content": (
|
| 117 |
-
"<query>\
|
| 118 |
"<tool_output>\n"
|
| 119 |
"PASSED tests/test_login.py::test_valid_credentials\n"
|
| 120 |
"FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401\n"
|
| 121 |
"PASSED tests/test_login.py::test_logout\n"
|
| 122 |
-
"
|
| 123 |
-
"\n</tool_output>"
|
| 124 |
)},
|
| 125 |
]
|
| 126 |
|
|
@@ -132,144 +127,74 @@ with torch.no_grad():
|
|
| 132 |
|
| 133 |
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
|
| 134 |
print(response)
|
|
|
|
|
|
|
|
|
|
| 135 |
```
|
| 136 |
|
| 137 |
-
|
| 138 |
-
```xml
|
| 139 |
-
<relevant_lines>
|
| 140 |
-
FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401
|
| 141 |
-
</relevant_lines>
|
| 142 |
-
```
|
| 143 |
-
|
| 144 |
-
### Python API (with squeez)
|
| 145 |
-
|
| 146 |
-
```python
|
| 147 |
-
from squeez.inference.extractor import ToolOutputExtractor
|
| 148 |
-
|
| 149 |
-
# Loads this model locally
|
| 150 |
-
extractor = ToolOutputExtractor(model_path="KRLabsOrg/squeez-2b")
|
| 151 |
-
|
| 152 |
-
# Or connect to a vLLM server
|
| 153 |
-
extractor = ToolOutputExtractor(base_url="http://localhost:8000/v1")
|
| 154 |
-
|
| 155 |
-
filtered = extractor.extract(
|
| 156 |
-
task="Find the referer validation block",
|
| 157 |
-
tool_output=raw_output,
|
| 158 |
-
)
|
| 159 |
-
print(filtered)
|
| 160 |
-
```
|
| 161 |
-
|
| 162 |
-
## Input / Output Format
|
| 163 |
-
|
| 164 |
-
**Input** — chat format with system prompt:
|
| 165 |
|
| 166 |
-
|
| 167 |
-
System:
|
| 168 |
-
|
| 169 |
-
evidence block(s) the agent should read next. Return the kept text inside
|
| 170 |
-
<relevant_lines> tags. Do not rewrite, summarize, or invent lines.
|
| 171 |
|
| 172 |
-
|
| 173 |
-
<tool_output>{raw_tool_output}</tool_output>
|
| 174 |
```
|
| 175 |
-
|
| 176 |
-
**Output** — verbatim relevant lines wrapped in XML:
|
| 177 |
-
|
| 178 |
-
```xml
|
| 179 |
<relevant_lines>
|
| 180 |
{only the lines that matter, copied verbatim}
|
| 181 |
</relevant_lines>
|
| 182 |
```
|
| 183 |
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
## Supported Tool Types
|
| 187 |
-
|
| 188 |
-
The model was trained on 14 tool types from SWE-bench repositories:
|
| 189 |
-
|
| 190 |
-
| Tool type | Description | Example |
|
| 191 |
-
|-----------|-------------|---------|
|
| 192 |
-
| `test_output` | pytest / unittest output | Test failures, tracebacks, assertion errors |
|
| 193 |
-
| `read_file` | File contents | Source code, config files |
|
| 194 |
-
| `grep` | Search results | Pattern matches across files |
|
| 195 |
-
| `git_diff` | Code changes | Diffs between commits or branches |
|
| 196 |
-
| `git_log` | Commit history | Relevant commits |
|
| 197 |
-
| `git_blame` | Line-level attribution | Who changed what |
|
| 198 |
-
| `ls` | Directory listings | File structure |
|
| 199 |
-
| `python` | Python REPL output | Script output, errors |
|
| 200 |
-
| `curl` | HTTP responses | API responses, documentation |
|
| 201 |
-
| `build_output` | Build logs | Compilation errors, warnings |
|
| 202 |
-
| `lint_output` | Linter output | Style/type violations |
|
| 203 |
-
| `pip_install` | Package manager output | Dependency errors |
|
| 204 |
-
| `type_check` | Type checker output | mypy/pyright errors |
|
| 205 |
-
| `coverage` | Coverage reports | Uncovered lines |
|
| 206 |
-
|
| 207 |
-
## Training Details
|
| 208 |
|
| 209 |
-
|
|
| 210 |
-
|-----------|-------|
|
| 211 |
-
| Base model | [Qwen/Qwen3.5-2B](https://huggingface.co/Qwen/Qwen3.5-2B) |
|
| 212 |
-
| Fine-tuning method | LoRA (r=16, alpha=32) via [Unsloth](https://github.com/unslothai/unsloth) |
|
| 213 |
-
| Training data | Squeez v3 — 10,508 samples from [SWE-bench](https://swe-bench.github.io/) |
|
| 214 |
-
| Epochs | 3 (best checkpoint at epoch 1.5) |
|
| 215 |
-
| Max sequence length | 16,384 tokens |
|
| 216 |
-
| Learning rate | 2e-4 |
|
| 217 |
-
| Batch size | 8 (effective 32 with 4x gradient accumulation) |
|
| 218 |
-
| Warmup | 5% of steps |
|
| 219 |
-
| Weight decay | 0.01 |
|
| 220 |
-
| Checkpoint selection | Best validation Span F1 |
|
| 221 |
|
| 222 |
-
|
| 223 |
|
| 224 |
-
Training
|
| 225 |
-
- A focused extraction query (what the agent needs to find)
|
| 226 |
-
- Raw tool output (as the agent would see it)
|
| 227 |
-
- Gold relevant lines (the minimal set the agent should read)
|
| 228 |
|
| 229 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 230 |
|
| 231 |
-
##
|
| 232 |
|
| 233 |
-
|
| 234 |
-
- Not designed for general-purpose text summarization or question answering
|
| 235 |
-
- Very short outputs (<5 lines) may be returned unchanged
|
| 236 |
-
- Max input length is 16,384 tokens — longer outputs should be chunked
|
| 237 |
|
| 238 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 239 |
|
| 240 |
-
|
| 241 |
|
| 242 |
-
|
| 243 |
-
|
|
|
|
|
|
|
| 244 |
|
| 245 |
-
|
| 246 |
-
- `bun test 2>&1 | squeez "did the tests pass?"`
|
| 247 |
-
- `git log --oneline -50 | squeez "find the commit that broke CSRF"`
|
| 248 |
-
- `cat src/auth/middleware.py | squeez "find the referer validation logic"`
|
| 249 |
|
| 250 |
-
|
| 251 |
-
- You need exact, uncompressed output (e.g. writing a patch)
|
| 252 |
-
- The command is interactive
|
| 253 |
-
```
|
| 254 |
|
| 255 |
## Citation
|
| 256 |
|
| 257 |
```bibtex
|
| 258 |
-
@
|
| 259 |
-
title={Squeez:
|
| 260 |
author={Adam Kovacs},
|
| 261 |
year={2026},
|
| 262 |
url={https://github.com/KRLabsOrg/squeez}
|
| 263 |
}
|
| 264 |
```
|
| 265 |
-
|
| 266 |
-
## License
|
| 267 |
-
|
| 268 |
-
Apache 2.0
|
| 269 |
-
|
| 270 |
-
## Acknowledgments
|
| 271 |
-
|
| 272 |
-
- [Qwen](https://huggingface.co/Qwen) for the Qwen 3.5 2B base model
|
| 273 |
-
- [Unsloth](https://github.com/unslothai/unsloth) for efficient LoRA training
|
| 274 |
-
- [SWE-bench](https://swe-bench.github.io/) for the evaluation framework and source repositories
|
| 275 |
-
- [Provence](https://arxiv.org/abs/2501.16214) and [SWE-Pruner](https://github.com/ayanami-kitasan/SWE-Pruner) for inspiration on context pruning approaches
|
|
|
|
| 2 |
license: apache-2.0
|
| 3 |
language:
|
| 4 |
- en
|
|
|
|
| 5 |
tags:
|
| 6 |
+
- code
|
| 7 |
+
- tool-output
|
| 8 |
+
- pruning
|
| 9 |
+
- coding-agents
|
| 10 |
+
- extraction
|
|
|
|
|
|
|
|
|
|
| 11 |
datasets:
|
| 12 |
- KRLabsOrg/tool-output-extraction-swebench
|
| 13 |
+
base_model: Qwen/Qwen3.5-2B
|
| 14 |
+
pipeline_tag: text-generation
|
| 15 |
---
|
| 16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
# Squeez-2B
|
| 18 |
|
| 19 |
+
**Squeez-2B** is a 2B parameter model fine-tuned from Qwen 3.5 2B for task-conditioned tool-output pruning in coding agents. Given a focused query and one raw tool observation, it extracts the smallest verbatim evidence block the agent should inspect next — removing **92%** of input tokens while retaining **0.86 recall**.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
```
|
| 22 |
Tool output (500 lines) → Squeez → Relevant lines (30 lines) → Agent context
|
| 23 |
```
|
| 24 |
|
| 25 |
+
- Outperforms zero-shot Qwen 3.5 35B A3B by **+11 recall points**
|
| 26 |
+
- Returns verbatim lines only (no rewriting or summarization)
|
| 27 |
+
- Works as CLI pipe, Python library, or vLLM server
|
| 28 |
+
- Trained on **27 tool types** from real SWE-bench workflows and synthetic multi-ecosystem outputs
|
| 29 |
+
|
| 30 |
+
**Resources:** [Paper (coming soon)]() | [Dataset](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) | [Code & CLI](https://github.com/KRLabsOrg/squeez) | [Blog post](https://huggingface.co/blog/KRLabsOrg/squeez)
|
| 31 |
|
| 32 |
+
## Results
|
| 33 |
|
| 34 |
+
Evaluated on 618 manually curated held-out examples spanning 27 tool types.
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
+
| Model | Prec. | Recall | F1 | Compression |
|
| 37 |
+
|-------|-------|--------|-----|-------------|
|
| 38 |
+
| **Squeez-2B** | **0.80** | **0.86** | **0.80** | 0.92 |
|
| 39 |
+
| Qwen 3.5 35B A3B (zero-shot) | 0.74 | 0.75 | 0.73 | 0.92 |
|
| 40 |
+
| Kimi K2 (zero-shot) | 0.61 | 0.53 | 0.68 | 0.94 |
|
| 41 |
+
| Qwen 3.5 2B (untrained) | 0.42 | 0.53 | 0.55 | 0.82 |
|
| 42 |
|
| 43 |
+
The fine-tuned 2B model is also the most precise system in the comparison, indicating it has learned a tool-specific extraction policy rather than relying on generic instruction following.
|
| 44 |
|
| 45 |
+
### Qualitative patterns
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
+
| Pattern | Example | Squeez-2B | Baseline failure |
|
| 48 |
+
|---------|---------|-----------|-----------------|
|
| 49 |
+
| Precise selection | `git_log`, 21 lines — find one commit | Selects the single correct entry | Qwen 35B picks a plausible but wrong commit |
|
| 50 |
+
| Failure-block extraction | Service log, 176 lines — two similar TLS errors | Returns the correct 5-line block | Qwen 35B picks the wrong TLS error (different timestamp) |
|
| 51 |
+
| Correct empty prediction | `docker_logs`, 316 lines — no matching evidence | Returns empty output | Qwen 35B generates "No relevant lines found..." |
|
| 52 |
+
| Adjacent over-selection | Build output, 110 lines — Dockerfile error | Finds the right error + nearby noise | Qwen 35B misses the Dockerfile error entirely |
|
| 53 |
+
|
| 54 |
+
On the 59 negative examples in the test set, Squeez-2B correctly returns empty output 80% of the time. Qwen 35B returns empty only 7% of the time.
|
| 55 |
|
| 56 |
## Quick Start
|
| 57 |
|
| 58 |
+
### CLI (recommended)
|
| 59 |
|
| 60 |
```bash
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
pip install squeez
|
| 62 |
+
|
| 63 |
+
# With vLLM server
|
| 64 |
+
vllm serve KRLabsOrg/squeez-2b --dtype bfloat16 --max-model-len 16384
|
| 65 |
export SQUEEZ_SERVER_URL=http://localhost:8000/v1
|
|
|
|
| 66 |
|
| 67 |
+
pytest -q 2>&1 | squeez "find the failure block"
|
| 68 |
+
git log --oneline -50 | squeez "find the commit that changed CSRF handling"
|
| 69 |
+
cat src/auth/middleware.py | squeez "find the referer validation logic"
|
| 70 |
```
|
| 71 |
|
| 72 |
+
### Python API
|
| 73 |
|
| 74 |
+
```python
|
| 75 |
+
from squeez.inference.extractor import ToolOutputExtractor
|
| 76 |
|
| 77 |
+
# vLLM server
|
| 78 |
+
extractor = ToolOutputExtractor(base_url="http://localhost:8000/v1")
|
| 79 |
|
| 80 |
+
# Or local
|
| 81 |
+
extractor = ToolOutputExtractor(model_path="KRLabsOrg/squeez-2b")
|
|
|
|
| 82 |
|
| 83 |
+
filtered = extractor.extract(
|
| 84 |
+
task="Find the failing test block",
|
| 85 |
+
tool_output=raw_output,
|
| 86 |
+
)
|
| 87 |
+
```
|
| 88 |
|
| 89 |
+
### With transformers directly
|
| 90 |
|
| 91 |
```python
|
| 92 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
|
| 110 |
"Do not rewrite, summarize, or invent lines."
|
| 111 |
)},
|
| 112 |
{"role": "user", "content": (
|
| 113 |
+
"<query>\nFind the failing authentication test\n</query>\n"
|
| 114 |
"<tool_output>\n"
|
| 115 |
"PASSED tests/test_login.py::test_valid_credentials\n"
|
| 116 |
"FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401\n"
|
| 117 |
"PASSED tests/test_login.py::test_logout\n"
|
| 118 |
+
"</tool_output>"
|
|
|
|
| 119 |
)},
|
| 120 |
]
|
| 121 |
|
|
|
|
| 127 |
|
| 128 |
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
|
| 129 |
print(response)
|
| 130 |
+
# <relevant_lines>
|
| 131 |
+
# FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401
|
| 132 |
+
# </relevant_lines>
|
| 133 |
```
|
| 134 |
|
| 135 |
+
## Input/Output Format
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
+
**Input** — Chat messages with system prompt:
|
| 138 |
+
- System: extraction instructions (see above)
|
| 139 |
+
- User: `<query>{task}</query>\n<tool_output>{raw_output}</tool_output>`
|
|
|
|
|
|
|
| 140 |
|
| 141 |
+
**Output** — Verbatim lines in XML tags:
|
|
|
|
| 142 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
| 143 |
<relevant_lines>
|
| 144 |
{only the lines that matter, copied verbatim}
|
| 145 |
</relevant_lines>
|
| 146 |
```
|
| 147 |
|
| 148 |
+
## Supported Tool Types (27)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
|
| 150 |
+
**SWE-bench derived (14):** `read_file` | `grep` | `git_log` | `git_blame` | `git_diff` | `test_output` | `python` | `curl` | `pip_install` | `ls` | `lint_output` | `build_output` | `type_check` | `coverage`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
|
| 152 |
+
**Synthetic multi-ecosystem (13):** `npm_build` | `tsc` | `npm_install` | `docker_logs` | `docker_build` | `make_cmake` | `kubectl` | `cargo_build` | `go_build` | `mvn_gradle` | `terraform` | `mypy_pyright` | `eslint`
|
| 153 |
|
| 154 |
+
## Training Details
|
|
|
|
|
|
|
|
|
|
| 155 |
|
| 156 |
+
| | |
|
| 157 |
+
|---|---|
|
| 158 |
+
| **Base model** | Qwen/Qwen3.5-2B |
|
| 159 |
+
| **Method** | LoRA (r=16, alpha=32) via Unsloth |
|
| 160 |
+
| **Training data** | 10,508 examples (SWE-bench + synthetic) |
|
| 161 |
+
| **Epochs** | 3 |
|
| 162 |
+
| **Max sequence length** | 20,000 tokens |
|
| 163 |
+
| **Learning rate** | 2e-4 |
|
| 164 |
+
| **Batch size** | 8 (32 effective with 4x gradient accumulation) |
|
| 165 |
+
| **Hardware** | Single NVIDIA A100 80GB |
|
| 166 |
+
| **Dataset** | [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) |
|
| 167 |
|
| 168 |
+
## Usage with Coding Agents
|
| 169 |
|
| 170 |
+
Add to your `CLAUDE.md` or agent system prompt:
|
|
|
|
|
|
|
|
|
|
| 171 |
|
| 172 |
+
```
|
| 173 |
+
When you invoke a shell command, pipe it through `squeez` and describe what you need.
|
| 174 |
+
Examples:
|
| 175 |
+
- bun test 2>&1 | squeez "did the tests pass?"
|
| 176 |
+
- git log --oneline -50 | squeez "find the commit that broke CSRF"
|
| 177 |
+
- cat src/auth/middleware.py | squeez "find the referer validation logic"
|
| 178 |
+
```
|
| 179 |
|
| 180 |
+
## Limitations
|
| 181 |
|
| 182 |
+
- Best on software engineering tool output; not designed for general-purpose summarization
|
| 183 |
+
- Synthetic data generated by `openai/gpt-oss-120b` — may not fully reflect real-world distributions for all ecosystems
|
| 184 |
+
- Evaluates single tool observations, not full agent trajectories
|
| 185 |
+
- Max input: 20,000 tokens (training length); can be extended at serving time
|
| 186 |
|
| 187 |
+
## License
|
|
|
|
|
|
|
|
|
|
| 188 |
|
| 189 |
+
Apache 2.0
|
|
|
|
|
|
|
|
|
|
| 190 |
|
| 191 |
## Citation
|
| 192 |
|
| 193 |
```bibtex
|
| 194 |
+
@misc{kovacs2026squeez,
|
| 195 |
+
title={Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents},
|
| 196 |
author={Adam Kovacs},
|
| 197 |
year={2026},
|
| 198 |
url={https://github.com/KRLabsOrg/squeez}
|
| 199 |
}
|
| 200 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|