Spaces:
Sleeping
Sleeping
Deploy FlakyGym UI + inference updates (minimal upload)
Browse files- .gitignore +32 -0
- .openenv_push_ignore +26 -0
- Dockerfile +0 -1
- GRADING.md +31 -2
- README.md +31 -9
- env/environment.py +68 -1
- inference.py +619 -81
- inference_debug.py +207 -3
- server/app.py +1 -1
- server/inference_runner.py +2 -2
- server/ui.py +260 -8
.gitignore
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Python caches
|
| 2 |
+
__pycache__/
|
| 3 |
+
*.py[cod]
|
| 4 |
+
*.so
|
| 5 |
+
|
| 6 |
+
# Virtual environments
|
| 7 |
+
.venv/
|
| 8 |
+
venv/
|
| 9 |
+
openenv/
|
| 10 |
+
|
| 11 |
+
# Local editor / tooling state
|
| 12 |
+
.vscode/
|
| 13 |
+
.codex/
|
| 14 |
+
.agents/
|
| 15 |
+
|
| 16 |
+
# Secrets and local config
|
| 17 |
+
.env
|
| 18 |
+
.env.*
|
| 19 |
+
*.local
|
| 20 |
+
|
| 21 |
+
# Logs and runtime artifacts
|
| 22 |
+
*.log
|
| 23 |
+
outputs/
|
| 24 |
+
|
| 25 |
+
# Build artifacts
|
| 26 |
+
build/
|
| 27 |
+
dist/
|
| 28 |
+
*.egg-info/
|
| 29 |
+
|
| 30 |
+
# Dataset/generated outputs
|
| 31 |
+
dataset/__pycache__/
|
| 32 |
+
dataset/py_tasks.csv
|
.openenv_push_ignore
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
.git/
|
| 2 |
+
.agents/
|
| 3 |
+
.codex/
|
| 4 |
+
.vscode/
|
| 5 |
+
openenv/
|
| 6 |
+
openenv/**
|
| 7 |
+
.venv/
|
| 8 |
+
.venv/**
|
| 9 |
+
venv/
|
| 10 |
+
venv/**
|
| 11 |
+
__pycache__/
|
| 12 |
+
**/__pycache__/
|
| 13 |
+
*.pyc
|
| 14 |
+
*.pyo
|
| 15 |
+
*.pyd
|
| 16 |
+
*.log
|
| 17 |
+
*.tmp
|
| 18 |
+
*.cache
|
| 19 |
+
agent_trace.log
|
| 20 |
+
debug_trace.log
|
| 21 |
+
debug_trace2.log
|
| 22 |
+
debug_trace3.log
|
| 23 |
+
debug_trace4.log
|
| 24 |
+
debug_trace5.log
|
| 25 |
+
outputs/
|
| 26 |
+
.pytest_cache/
|
Dockerfile
CHANGED
|
@@ -14,5 +14,4 @@ COPY . .
|
|
| 14 |
|
| 15 |
EXPOSE 8000
|
| 16 |
|
| 17 |
-
ENV ENABLE_WEB_INTERFACE=true
|
| 18 |
CMD ["python", "-m", "server.app"]
|
|
|
|
| 14 |
|
| 15 |
EXPOSE 8000
|
| 16 |
|
|
|
|
| 17 |
CMD ["python", "-m", "server.app"]
|
GRADING.md
CHANGED
|
@@ -73,8 +73,25 @@ reward = progress
|
|
| 73 |
- else -> `0.01`
|
| 74 |
|
| 75 |
### `search_code`
|
| 76 |
-
-
|
| 77 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
### `run_test`
|
| 80 |
- if category is **not** one of `OD`, `OD-Brit`, `OD-Vic` -> `0.05`
|
|
@@ -251,3 +268,15 @@ reward = clamp(0.05 + 0.0, 0, 1) = 0.05
|
|
| 251 |
- Timeout does not invoke grader; it only ends the episode.
|
| 252 |
- Dataset construction choices (especially `label` and category quality) heavily influence observed score behavior.
|
| 253 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
- else -> `0.01`
|
| 74 |
|
| 75 |
### `search_code`
|
| 76 |
+
- base reward:
|
| 77 |
+
- if query contains any flaky-signal tokens (`sleep`, `random`, `time`, `datetime`, `thread`, `asyncio`, `fixture`, `setup`, `teardown`, `global`, `shared`, `singleton`, `os.environ`, `socket`, `timeout`, `retry`, `mock`, `patch`) -> `0.04`
|
| 78 |
+
- otherwise -> `0.01`
|
| 79 |
+
- spam penalties (all apply, then summed and capped):
|
| 80 |
+
- repeated same normalized search pattern in episode:
|
| 81 |
+
- `repeat_penalty = min(0.02 * (pattern_count - 1), 0.12)` for `pattern_count > 1`
|
| 82 |
+
- repeated same search context (same normalized pattern + same extracted top `.py` hit files):
|
| 83 |
+
- `context_penalty = min(0.03 * (context_count - 1), 0.15)` for `context_count > 1`
|
| 84 |
+
- long search-only streak:
|
| 85 |
+
- `streak_penalty = min(0.02 * (consecutive_searches - 3), 0.20)` for `consecutive_searches > 3`
|
| 86 |
+
- total spam penalty cap: `min(sum_penalties, 0.35)`
|
| 87 |
+
- final `search_code` progress:
|
| 88 |
+
|
| 89 |
+
```text
|
| 90 |
+
progress = max(-0.25, base_reward - spam_penalty)
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
- environment appends `WARNING:` text to tool output when penalties fire.
|
| 94 |
+
- `consecutive_searches` resets on any non-`search_code` action.
|
| 95 |
|
| 96 |
### `run_test`
|
| 97 |
- if category is **not** one of `OD`, `OD-Brit`, `OD-Vic` -> `0.05`
|
|
|
|
| 268 |
- Timeout does not invoke grader; it only ends the episode.
|
| 269 |
- Dataset construction choices (especially `label` and category quality) heavily influence observed score behavior.
|
| 270 |
|
| 271 |
+
## 9) Inference-side controls (not grader formulas)
|
| 272 |
+
|
| 273 |
+
`inference.py` now includes policy/runtime controls that do not change grader math directly but change agent behavior:
|
| 274 |
+
|
| 275 |
+
- episode memory injected into every prompt (recent files, search patterns, no-progress streak)
|
| 276 |
+
- explicit loop warning prompt when no-progress/duplicate patterns are detected
|
| 277 |
+
- duplicate `read_file` attempts are overridden to targeted `search_code`
|
| 278 |
+
- conversation compaction controls:
|
| 279 |
+
- `--history-prune-start-step` (default `12`)
|
| 280 |
+
- `--history-window-turns` (default `4`)
|
| 281 |
+
- `--history-max-chars` (default `50000`)
|
| 282 |
+
- detailed tracing options (`--trace-agent`, `--trace-prompts`) for audit/debug
|
README.md
CHANGED
|
@@ -15,6 +15,8 @@ tags:
|
|
| 15 |
|
| 16 |
OpenEnv-compatible RL environment for flaky-test investigation in real Python repos.
|
| 17 |
|
|
|
|
|
|
|
| 18 |
## Setup
|
| 19 |
|
| 20 |
```bash
|
|
@@ -58,14 +60,15 @@ curl -s http://localhost:8000/health
|
|
| 58 |
|
| 59 |
## Run Inference
|
| 60 |
|
| 61 |
-
Recommended (OpenRouter):
|
| 62 |
|
| 63 |
```bash
|
| 64 |
-
export
|
| 65 |
-
|
| 66 |
-
export
|
|
|
|
| 67 |
|
| 68 |
-
python inference.py --dataset-path dataset/py_tasks.csv --episodes-per-task
|
| 69 |
```
|
| 70 |
|
| 71 |
### Run Inference From Space UI
|
|
@@ -73,22 +76,41 @@ python inference.py --dataset-path dataset/py_tasks.csv --episodes-per-task 5
|
|
| 73 |
When deployed, the Space homepage serves a UI at `/` (also `/web`) that starts
|
| 74 |
`inference.py` in the background and streams logs live.
|
| 75 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
### `inference.py` flags
|
| 77 |
|
| 78 |
| Flag | Type | Default | Description |
|
| 79 |
|---|---|---|---|
|
| 80 |
| `--dataset-path` | `str` | `dataset/py_tasks.csv` | Processed task CSV used by env |
|
| 81 |
-
| `--episodes-per-task` | `int` | `
|
| 82 |
| `--task-types` | `str` | `classify,root_cause,fix_proposal` | Comma-separated task types |
|
| 83 |
| `--max-steps` | `int` | `20` | Max steps per episode |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
| `--benchmark-name` | `str` | `flakysleuth` | Label printed in `[START]` logs |
|
|
|
|
|
|
|
|
|
|
| 85 |
|
| 86 |
-
|
| 87 |
```bash
|
| 88 |
python inference.py \
|
| 89 |
--dataset-path dataset/py_tasks.csv \
|
| 90 |
-
--episodes-per-task
|
| 91 |
-
--task-types classify,root_cause
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
```
|
| 93 |
|
| 94 |
## OpenEnv CLI
|
|
|
|
| 15 |
|
| 16 |
OpenEnv-compatible RL environment for flaky-test investigation in real Python repos.
|
| 17 |
|
| 18 |
+
Flaky tests are dangerous because they make CI results untrustworthy: real regressions can be ignored as "just flaky," while healthy code can fail randomly and block releases, wasting engineering time and eroding confidence in test signals. We are building this Gym-style RL environment so agents can practice flaky-test triage in realistic repositories, learn to separate true failures from nondeterministic noise, and generate faster, more reliable debugging and fix strategies at scale.
|
| 19 |
+
|
| 20 |
## Setup
|
| 21 |
|
| 22 |
```bash
|
|
|
|
| 60 |
|
| 61 |
## Run Inference
|
| 62 |
|
| 63 |
+
Recommended (HF Router/OpenRouter/OpenAI compatible):
|
| 64 |
|
| 65 |
```bash
|
| 66 |
+
export HF_TOKEN=your_hf_token
|
| 67 |
+
# optional:
|
| 68 |
+
# export API_BASE_URL=https://router.huggingface.co/v1
|
| 69 |
+
# export MODEL_NAME=openai/gpt-oss-120b:novita
|
| 70 |
|
| 71 |
+
python inference.py --dataset-path dataset/py_tasks.csv --episodes-per-task 2
|
| 72 |
```
|
| 73 |
|
| 74 |
### Run Inference From Space UI
|
|
|
|
| 76 |
When deployed, the Space homepage serves a UI at `/` (also `/web`) that starts
|
| 77 |
`inference.py` in the background and streams logs live.
|
| 78 |
|
| 79 |
+
UI defaults:
|
| 80 |
+
- `episodes_per_task=1`
|
| 81 |
+
- slider range up to `100`
|
| 82 |
+
- live ETA estimator: `selected_tasks × episodes_per_task × 180s`
|
| 83 |
+
- warning when ETA may exceed 20 minutes (hackathon guidance)
|
| 84 |
+
|
| 85 |
### `inference.py` flags
|
| 86 |
|
| 87 |
| Flag | Type | Default | Description |
|
| 88 |
|---|---|---|---|
|
| 89 |
| `--dataset-path` | `str` | `dataset/py_tasks.csv` | Processed task CSV used by env |
|
| 90 |
+
| `--episodes-per-task` | `int` | `2` | Episodes per selected task type |
|
| 91 |
| `--task-types` | `str` | `classify,root_cause,fix_proposal` | Comma-separated task types |
|
| 92 |
| `--max-steps` | `int` | `20` | Max steps per episode |
|
| 93 |
+
| `--no-progress` | flag | `False` | Disable progress bars in non-compliance mode |
|
| 94 |
+
| `--trace-agent` | flag | `False` | Print detailed action/model/tool trace |
|
| 95 |
+
| `--trace-prompts` | flag | `False` | Include full prompts in trace |
|
| 96 |
+
| `--trace-max-chars` | `int` | `2500` | Clip size for traced prompt/output blocks |
|
| 97 |
+
| `--compliance-stdout` | flag | `True` | Strict `[START]/[STEP]/[END]` logs (default on) |
|
| 98 |
+
| `--no-compliance-stdout` | flag | `False` | Switch to baseline summary/progress output |
|
| 99 |
| `--benchmark-name` | `str` | `flakysleuth` | Label printed in `[START]` logs |
|
| 100 |
+
| `--history-prune-start-step` | `int` | `12` | Start compacting history from this step |
|
| 101 |
+
| `--history-window-turns` | `int` | `4` | Keep this many recent assistant/user turns on prune |
|
| 102 |
+
| `--history-max-chars` | `int` | `50000` | Force prune when message history exceeds this size |
|
| 103 |
|
| 104 |
+
Detailed trace to log:
|
| 105 |
```bash
|
| 106 |
python inference.py \
|
| 107 |
--dataset-path dataset/py_tasks.csv \
|
| 108 |
+
--episodes-per-task 1 \
|
| 109 |
+
--task-types classify,root_cause \
|
| 110 |
+
--no-compliance-stdout \
|
| 111 |
+
--trace-agent \
|
| 112 |
+
--history-prune-start-step 12 \
|
| 113 |
+
--history-window-turns 4 > agent_trace.log 2>&1
|
| 114 |
```
|
| 115 |
|
| 116 |
## OpenEnv CLI
|
env/environment.py
CHANGED
|
@@ -41,6 +41,9 @@ class FlakySleuthEnv:
|
|
| 41 |
self.cumulative_progress = 0.0
|
| 42 |
self.files_read: set[str] = set()
|
| 43 |
self.episode_actions: list[FlakySleuthAction] = []
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
def reset(self) -> FlakySleuthObservation:
|
| 46 |
if self.sandbox:
|
|
@@ -61,6 +64,9 @@ class FlakySleuthEnv:
|
|
| 61 |
self.cumulative_progress = 0.0
|
| 62 |
self.files_read = set()
|
| 63 |
self.episode_actions = []
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
return self._make_obs()
|
| 66 |
|
|
@@ -153,6 +159,8 @@ class FlakySleuthEnv:
|
|
| 153 |
|
| 154 |
progress = 0.0
|
| 155 |
output = ""
|
|
|
|
|
|
|
| 156 |
|
| 157 |
if action.action_type == "read_file":
|
| 158 |
content = self.sandbox.read_file(action.argument)
|
|
@@ -168,8 +176,13 @@ class FlakySleuthEnv:
|
|
| 168 |
progress = self._file_relevance_reward(action.argument)
|
| 169 |
|
| 170 |
elif action.action_type == "search_code":
|
|
|
|
| 171 |
output = self.sandbox.grep(action.argument)
|
| 172 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 173 |
|
| 174 |
elif action.action_type == "run_test":
|
| 175 |
output = self.sandbox.run_test(self.current_task.get("test_name", ""))
|
|
@@ -198,6 +211,60 @@ class FlakySleuthEnv:
|
|
| 198 |
return 0.04
|
| 199 |
return 0.01
|
| 200 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 201 |
def _make_obs(self, tool_output: str | None = None) -> FlakySleuthObservation:
|
| 202 |
if not self.current_task:
|
| 203 |
raise RuntimeError("No current task available")
|
|
|
|
| 41 |
self.cumulative_progress = 0.0
|
| 42 |
self.files_read: set[str] = set()
|
| 43 |
self.episode_actions: list[FlakySleuthAction] = []
|
| 44 |
+
self.search_pattern_counts: dict[str, int] = {}
|
| 45 |
+
self.search_context_counts: dict[str, int] = {}
|
| 46 |
+
self.consecutive_searches = 0
|
| 47 |
|
| 48 |
def reset(self) -> FlakySleuthObservation:
|
| 49 |
if self.sandbox:
|
|
|
|
| 64 |
self.cumulative_progress = 0.0
|
| 65 |
self.files_read = set()
|
| 66 |
self.episode_actions = []
|
| 67 |
+
self.search_pattern_counts = {}
|
| 68 |
+
self.search_context_counts = {}
|
| 69 |
+
self.consecutive_searches = 0
|
| 70 |
|
| 71 |
return self._make_obs()
|
| 72 |
|
|
|
|
| 159 |
|
| 160 |
progress = 0.0
|
| 161 |
output = ""
|
| 162 |
+
if action.action_type != "search_code":
|
| 163 |
+
self.consecutive_searches = 0
|
| 164 |
|
| 165 |
if action.action_type == "read_file":
|
| 166 |
content = self.sandbox.read_file(action.argument)
|
|
|
|
| 176 |
progress = self._file_relevance_reward(action.argument)
|
| 177 |
|
| 178 |
elif action.action_type == "search_code":
|
| 179 |
+
self.consecutive_searches += 1
|
| 180 |
output = self.sandbox.grep(action.argument)
|
| 181 |
+
base_progress = self._search_relevance_reward(action.argument)
|
| 182 |
+
spam_penalty, warnings = self._search_spam_penalty(action.argument, output)
|
| 183 |
+
progress = max(-0.25, base_progress - spam_penalty)
|
| 184 |
+
if warnings:
|
| 185 |
+
output = f"{output}\n\nWARNING: {' '.join(warnings)}"
|
| 186 |
|
| 187 |
elif action.action_type == "run_test":
|
| 188 |
output = self.sandbox.run_test(self.current_task.get("test_name", ""))
|
|
|
|
| 211 |
return 0.04
|
| 212 |
return 0.01
|
| 213 |
|
| 214 |
+
def _search_spam_penalty(self, pattern: str, output: str) -> tuple[float, list[str]]:
|
| 215 |
+
penalty = 0.0
|
| 216 |
+
warnings: list[str] = []
|
| 217 |
+
|
| 218 |
+
pattern_key = " ".join(pattern.lower().split())
|
| 219 |
+
if pattern_key:
|
| 220 |
+
pattern_count = self.search_pattern_counts.get(pattern_key, 0) + 1
|
| 221 |
+
self.search_pattern_counts[pattern_key] = pattern_count
|
| 222 |
+
if pattern_count > 1:
|
| 223 |
+
repeat_penalty = min(0.02 * (pattern_count - 1), 0.12)
|
| 224 |
+
penalty += repeat_penalty
|
| 225 |
+
warnings.append(
|
| 226 |
+
f"Repeated search pattern ({pattern_count}x) penalty={repeat_penalty:.2f}."
|
| 227 |
+
)
|
| 228 |
+
|
| 229 |
+
context_hits = self._extract_search_hits(output)
|
| 230 |
+
context_key = f"{pattern_key}::{','.join(context_hits)}"
|
| 231 |
+
context_count = self.search_context_counts.get(context_key, 0) + 1
|
| 232 |
+
self.search_context_counts[context_key] = context_count
|
| 233 |
+
if context_count > 1:
|
| 234 |
+
context_penalty = min(0.03 * (context_count - 1), 0.15)
|
| 235 |
+
penalty += context_penalty
|
| 236 |
+
warnings.append(
|
| 237 |
+
f"Same search context repeated ({context_count}x) penalty={context_penalty:.2f}."
|
| 238 |
+
)
|
| 239 |
+
|
| 240 |
+
if self.consecutive_searches > 3:
|
| 241 |
+
streak_penalty = min(0.02 * (self.consecutive_searches - 3), 0.20)
|
| 242 |
+
penalty += streak_penalty
|
| 243 |
+
warnings.append(
|
| 244 |
+
f"Search-only streak={self.consecutive_searches} penalty={streak_penalty:.2f}."
|
| 245 |
+
)
|
| 246 |
+
|
| 247 |
+
return min(penalty, 0.35), warnings
|
| 248 |
+
|
| 249 |
+
def _extract_search_hits(self, output: str) -> tuple[str, ...]:
|
| 250 |
+
files: list[str] = []
|
| 251 |
+
seen: set[str] = set()
|
| 252 |
+
for raw_line in output.splitlines():
|
| 253 |
+
line = raw_line.strip()
|
| 254 |
+
if not line or line.startswith("No matches found") or line.startswith("Search "):
|
| 255 |
+
continue
|
| 256 |
+
filepath = line.split(":", 1)[0].strip()
|
| 257 |
+
if filepath.startswith("./"):
|
| 258 |
+
filepath = filepath[2:]
|
| 259 |
+
if not filepath.endswith(".py"):
|
| 260 |
+
continue
|
| 261 |
+
if filepath not in seen:
|
| 262 |
+
seen.add(filepath)
|
| 263 |
+
files.append(filepath)
|
| 264 |
+
if len(files) >= 5:
|
| 265 |
+
break
|
| 266 |
+
return tuple(files)
|
| 267 |
+
|
| 268 |
def _make_obs(self, tool_output: str | None = None) -> FlakySleuthObservation:
|
| 269 |
if not self.current_task:
|
| 270 |
raise RuntimeError("No current task available")
|
inference.py
CHANGED
|
@@ -1,23 +1,44 @@
|
|
| 1 |
-
"""FlakySleuth
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
"""
|
| 3 |
|
| 4 |
from __future__ import annotations
|
| 5 |
|
| 6 |
-
import argparse
|
| 7 |
import json
|
| 8 |
import os
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
from typing import Any
|
| 10 |
|
| 11 |
from openai import OpenAI
|
| 12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
from env.environment import FlakySleuthEnv
|
| 14 |
from env.models import FlakySleuthAction, FlakySleuthObservation
|
| 15 |
|
| 16 |
-
HF_TOKEN = os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_HUB_TOKEN")
|
| 17 |
OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY")
|
| 18 |
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
|
|
|
|
|
|
|
|
|
|
| 19 |
RAW_API_KEY = os.environ.get("API_KEY")
|
| 20 |
-
API_KEY = RAW_API_KEY or
|
| 21 |
|
| 22 |
|
| 23 |
def _looks_like_openrouter_key(key: str | None) -> bool:
|
|
@@ -26,14 +47,19 @@ def _looks_like_openrouter_key(key: str | None) -> bool:
|
|
| 26 |
|
| 27 |
DEFAULT_BASE_URL = (
|
| 28 |
"https://router.huggingface.co/v1"
|
| 29 |
-
if (
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
else (
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
)
|
| 38 |
)
|
| 39 |
API_BASE_URL = os.environ.get("API_BASE_URL", DEFAULT_BASE_URL)
|
|
@@ -41,13 +67,17 @@ API_BASE_URL = os.environ.get("API_BASE_URL", DEFAULT_BASE_URL)
|
|
| 41 |
DEFAULT_MODEL = (
|
| 42 |
"openai/gpt-oss-120b:novita"
|
| 43 |
if API_BASE_URL.startswith("https://router.huggingface.co")
|
| 44 |
-
else (
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
)
|
| 46 |
MODEL_NAME = os.environ.get("MODEL_NAME", DEFAULT_MODEL)
|
| 47 |
-
|
| 48 |
-
EPISODES_PER_TASK =
|
| 49 |
MAX_STEPS = 20
|
| 50 |
-
|
| 51 |
|
| 52 |
client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
|
| 53 |
|
|
@@ -81,34 +111,44 @@ Rules:
|
|
| 81 |
"""
|
| 82 |
|
| 83 |
|
| 84 |
-
def
|
| 85 |
return " ".join(str(text).split())
|
| 86 |
|
| 87 |
|
| 88 |
-
def
|
| 89 |
-
print(f"[START] task={task} env={
|
| 90 |
|
| 91 |
|
| 92 |
-
def
|
| 93 |
-
|
| 94 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
print(
|
| 96 |
-
f"[STEP] step={step} action={
|
| 97 |
-
f"reward={reward:.2f} done={
|
| 98 |
flush=True,
|
| 99 |
)
|
| 100 |
|
| 101 |
|
| 102 |
-
def
|
| 103 |
-
|
| 104 |
print(
|
| 105 |
f"[END] success={str(bool(success)).lower()} steps={steps} "
|
| 106 |
-
f"score={score:.2f} rewards={
|
| 107 |
flush=True,
|
| 108 |
)
|
| 109 |
|
| 110 |
|
| 111 |
-
def obs_to_prompt(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
tree_preview = "\n".join(obs.file_tree[:40])
|
| 113 |
return f"""TASK: {obs.task_description}
|
| 114 |
|
|
@@ -125,7 +165,10 @@ Repository file tree:
|
|
| 125 |
{tree_preview}
|
| 126 |
|
| 127 |
Last tool output:
|
| 128 |
-
{obs.tool_output or
|
|
|
|
|
|
|
|
|
|
| 129 |
|
| 130 |
Return only JSON action."""
|
| 131 |
|
|
@@ -157,10 +200,18 @@ def heuristic_action(obs: FlakySleuthObservation) -> FlakySleuthAction:
|
|
| 157 |
)
|
| 158 |
|
| 159 |
|
| 160 |
-
def llm_action(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 161 |
if not API_KEY:
|
| 162 |
-
return None
|
| 163 |
|
|
|
|
| 164 |
response = client.chat.completions.create(
|
| 165 |
model=MODEL_NAME,
|
| 166 |
messages=messages,
|
|
@@ -168,80 +219,407 @@ def llm_action(messages: list[dict[str, str]]) -> FlakySleuthAction | None:
|
|
| 168 |
temperature=0.0,
|
| 169 |
)
|
| 170 |
raw = (response.choices[0].message.content or "").strip()
|
|
|
|
| 171 |
cleaned = raw.replace("```json", "").replace("```", "").strip()
|
| 172 |
payload = json.loads(cleaned)
|
| 173 |
-
return FlakySleuthAction.model_validate(payload)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 174 |
|
| 175 |
|
| 176 |
def run_episode(
|
| 177 |
env: FlakySleuthEnv,
|
| 178 |
*,
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 183 |
rewards: list[float] = []
|
| 184 |
steps_taken = 0
|
| 185 |
-
score = 0.0
|
| 186 |
success = False
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 190 |
try:
|
| 191 |
obs = env.reset()
|
| 192 |
-
|
|
|
|
|
|
|
| 193 |
{"role": "system", "content": SYSTEM_PROMPT},
|
| 194 |
-
{"role": "user", "content":
|
| 195 |
]
|
| 196 |
|
| 197 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 198 |
try:
|
| 199 |
-
|
| 200 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 201 |
action = heuristic_action(obs)
|
| 202 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 203 |
obs, reward, done, info = env.step(action)
|
| 204 |
-
rewards.append(
|
| 205 |
-
steps_taken = step_idx
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 206 |
|
| 207 |
step_error: str | None = None
|
| 208 |
if isinstance(info, dict):
|
| 209 |
-
|
| 210 |
-
if
|
| 211 |
-
step_error = str(
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
|
| 217 |
-
|
| 218 |
-
|
| 219 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 220 |
|
| 221 |
if done:
|
| 222 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 223 |
break
|
| 224 |
|
|
|
|
| 225 |
messages.append({"role": "assistant", "content": action.model_dump_json()})
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 232 |
success = False
|
|
|
|
|
|
|
| 233 |
finally:
|
| 234 |
-
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 239 |
|
| 240 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 241 |
|
| 242 |
|
| 243 |
def _parse_args() -> argparse.Namespace:
|
| 244 |
-
parser = argparse.ArgumentParser(description="Run FlakySleuth
|
| 245 |
parser.add_argument(
|
| 246 |
"--dataset-path",
|
| 247 |
default="dataset/py_tasks.csv",
|
|
@@ -264,34 +642,194 @@ def _parse_args() -> argparse.Namespace:
|
|
| 264 |
default=MAX_STEPS,
|
| 265 |
help="Max steps per episode.",
|
| 266 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 267 |
parser.add_argument(
|
| 268 |
"--benchmark-name",
|
| 269 |
-
default=
|
| 270 |
-
help="Benchmark
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 271 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 272 |
return parser.parse_args()
|
| 273 |
|
| 274 |
|
| 275 |
def main() -> None:
|
|
|
|
| 276 |
args = _parse_args()
|
| 277 |
env = FlakySleuthEnv(dataset_path=args.dataset_path, max_steps=args.max_steps)
|
| 278 |
-
|
| 279 |
allowed_task_types = {"classify", "root_cause", "fix_proposal"}
|
| 280 |
task_types = [t.strip() for t in args.task_types.split(",") if t.strip()]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 281 |
if not task_types:
|
| 282 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 283 |
|
| 284 |
for task_type in task_types:
|
| 285 |
-
|
| 286 |
-
|
|
|
|
| 287 |
env.loader.force_task_type(task_type)
|
| 288 |
-
|
| 289 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 290 |
env,
|
| 291 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 292 |
benchmark_name=args.benchmark_name,
|
| 293 |
-
|
|
|
|
|
|
|
|
|
|
| 294 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 295 |
|
| 296 |
|
| 297 |
if __name__ == "__main__":
|
|
|
|
| 1 |
+
"""FlakySleuth baseline inference script.
|
| 2 |
+
|
| 3 |
+
Environment variables:
|
| 4 |
+
Preferred:
|
| 5 |
+
HF_TOKEN / HUGGINGFACE_HUB_TOKEN (or OPENROUTER_API_KEY / API_KEY)
|
| 6 |
+
API_BASE_URL (optional, defaults to https://openrouter.ai/api/v1 for router-style keys)
|
| 7 |
+
MODEL_NAME (optional, defaults to qwen/qwen3.6-plus:free on OpenRouter)
|
| 8 |
+
|
| 9 |
+
Optional fallback:
|
| 10 |
+
OPENAI_API_KEY
|
| 11 |
+
API_BASE_URL (defaults to https://api.openai.com/v1 when OpenAI key is used)
|
| 12 |
+
MODEL_NAME (defaults to gpt-4o-mini for OpenAI)
|
| 13 |
"""
|
| 14 |
|
| 15 |
from __future__ import annotations
|
| 16 |
|
|
|
|
| 17 |
import json
|
| 18 |
import os
|
| 19 |
+
import argparse
|
| 20 |
+
import time
|
| 21 |
+
from collections import defaultdict
|
| 22 |
+
from pathlib import Path
|
| 23 |
from typing import Any
|
| 24 |
|
| 25 |
from openai import OpenAI
|
| 26 |
|
| 27 |
+
try:
|
| 28 |
+
from tqdm import tqdm
|
| 29 |
+
except Exception: # pragma: no cover
|
| 30 |
+
tqdm = None
|
| 31 |
+
|
| 32 |
from env.environment import FlakySleuthEnv
|
| 33 |
from env.models import FlakySleuthAction, FlakySleuthObservation
|
| 34 |
|
|
|
|
| 35 |
OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY")
|
| 36 |
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
|
| 37 |
+
HF_TOKEN = os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_HUB_TOKEN")
|
| 38 |
+
# Optional for environments created via from_docker_image(); kept for checklist parity.
|
| 39 |
+
LOCAL_IMAGE_NAME = os.environ.get("LOCAL_IMAGE_NAME")
|
| 40 |
RAW_API_KEY = os.environ.get("API_KEY")
|
| 41 |
+
API_KEY = RAW_API_KEY or OPENROUTER_API_KEY or OPENAI_API_KEY or HF_TOKEN or ""
|
| 42 |
|
| 43 |
|
| 44 |
def _looks_like_openrouter_key(key: str | None) -> bool:
|
|
|
|
| 47 |
|
| 48 |
DEFAULT_BASE_URL = (
|
| 49 |
"https://router.huggingface.co/v1"
|
| 50 |
+
if (
|
| 51 |
+
HF_TOKEN
|
| 52 |
+
and not RAW_API_KEY
|
| 53 |
+
and not OPENROUTER_API_KEY
|
| 54 |
+
and not OPENAI_API_KEY
|
| 55 |
+
)
|
| 56 |
else (
|
| 57 |
+
"https://openrouter.ai/api/v1"
|
| 58 |
+
if (
|
| 59 |
+
(OPENROUTER_API_KEY and not RAW_API_KEY and not OPENAI_API_KEY)
|
| 60 |
+
or (_looks_like_openrouter_key(RAW_API_KEY) and not OPENAI_API_KEY)
|
| 61 |
+
)
|
| 62 |
+
else "https://api.openai.com/v1"
|
| 63 |
)
|
| 64 |
)
|
| 65 |
API_BASE_URL = os.environ.get("API_BASE_URL", DEFAULT_BASE_URL)
|
|
|
|
| 67 |
DEFAULT_MODEL = (
|
| 68 |
"openai/gpt-oss-120b:novita"
|
| 69 |
if API_BASE_URL.startswith("https://router.huggingface.co")
|
| 70 |
+
else (
|
| 71 |
+
"qwen/qwen3.6-plus:free"
|
| 72 |
+
if API_BASE_URL.startswith("https://openrouter.ai")
|
| 73 |
+
else "gpt-4o-mini"
|
| 74 |
+
)
|
| 75 |
)
|
| 76 |
MODEL_NAME = os.environ.get("MODEL_NAME", DEFAULT_MODEL)
|
| 77 |
+
# Keep a conservative default to stay under common hackathon runtime limits.
|
| 78 |
+
EPISODES_PER_TASK = 2
|
| 79 |
MAX_STEPS = 20
|
| 80 |
+
MEMORY_MAX_CHARS = 900
|
| 81 |
|
| 82 |
client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
|
| 83 |
|
|
|
|
| 111 |
"""
|
| 112 |
|
| 113 |
|
| 114 |
+
def _to_single_line(text: str) -> str:
|
| 115 |
return " ".join(str(text).split())
|
| 116 |
|
| 117 |
|
| 118 |
+
def _compliance_log_start(task: str, benchmark: str, model: str) -> None:
|
| 119 |
+
print(f"[START] task={task} env={benchmark} model={model}", flush=True)
|
| 120 |
|
| 121 |
|
| 122 |
+
def _compliance_log_step(
|
| 123 |
+
step: int,
|
| 124 |
+
action: str,
|
| 125 |
+
reward: float,
|
| 126 |
+
done: bool,
|
| 127 |
+
error: str | None,
|
| 128 |
+
) -> None:
|
| 129 |
+
error_value = _to_single_line(error) if error else "null"
|
| 130 |
print(
|
| 131 |
+
f"[STEP] step={step} action={_to_single_line(action)} "
|
| 132 |
+
f"reward={reward:.2f} done={str(bool(done)).lower()} error={error_value}",
|
| 133 |
flush=True,
|
| 134 |
)
|
| 135 |
|
| 136 |
|
| 137 |
+
def _compliance_log_end(success: bool, steps: int, score: float, rewards: list[float]) -> None:
|
| 138 |
+
rewards_value = ",".join(f"{r:.2f}" for r in rewards)
|
| 139 |
print(
|
| 140 |
f"[END] success={str(bool(success)).lower()} steps={steps} "
|
| 141 |
+
f"score={score:.2f} rewards={rewards_value}",
|
| 142 |
flush=True,
|
| 143 |
)
|
| 144 |
|
| 145 |
|
| 146 |
+
def obs_to_prompt(
|
| 147 |
+
obs: FlakySleuthObservation,
|
| 148 |
+
*,
|
| 149 |
+
memory_hint: str | None = None,
|
| 150 |
+
max_steps: int = MAX_STEPS,
|
| 151 |
+
) -> str:
|
| 152 |
tree_preview = "\n".join(obs.file_tree[:40])
|
| 153 |
return f"""TASK: {obs.task_description}
|
| 154 |
|
|
|
|
| 165 |
{tree_preview}
|
| 166 |
|
| 167 |
Last tool output:
|
| 168 |
+
{obs.tool_output or "(No action taken yet)"}
|
| 169 |
+
|
| 170 |
+
Episode memory:
|
| 171 |
+
{memory_hint or "(No memory yet.)"}
|
| 172 |
|
| 173 |
Return only JSON action."""
|
| 174 |
|
|
|
|
| 200 |
)
|
| 201 |
|
| 202 |
|
| 203 |
+
def llm_action(
|
| 204 |
+
messages: list[dict[str, str]],
|
| 205 |
+
) -> tuple[FlakySleuthAction | None, dict[str, Any]]:
|
| 206 |
+
meta: dict[str, Any] = {
|
| 207 |
+
"attempted": False,
|
| 208 |
+
"raw_output": "",
|
| 209 |
+
"error": "",
|
| 210 |
+
}
|
| 211 |
if not API_KEY:
|
| 212 |
+
return None, meta
|
| 213 |
|
| 214 |
+
meta["attempted"] = True
|
| 215 |
response = client.chat.completions.create(
|
| 216 |
model=MODEL_NAME,
|
| 217 |
messages=messages,
|
|
|
|
| 219 |
temperature=0.0,
|
| 220 |
)
|
| 221 |
raw = (response.choices[0].message.content or "").strip()
|
| 222 |
+
meta["raw_output"] = raw
|
| 223 |
cleaned = raw.replace("```json", "").replace("```", "").strip()
|
| 224 |
payload = json.loads(cleaned)
|
| 225 |
+
return FlakySleuthAction.model_validate(payload), meta
|
| 226 |
+
|
| 227 |
+
|
| 228 |
+
def _clip_text(text: str, max_chars: int) -> str:
|
| 229 |
+
if max_chars <= 0:
|
| 230 |
+
return text
|
| 231 |
+
if len(text) <= max_chars:
|
| 232 |
+
return text
|
| 233 |
+
remaining = len(text) - max_chars
|
| 234 |
+
return f"{text[:max_chars]}\n...[truncated {remaining} chars]"
|
| 235 |
+
|
| 236 |
+
|
| 237 |
+
def _trace_print(
|
| 238 |
+
enabled: bool,
|
| 239 |
+
message: str,
|
| 240 |
+
*,
|
| 241 |
+
text: str | None = None,
|
| 242 |
+
max_chars: int = 0,
|
| 243 |
+
) -> None:
|
| 244 |
+
if not enabled:
|
| 245 |
+
return
|
| 246 |
+
print(message)
|
| 247 |
+
if text is not None:
|
| 248 |
+
print(_clip_text(text, max_chars))
|
| 249 |
+
|
| 250 |
+
|
| 251 |
+
def _format_duration(seconds: float) -> str:
|
| 252 |
+
seconds = max(0.0, float(seconds))
|
| 253 |
+
mins, secs = divmod(int(round(seconds)), 60)
|
| 254 |
+
hrs, mins = divmod(mins, 60)
|
| 255 |
+
if hrs > 0:
|
| 256 |
+
return f"{hrs:d}h {mins:02d}m {secs:02d}s"
|
| 257 |
+
return f"{mins:02d}m {secs:02d}s"
|
| 258 |
+
|
| 259 |
+
|
| 260 |
+
def _build_episode_memory(
|
| 261 |
+
*,
|
| 262 |
+
unique_read_files: list[str],
|
| 263 |
+
zero_gain_read_files: set[str],
|
| 264 |
+
search_patterns: list[str],
|
| 265 |
+
blocked_duplicate_reads: int,
|
| 266 |
+
no_progress_streak: int,
|
| 267 |
+
max_chars: int,
|
| 268 |
+
) -> str:
|
| 269 |
+
read_tail = ", ".join(unique_read_files[-8:]) if unique_read_files else "none"
|
| 270 |
+
zero_tail = ", ".join(sorted(zero_gain_read_files)[-8:]) if zero_gain_read_files else "none"
|
| 271 |
+
search_tail = ", ".join(search_patterns[-6:]) if search_patterns else "none"
|
| 272 |
+
loop_warning = (
|
| 273 |
+
"WARNING: Possible loop detected. Stop repeating similar exploration. "
|
| 274 |
+
"Switch strategy or take a terminal action."
|
| 275 |
+
if no_progress_streak >= 3 or blocked_duplicate_reads >= 2
|
| 276 |
+
else "Status: exploration progress appears normal."
|
| 277 |
+
)
|
| 278 |
+
memory = (
|
| 279 |
+
f"Read files (recent): {read_tail}\n"
|
| 280 |
+
f"Zero-gain read files: {zero_tail}\n"
|
| 281 |
+
f"Search patterns (recent): {search_tail}\n"
|
| 282 |
+
f"Blocked duplicate reads: {blocked_duplicate_reads}\n"
|
| 283 |
+
f"No-progress streak: {no_progress_streak}\n"
|
| 284 |
+
f"{loop_warning}\n"
|
| 285 |
+
"Guidance: Avoid rereading zero-gain files unless there is new evidence. "
|
| 286 |
+
"Prefer targeted search_code or terminal action when confidence is enough."
|
| 287 |
+
)
|
| 288 |
+
return _clip_text(memory, max_chars=max_chars)
|
| 289 |
+
|
| 290 |
+
|
| 291 |
+
def _duplicate_read_replacement_pattern(obs: FlakySleuthObservation) -> str:
|
| 292 |
+
test_hint = obs.test_name.split("::")[-1] if obs.test_name else "test"
|
| 293 |
+
return (
|
| 294 |
+
f"{test_hint}|random|sleep|time|timeout|retry|asyncio|thread|"
|
| 295 |
+
"fixture|global|shared|mock|patch"
|
| 296 |
+
)
|
| 297 |
+
|
| 298 |
+
|
| 299 |
+
def _messages_char_count(messages: list[dict[str, str]]) -> int:
|
| 300 |
+
# Lightweight size heuristic to avoid unbounded context growth.
|
| 301 |
+
return sum(len(str(msg.get("content", ""))) + 32 for msg in messages)
|
| 302 |
+
|
| 303 |
+
|
| 304 |
+
def _prune_messages_window(
|
| 305 |
+
messages: list[dict[str, str]],
|
| 306 |
+
*,
|
| 307 |
+
step_number: int,
|
| 308 |
+
prune_start_step: int,
|
| 309 |
+
window_turns: int,
|
| 310 |
+
max_chars: int,
|
| 311 |
+
) -> tuple[list[dict[str, str]], dict[str, Any] | None]:
|
| 312 |
+
if len(messages) <= 2:
|
| 313 |
+
return messages, None
|
| 314 |
+
|
| 315 |
+
current_chars = _messages_char_count(messages)
|
| 316 |
+
exceeds_step_threshold = step_number >= prune_start_step
|
| 317 |
+
exceeds_char_budget = current_chars > max_chars
|
| 318 |
+
if not exceeds_step_threshold and not exceeds_char_budget:
|
| 319 |
+
return messages, None
|
| 320 |
+
|
| 321 |
+
base = messages[:2] # system + initial prompt
|
| 322 |
+
tail = messages[2:]
|
| 323 |
+
keep_tail_items = max(2, window_turns * 2)
|
| 324 |
+
if len(tail) > keep_tail_items:
|
| 325 |
+
tail = tail[-keep_tail_items:]
|
| 326 |
+
pruned = base + tail
|
| 327 |
+
|
| 328 |
+
reason = "step_threshold" if exceeds_step_threshold else "char_budget"
|
| 329 |
+
return pruned, {
|
| 330 |
+
"reason": reason,
|
| 331 |
+
"before_messages": len(messages),
|
| 332 |
+
"after_messages": len(pruned),
|
| 333 |
+
"before_chars": current_chars,
|
| 334 |
+
"after_chars": _messages_char_count(pruned),
|
| 335 |
+
"step": step_number,
|
| 336 |
+
}
|
| 337 |
|
| 338 |
|
| 339 |
def run_episode(
|
| 340 |
env: FlakySleuthEnv,
|
| 341 |
*,
|
| 342 |
+
print_terminal: bool = True,
|
| 343 |
+
trace_agent: bool = False,
|
| 344 |
+
trace_prompts: bool = False,
|
| 345 |
+
trace_max_chars: int = 2000,
|
| 346 |
+
episode_label: str = "",
|
| 347 |
+
compliance_stdout: bool = False,
|
| 348 |
+
benchmark_name: str = "flakysleuth",
|
| 349 |
+
compliance_task_name: str | None = None,
|
| 350 |
+
history_prune_start_step: int = 12,
|
| 351 |
+
history_window_turns: int = 4,
|
| 352 |
+
history_max_chars: int = 50000,
|
| 353 |
+
) -> tuple[float, dict[str, Any]]:
|
| 354 |
rewards: list[float] = []
|
| 355 |
steps_taken = 0
|
|
|
|
| 356 |
success = False
|
| 357 |
+
episode_task_name = (compliance_task_name or episode_label.split(" ", 1)[0].strip() or "unknown")
|
| 358 |
+
exploration_reward_total = 0.0
|
| 359 |
+
final_episode_score = 0.0
|
| 360 |
+
terminal_meta: dict[str, Any] = {}
|
| 361 |
+
llm_steps = 0
|
| 362 |
+
heuristic_steps = 0
|
| 363 |
+
fallback_reasons: dict[str, int] = {}
|
| 364 |
+
prune_events = 0
|
| 365 |
+
read_attempt_counts: dict[str, int] = {}
|
| 366 |
+
unique_read_files: list[str] = []
|
| 367 |
+
zero_gain_read_files: set[str] = set()
|
| 368 |
+
search_patterns: list[str] = []
|
| 369 |
+
blocked_duplicate_reads = 0
|
| 370 |
+
no_progress_streak = 0
|
| 371 |
+
memory_hint = _build_episode_memory(
|
| 372 |
+
unique_read_files=unique_read_files,
|
| 373 |
+
zero_gain_read_files=zero_gain_read_files,
|
| 374 |
+
search_patterns=search_patterns,
|
| 375 |
+
blocked_duplicate_reads=blocked_duplicate_reads,
|
| 376 |
+
no_progress_streak=no_progress_streak,
|
| 377 |
+
max_chars=MEMORY_MAX_CHARS,
|
| 378 |
+
)
|
| 379 |
+
if compliance_stdout:
|
| 380 |
+
_compliance_log_start(episode_task_name, benchmark_name, MODEL_NAME)
|
| 381 |
try:
|
| 382 |
obs = env.reset()
|
| 383 |
+
|
| 384 |
+
initial_prompt = obs_to_prompt(obs, memory_hint=memory_hint, max_steps=env.max_steps)
|
| 385 |
+
messages = [
|
| 386 |
{"role": "system", "content": SYSTEM_PROMPT},
|
| 387 |
+
{"role": "user", "content": initial_prompt},
|
| 388 |
]
|
| 389 |
|
| 390 |
+
if not compliance_stdout:
|
| 391 |
+
_trace_print(
|
| 392 |
+
trace_agent,
|
| 393 |
+
(
|
| 394 |
+
f"\n[trace] {episode_label} "
|
| 395 |
+
f"task={obs.task_type} repo={obs.repo_url} test={obs.test_name}"
|
| 396 |
+
).strip(),
|
| 397 |
+
)
|
| 398 |
+
if trace_prompts and not compliance_stdout:
|
| 399 |
+
_trace_print(
|
| 400 |
+
trace_agent,
|
| 401 |
+
"[trace] system prompt:",
|
| 402 |
+
text=SYSTEM_PROMPT,
|
| 403 |
+
max_chars=trace_max_chars,
|
| 404 |
+
)
|
| 405 |
+
_trace_print(
|
| 406 |
+
trace_agent,
|
| 407 |
+
"[trace] initial user prompt:",
|
| 408 |
+
text=initial_prompt,
|
| 409 |
+
max_chars=trace_max_chars,
|
| 410 |
+
)
|
| 411 |
+
|
| 412 |
+
for step_idx in range(env.max_steps):
|
| 413 |
+
messages, prune_info = _prune_messages_window(
|
| 414 |
+
messages,
|
| 415 |
+
step_number=step_idx + 1,
|
| 416 |
+
prune_start_step=history_prune_start_step,
|
| 417 |
+
window_turns=history_window_turns,
|
| 418 |
+
max_chars=history_max_chars,
|
| 419 |
+
)
|
| 420 |
+
if prune_info:
|
| 421 |
+
prune_events += 1
|
| 422 |
+
if trace_agent and not compliance_stdout:
|
| 423 |
+
print(
|
| 424 |
+
"[trace] context_pruned "
|
| 425 |
+
f"reason={prune_info['reason']} "
|
| 426 |
+
f"step={prune_info['step']} "
|
| 427 |
+
f"messages={prune_info['before_messages']}->{prune_info['after_messages']} "
|
| 428 |
+
f"chars={prune_info['before_chars']}->{prune_info['after_chars']}"
|
| 429 |
+
)
|
| 430 |
+
|
| 431 |
+
action: FlakySleuthAction
|
| 432 |
+
action_source = "heuristic"
|
| 433 |
+
llm_meta: dict[str, Any] = {"attempted": False, "raw_output": "", "error": ""}
|
| 434 |
try:
|
| 435 |
+
candidate, llm_meta = llm_action(messages)
|
| 436 |
+
if candidate is not None:
|
| 437 |
+
action = candidate
|
| 438 |
+
action_source = "llm"
|
| 439 |
+
else:
|
| 440 |
+
action = heuristic_action(obs)
|
| 441 |
+
if llm_meta.get("attempted"):
|
| 442 |
+
llm_meta["error"] = (
|
| 443 |
+
"Model response unavailable, using heuristic fallback."
|
| 444 |
+
)
|
| 445 |
+
except Exception as exc:
|
| 446 |
+
llm_meta["error"] = str(exc)
|
| 447 |
action = heuristic_action(obs)
|
| 448 |
|
| 449 |
+
if action.action_type == "read_file":
|
| 450 |
+
prior_reads = read_attempt_counts.get(action.argument, 0)
|
| 451 |
+
if prior_reads >= 1:
|
| 452 |
+
blocked_duplicate_reads += 1
|
| 453 |
+
replacement = FlakySleuthAction(
|
| 454 |
+
action_type="search_code",
|
| 455 |
+
argument=_duplicate_read_replacement_pattern(obs),
|
| 456 |
+
)
|
| 457 |
+
if trace_agent and not compliance_stdout:
|
| 458 |
+
print(
|
| 459 |
+
"[trace] action_overridden "
|
| 460 |
+
f"reason=duplicate_read file={action.argument} "
|
| 461 |
+
f"replacement={replacement.action_type}"
|
| 462 |
+
)
|
| 463 |
+
action = replacement
|
| 464 |
+
|
| 465 |
+
if action_source == "llm":
|
| 466 |
+
llm_steps += 1
|
| 467 |
+
else:
|
| 468 |
+
heuristic_steps += 1
|
| 469 |
+
if not API_KEY:
|
| 470 |
+
reason_key = "no_api_key"
|
| 471 |
+
elif llm_meta.get("error"):
|
| 472 |
+
reason_key = "llm_error"
|
| 473 |
+
elif llm_meta.get("attempted"):
|
| 474 |
+
reason_key = "empty_or_invalid_response"
|
| 475 |
+
else:
|
| 476 |
+
reason_key = "heuristic_default"
|
| 477 |
+
fallback_reasons[reason_key] = fallback_reasons.get(reason_key, 0) + 1
|
| 478 |
+
|
| 479 |
+
if trace_agent and not compliance_stdout:
|
| 480 |
+
print(f"[trace] step={step_idx + 1} action_source={action_source}")
|
| 481 |
+
if llm_meta.get("attempted"):
|
| 482 |
+
_trace_print(
|
| 483 |
+
True,
|
| 484 |
+
"[trace] raw model output:",
|
| 485 |
+
text=str(llm_meta.get("raw_output", "")),
|
| 486 |
+
max_chars=trace_max_chars,
|
| 487 |
+
)
|
| 488 |
+
if llm_meta.get("error"):
|
| 489 |
+
print(f"[trace] llm_error={llm_meta['error']}")
|
| 490 |
+
print(f"[trace] action={action.model_dump_json()}")
|
| 491 |
+
|
| 492 |
obs, reward, done, info = env.step(action)
|
| 493 |
+
rewards.append(reward)
|
| 494 |
+
steps_taken = step_idx + 1
|
| 495 |
+
|
| 496 |
+
if action.action_type == "read_file":
|
| 497 |
+
read_attempt_counts[action.argument] = read_attempt_counts.get(action.argument, 0) + 1
|
| 498 |
+
if action.argument not in unique_read_files:
|
| 499 |
+
unique_read_files.append(action.argument)
|
| 500 |
+
if reward <= 0:
|
| 501 |
+
zero_gain_read_files.add(action.argument)
|
| 502 |
+
elif action.action_type == "search_code":
|
| 503 |
+
if action.argument not in search_patterns:
|
| 504 |
+
search_patterns.append(action.argument)
|
| 505 |
+
|
| 506 |
+
if done:
|
| 507 |
+
no_progress_streak = 0
|
| 508 |
+
elif reward <= 0:
|
| 509 |
+
no_progress_streak += 1
|
| 510 |
+
else:
|
| 511 |
+
no_progress_streak = 0
|
| 512 |
+
|
| 513 |
+
memory_hint = _build_episode_memory(
|
| 514 |
+
unique_read_files=unique_read_files,
|
| 515 |
+
zero_gain_read_files=zero_gain_read_files,
|
| 516 |
+
search_patterns=search_patterns,
|
| 517 |
+
blocked_duplicate_reads=blocked_duplicate_reads,
|
| 518 |
+
no_progress_streak=no_progress_streak,
|
| 519 |
+
max_chars=MEMORY_MAX_CHARS,
|
| 520 |
+
)
|
| 521 |
|
| 522 |
step_error: str | None = None
|
| 523 |
if isinstance(info, dict):
|
| 524 |
+
raw_err = info.get("last_action_error")
|
| 525 |
+
if raw_err:
|
| 526 |
+
step_error = str(raw_err)
|
| 527 |
+
if not step_error and obs.tool_output and str(obs.tool_output).startswith("ERROR:"):
|
| 528 |
+
step_error = str(obs.tool_output)
|
| 529 |
+
|
| 530 |
+
if compliance_stdout:
|
| 531 |
+
_compliance_log_step(
|
| 532 |
+
step=steps_taken,
|
| 533 |
+
action=action.model_dump_json(),
|
| 534 |
+
reward=reward,
|
| 535 |
+
done=done,
|
| 536 |
+
error=step_error,
|
| 537 |
+
)
|
| 538 |
+
|
| 539 |
+
if trace_agent and not compliance_stdout:
|
| 540 |
+
print(
|
| 541 |
+
f"[trace] step_result reward={reward:.3f} done={done} "
|
| 542 |
+
f"step_count={obs.step_count}"
|
| 543 |
+
)
|
| 544 |
+
if obs.tool_output:
|
| 545 |
+
_trace_print(
|
| 546 |
+
True,
|
| 547 |
+
"[trace] tool_output:",
|
| 548 |
+
text=obs.tool_output,
|
| 549 |
+
max_chars=trace_max_chars,
|
| 550 |
+
)
|
| 551 |
|
| 552 |
if done:
|
| 553 |
+
# Terminal reward already includes cumulative progress + terminal score.
|
| 554 |
+
final_episode_score = reward
|
| 555 |
+
terminal_meta = {
|
| 556 |
+
"action_type": action.action_type,
|
| 557 |
+
"terminal_score": float(info.get("terminal_score", 0) or 0),
|
| 558 |
+
"progress_score": float(info.get("progress_score", 0) or 0),
|
| 559 |
+
"explore_sum": exploration_reward_total,
|
| 560 |
+
"episode_score": final_episode_score,
|
| 561 |
+
"llm_steps": llm_steps,
|
| 562 |
+
"heuristic_steps": heuristic_steps,
|
| 563 |
+
"fallback_reasons": dict(fallback_reasons),
|
| 564 |
+
"context_prune_events": prune_events,
|
| 565 |
+
"duplicate_read_blocks": blocked_duplicate_reads,
|
| 566 |
+
}
|
| 567 |
+
success = final_episode_score > 0.0
|
| 568 |
+
if print_terminal:
|
| 569 |
+
print(
|
| 570 |
+
f" Terminal: {action.action_type}({action.argument[:40]}) "
|
| 571 |
+
f"-> terminal={info.get('terminal_score', 0):.2f} "
|
| 572 |
+
f"progress={info.get('progress_score', 0):.2f} "
|
| 573 |
+
f"explore_sum={exploration_reward_total:.3f} "
|
| 574 |
+
f"episode_score={final_episode_score:.3f}"
|
| 575 |
+
)
|
| 576 |
break
|
| 577 |
|
| 578 |
+
exploration_reward_total += reward
|
| 579 |
messages.append({"role": "assistant", "content": action.model_dump_json()})
|
| 580 |
+
next_prompt = obs_to_prompt(obs, memory_hint=memory_hint, max_steps=env.max_steps)
|
| 581 |
+
messages.append({"role": "user", "content": next_prompt})
|
| 582 |
+
if trace_agent and trace_prompts and not compliance_stdout:
|
| 583 |
+
_trace_print(
|
| 584 |
+
True,
|
| 585 |
+
f"[trace] next user prompt (step={step_idx + 1}):",
|
| 586 |
+
text=next_prompt,
|
| 587 |
+
max_chars=trace_max_chars,
|
| 588 |
+
)
|
| 589 |
+
except Exception as exc:
|
| 590 |
+
terminal_meta["error"] = str(exc)
|
| 591 |
success = False
|
| 592 |
+
if not compliance_stdout:
|
| 593 |
+
raise
|
| 594 |
finally:
|
| 595 |
+
if compliance_stdout:
|
| 596 |
+
try:
|
| 597 |
+
env.close()
|
| 598 |
+
except Exception:
|
| 599 |
+
pass
|
| 600 |
+
_compliance_log_end(
|
| 601 |
+
success=success,
|
| 602 |
+
steps=steps_taken,
|
| 603 |
+
score=min(max(final_episode_score, 0.0), 1.0),
|
| 604 |
+
rewards=rewards,
|
| 605 |
+
)
|
| 606 |
+
|
| 607 |
+
return final_episode_score, terminal_meta
|
| 608 |
|
| 609 |
+
|
| 610 |
+
def _looks_like_placeholder_dataset(dataset_path: str) -> bool:
|
| 611 |
+
path = Path(dataset_path)
|
| 612 |
+
if not path.exists():
|
| 613 |
+
return False
|
| 614 |
+
try:
|
| 615 |
+
text = path.read_text(encoding="utf-8", errors="replace")
|
| 616 |
+
except Exception:
|
| 617 |
+
return False
|
| 618 |
+
return "fixture://" in text
|
| 619 |
|
| 620 |
|
| 621 |
def _parse_args() -> argparse.Namespace:
|
| 622 |
+
parser = argparse.ArgumentParser(description="Run FlakySleuth baseline inference.")
|
| 623 |
parser.add_argument(
|
| 624 |
"--dataset-path",
|
| 625 |
default="dataset/py_tasks.csv",
|
|
|
|
| 642 |
default=MAX_STEPS,
|
| 643 |
help="Max steps per episode.",
|
| 644 |
)
|
| 645 |
+
parser.add_argument(
|
| 646 |
+
"--no-progress",
|
| 647 |
+
action="store_true",
|
| 648 |
+
help="Disable progress bars and print classic per-episode logs.",
|
| 649 |
+
)
|
| 650 |
+
parser.add_argument(
|
| 651 |
+
"--trace-agent",
|
| 652 |
+
action="store_true",
|
| 653 |
+
help=(
|
| 654 |
+
"Print detailed agent trace: model output, chosen action/tool call, and "
|
| 655 |
+
"step results for every episode."
|
| 656 |
+
),
|
| 657 |
+
)
|
| 658 |
+
parser.add_argument(
|
| 659 |
+
"--trace-prompts",
|
| 660 |
+
action="store_true",
|
| 661 |
+
help="When tracing, also print full prompts sent to the model.",
|
| 662 |
+
)
|
| 663 |
+
parser.add_argument(
|
| 664 |
+
"--trace-max-chars",
|
| 665 |
+
type=int,
|
| 666 |
+
default=2500,
|
| 667 |
+
help="Max chars per traced text block (prompt/model output/tool output).",
|
| 668 |
+
)
|
| 669 |
+
parser.add_argument(
|
| 670 |
+
"--compliance-stdout",
|
| 671 |
+
dest="compliance_stdout",
|
| 672 |
+
action="store_true",
|
| 673 |
+
help=(
|
| 674 |
+
"Emit strict compliance logs to stdout using only [START]/[STEP]/[END] lines "
|
| 675 |
+
"for each episode."
|
| 676 |
+
),
|
| 677 |
+
)
|
| 678 |
+
parser.add_argument(
|
| 679 |
+
"--no-compliance-stdout",
|
| 680 |
+
dest="compliance_stdout",
|
| 681 |
+
action="store_false",
|
| 682 |
+
help="Disable strict compliance logs and print baseline summaries/progress.",
|
| 683 |
+
)
|
| 684 |
parser.add_argument(
|
| 685 |
"--benchmark-name",
|
| 686 |
+
default="flakysleuth",
|
| 687 |
+
help="Benchmark name used in [START] lines when --compliance-stdout is enabled.",
|
| 688 |
+
)
|
| 689 |
+
parser.add_argument(
|
| 690 |
+
"--history-prune-start-step",
|
| 691 |
+
type=int,
|
| 692 |
+
default=12,
|
| 693 |
+
help="Start pruning conversation history only from this step onward.",
|
| 694 |
)
|
| 695 |
+
parser.add_argument(
|
| 696 |
+
"--history-window-turns",
|
| 697 |
+
type=int,
|
| 698 |
+
default=4,
|
| 699 |
+
help="When pruning is active, keep this many recent assistant/user turns.",
|
| 700 |
+
)
|
| 701 |
+
parser.add_argument(
|
| 702 |
+
"--history-max-chars",
|
| 703 |
+
type=int,
|
| 704 |
+
default=50000,
|
| 705 |
+
help="Approx max chars for messages before forced pruning by size.",
|
| 706 |
+
)
|
| 707 |
+
parser.set_defaults(compliance_stdout=True)
|
| 708 |
return parser.parse_args()
|
| 709 |
|
| 710 |
|
| 711 |
def main() -> None:
|
| 712 |
+
run_start = time.perf_counter()
|
| 713 |
args = _parse_args()
|
| 714 |
env = FlakySleuthEnv(dataset_path=args.dataset_path, max_steps=args.max_steps)
|
|
|
|
| 715 |
allowed_task_types = {"classify", "root_cause", "fix_proposal"}
|
| 716 |
task_types = [t.strip() for t in args.task_types.split(",") if t.strip()]
|
| 717 |
+
invalid = [t for t in task_types if t not in allowed_task_types]
|
| 718 |
+
if invalid:
|
| 719 |
+
raise ValueError(
|
| 720 |
+
f"Invalid task type(s): {invalid}. "
|
| 721 |
+
"Valid values: classify,root_cause,fix_proposal."
|
| 722 |
+
)
|
| 723 |
if not task_types:
|
| 724 |
+
raise ValueError(
|
| 725 |
+
"No task types selected. Pass --task-types with at least one value."
|
| 726 |
+
)
|
| 727 |
+
results: dict[str, list[float]] = defaultdict(list)
|
| 728 |
+
|
| 729 |
+
if _looks_like_placeholder_dataset(args.dataset_path) and not args.compliance_stdout:
|
| 730 |
+
print(
|
| 731 |
+
"[warning] dataset appears to contain fixture rows (fixture://...). "
|
| 732 |
+
"Build real dataset from py-data.csv for real evaluation."
|
| 733 |
+
)
|
| 734 |
+
|
| 735 |
+
use_progress = (
|
| 736 |
+
(tqdm is not None)
|
| 737 |
+
and (not args.no_progress)
|
| 738 |
+
and (not args.compliance_stdout)
|
| 739 |
+
and os.isatty(1)
|
| 740 |
+
)
|
| 741 |
+
if args.trace_agent and use_progress and not args.compliance_stdout:
|
| 742 |
+
print(
|
| 743 |
+
"[info] --trace-agent enabled, disabling progress bars for readable trace logs."
|
| 744 |
+
)
|
| 745 |
+
use_progress = False
|
| 746 |
+
overall_bar = None
|
| 747 |
+
if use_progress:
|
| 748 |
+
overall_bar = tqdm(
|
| 749 |
+
total=len(task_types) * args.episodes_per_task,
|
| 750 |
+
desc="All tasks",
|
| 751 |
+
unit="ep",
|
| 752 |
+
dynamic_ncols=True,
|
| 753 |
+
)
|
| 754 |
|
| 755 |
for task_type in task_types:
|
| 756 |
+
task_start = time.perf_counter()
|
| 757 |
+
if not args.compliance_stdout:
|
| 758 |
+
print(f"\n-- Task type: {task_type} --")
|
| 759 |
env.loader.force_task_type(task_type)
|
| 760 |
+
task_bar = None
|
| 761 |
+
if use_progress:
|
| 762 |
+
task_bar = tqdm(
|
| 763 |
+
total=args.episodes_per_task,
|
| 764 |
+
desc=f"{task_type}",
|
| 765 |
+
unit="ep",
|
| 766 |
+
leave=False,
|
| 767 |
+
dynamic_ncols=True,
|
| 768 |
+
)
|
| 769 |
+
for episode in range(args.episodes_per_task):
|
| 770 |
+
score, meta = run_episode(
|
| 771 |
env,
|
| 772 |
+
print_terminal=(not use_progress) and (not args.compliance_stdout),
|
| 773 |
+
trace_agent=args.trace_agent,
|
| 774 |
+
trace_prompts=args.trace_prompts,
|
| 775 |
+
trace_max_chars=args.trace_max_chars,
|
| 776 |
+
episode_label=f"{task_type} ep={episode + 1}/{args.episodes_per_task}",
|
| 777 |
+
compliance_stdout=args.compliance_stdout,
|
| 778 |
benchmark_name=args.benchmark_name,
|
| 779 |
+
compliance_task_name=task_type,
|
| 780 |
+
history_prune_start_step=args.history_prune_start_step,
|
| 781 |
+
history_window_turns=args.history_window_turns,
|
| 782 |
+
history_max_chars=args.history_max_chars,
|
| 783 |
)
|
| 784 |
+
results[task_type].append(score)
|
| 785 |
+
if use_progress and task_bar is not None:
|
| 786 |
+
task_bar.update(1)
|
| 787 |
+
task_avg = sum(results[task_type]) / len(results[task_type])
|
| 788 |
+
task_bar.set_postfix(
|
| 789 |
+
score=f"{score:.3f}",
|
| 790 |
+
avg=f"{task_avg:.3f}",
|
| 791 |
+
term=f"{meta.get('terminal_score', 0):.2f}",
|
| 792 |
+
)
|
| 793 |
+
if overall_bar is not None:
|
| 794 |
+
overall_bar.update(1)
|
| 795 |
+
all_scores = [s for values in results.values() for s in values]
|
| 796 |
+
overall_avg = sum(all_scores) / len(all_scores)
|
| 797 |
+
overall_bar.set_postfix(task=task_type, avg=f"{overall_avg:.3f}")
|
| 798 |
+
elif not args.compliance_stdout:
|
| 799 |
+
print(f" Episode {episode + 1}: {score:.3f}")
|
| 800 |
+
if task_bar is not None:
|
| 801 |
+
task_bar.close()
|
| 802 |
+
task_elapsed = time.perf_counter() - task_start
|
| 803 |
+
if not args.compliance_stdout:
|
| 804 |
+
avg_task = sum(results[task_type]) / max(1, len(results[task_type]))
|
| 805 |
+
print(
|
| 806 |
+
f" [time] task={task_type} elapsed={_format_duration(task_elapsed)} "
|
| 807 |
+
f"avg_ep={task_elapsed / max(1, args.episodes_per_task):.2f}s "
|
| 808 |
+
f"avg_score={avg_task:.3f}"
|
| 809 |
+
)
|
| 810 |
+
|
| 811 |
+
if overall_bar is not None:
|
| 812 |
+
overall_bar.close()
|
| 813 |
+
|
| 814 |
+
if args.compliance_stdout:
|
| 815 |
+
return
|
| 816 |
+
|
| 817 |
+
total_elapsed = time.perf_counter() - run_start
|
| 818 |
+
print("\n== BASELINE RESULTS ==")
|
| 819 |
+
all_scores: list[float] = []
|
| 820 |
+
for task_type in task_types:
|
| 821 |
+
scores = results[task_type]
|
| 822 |
+
avg = sum(scores) / len(scores)
|
| 823 |
+
all_scores.extend(scores)
|
| 824 |
+
print(f" {task_type:12s} avg={avg:.3f} scores={[round(s, 3) for s in scores]}")
|
| 825 |
+
|
| 826 |
+
overall = sum(all_scores) / len(all_scores)
|
| 827 |
+
print(f" {'OVERALL':12s} avg={overall:.3f}")
|
| 828 |
+
print(
|
| 829 |
+
f" {'RUNTIME':12s} total={_format_duration(total_elapsed)} "
|
| 830 |
+
f"episodes={len(all_scores)} "
|
| 831 |
+
f"avg_ep={(total_elapsed / max(1, len(all_scores))):.2f}s"
|
| 832 |
+
)
|
| 833 |
|
| 834 |
|
| 835 |
if __name__ == "__main__":
|
inference_debug.py
CHANGED
|
@@ -75,6 +75,7 @@ MODEL_NAME = os.environ.get("MODEL_NAME", DEFAULT_MODEL)
|
|
| 75 |
# Keep a conservative default to stay under common hackathon runtime limits.
|
| 76 |
EPISODES_PER_TASK = 2
|
| 77 |
MAX_STEPS = 20
|
|
|
|
| 78 |
|
| 79 |
client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
|
| 80 |
|
|
@@ -140,7 +141,7 @@ def _compliance_log_end(success: bool, steps: int, score: float, rewards: list[f
|
|
| 140 |
)
|
| 141 |
|
| 142 |
|
| 143 |
-
def obs_to_prompt(obs: FlakySleuthObservation) -> str:
|
| 144 |
tree_preview = "\n".join(obs.file_tree[:40])
|
| 145 |
return f"""TASK: {obs.task_description}
|
| 146 |
|
|
@@ -159,6 +160,9 @@ Repository file tree:
|
|
| 159 |
Last tool output:
|
| 160 |
{obs.tool_output or "(No action taken yet)"}
|
| 161 |
|
|
|
|
|
|
|
|
|
|
| 162 |
Return only JSON action."""
|
| 163 |
|
| 164 |
|
|
@@ -246,6 +250,85 @@ def _format_duration(seconds: float) -> str:
|
|
| 246 |
return f"{mins:02d}m {secs:02d}s"
|
| 247 |
|
| 248 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 249 |
def run_episode(
|
| 250 |
env: FlakySleuthEnv,
|
| 251 |
*,
|
|
@@ -257,6 +340,9 @@ def run_episode(
|
|
| 257 |
compliance_stdout: bool = False,
|
| 258 |
benchmark_name: str = "flakysleuth",
|
| 259 |
compliance_task_name: str | None = None,
|
|
|
|
|
|
|
|
|
|
| 260 |
) -> tuple[float, dict[str, Any]]:
|
| 261 |
rewards: list[float] = []
|
| 262 |
steps_taken = 0
|
|
@@ -265,12 +351,30 @@ def run_episode(
|
|
| 265 |
exploration_reward_total = 0.0
|
| 266 |
final_episode_score = 0.0
|
| 267 |
terminal_meta: dict[str, Any] = {}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 268 |
if compliance_stdout:
|
| 269 |
_compliance_log_start(episode_task_name, benchmark_name, MODEL_NAME)
|
| 270 |
try:
|
| 271 |
obs = env.reset()
|
| 272 |
|
| 273 |
-
initial_prompt = obs_to_prompt(obs)
|
| 274 |
messages = [
|
| 275 |
{"role": "system", "content": SYSTEM_PROMPT},
|
| 276 |
{"role": "user", "content": initial_prompt},
|
|
@@ -299,6 +403,24 @@ def run_episode(
|
|
| 299 |
)
|
| 300 |
|
| 301 |
for step_idx in range(MAX_STEPS):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 302 |
action: FlakySleuthAction
|
| 303 |
action_source = "heuristic"
|
| 304 |
llm_meta: dict[str, Any] = {"attempted": False, "raw_output": "", "error": ""}
|
|
@@ -317,6 +439,36 @@ def run_episode(
|
|
| 317 |
llm_meta["error"] = str(exc)
|
| 318 |
action = heuristic_action(obs)
|
| 319 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 320 |
if trace_agent and not compliance_stdout:
|
| 321 |
print(f"[trace] step={step_idx + 1} action_source={action_source}")
|
| 322 |
if llm_meta.get("attempted"):
|
|
@@ -334,6 +486,32 @@ def run_episode(
|
|
| 334 |
rewards.append(reward)
|
| 335 |
steps_taken = step_idx + 1
|
| 336 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 337 |
step_error: str | None = None
|
| 338 |
if isinstance(info, dict):
|
| 339 |
raw_err = info.get("last_action_error")
|
|
@@ -373,6 +551,11 @@ def run_episode(
|
|
| 373 |
"progress_score": float(info.get("progress_score", 0) or 0),
|
| 374 |
"explore_sum": exploration_reward_total,
|
| 375 |
"episode_score": final_episode_score,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 376 |
}
|
| 377 |
success = final_episode_score > 0.0
|
| 378 |
if print_terminal:
|
|
@@ -387,7 +570,7 @@ def run_episode(
|
|
| 387 |
|
| 388 |
exploration_reward_total += reward
|
| 389 |
messages.append({"role": "assistant", "content": action.model_dump_json()})
|
| 390 |
-
next_prompt = obs_to_prompt(obs)
|
| 391 |
messages.append({"role": "user", "content": next_prompt})
|
| 392 |
if trace_agent and trace_prompts and not compliance_stdout:
|
| 393 |
_trace_print(
|
|
@@ -483,6 +666,24 @@ def _parse_args() -> argparse.Namespace:
|
|
| 483 |
default="flakysleuth",
|
| 484 |
help="Benchmark name used in [START] lines when --compliance-stdout is enabled.",
|
| 485 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 486 |
return parser.parse_args()
|
| 487 |
|
| 488 |
|
|
@@ -550,6 +751,9 @@ def main() -> None:
|
|
| 550 |
compliance_stdout=args.compliance_stdout,
|
| 551 |
benchmark_name=args.benchmark_name,
|
| 552 |
compliance_task_name=task_type,
|
|
|
|
|
|
|
|
|
|
| 553 |
)
|
| 554 |
results[task_type].append(score)
|
| 555 |
if use_progress and task_bar is not None:
|
|
|
|
| 75 |
# Keep a conservative default to stay under common hackathon runtime limits.
|
| 76 |
EPISODES_PER_TASK = 2
|
| 77 |
MAX_STEPS = 20
|
| 78 |
+
MEMORY_MAX_CHARS = 900
|
| 79 |
|
| 80 |
client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
|
| 81 |
|
|
|
|
| 141 |
)
|
| 142 |
|
| 143 |
|
| 144 |
+
def obs_to_prompt(obs: FlakySleuthObservation, *, memory_hint: str | None = None) -> str:
|
| 145 |
tree_preview = "\n".join(obs.file_tree[:40])
|
| 146 |
return f"""TASK: {obs.task_description}
|
| 147 |
|
|
|
|
| 160 |
Last tool output:
|
| 161 |
{obs.tool_output or "(No action taken yet)"}
|
| 162 |
|
| 163 |
+
Episode memory:
|
| 164 |
+
{memory_hint or "(No memory yet.)"}
|
| 165 |
+
|
| 166 |
Return only JSON action."""
|
| 167 |
|
| 168 |
|
|
|
|
| 250 |
return f"{mins:02d}m {secs:02d}s"
|
| 251 |
|
| 252 |
|
| 253 |
+
def _build_episode_memory(
|
| 254 |
+
*,
|
| 255 |
+
unique_read_files: list[str],
|
| 256 |
+
zero_gain_read_files: set[str],
|
| 257 |
+
search_patterns: list[str],
|
| 258 |
+
blocked_duplicate_reads: int,
|
| 259 |
+
no_progress_streak: int,
|
| 260 |
+
max_chars: int,
|
| 261 |
+
) -> str:
|
| 262 |
+
read_tail = ", ".join(unique_read_files[-8:]) if unique_read_files else "none"
|
| 263 |
+
zero_tail = ", ".join(sorted(zero_gain_read_files)[-8:]) if zero_gain_read_files else "none"
|
| 264 |
+
search_tail = ", ".join(search_patterns[-6:]) if search_patterns else "none"
|
| 265 |
+
loop_warning = (
|
| 266 |
+
"WARNING: Possible loop detected. Stop repeating similar exploration. "
|
| 267 |
+
"Switch strategy or take a terminal action."
|
| 268 |
+
if no_progress_streak >= 3 or blocked_duplicate_reads >= 2
|
| 269 |
+
else "Status: exploration progress appears normal."
|
| 270 |
+
)
|
| 271 |
+
memory = (
|
| 272 |
+
f"Read files (recent): {read_tail}\n"
|
| 273 |
+
f"Zero-gain read files: {zero_tail}\n"
|
| 274 |
+
f"Search patterns (recent): {search_tail}\n"
|
| 275 |
+
f"Blocked duplicate reads: {blocked_duplicate_reads}\n"
|
| 276 |
+
f"No-progress streak: {no_progress_streak}\n"
|
| 277 |
+
f"{loop_warning}\n"
|
| 278 |
+
"Guidance: Avoid rereading zero-gain files unless there is new evidence. "
|
| 279 |
+
"Prefer targeted search_code or terminal action when confidence is enough."
|
| 280 |
+
)
|
| 281 |
+
return _clip_text(memory, max_chars=max_chars)
|
| 282 |
+
|
| 283 |
+
|
| 284 |
+
def _duplicate_read_replacement_pattern(obs: FlakySleuthObservation) -> str:
|
| 285 |
+
test_hint = obs.test_name.split("::")[-1] if obs.test_name else "test"
|
| 286 |
+
return (
|
| 287 |
+
f"{test_hint}|random|sleep|time|timeout|retry|asyncio|thread|"
|
| 288 |
+
"fixture|global|shared|mock|patch"
|
| 289 |
+
)
|
| 290 |
+
|
| 291 |
+
|
| 292 |
+
def _messages_char_count(messages: list[dict[str, str]]) -> int:
|
| 293 |
+
# Lightweight size heuristic to avoid unbounded context growth.
|
| 294 |
+
return sum(len(str(msg.get("content", ""))) + 32 for msg in messages)
|
| 295 |
+
|
| 296 |
+
|
| 297 |
+
def _prune_messages_window(
|
| 298 |
+
messages: list[dict[str, str]],
|
| 299 |
+
*,
|
| 300 |
+
step_number: int,
|
| 301 |
+
prune_start_step: int,
|
| 302 |
+
window_turns: int,
|
| 303 |
+
max_chars: int,
|
| 304 |
+
) -> tuple[list[dict[str, str]], dict[str, Any] | None]:
|
| 305 |
+
if len(messages) <= 2:
|
| 306 |
+
return messages, None
|
| 307 |
+
|
| 308 |
+
current_chars = _messages_char_count(messages)
|
| 309 |
+
exceeds_step_threshold = step_number >= prune_start_step
|
| 310 |
+
exceeds_char_budget = current_chars > max_chars
|
| 311 |
+
if not exceeds_step_threshold and not exceeds_char_budget:
|
| 312 |
+
return messages, None
|
| 313 |
+
|
| 314 |
+
base = messages[:2] # system + initial prompt
|
| 315 |
+
tail = messages[2:]
|
| 316 |
+
keep_tail_items = max(2, window_turns * 2)
|
| 317 |
+
if len(tail) > keep_tail_items:
|
| 318 |
+
tail = tail[-keep_tail_items:]
|
| 319 |
+
pruned = base + tail
|
| 320 |
+
|
| 321 |
+
reason = "step_threshold" if exceeds_step_threshold else "char_budget"
|
| 322 |
+
return pruned, {
|
| 323 |
+
"reason": reason,
|
| 324 |
+
"before_messages": len(messages),
|
| 325 |
+
"after_messages": len(pruned),
|
| 326 |
+
"before_chars": current_chars,
|
| 327 |
+
"after_chars": _messages_char_count(pruned),
|
| 328 |
+
"step": step_number,
|
| 329 |
+
}
|
| 330 |
+
|
| 331 |
+
|
| 332 |
def run_episode(
|
| 333 |
env: FlakySleuthEnv,
|
| 334 |
*,
|
|
|
|
| 340 |
compliance_stdout: bool = False,
|
| 341 |
benchmark_name: str = "flakysleuth",
|
| 342 |
compliance_task_name: str | None = None,
|
| 343 |
+
history_prune_start_step: int = 12,
|
| 344 |
+
history_window_turns: int = 4,
|
| 345 |
+
history_max_chars: int = 50000,
|
| 346 |
) -> tuple[float, dict[str, Any]]:
|
| 347 |
rewards: list[float] = []
|
| 348 |
steps_taken = 0
|
|
|
|
| 351 |
exploration_reward_total = 0.0
|
| 352 |
final_episode_score = 0.0
|
| 353 |
terminal_meta: dict[str, Any] = {}
|
| 354 |
+
llm_steps = 0
|
| 355 |
+
heuristic_steps = 0
|
| 356 |
+
fallback_reasons: dict[str, int] = {}
|
| 357 |
+
prune_events = 0
|
| 358 |
+
read_attempt_counts: dict[str, int] = {}
|
| 359 |
+
unique_read_files: list[str] = []
|
| 360 |
+
zero_gain_read_files: set[str] = set()
|
| 361 |
+
search_patterns: list[str] = []
|
| 362 |
+
blocked_duplicate_reads = 0
|
| 363 |
+
no_progress_streak = 0
|
| 364 |
+
memory_hint = _build_episode_memory(
|
| 365 |
+
unique_read_files=unique_read_files,
|
| 366 |
+
zero_gain_read_files=zero_gain_read_files,
|
| 367 |
+
search_patterns=search_patterns,
|
| 368 |
+
blocked_duplicate_reads=blocked_duplicate_reads,
|
| 369 |
+
no_progress_streak=no_progress_streak,
|
| 370 |
+
max_chars=MEMORY_MAX_CHARS,
|
| 371 |
+
)
|
| 372 |
if compliance_stdout:
|
| 373 |
_compliance_log_start(episode_task_name, benchmark_name, MODEL_NAME)
|
| 374 |
try:
|
| 375 |
obs = env.reset()
|
| 376 |
|
| 377 |
+
initial_prompt = obs_to_prompt(obs, memory_hint=memory_hint)
|
| 378 |
messages = [
|
| 379 |
{"role": "system", "content": SYSTEM_PROMPT},
|
| 380 |
{"role": "user", "content": initial_prompt},
|
|
|
|
| 403 |
)
|
| 404 |
|
| 405 |
for step_idx in range(MAX_STEPS):
|
| 406 |
+
messages, prune_info = _prune_messages_window(
|
| 407 |
+
messages,
|
| 408 |
+
step_number=step_idx + 1,
|
| 409 |
+
prune_start_step=history_prune_start_step,
|
| 410 |
+
window_turns=history_window_turns,
|
| 411 |
+
max_chars=history_max_chars,
|
| 412 |
+
)
|
| 413 |
+
if prune_info:
|
| 414 |
+
prune_events += 1
|
| 415 |
+
if trace_agent and not compliance_stdout:
|
| 416 |
+
print(
|
| 417 |
+
"[trace] context_pruned "
|
| 418 |
+
f"reason={prune_info['reason']} "
|
| 419 |
+
f"step={prune_info['step']} "
|
| 420 |
+
f"messages={prune_info['before_messages']}->{prune_info['after_messages']} "
|
| 421 |
+
f"chars={prune_info['before_chars']}->{prune_info['after_chars']}"
|
| 422 |
+
)
|
| 423 |
+
|
| 424 |
action: FlakySleuthAction
|
| 425 |
action_source = "heuristic"
|
| 426 |
llm_meta: dict[str, Any] = {"attempted": False, "raw_output": "", "error": ""}
|
|
|
|
| 439 |
llm_meta["error"] = str(exc)
|
| 440 |
action = heuristic_action(obs)
|
| 441 |
|
| 442 |
+
if action.action_type == "read_file":
|
| 443 |
+
prior_reads = read_attempt_counts.get(action.argument, 0)
|
| 444 |
+
if prior_reads >= 1:
|
| 445 |
+
blocked_duplicate_reads += 1
|
| 446 |
+
replacement = FlakySleuthAction(
|
| 447 |
+
action_type="search_code",
|
| 448 |
+
argument=_duplicate_read_replacement_pattern(obs),
|
| 449 |
+
)
|
| 450 |
+
if trace_agent and not compliance_stdout:
|
| 451 |
+
print(
|
| 452 |
+
"[trace] action_overridden "
|
| 453 |
+
f"reason=duplicate_read file={action.argument} "
|
| 454 |
+
f"replacement={replacement.action_type}"
|
| 455 |
+
)
|
| 456 |
+
action = replacement
|
| 457 |
+
|
| 458 |
+
if action_source == "llm":
|
| 459 |
+
llm_steps += 1
|
| 460 |
+
else:
|
| 461 |
+
heuristic_steps += 1
|
| 462 |
+
if not API_KEY:
|
| 463 |
+
reason_key = "no_api_key"
|
| 464 |
+
elif llm_meta.get("error"):
|
| 465 |
+
reason_key = "llm_error"
|
| 466 |
+
elif llm_meta.get("attempted"):
|
| 467 |
+
reason_key = "empty_or_invalid_response"
|
| 468 |
+
else:
|
| 469 |
+
reason_key = "heuristic_default"
|
| 470 |
+
fallback_reasons[reason_key] = fallback_reasons.get(reason_key, 0) + 1
|
| 471 |
+
|
| 472 |
if trace_agent and not compliance_stdout:
|
| 473 |
print(f"[trace] step={step_idx + 1} action_source={action_source}")
|
| 474 |
if llm_meta.get("attempted"):
|
|
|
|
| 486 |
rewards.append(reward)
|
| 487 |
steps_taken = step_idx + 1
|
| 488 |
|
| 489 |
+
if action.action_type == "read_file":
|
| 490 |
+
read_attempt_counts[action.argument] = read_attempt_counts.get(action.argument, 0) + 1
|
| 491 |
+
if action.argument not in unique_read_files:
|
| 492 |
+
unique_read_files.append(action.argument)
|
| 493 |
+
if reward <= 0:
|
| 494 |
+
zero_gain_read_files.add(action.argument)
|
| 495 |
+
elif action.action_type == "search_code":
|
| 496 |
+
if action.argument not in search_patterns:
|
| 497 |
+
search_patterns.append(action.argument)
|
| 498 |
+
|
| 499 |
+
if done:
|
| 500 |
+
no_progress_streak = 0
|
| 501 |
+
elif reward <= 0:
|
| 502 |
+
no_progress_streak += 1
|
| 503 |
+
else:
|
| 504 |
+
no_progress_streak = 0
|
| 505 |
+
|
| 506 |
+
memory_hint = _build_episode_memory(
|
| 507 |
+
unique_read_files=unique_read_files,
|
| 508 |
+
zero_gain_read_files=zero_gain_read_files,
|
| 509 |
+
search_patterns=search_patterns,
|
| 510 |
+
blocked_duplicate_reads=blocked_duplicate_reads,
|
| 511 |
+
no_progress_streak=no_progress_streak,
|
| 512 |
+
max_chars=MEMORY_MAX_CHARS,
|
| 513 |
+
)
|
| 514 |
+
|
| 515 |
step_error: str | None = None
|
| 516 |
if isinstance(info, dict):
|
| 517 |
raw_err = info.get("last_action_error")
|
|
|
|
| 551 |
"progress_score": float(info.get("progress_score", 0) or 0),
|
| 552 |
"explore_sum": exploration_reward_total,
|
| 553 |
"episode_score": final_episode_score,
|
| 554 |
+
"llm_steps": llm_steps,
|
| 555 |
+
"heuristic_steps": heuristic_steps,
|
| 556 |
+
"fallback_reasons": dict(fallback_reasons),
|
| 557 |
+
"context_prune_events": prune_events,
|
| 558 |
+
"duplicate_read_blocks": blocked_duplicate_reads,
|
| 559 |
}
|
| 560 |
success = final_episode_score > 0.0
|
| 561 |
if print_terminal:
|
|
|
|
| 570 |
|
| 571 |
exploration_reward_total += reward
|
| 572 |
messages.append({"role": "assistant", "content": action.model_dump_json()})
|
| 573 |
+
next_prompt = obs_to_prompt(obs, memory_hint=memory_hint)
|
| 574 |
messages.append({"role": "user", "content": next_prompt})
|
| 575 |
if trace_agent and trace_prompts and not compliance_stdout:
|
| 576 |
_trace_print(
|
|
|
|
| 666 |
default="flakysleuth",
|
| 667 |
help="Benchmark name used in [START] lines when --compliance-stdout is enabled.",
|
| 668 |
)
|
| 669 |
+
parser.add_argument(
|
| 670 |
+
"--history-prune-start-step",
|
| 671 |
+
type=int,
|
| 672 |
+
default=12,
|
| 673 |
+
help="Start pruning conversation history only from this step onward.",
|
| 674 |
+
)
|
| 675 |
+
parser.add_argument(
|
| 676 |
+
"--history-window-turns",
|
| 677 |
+
type=int,
|
| 678 |
+
default=4,
|
| 679 |
+
help="When pruning is active, keep this many recent assistant/user turns.",
|
| 680 |
+
)
|
| 681 |
+
parser.add_argument(
|
| 682 |
+
"--history-max-chars",
|
| 683 |
+
type=int,
|
| 684 |
+
default=50000,
|
| 685 |
+
help="Approx max chars for messages before forced pruning by size.",
|
| 686 |
+
)
|
| 687 |
return parser.parse_args()
|
| 688 |
|
| 689 |
|
|
|
|
| 751 |
compliance_stdout=args.compliance_stdout,
|
| 752 |
benchmark_name=args.benchmark_name,
|
| 753 |
compliance_task_name=task_type,
|
| 754 |
+
history_prune_start_step=args.history_prune_start_step,
|
| 755 |
+
history_window_turns=args.history_window_turns,
|
| 756 |
+
history_max_chars=args.history_max_chars,
|
| 757 |
)
|
| 758 |
results[task_type].append(score)
|
| 759 |
if use_progress and task_bar is not None:
|
server/app.py
CHANGED
|
@@ -28,7 +28,7 @@ class FlakySleuthState(BaseModel):
|
|
| 28 |
|
| 29 |
class InferenceRunRequest(BaseModel):
|
| 30 |
dataset_path: str = Field(default="dataset/py_tasks.csv")
|
| 31 |
-
episodes_per_task: int = Field(default=1, ge=1, le=
|
| 32 |
task_types: str = Field(default="classify,root_cause,fix_proposal")
|
| 33 |
max_steps: int = Field(default=20, ge=1, le=100)
|
| 34 |
benchmark_name: str = Field(default="flakysleuth")
|
|
|
|
| 28 |
|
| 29 |
class InferenceRunRequest(BaseModel):
|
| 30 |
dataset_path: str = Field(default="dataset/py_tasks.csv")
|
| 31 |
+
episodes_per_task: int = Field(default=1, ge=1, le=100)
|
| 32 |
task_types: str = Field(default="classify,root_cause,fix_proposal")
|
| 33 |
max_steps: int = Field(default=20, ge=1, le=100)
|
| 34 |
benchmark_name: str = Field(default="flakysleuth")
|
server/inference_runner.py
CHANGED
|
@@ -48,8 +48,8 @@ class InferenceRunner:
|
|
| 48 |
|
| 49 |
if not dataset_rel:
|
| 50 |
raise ValueError("dataset_path must not be empty.")
|
| 51 |
-
if episodes < 1 or episodes >
|
| 52 |
-
raise ValueError("episodes_per_task must be between 1 and
|
| 53 |
if max_steps < 1 or max_steps > 100:
|
| 54 |
raise ValueError("max_steps must be between 1 and 100.")
|
| 55 |
if not task_types:
|
|
|
|
| 48 |
|
| 49 |
if not dataset_rel:
|
| 50 |
raise ValueError("dataset_path must not be empty.")
|
| 51 |
+
if episodes < 1 or episodes > 100:
|
| 52 |
+
raise ValueError("episodes_per_task must be between 1 and 100.")
|
| 53 |
if max_steps < 1 or max_steps > 100:
|
| 54 |
raise ValueError("max_steps must be between 1 and 100.")
|
| 55 |
if not task_types:
|
server/ui.py
CHANGED
|
@@ -8,7 +8,7 @@ def render_home_page() -> str:
|
|
| 8 |
<head>
|
| 9 |
<meta charset="utf-8" />
|
| 10 |
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
| 11 |
-
<title>
|
| 12 |
<link rel="preconnect" href="https://fonts.googleapis.com" />
|
| 13 |
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
|
| 14 |
<link href="https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@500;600;700&family=IBM+Plex+Mono:wght@400;500&display=swap" rel="stylesheet" />
|
|
@@ -123,6 +123,60 @@ def render_home_page() -> str:
|
|
| 123 |
letter-spacing: -0.01em;
|
| 124 |
}
|
| 125 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
.form-grid {
|
| 127 |
display: grid;
|
| 128 |
gap: 12px;
|
|
@@ -230,6 +284,61 @@ def render_home_page() -> str:
|
|
| 230 |
line-height: 1.4;
|
| 231 |
}
|
| 232 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 233 |
.actions {
|
| 234 |
margin-top: 6px;
|
| 235 |
display: flex;
|
|
@@ -373,6 +482,10 @@ def render_home_page() -> str:
|
|
| 373 |
grid-template-columns: 1fr;
|
| 374 |
}
|
| 375 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 376 |
.field.span-2 {
|
| 377 |
grid-column: span 1;
|
| 378 |
}
|
|
@@ -386,12 +499,37 @@ def render_home_page() -> str:
|
|
| 386 |
<body>
|
| 387 |
<main class="shell">
|
| 388 |
<section class="hero">
|
| 389 |
-
<span class="eyebrow"><span class="dot"></span>
|
| 390 |
-
<h1>
|
| 391 |
-
<p>This
|
| 392 |
</section>
|
| 393 |
|
| 394 |
<section class="panel-grid">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 395 |
<div class="panel">
|
| 396 |
<h2>Run Configuration</h2>
|
| 397 |
<form id="run-form" class="form-grid">
|
|
@@ -402,12 +540,24 @@ def render_home_page() -> str:
|
|
| 402 |
|
| 403 |
<div class="field">
|
| 404 |
<label for="episodes_per_task">Episodes Per Task</label>
|
| 405 |
-
<
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 406 |
</div>
|
| 407 |
|
| 408 |
<div class="field">
|
| 409 |
<label for="max_steps">Max Steps</label>
|
| 410 |
-
<
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 411 |
</div>
|
| 412 |
|
| 413 |
<div class="field span-2">
|
|
@@ -430,6 +580,15 @@ def render_home_page() -> str:
|
|
| 430 |
<input id="benchmark_name" name="benchmark_name" value="flakysleuth" />
|
| 431 |
</div>
|
| 432 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 433 |
<div class="field span-2">
|
| 434 |
<label for="api_base_url">API Base URL (optional)</label>
|
| 435 |
<input id="api_base_url" name="api_base_url" placeholder="https://api.openai.com/v1 or provider endpoint" />
|
|
@@ -494,6 +653,13 @@ def render_home_page() -> str:
|
|
| 494 |
const taskChipsEl = document.getElementById("task-chips");
|
| 495 |
const taskSelectEl = document.getElementById("task-type-select");
|
| 496 |
const taskAddButton = document.getElementById("btn-add-task");
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 497 |
|
| 498 |
const TASK_TYPE_ORDER = ["classify", "root_cause", "fix_proposal"];
|
| 499 |
const TASK_TYPE_LABELS = {
|
|
@@ -501,6 +667,35 @@ def render_home_page() -> str:
|
|
| 501 |
root_cause: "Root Cause",
|
| 502 |
fix_proposal: "Fix Proposal",
|
| 503 |
};
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 504 |
|
| 505 |
function parseTaskTypes(raw) {
|
| 506 |
const tokens = String(raw || "")
|
|
@@ -580,6 +775,7 @@ def render_home_page() -> str:
|
|
| 580 |
taskInput.value = selectedTaskTypes.join(",");
|
| 581 |
renderTaskChips();
|
| 582 |
renderTaskSelect();
|
|
|
|
| 583 |
}
|
| 584 |
|
| 585 |
function addSelectedTaskType() {
|
|
@@ -591,12 +787,58 @@ def render_home_page() -> str:
|
|
| 591 |
syncTaskTypes();
|
| 592 |
}
|
| 593 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 594 |
function readFormPayload() {
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 595 |
return {
|
| 596 |
dataset_path: form.dataset_path.value.trim(),
|
| 597 |
-
episodes_per_task:
|
| 598 |
task_types: form.task_types.value.trim(),
|
| 599 |
-
max_steps:
|
| 600 |
benchmark_name: form.benchmark_name.value.trim(),
|
| 601 |
api_base_url: form.api_base_url.value.trim() || null,
|
| 602 |
model_name: form.model_name.value.trim() || null,
|
|
@@ -707,8 +949,18 @@ def render_home_page() -> str:
|
|
| 707 |
addSelectedTaskType();
|
| 708 |
}
|
| 709 |
});
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 710 |
|
|
|
|
| 711 |
syncTaskTypes();
|
|
|
|
| 712 |
|
| 713 |
fetchStatus();
|
| 714 |
window.setInterval(fetchStatus, 2200);
|
|
|
|
| 8 |
<head>
|
| 9 |
<meta charset="utf-8" />
|
| 10 |
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
| 11 |
+
<title>FlakyGym Control Center</title>
|
| 12 |
<link rel="preconnect" href="https://fonts.googleapis.com" />
|
| 13 |
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
|
| 14 |
<link href="https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@500;600;700&family=IBM+Plex+Mono:wght@400;500&display=swap" rel="stylesheet" />
|
|
|
|
| 123 |
letter-spacing: -0.01em;
|
| 124 |
}
|
| 125 |
|
| 126 |
+
.brief-grid {
|
| 127 |
+
display: grid;
|
| 128 |
+
grid-template-columns: repeat(2, minmax(0, 1fr));
|
| 129 |
+
gap: 12px;
|
| 130 |
+
}
|
| 131 |
+
|
| 132 |
+
.brief-card {
|
| 133 |
+
border: 1px solid rgba(15, 139, 99, 0.22);
|
| 134 |
+
border-radius: 12px;
|
| 135 |
+
background: rgba(255, 255, 255, 0.84);
|
| 136 |
+
padding: 10px;
|
| 137 |
+
display: grid;
|
| 138 |
+
gap: 8px;
|
| 139 |
+
}
|
| 140 |
+
|
| 141 |
+
.brief-card h3 {
|
| 142 |
+
margin: 0;
|
| 143 |
+
font-size: 0.95rem;
|
| 144 |
+
letter-spacing: -0.01em;
|
| 145 |
+
}
|
| 146 |
+
|
| 147 |
+
.brief-card p {
|
| 148 |
+
margin: 0;
|
| 149 |
+
font-size: 12px;
|
| 150 |
+
color: #36564a;
|
| 151 |
+
line-height: 1.45;
|
| 152 |
+
}
|
| 153 |
+
|
| 154 |
+
.brief-list {
|
| 155 |
+
margin: 0;
|
| 156 |
+
padding-left: 16px;
|
| 157 |
+
font-size: 12px;
|
| 158 |
+
color: #2f4f43;
|
| 159 |
+
line-height: 1.45;
|
| 160 |
+
display: grid;
|
| 161 |
+
gap: 5px;
|
| 162 |
+
}
|
| 163 |
+
|
| 164 |
+
.header-chips {
|
| 165 |
+
display: flex;
|
| 166 |
+
flex-wrap: wrap;
|
| 167 |
+
gap: 6px;
|
| 168 |
+
}
|
| 169 |
+
|
| 170 |
+
.header-chips code {
|
| 171 |
+
background: #eef8f3;
|
| 172 |
+
border: 1px solid rgba(15, 139, 99, 0.26);
|
| 173 |
+
border-radius: 999px;
|
| 174 |
+
padding: 4px 8px;
|
| 175 |
+
font: 500 11px/1.1 var(--mono);
|
| 176 |
+
color: #20443a;
|
| 177 |
+
white-space: nowrap;
|
| 178 |
+
}
|
| 179 |
+
|
| 180 |
.form-grid {
|
| 181 |
display: grid;
|
| 182 |
gap: 12px;
|
|
|
|
| 284 |
line-height: 1.4;
|
| 285 |
}
|
| 286 |
|
| 287 |
+
.slider-wrap {
|
| 288 |
+
display: grid;
|
| 289 |
+
gap: 8px;
|
| 290 |
+
}
|
| 291 |
+
|
| 292 |
+
.slider-value-row {
|
| 293 |
+
display: flex;
|
| 294 |
+
justify-content: space-between;
|
| 295 |
+
gap: 10px;
|
| 296 |
+
font: 500 12px/1.3 var(--mono);
|
| 297 |
+
color: #3c5950;
|
| 298 |
+
}
|
| 299 |
+
|
| 300 |
+
input[type="range"] {
|
| 301 |
+
width: 100%;
|
| 302 |
+
padding: 0;
|
| 303 |
+
accent-color: var(--accent);
|
| 304 |
+
cursor: pointer;
|
| 305 |
+
}
|
| 306 |
+
|
| 307 |
+
.eta-box {
|
| 308 |
+
border: 1px solid rgba(15, 139, 99, 0.22);
|
| 309 |
+
border-radius: 12px;
|
| 310 |
+
background: rgba(255, 255, 255, 0.85);
|
| 311 |
+
padding: 10px;
|
| 312 |
+
display: grid;
|
| 313 |
+
gap: 6px;
|
| 314 |
+
}
|
| 315 |
+
|
| 316 |
+
.eta-value {
|
| 317 |
+
font: 700 17px/1 var(--display);
|
| 318 |
+
letter-spacing: -0.01em;
|
| 319 |
+
}
|
| 320 |
+
|
| 321 |
+
.eta-warning {
|
| 322 |
+
font: 500 12px/1.45 var(--mono);
|
| 323 |
+
border-radius: 8px;
|
| 324 |
+
padding: 6px 8px;
|
| 325 |
+
display: none;
|
| 326 |
+
}
|
| 327 |
+
|
| 328 |
+
.eta-warning.warn {
|
| 329 |
+
display: block;
|
| 330 |
+
color: #752f1b;
|
| 331 |
+
background: #fff0d7;
|
| 332 |
+
border: 1px solid rgba(213, 114, 65, 0.45);
|
| 333 |
+
}
|
| 334 |
+
|
| 335 |
+
.eta-warning.ok {
|
| 336 |
+
display: block;
|
| 337 |
+
color: #20443a;
|
| 338 |
+
background: #e8f7ef;
|
| 339 |
+
border: 1px solid rgba(15, 139, 99, 0.28);
|
| 340 |
+
}
|
| 341 |
+
|
| 342 |
.actions {
|
| 343 |
margin-top: 6px;
|
| 344 |
display: flex;
|
|
|
|
| 482 |
grid-template-columns: 1fr;
|
| 483 |
}
|
| 484 |
|
| 485 |
+
.brief-grid {
|
| 486 |
+
grid-template-columns: 1fr;
|
| 487 |
+
}
|
| 488 |
+
|
| 489 |
.field.span-2 {
|
| 490 |
grid-column: span 1;
|
| 491 |
}
|
|
|
|
| 499 |
<body>
|
| 500 |
<main class="shell">
|
| 501 |
<section class="hero">
|
| 502 |
+
<span class="eyebrow"><span class="dot"></span>FlakyGym Space</span>
|
| 503 |
+
<h1>FlakyGym Control Center</h1>
|
| 504 |
+
<p>This console runs flaky-test benchmark episodes and streams live logs. Use it to configure runs, estimate runtime, and review grader outcomes quickly.</p>
|
| 505 |
</section>
|
| 506 |
|
| 507 |
<section class="panel-grid">
|
| 508 |
+
<div class="panel">
|
| 509 |
+
<h2>Quick Brief: Dataset + Graders</h2>
|
| 510 |
+
<div class="brief-grid">
|
| 511 |
+
<div class="brief-card">
|
| 512 |
+
<h3>Dataset: <code>dataset/py_tasks.csv</code></h3>
|
| 513 |
+
<p>Each row is one flaky-test investigation task created from <code>py-data.csv</code> (repo + SHA + target test + labels + optional known fix diff).</p>
|
| 514 |
+
<p class="field-note">Headers:</p>
|
| 515 |
+
<div class="header-chips">
|
| 516 |
+
<code>repo_url</code><code>sha</code><code>test_name</code><code>test_file</code>
|
| 517 |
+
<code>category</code><code>label</code><code>status</code><code>pr_link</code>
|
| 518 |
+
<code>task_types</code><code>test_code</code><code>known_fix_diff</code>
|
| 519 |
+
</div>
|
| 520 |
+
</div>
|
| 521 |
+
|
| 522 |
+
<div class="brief-card">
|
| 523 |
+
<h3>3 Graders (short)</h3>
|
| 524 |
+
<ul class="brief-list">
|
| 525 |
+
<li><strong>Task 1 (`classify`):</strong> exact-match flaky vs stable.</li>
|
| 526 |
+
<li><strong>Task 2 (`root_cause`):</strong> category similarity matrix (partial credit allowed).</li>
|
| 527 |
+
<li><strong>Task 3 (`fix_proposal`):</strong> weighted score from pattern match, patch applicability, and LLM judge.</li>
|
| 528 |
+
</ul>
|
| 529 |
+
</div>
|
| 530 |
+
</div>
|
| 531 |
+
</div>
|
| 532 |
+
|
| 533 |
<div class="panel">
|
| 534 |
<h2>Run Configuration</h2>
|
| 535 |
<form id="run-form" class="form-grid">
|
|
|
|
| 540 |
|
| 541 |
<div class="field">
|
| 542 |
<label for="episodes_per_task">Episodes Per Task</label>
|
| 543 |
+
<div class="slider-wrap">
|
| 544 |
+
<input id="episodes_per_task" name="episodes_per_task" type="range" min="1" max="100" step="1" value="1" />
|
| 545 |
+
<div class="slider-value-row">
|
| 546 |
+
<span><strong id="episodes_per_task_value">1</strong> episode(s)</span>
|
| 547 |
+
<span>1-100</span>
|
| 548 |
+
</div>
|
| 549 |
+
</div>
|
| 550 |
</div>
|
| 551 |
|
| 552 |
<div class="field">
|
| 553 |
<label for="max_steps">Max Steps</label>
|
| 554 |
+
<div class="slider-wrap">
|
| 555 |
+
<input id="max_steps" name="max_steps" type="range" min="1" max="100" step="1" value="20" />
|
| 556 |
+
<div class="slider-value-row">
|
| 557 |
+
<span><strong id="max_steps_value">20</strong> step(s)</span>
|
| 558 |
+
<span>1-100</span>
|
| 559 |
+
</div>
|
| 560 |
+
</div>
|
| 561 |
</div>
|
| 562 |
|
| 563 |
<div class="field span-2">
|
|
|
|
| 580 |
<input id="benchmark_name" name="benchmark_name" value="flakysleuth" />
|
| 581 |
</div>
|
| 582 |
|
| 583 |
+
<div class="field span-2">
|
| 584 |
+
<label>Runtime ETA</label>
|
| 585 |
+
<div class="eta-box">
|
| 586 |
+
<div class="eta-value" id="eta-value">~09m 00s</div>
|
| 587 |
+
<p class="field-note" id="eta-detail">3 task(s) × 1 episode(s) × 180s/episode</p>
|
| 588 |
+
<div class="eta-warning" id="eta-warning"></div>
|
| 589 |
+
</div>
|
| 590 |
+
</div>
|
| 591 |
+
|
| 592 |
<div class="field span-2">
|
| 593 |
<label for="api_base_url">API Base URL (optional)</label>
|
| 594 |
<input id="api_base_url" name="api_base_url" placeholder="https://api.openai.com/v1 or provider endpoint" />
|
|
|
|
| 653 |
const taskChipsEl = document.getElementById("task-chips");
|
| 654 |
const taskSelectEl = document.getElementById("task-type-select");
|
| 655 |
const taskAddButton = document.getElementById("btn-add-task");
|
| 656 |
+
const episodesInput = document.getElementById("episodes_per_task");
|
| 657 |
+
const episodesValueEl = document.getElementById("episodes_per_task_value");
|
| 658 |
+
const maxStepsInput = document.getElementById("max_steps");
|
| 659 |
+
const maxStepsValueEl = document.getElementById("max_steps_value");
|
| 660 |
+
const etaValueEl = document.getElementById("eta-value");
|
| 661 |
+
const etaDetailEl = document.getElementById("eta-detail");
|
| 662 |
+
const etaWarningEl = document.getElementById("eta-warning");
|
| 663 |
|
| 664 |
const TASK_TYPE_ORDER = ["classify", "root_cause", "fix_proposal"];
|
| 665 |
const TASK_TYPE_LABELS = {
|
|
|
|
| 667 |
root_cause: "Root Cause",
|
| 668 |
fix_proposal: "Fix Proposal",
|
| 669 |
};
|
| 670 |
+
const ETA_SECONDS_PER_EPISODE = 180;
|
| 671 |
+
const HACKATHON_LIMIT_SECONDS = 20 * 60;
|
| 672 |
+
|
| 673 |
+
function clampInt(raw, min, max, fallback) {
|
| 674 |
+
const num = Number(raw);
|
| 675 |
+
if (!Number.isFinite(num)) return fallback;
|
| 676 |
+
return Math.max(min, Math.min(max, Math.trunc(num)));
|
| 677 |
+
}
|
| 678 |
+
|
| 679 |
+
function formatDuration(totalSeconds) {
|
| 680 |
+
const seconds = Math.max(0, Math.round(totalSeconds));
|
| 681 |
+
const mins = Math.floor(seconds / 60);
|
| 682 |
+
const secs = seconds % 60;
|
| 683 |
+
if (mins >= 60) {
|
| 684 |
+
const hrs = Math.floor(mins / 60);
|
| 685 |
+
const remMins = mins % 60;
|
| 686 |
+
return `${hrs}h ${String(remMins).padStart(2, "0")}m ${String(secs).padStart(2, "0")}s`;
|
| 687 |
+
}
|
| 688 |
+
return `${String(mins).padStart(2, "0")}m ${String(secs).padStart(2, "0")}s`;
|
| 689 |
+
}
|
| 690 |
+
|
| 691 |
+
function refreshSliderValues() {
|
| 692 |
+
const episodes = clampInt(episodesInput.value, 1, 100, 1);
|
| 693 |
+
const maxSteps = clampInt(maxStepsInput.value, 1, 100, 20);
|
| 694 |
+
episodesInput.value = String(episodes);
|
| 695 |
+
maxStepsInput.value = String(maxSteps);
|
| 696 |
+
episodesValueEl.textContent = String(episodes);
|
| 697 |
+
maxStepsValueEl.textContent = String(maxSteps);
|
| 698 |
+
}
|
| 699 |
|
| 700 |
function parseTaskTypes(raw) {
|
| 701 |
const tokens = String(raw || "")
|
|
|
|
| 775 |
taskInput.value = selectedTaskTypes.join(",");
|
| 776 |
renderTaskChips();
|
| 777 |
renderTaskSelect();
|
| 778 |
+
updateRuntimeEstimate();
|
| 779 |
}
|
| 780 |
|
| 781 |
function addSelectedTaskType() {
|
|
|
|
| 787 |
syncTaskTypes();
|
| 788 |
}
|
| 789 |
|
| 790 |
+
function updateRuntimeEstimate() {
|
| 791 |
+
const episodes = clampInt(episodesInput.value, 1, 100, 1);
|
| 792 |
+
const maxSteps = clampInt(maxStepsInput.value, 1, 100, 20);
|
| 793 |
+
const taskCount = selectedTaskTypes.length;
|
| 794 |
+
const totalEpisodes = taskCount * episodes;
|
| 795 |
+
const etaSeconds = totalEpisodes * ETA_SECONDS_PER_EPISODE;
|
| 796 |
+
|
| 797 |
+
etaValueEl.textContent = `~${formatDuration(etaSeconds)}`;
|
| 798 |
+
etaDetailEl.textContent =
|
| 799 |
+
`${taskCount} task(s) × ${episodes} episode(s) × ${ETA_SECONDS_PER_EPISODE}s/episode`;
|
| 800 |
+
|
| 801 |
+
const notes = [];
|
| 802 |
+
if (episodes > 2) {
|
| 803 |
+
notes.push("Recommended: keep episodes per task at 1-2 for faster hackathon runs.");
|
| 804 |
+
}
|
| 805 |
+
if (etaSeconds > HACKATHON_LIMIT_SECONDS) {
|
| 806 |
+
notes.push("Warning: ETA exceeds 20 minutes, which may violate hackathon runtime guidance.");
|
| 807 |
+
}
|
| 808 |
+
if (maxSteps > 20) {
|
| 809 |
+
notes.push("Higher max steps can increase runtime beyond this ETA estimate.");
|
| 810 |
+
}
|
| 811 |
+
if (taskCount === 0) {
|
| 812 |
+
notes.push("Add at least one task chip to run inference.");
|
| 813 |
+
}
|
| 814 |
+
|
| 815 |
+
etaWarningEl.classList.remove("warn", "ok");
|
| 816 |
+
if (!notes.length) {
|
| 817 |
+
etaWarningEl.textContent = "Runtime looks within limits for a quick benchmark run.";
|
| 818 |
+
etaWarningEl.classList.add("ok");
|
| 819 |
+
return;
|
| 820 |
+
}
|
| 821 |
+
|
| 822 |
+
etaWarningEl.textContent = notes.join(" ");
|
| 823 |
+
if (etaSeconds > HACKATHON_LIMIT_SECONDS || episodes > 2 || taskCount === 0) {
|
| 824 |
+
etaWarningEl.classList.add("warn");
|
| 825 |
+
} else {
|
| 826 |
+
etaWarningEl.classList.add("ok");
|
| 827 |
+
}
|
| 828 |
+
}
|
| 829 |
+
|
| 830 |
function readFormPayload() {
|
| 831 |
+
const episodes = clampInt(episodesInput.value, 1, 100, 1);
|
| 832 |
+
const maxSteps = clampInt(maxStepsInput.value, 1, 100, 20);
|
| 833 |
+
episodesInput.value = String(episodes);
|
| 834 |
+
maxStepsInput.value = String(maxSteps);
|
| 835 |
+
refreshSliderValues();
|
| 836 |
+
updateRuntimeEstimate();
|
| 837 |
return {
|
| 838 |
dataset_path: form.dataset_path.value.trim(),
|
| 839 |
+
episodes_per_task: episodes,
|
| 840 |
task_types: form.task_types.value.trim(),
|
| 841 |
+
max_steps: maxSteps,
|
| 842 |
benchmark_name: form.benchmark_name.value.trim(),
|
| 843 |
api_base_url: form.api_base_url.value.trim() || null,
|
| 844 |
model_name: form.model_name.value.trim() || null,
|
|
|
|
| 949 |
addSelectedTaskType();
|
| 950 |
}
|
| 951 |
});
|
| 952 |
+
episodesInput.addEventListener("input", () => {
|
| 953 |
+
refreshSliderValues();
|
| 954 |
+
updateRuntimeEstimate();
|
| 955 |
+
});
|
| 956 |
+
maxStepsInput.addEventListener("input", () => {
|
| 957 |
+
refreshSliderValues();
|
| 958 |
+
updateRuntimeEstimate();
|
| 959 |
+
});
|
| 960 |
|
| 961 |
+
refreshSliderValues();
|
| 962 |
syncTaskTypes();
|
| 963 |
+
updateRuntimeEstimate();
|
| 964 |
|
| 965 |
fetchStatus();
|
| 966 |
window.setInterval(fetchStatus, 2200);
|