vedkdev commited on
Commit
f53d90b
·
verified ·
1 Parent(s): 283dfb9

Deploy FlakyGym UI + inference updates (minimal upload)

Browse files
Files changed (11) hide show
  1. .gitignore +32 -0
  2. .openenv_push_ignore +26 -0
  3. Dockerfile +0 -1
  4. GRADING.md +31 -2
  5. README.md +31 -9
  6. env/environment.py +68 -1
  7. inference.py +619 -81
  8. inference_debug.py +207 -3
  9. server/app.py +1 -1
  10. server/inference_runner.py +2 -2
  11. server/ui.py +260 -8
.gitignore ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python caches
2
+ __pycache__/
3
+ *.py[cod]
4
+ *.so
5
+
6
+ # Virtual environments
7
+ .venv/
8
+ venv/
9
+ openenv/
10
+
11
+ # Local editor / tooling state
12
+ .vscode/
13
+ .codex/
14
+ .agents/
15
+
16
+ # Secrets and local config
17
+ .env
18
+ .env.*
19
+ *.local
20
+
21
+ # Logs and runtime artifacts
22
+ *.log
23
+ outputs/
24
+
25
+ # Build artifacts
26
+ build/
27
+ dist/
28
+ *.egg-info/
29
+
30
+ # Dataset/generated outputs
31
+ dataset/__pycache__/
32
+ dataset/py_tasks.csv
.openenv_push_ignore ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ .git/
2
+ .agents/
3
+ .codex/
4
+ .vscode/
5
+ openenv/
6
+ openenv/**
7
+ .venv/
8
+ .venv/**
9
+ venv/
10
+ venv/**
11
+ __pycache__/
12
+ **/__pycache__/
13
+ *.pyc
14
+ *.pyo
15
+ *.pyd
16
+ *.log
17
+ *.tmp
18
+ *.cache
19
+ agent_trace.log
20
+ debug_trace.log
21
+ debug_trace2.log
22
+ debug_trace3.log
23
+ debug_trace4.log
24
+ debug_trace5.log
25
+ outputs/
26
+ .pytest_cache/
Dockerfile CHANGED
@@ -14,5 +14,4 @@ COPY . .
14
 
15
  EXPOSE 8000
16
 
17
- ENV ENABLE_WEB_INTERFACE=true
18
  CMD ["python", "-m", "server.app"]
 
14
 
15
  EXPOSE 8000
16
 
 
17
  CMD ["python", "-m", "server.app"]
GRADING.md CHANGED
@@ -73,8 +73,25 @@ reward = progress
73
  - else -> `0.01`
74
 
75
  ### `search_code`
76
- - if query contains any flaky-signal tokens (`sleep`, `random`, `time`, `datetime`, `thread`, `asyncio`, `fixture`, `setup`, `teardown`, `global`, `shared`, `singleton`, `os.environ`, `socket`, `timeout`, `retry`, `mock`, `patch`) -> `0.04`
77
- - otherwise -> `0.01`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
 
79
  ### `run_test`
80
  - if category is **not** one of `OD`, `OD-Brit`, `OD-Vic` -> `0.05`
@@ -251,3 +268,15 @@ reward = clamp(0.05 + 0.0, 0, 1) = 0.05
251
  - Timeout does not invoke grader; it only ends the episode.
252
  - Dataset construction choices (especially `label` and category quality) heavily influence observed score behavior.
253
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
  - else -> `0.01`
74
 
75
  ### `search_code`
76
+ - base reward:
77
+ - if query contains any flaky-signal tokens (`sleep`, `random`, `time`, `datetime`, `thread`, `asyncio`, `fixture`, `setup`, `teardown`, `global`, `shared`, `singleton`, `os.environ`, `socket`, `timeout`, `retry`, `mock`, `patch`) -> `0.04`
78
+ - otherwise -> `0.01`
79
+ - spam penalties (all apply, then summed and capped):
80
+ - repeated same normalized search pattern in episode:
81
+ - `repeat_penalty = min(0.02 * (pattern_count - 1), 0.12)` for `pattern_count > 1`
82
+ - repeated same search context (same normalized pattern + same extracted top `.py` hit files):
83
+ - `context_penalty = min(0.03 * (context_count - 1), 0.15)` for `context_count > 1`
84
+ - long search-only streak:
85
+ - `streak_penalty = min(0.02 * (consecutive_searches - 3), 0.20)` for `consecutive_searches > 3`
86
+ - total spam penalty cap: `min(sum_penalties, 0.35)`
87
+ - final `search_code` progress:
88
+
89
+ ```text
90
+ progress = max(-0.25, base_reward - spam_penalty)
91
+ ```
92
+
93
+ - environment appends `WARNING:` text to tool output when penalties fire.
94
+ - `consecutive_searches` resets on any non-`search_code` action.
95
 
96
  ### `run_test`
97
  - if category is **not** one of `OD`, `OD-Brit`, `OD-Vic` -> `0.05`
 
268
  - Timeout does not invoke grader; it only ends the episode.
269
  - Dataset construction choices (especially `label` and category quality) heavily influence observed score behavior.
270
 
271
+ ## 9) Inference-side controls (not grader formulas)
272
+
273
+ `inference.py` now includes policy/runtime controls that do not change grader math directly but change agent behavior:
274
+
275
+ - episode memory injected into every prompt (recent files, search patterns, no-progress streak)
276
+ - explicit loop warning prompt when no-progress/duplicate patterns are detected
277
+ - duplicate `read_file` attempts are overridden to targeted `search_code`
278
+ - conversation compaction controls:
279
+ - `--history-prune-start-step` (default `12`)
280
+ - `--history-window-turns` (default `4`)
281
+ - `--history-max-chars` (default `50000`)
282
+ - detailed tracing options (`--trace-agent`, `--trace-prompts`) for audit/debug
README.md CHANGED
@@ -15,6 +15,8 @@ tags:
15
 
16
  OpenEnv-compatible RL environment for flaky-test investigation in real Python repos.
17
 
 
 
18
  ## Setup
19
 
20
  ```bash
@@ -58,14 +60,15 @@ curl -s http://localhost:8000/health
58
 
59
  ## Run Inference
60
 
61
- Recommended (OpenRouter):
62
 
63
  ```bash
64
- export OPENROUTER_API_KEY=your_openrouter_api_key
65
- export API_BASE_URL=https://openrouter.ai/api/v1
66
- export MODEL_NAME=qwen/qwen3.6-plus:free
 
67
 
68
- python inference.py --dataset-path dataset/py_tasks.csv --episodes-per-task 5
69
  ```
70
 
71
  ### Run Inference From Space UI
@@ -73,22 +76,41 @@ python inference.py --dataset-path dataset/py_tasks.csv --episodes-per-task 5
73
  When deployed, the Space homepage serves a UI at `/` (also `/web`) that starts
74
  `inference.py` in the background and streams logs live.
75
 
 
 
 
 
 
 
76
  ### `inference.py` flags
77
 
78
  | Flag | Type | Default | Description |
79
  |---|---|---|---|
80
  | `--dataset-path` | `str` | `dataset/py_tasks.csv` | Processed task CSV used by env |
81
- | `--episodes-per-task` | `int` | `5` | Episodes per selected task type |
82
  | `--task-types` | `str` | `classify,root_cause,fix_proposal` | Comma-separated task types |
83
  | `--max-steps` | `int` | `20` | Max steps per episode |
 
 
 
 
 
 
84
  | `--benchmark-name` | `str` | `flakysleuth` | Label printed in `[START]` logs |
 
 
 
85
 
86
- Trace to log:
87
  ```bash
88
  python inference.py \
89
  --dataset-path dataset/py_tasks.csv \
90
- --episodes-per-task 5 \
91
- --task-types classify,root_cause > agent_trace.log 2>&1
 
 
 
 
92
  ```
93
 
94
  ## OpenEnv CLI
 
15
 
16
  OpenEnv-compatible RL environment for flaky-test investigation in real Python repos.
17
 
18
+ Flaky tests are dangerous because they make CI results untrustworthy: real regressions can be ignored as "just flaky," while healthy code can fail randomly and block releases, wasting engineering time and eroding confidence in test signals. We are building this Gym-style RL environment so agents can practice flaky-test triage in realistic repositories, learn to separate true failures from nondeterministic noise, and generate faster, more reliable debugging and fix strategies at scale.
19
+
20
  ## Setup
21
 
22
  ```bash
 
60
 
61
  ## Run Inference
62
 
63
+ Recommended (HF Router/OpenRouter/OpenAI compatible):
64
 
65
  ```bash
66
+ export HF_TOKEN=your_hf_token
67
+ # optional:
68
+ # export API_BASE_URL=https://router.huggingface.co/v1
69
+ # export MODEL_NAME=openai/gpt-oss-120b:novita
70
 
71
+ python inference.py --dataset-path dataset/py_tasks.csv --episodes-per-task 2
72
  ```
73
 
74
  ### Run Inference From Space UI
 
76
  When deployed, the Space homepage serves a UI at `/` (also `/web`) that starts
77
  `inference.py` in the background and streams logs live.
78
 
79
+ UI defaults:
80
+ - `episodes_per_task=1`
81
+ - slider range up to `100`
82
+ - live ETA estimator: `selected_tasks × episodes_per_task × 180s`
83
+ - warning when ETA may exceed 20 minutes (hackathon guidance)
84
+
85
  ### `inference.py` flags
86
 
87
  | Flag | Type | Default | Description |
88
  |---|---|---|---|
89
  | `--dataset-path` | `str` | `dataset/py_tasks.csv` | Processed task CSV used by env |
90
+ | `--episodes-per-task` | `int` | `2` | Episodes per selected task type |
91
  | `--task-types` | `str` | `classify,root_cause,fix_proposal` | Comma-separated task types |
92
  | `--max-steps` | `int` | `20` | Max steps per episode |
93
+ | `--no-progress` | flag | `False` | Disable progress bars in non-compliance mode |
94
+ | `--trace-agent` | flag | `False` | Print detailed action/model/tool trace |
95
+ | `--trace-prompts` | flag | `False` | Include full prompts in trace |
96
+ | `--trace-max-chars` | `int` | `2500` | Clip size for traced prompt/output blocks |
97
+ | `--compliance-stdout` | flag | `True` | Strict `[START]/[STEP]/[END]` logs (default on) |
98
+ | `--no-compliance-stdout` | flag | `False` | Switch to baseline summary/progress output |
99
  | `--benchmark-name` | `str` | `flakysleuth` | Label printed in `[START]` logs |
100
+ | `--history-prune-start-step` | `int` | `12` | Start compacting history from this step |
101
+ | `--history-window-turns` | `int` | `4` | Keep this many recent assistant/user turns on prune |
102
+ | `--history-max-chars` | `int` | `50000` | Force prune when message history exceeds this size |
103
 
104
+ Detailed trace to log:
105
  ```bash
106
  python inference.py \
107
  --dataset-path dataset/py_tasks.csv \
108
+ --episodes-per-task 1 \
109
+ --task-types classify,root_cause \
110
+ --no-compliance-stdout \
111
+ --trace-agent \
112
+ --history-prune-start-step 12 \
113
+ --history-window-turns 4 > agent_trace.log 2>&1
114
  ```
115
 
116
  ## OpenEnv CLI
env/environment.py CHANGED
@@ -41,6 +41,9 @@ class FlakySleuthEnv:
41
  self.cumulative_progress = 0.0
42
  self.files_read: set[str] = set()
43
  self.episode_actions: list[FlakySleuthAction] = []
 
 
 
44
 
45
  def reset(self) -> FlakySleuthObservation:
46
  if self.sandbox:
@@ -61,6 +64,9 @@ class FlakySleuthEnv:
61
  self.cumulative_progress = 0.0
62
  self.files_read = set()
63
  self.episode_actions = []
 
 
 
64
 
65
  return self._make_obs()
66
 
@@ -153,6 +159,8 @@ class FlakySleuthEnv:
153
 
154
  progress = 0.0
155
  output = ""
 
 
156
 
157
  if action.action_type == "read_file":
158
  content = self.sandbox.read_file(action.argument)
@@ -168,8 +176,13 @@ class FlakySleuthEnv:
168
  progress = self._file_relevance_reward(action.argument)
169
 
170
  elif action.action_type == "search_code":
 
171
  output = self.sandbox.grep(action.argument)
172
- progress = self._search_relevance_reward(action.argument)
 
 
 
 
173
 
174
  elif action.action_type == "run_test":
175
  output = self.sandbox.run_test(self.current_task.get("test_name", ""))
@@ -198,6 +211,60 @@ class FlakySleuthEnv:
198
  return 0.04
199
  return 0.01
200
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
201
  def _make_obs(self, tool_output: str | None = None) -> FlakySleuthObservation:
202
  if not self.current_task:
203
  raise RuntimeError("No current task available")
 
41
  self.cumulative_progress = 0.0
42
  self.files_read: set[str] = set()
43
  self.episode_actions: list[FlakySleuthAction] = []
44
+ self.search_pattern_counts: dict[str, int] = {}
45
+ self.search_context_counts: dict[str, int] = {}
46
+ self.consecutive_searches = 0
47
 
48
  def reset(self) -> FlakySleuthObservation:
49
  if self.sandbox:
 
64
  self.cumulative_progress = 0.0
65
  self.files_read = set()
66
  self.episode_actions = []
67
+ self.search_pattern_counts = {}
68
+ self.search_context_counts = {}
69
+ self.consecutive_searches = 0
70
 
71
  return self._make_obs()
72
 
 
159
 
160
  progress = 0.0
161
  output = ""
162
+ if action.action_type != "search_code":
163
+ self.consecutive_searches = 0
164
 
165
  if action.action_type == "read_file":
166
  content = self.sandbox.read_file(action.argument)
 
176
  progress = self._file_relevance_reward(action.argument)
177
 
178
  elif action.action_type == "search_code":
179
+ self.consecutive_searches += 1
180
  output = self.sandbox.grep(action.argument)
181
+ base_progress = self._search_relevance_reward(action.argument)
182
+ spam_penalty, warnings = self._search_spam_penalty(action.argument, output)
183
+ progress = max(-0.25, base_progress - spam_penalty)
184
+ if warnings:
185
+ output = f"{output}\n\nWARNING: {' '.join(warnings)}"
186
 
187
  elif action.action_type == "run_test":
188
  output = self.sandbox.run_test(self.current_task.get("test_name", ""))
 
211
  return 0.04
212
  return 0.01
213
 
214
+ def _search_spam_penalty(self, pattern: str, output: str) -> tuple[float, list[str]]:
215
+ penalty = 0.0
216
+ warnings: list[str] = []
217
+
218
+ pattern_key = " ".join(pattern.lower().split())
219
+ if pattern_key:
220
+ pattern_count = self.search_pattern_counts.get(pattern_key, 0) + 1
221
+ self.search_pattern_counts[pattern_key] = pattern_count
222
+ if pattern_count > 1:
223
+ repeat_penalty = min(0.02 * (pattern_count - 1), 0.12)
224
+ penalty += repeat_penalty
225
+ warnings.append(
226
+ f"Repeated search pattern ({pattern_count}x) penalty={repeat_penalty:.2f}."
227
+ )
228
+
229
+ context_hits = self._extract_search_hits(output)
230
+ context_key = f"{pattern_key}::{','.join(context_hits)}"
231
+ context_count = self.search_context_counts.get(context_key, 0) + 1
232
+ self.search_context_counts[context_key] = context_count
233
+ if context_count > 1:
234
+ context_penalty = min(0.03 * (context_count - 1), 0.15)
235
+ penalty += context_penalty
236
+ warnings.append(
237
+ f"Same search context repeated ({context_count}x) penalty={context_penalty:.2f}."
238
+ )
239
+
240
+ if self.consecutive_searches > 3:
241
+ streak_penalty = min(0.02 * (self.consecutive_searches - 3), 0.20)
242
+ penalty += streak_penalty
243
+ warnings.append(
244
+ f"Search-only streak={self.consecutive_searches} penalty={streak_penalty:.2f}."
245
+ )
246
+
247
+ return min(penalty, 0.35), warnings
248
+
249
+ def _extract_search_hits(self, output: str) -> tuple[str, ...]:
250
+ files: list[str] = []
251
+ seen: set[str] = set()
252
+ for raw_line in output.splitlines():
253
+ line = raw_line.strip()
254
+ if not line or line.startswith("No matches found") or line.startswith("Search "):
255
+ continue
256
+ filepath = line.split(":", 1)[0].strip()
257
+ if filepath.startswith("./"):
258
+ filepath = filepath[2:]
259
+ if not filepath.endswith(".py"):
260
+ continue
261
+ if filepath not in seen:
262
+ seen.add(filepath)
263
+ files.append(filepath)
264
+ if len(files) >= 5:
265
+ break
266
+ return tuple(files)
267
+
268
  def _make_obs(self, tool_output: str | None = None) -> FlakySleuthObservation:
269
  if not self.current_task:
270
  raise RuntimeError("No current task available")
inference.py CHANGED
@@ -1,23 +1,44 @@
1
- """FlakySleuth compliance inference script.
 
 
 
 
 
 
 
 
 
 
 
2
  """
3
 
4
  from __future__ import annotations
5
 
6
- import argparse
7
  import json
8
  import os
 
 
 
 
9
  from typing import Any
10
 
11
  from openai import OpenAI
12
 
 
 
 
 
 
13
  from env.environment import FlakySleuthEnv
14
  from env.models import FlakySleuthAction, FlakySleuthObservation
15
 
16
- HF_TOKEN = os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_HUB_TOKEN")
17
  OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY")
18
  OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
 
 
 
19
  RAW_API_KEY = os.environ.get("API_KEY")
20
- API_KEY = RAW_API_KEY or HF_TOKEN or OPENROUTER_API_KEY or OPENAI_API_KEY or ""
21
 
22
 
23
  def _looks_like_openrouter_key(key: str | None) -> bool:
@@ -26,14 +47,19 @@ def _looks_like_openrouter_key(key: str | None) -> bool:
26
 
27
  DEFAULT_BASE_URL = (
28
  "https://router.huggingface.co/v1"
29
- if (HF_TOKEN and not RAW_API_KEY and not OPENROUTER_API_KEY and not OPENAI_API_KEY)
 
 
 
 
 
30
  else (
31
- "https://openrouter.ai/api/v1"
32
- if (
33
- (OPENROUTER_API_KEY and not RAW_API_KEY and not OPENAI_API_KEY)
34
- or (_looks_like_openrouter_key(RAW_API_KEY) and not OPENAI_API_KEY)
35
- )
36
- else "https://api.openai.com/v1"
37
  )
38
  )
39
  API_BASE_URL = os.environ.get("API_BASE_URL", DEFAULT_BASE_URL)
@@ -41,13 +67,17 @@ API_BASE_URL = os.environ.get("API_BASE_URL", DEFAULT_BASE_URL)
41
  DEFAULT_MODEL = (
42
  "openai/gpt-oss-120b:novita"
43
  if API_BASE_URL.startswith("https://router.huggingface.co")
44
- else ("qwen/qwen3.6-plus:free" if API_BASE_URL.startswith("https://openrouter.ai") else "gpt-4o-mini")
 
 
 
 
45
  )
46
  MODEL_NAME = os.environ.get("MODEL_NAME", DEFAULT_MODEL)
47
-
48
- EPISODES_PER_TASK = 5
49
  MAX_STEPS = 20
50
- BENCHMARK_NAME = "flakysleuth"
51
 
52
  client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
53
 
@@ -81,34 +111,44 @@ Rules:
81
  """
82
 
83
 
84
- def _single_line(text: str) -> str:
85
  return " ".join(str(text).split())
86
 
87
 
88
- def log_start(task: str, env_name: str, model: str) -> None:
89
- print(f"[START] task={task} env={env_name} model={model}", flush=True)
90
 
91
 
92
- def log_step(step: int, action: str, reward: float, done: bool, error: str | None) -> None:
93
- error_value = _single_line(error) if error else "null"
94
- done_value = str(bool(done)).lower()
 
 
 
 
 
95
  print(
96
- f"[STEP] step={step} action={_single_line(action)} "
97
- f"reward={reward:.2f} done={done_value} error={error_value}",
98
  flush=True,
99
  )
100
 
101
 
102
- def log_end(success: bool, steps: int, score: float, rewards: list[float]) -> None:
103
- rewards_str = ",".join(f"{r:.2f}" for r in rewards)
104
  print(
105
  f"[END] success={str(bool(success)).lower()} steps={steps} "
106
- f"score={score:.2f} rewards={rewards_str}",
107
  flush=True,
108
  )
109
 
110
 
111
- def obs_to_prompt(obs: FlakySleuthObservation, max_steps: int) -> str:
 
 
 
 
 
112
  tree_preview = "\n".join(obs.file_tree[:40])
113
  return f"""TASK: {obs.task_description}
114
 
@@ -125,7 +165,10 @@ Repository file tree:
125
  {tree_preview}
126
 
127
  Last tool output:
128
- {obs.tool_output or '(No action taken yet)'}
 
 
 
129
 
130
  Return only JSON action."""
131
 
@@ -157,10 +200,18 @@ def heuristic_action(obs: FlakySleuthObservation) -> FlakySleuthAction:
157
  )
158
 
159
 
160
- def llm_action(messages: list[dict[str, str]]) -> FlakySleuthAction | None:
 
 
 
 
 
 
 
161
  if not API_KEY:
162
- return None
163
 
 
164
  response = client.chat.completions.create(
165
  model=MODEL_NAME,
166
  messages=messages,
@@ -168,80 +219,407 @@ def llm_action(messages: list[dict[str, str]]) -> FlakySleuthAction | None:
168
  temperature=0.0,
169
  )
170
  raw = (response.choices[0].message.content or "").strip()
 
171
  cleaned = raw.replace("```json", "").replace("```", "").strip()
172
  payload = json.loads(cleaned)
173
- return FlakySleuthAction.model_validate(payload)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
174
 
175
 
176
  def run_episode(
177
  env: FlakySleuthEnv,
178
  *,
179
- task_name: str,
180
- benchmark_name: str,
181
- max_steps: int,
182
- ) -> float:
 
 
 
 
 
 
 
 
183
  rewards: list[float] = []
184
  steps_taken = 0
185
- score = 0.0
186
  success = False
187
-
188
- log_start(task=task_name, env_name=benchmark_name, model=MODEL_NAME)
189
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
190
  try:
191
  obs = env.reset()
192
- messages: list[dict[str, str]] = [
 
 
193
  {"role": "system", "content": SYSTEM_PROMPT},
194
- {"role": "user", "content": obs_to_prompt(obs, max_steps)},
195
  ]
196
 
197
- for step_idx in range(1, max_steps + 1):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
198
  try:
199
- action = llm_action(messages) or heuristic_action(obs)
200
- except Exception:
 
 
 
 
 
 
 
 
 
 
201
  action = heuristic_action(obs)
202
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
203
  obs, reward, done, info = env.step(action)
204
- rewards.append(float(reward or 0.0))
205
- steps_taken = step_idx
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
206
 
207
  step_error: str | None = None
208
  if isinstance(info, dict):
209
- last_action_error = info.get("last_action_error")
210
- if last_action_error:
211
- step_error = str(last_action_error)
212
-
213
- log_step(
214
- step=step_idx,
215
- action=action.model_dump_json(),
216
- reward=float(reward or 0.0),
217
- done=bool(done),
218
- error=step_error,
219
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
220
 
221
  if done:
222
- score = float(reward or 0.0)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
223
  break
224
 
 
225
  messages.append({"role": "assistant", "content": action.model_dump_json()})
226
- messages.append({"role": "user", "content": obs_to_prompt(obs, max_steps)})
227
-
228
- score = min(max(score, 0.0), 1.0)
229
- success = score > 0.0
230
- except Exception:
231
- score = 0.0
 
 
 
 
 
232
  success = False
 
 
233
  finally:
234
- try:
235
- env.close()
236
- except Exception:
237
- pass
238
- log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
 
 
 
 
 
 
 
 
239
 
240
- return score
 
 
 
 
 
 
 
 
 
241
 
242
 
243
  def _parse_args() -> argparse.Namespace:
244
- parser = argparse.ArgumentParser(description="Run FlakySleuth compliance inference.")
245
  parser.add_argument(
246
  "--dataset-path",
247
  default="dataset/py_tasks.csv",
@@ -264,34 +642,194 @@ def _parse_args() -> argparse.Namespace:
264
  default=MAX_STEPS,
265
  help="Max steps per episode.",
266
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
267
  parser.add_argument(
268
  "--benchmark-name",
269
- default=BENCHMARK_NAME,
270
- help="Benchmark label for [START] lines.",
 
 
 
 
 
 
271
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
272
  return parser.parse_args()
273
 
274
 
275
  def main() -> None:
 
276
  args = _parse_args()
277
  env = FlakySleuthEnv(dataset_path=args.dataset_path, max_steps=args.max_steps)
278
-
279
  allowed_task_types = {"classify", "root_cause", "fix_proposal"}
280
  task_types = [t.strip() for t in args.task_types.split(",") if t.strip()]
 
 
 
 
 
 
281
  if not task_types:
282
- return
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
283
 
284
  for task_type in task_types:
285
- if task_type not in allowed_task_types:
286
- continue
 
287
  env.loader.force_task_type(task_type)
288
- for _ in range(args.episodes_per_task):
289
- run_episode(
 
 
 
 
 
 
 
 
 
290
  env,
291
- task_name=task_type,
 
 
 
 
 
292
  benchmark_name=args.benchmark_name,
293
- max_steps=args.max_steps,
 
 
 
294
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
295
 
296
 
297
  if __name__ == "__main__":
 
1
+ """FlakySleuth baseline inference script.
2
+
3
+ Environment variables:
4
+ Preferred:
5
+ HF_TOKEN / HUGGINGFACE_HUB_TOKEN (or OPENROUTER_API_KEY / API_KEY)
6
+ API_BASE_URL (optional, defaults to https://openrouter.ai/api/v1 for router-style keys)
7
+ MODEL_NAME (optional, defaults to qwen/qwen3.6-plus:free on OpenRouter)
8
+
9
+ Optional fallback:
10
+ OPENAI_API_KEY
11
+ API_BASE_URL (defaults to https://api.openai.com/v1 when OpenAI key is used)
12
+ MODEL_NAME (defaults to gpt-4o-mini for OpenAI)
13
  """
14
 
15
  from __future__ import annotations
16
 
 
17
  import json
18
  import os
19
+ import argparse
20
+ import time
21
+ from collections import defaultdict
22
+ from pathlib import Path
23
  from typing import Any
24
 
25
  from openai import OpenAI
26
 
27
+ try:
28
+ from tqdm import tqdm
29
+ except Exception: # pragma: no cover
30
+ tqdm = None
31
+
32
  from env.environment import FlakySleuthEnv
33
  from env.models import FlakySleuthAction, FlakySleuthObservation
34
 
 
35
  OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY")
36
  OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
37
+ HF_TOKEN = os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_HUB_TOKEN")
38
+ # Optional for environments created via from_docker_image(); kept for checklist parity.
39
+ LOCAL_IMAGE_NAME = os.environ.get("LOCAL_IMAGE_NAME")
40
  RAW_API_KEY = os.environ.get("API_KEY")
41
+ API_KEY = RAW_API_KEY or OPENROUTER_API_KEY or OPENAI_API_KEY or HF_TOKEN or ""
42
 
43
 
44
  def _looks_like_openrouter_key(key: str | None) -> bool:
 
47
 
48
  DEFAULT_BASE_URL = (
49
  "https://router.huggingface.co/v1"
50
+ if (
51
+ HF_TOKEN
52
+ and not RAW_API_KEY
53
+ and not OPENROUTER_API_KEY
54
+ and not OPENAI_API_KEY
55
+ )
56
  else (
57
+ "https://openrouter.ai/api/v1"
58
+ if (
59
+ (OPENROUTER_API_KEY and not RAW_API_KEY and not OPENAI_API_KEY)
60
+ or (_looks_like_openrouter_key(RAW_API_KEY) and not OPENAI_API_KEY)
61
+ )
62
+ else "https://api.openai.com/v1"
63
  )
64
  )
65
  API_BASE_URL = os.environ.get("API_BASE_URL", DEFAULT_BASE_URL)
 
67
  DEFAULT_MODEL = (
68
  "openai/gpt-oss-120b:novita"
69
  if API_BASE_URL.startswith("https://router.huggingface.co")
70
+ else (
71
+ "qwen/qwen3.6-plus:free"
72
+ if API_BASE_URL.startswith("https://openrouter.ai")
73
+ else "gpt-4o-mini"
74
+ )
75
  )
76
  MODEL_NAME = os.environ.get("MODEL_NAME", DEFAULT_MODEL)
77
+ # Keep a conservative default to stay under common hackathon runtime limits.
78
+ EPISODES_PER_TASK = 2
79
  MAX_STEPS = 20
80
+ MEMORY_MAX_CHARS = 900
81
 
82
  client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
83
 
 
111
  """
112
 
113
 
114
+ def _to_single_line(text: str) -> str:
115
  return " ".join(str(text).split())
116
 
117
 
118
+ def _compliance_log_start(task: str, benchmark: str, model: str) -> None:
119
+ print(f"[START] task={task} env={benchmark} model={model}", flush=True)
120
 
121
 
122
+ def _compliance_log_step(
123
+ step: int,
124
+ action: str,
125
+ reward: float,
126
+ done: bool,
127
+ error: str | None,
128
+ ) -> None:
129
+ error_value = _to_single_line(error) if error else "null"
130
  print(
131
+ f"[STEP] step={step} action={_to_single_line(action)} "
132
+ f"reward={reward:.2f} done={str(bool(done)).lower()} error={error_value}",
133
  flush=True,
134
  )
135
 
136
 
137
+ def _compliance_log_end(success: bool, steps: int, score: float, rewards: list[float]) -> None:
138
+ rewards_value = ",".join(f"{r:.2f}" for r in rewards)
139
  print(
140
  f"[END] success={str(bool(success)).lower()} steps={steps} "
141
+ f"score={score:.2f} rewards={rewards_value}",
142
  flush=True,
143
  )
144
 
145
 
146
+ def obs_to_prompt(
147
+ obs: FlakySleuthObservation,
148
+ *,
149
+ memory_hint: str | None = None,
150
+ max_steps: int = MAX_STEPS,
151
+ ) -> str:
152
  tree_preview = "\n".join(obs.file_tree[:40])
153
  return f"""TASK: {obs.task_description}
154
 
 
165
  {tree_preview}
166
 
167
  Last tool output:
168
+ {obs.tool_output or "(No action taken yet)"}
169
+
170
+ Episode memory:
171
+ {memory_hint or "(No memory yet.)"}
172
 
173
  Return only JSON action."""
174
 
 
200
  )
201
 
202
 
203
+ def llm_action(
204
+ messages: list[dict[str, str]],
205
+ ) -> tuple[FlakySleuthAction | None, dict[str, Any]]:
206
+ meta: dict[str, Any] = {
207
+ "attempted": False,
208
+ "raw_output": "",
209
+ "error": "",
210
+ }
211
  if not API_KEY:
212
+ return None, meta
213
 
214
+ meta["attempted"] = True
215
  response = client.chat.completions.create(
216
  model=MODEL_NAME,
217
  messages=messages,
 
219
  temperature=0.0,
220
  )
221
  raw = (response.choices[0].message.content or "").strip()
222
+ meta["raw_output"] = raw
223
  cleaned = raw.replace("```json", "").replace("```", "").strip()
224
  payload = json.loads(cleaned)
225
+ return FlakySleuthAction.model_validate(payload), meta
226
+
227
+
228
+ def _clip_text(text: str, max_chars: int) -> str:
229
+ if max_chars <= 0:
230
+ return text
231
+ if len(text) <= max_chars:
232
+ return text
233
+ remaining = len(text) - max_chars
234
+ return f"{text[:max_chars]}\n...[truncated {remaining} chars]"
235
+
236
+
237
+ def _trace_print(
238
+ enabled: bool,
239
+ message: str,
240
+ *,
241
+ text: str | None = None,
242
+ max_chars: int = 0,
243
+ ) -> None:
244
+ if not enabled:
245
+ return
246
+ print(message)
247
+ if text is not None:
248
+ print(_clip_text(text, max_chars))
249
+
250
+
251
+ def _format_duration(seconds: float) -> str:
252
+ seconds = max(0.0, float(seconds))
253
+ mins, secs = divmod(int(round(seconds)), 60)
254
+ hrs, mins = divmod(mins, 60)
255
+ if hrs > 0:
256
+ return f"{hrs:d}h {mins:02d}m {secs:02d}s"
257
+ return f"{mins:02d}m {secs:02d}s"
258
+
259
+
260
+ def _build_episode_memory(
261
+ *,
262
+ unique_read_files: list[str],
263
+ zero_gain_read_files: set[str],
264
+ search_patterns: list[str],
265
+ blocked_duplicate_reads: int,
266
+ no_progress_streak: int,
267
+ max_chars: int,
268
+ ) -> str:
269
+ read_tail = ", ".join(unique_read_files[-8:]) if unique_read_files else "none"
270
+ zero_tail = ", ".join(sorted(zero_gain_read_files)[-8:]) if zero_gain_read_files else "none"
271
+ search_tail = ", ".join(search_patterns[-6:]) if search_patterns else "none"
272
+ loop_warning = (
273
+ "WARNING: Possible loop detected. Stop repeating similar exploration. "
274
+ "Switch strategy or take a terminal action."
275
+ if no_progress_streak >= 3 or blocked_duplicate_reads >= 2
276
+ else "Status: exploration progress appears normal."
277
+ )
278
+ memory = (
279
+ f"Read files (recent): {read_tail}\n"
280
+ f"Zero-gain read files: {zero_tail}\n"
281
+ f"Search patterns (recent): {search_tail}\n"
282
+ f"Blocked duplicate reads: {blocked_duplicate_reads}\n"
283
+ f"No-progress streak: {no_progress_streak}\n"
284
+ f"{loop_warning}\n"
285
+ "Guidance: Avoid rereading zero-gain files unless there is new evidence. "
286
+ "Prefer targeted search_code or terminal action when confidence is enough."
287
+ )
288
+ return _clip_text(memory, max_chars=max_chars)
289
+
290
+
291
+ def _duplicate_read_replacement_pattern(obs: FlakySleuthObservation) -> str:
292
+ test_hint = obs.test_name.split("::")[-1] if obs.test_name else "test"
293
+ return (
294
+ f"{test_hint}|random|sleep|time|timeout|retry|asyncio|thread|"
295
+ "fixture|global|shared|mock|patch"
296
+ )
297
+
298
+
299
+ def _messages_char_count(messages: list[dict[str, str]]) -> int:
300
+ # Lightweight size heuristic to avoid unbounded context growth.
301
+ return sum(len(str(msg.get("content", ""))) + 32 for msg in messages)
302
+
303
+
304
+ def _prune_messages_window(
305
+ messages: list[dict[str, str]],
306
+ *,
307
+ step_number: int,
308
+ prune_start_step: int,
309
+ window_turns: int,
310
+ max_chars: int,
311
+ ) -> tuple[list[dict[str, str]], dict[str, Any] | None]:
312
+ if len(messages) <= 2:
313
+ return messages, None
314
+
315
+ current_chars = _messages_char_count(messages)
316
+ exceeds_step_threshold = step_number >= prune_start_step
317
+ exceeds_char_budget = current_chars > max_chars
318
+ if not exceeds_step_threshold and not exceeds_char_budget:
319
+ return messages, None
320
+
321
+ base = messages[:2] # system + initial prompt
322
+ tail = messages[2:]
323
+ keep_tail_items = max(2, window_turns * 2)
324
+ if len(tail) > keep_tail_items:
325
+ tail = tail[-keep_tail_items:]
326
+ pruned = base + tail
327
+
328
+ reason = "step_threshold" if exceeds_step_threshold else "char_budget"
329
+ return pruned, {
330
+ "reason": reason,
331
+ "before_messages": len(messages),
332
+ "after_messages": len(pruned),
333
+ "before_chars": current_chars,
334
+ "after_chars": _messages_char_count(pruned),
335
+ "step": step_number,
336
+ }
337
 
338
 
339
  def run_episode(
340
  env: FlakySleuthEnv,
341
  *,
342
+ print_terminal: bool = True,
343
+ trace_agent: bool = False,
344
+ trace_prompts: bool = False,
345
+ trace_max_chars: int = 2000,
346
+ episode_label: str = "",
347
+ compliance_stdout: bool = False,
348
+ benchmark_name: str = "flakysleuth",
349
+ compliance_task_name: str | None = None,
350
+ history_prune_start_step: int = 12,
351
+ history_window_turns: int = 4,
352
+ history_max_chars: int = 50000,
353
+ ) -> tuple[float, dict[str, Any]]:
354
  rewards: list[float] = []
355
  steps_taken = 0
 
356
  success = False
357
+ episode_task_name = (compliance_task_name or episode_label.split(" ", 1)[0].strip() or "unknown")
358
+ exploration_reward_total = 0.0
359
+ final_episode_score = 0.0
360
+ terminal_meta: dict[str, Any] = {}
361
+ llm_steps = 0
362
+ heuristic_steps = 0
363
+ fallback_reasons: dict[str, int] = {}
364
+ prune_events = 0
365
+ read_attempt_counts: dict[str, int] = {}
366
+ unique_read_files: list[str] = []
367
+ zero_gain_read_files: set[str] = set()
368
+ search_patterns: list[str] = []
369
+ blocked_duplicate_reads = 0
370
+ no_progress_streak = 0
371
+ memory_hint = _build_episode_memory(
372
+ unique_read_files=unique_read_files,
373
+ zero_gain_read_files=zero_gain_read_files,
374
+ search_patterns=search_patterns,
375
+ blocked_duplicate_reads=blocked_duplicate_reads,
376
+ no_progress_streak=no_progress_streak,
377
+ max_chars=MEMORY_MAX_CHARS,
378
+ )
379
+ if compliance_stdout:
380
+ _compliance_log_start(episode_task_name, benchmark_name, MODEL_NAME)
381
  try:
382
  obs = env.reset()
383
+
384
+ initial_prompt = obs_to_prompt(obs, memory_hint=memory_hint, max_steps=env.max_steps)
385
+ messages = [
386
  {"role": "system", "content": SYSTEM_PROMPT},
387
+ {"role": "user", "content": initial_prompt},
388
  ]
389
 
390
+ if not compliance_stdout:
391
+ _trace_print(
392
+ trace_agent,
393
+ (
394
+ f"\n[trace] {episode_label} "
395
+ f"task={obs.task_type} repo={obs.repo_url} test={obs.test_name}"
396
+ ).strip(),
397
+ )
398
+ if trace_prompts and not compliance_stdout:
399
+ _trace_print(
400
+ trace_agent,
401
+ "[trace] system prompt:",
402
+ text=SYSTEM_PROMPT,
403
+ max_chars=trace_max_chars,
404
+ )
405
+ _trace_print(
406
+ trace_agent,
407
+ "[trace] initial user prompt:",
408
+ text=initial_prompt,
409
+ max_chars=trace_max_chars,
410
+ )
411
+
412
+ for step_idx in range(env.max_steps):
413
+ messages, prune_info = _prune_messages_window(
414
+ messages,
415
+ step_number=step_idx + 1,
416
+ prune_start_step=history_prune_start_step,
417
+ window_turns=history_window_turns,
418
+ max_chars=history_max_chars,
419
+ )
420
+ if prune_info:
421
+ prune_events += 1
422
+ if trace_agent and not compliance_stdout:
423
+ print(
424
+ "[trace] context_pruned "
425
+ f"reason={prune_info['reason']} "
426
+ f"step={prune_info['step']} "
427
+ f"messages={prune_info['before_messages']}->{prune_info['after_messages']} "
428
+ f"chars={prune_info['before_chars']}->{prune_info['after_chars']}"
429
+ )
430
+
431
+ action: FlakySleuthAction
432
+ action_source = "heuristic"
433
+ llm_meta: dict[str, Any] = {"attempted": False, "raw_output": "", "error": ""}
434
  try:
435
+ candidate, llm_meta = llm_action(messages)
436
+ if candidate is not None:
437
+ action = candidate
438
+ action_source = "llm"
439
+ else:
440
+ action = heuristic_action(obs)
441
+ if llm_meta.get("attempted"):
442
+ llm_meta["error"] = (
443
+ "Model response unavailable, using heuristic fallback."
444
+ )
445
+ except Exception as exc:
446
+ llm_meta["error"] = str(exc)
447
  action = heuristic_action(obs)
448
 
449
+ if action.action_type == "read_file":
450
+ prior_reads = read_attempt_counts.get(action.argument, 0)
451
+ if prior_reads >= 1:
452
+ blocked_duplicate_reads += 1
453
+ replacement = FlakySleuthAction(
454
+ action_type="search_code",
455
+ argument=_duplicate_read_replacement_pattern(obs),
456
+ )
457
+ if trace_agent and not compliance_stdout:
458
+ print(
459
+ "[trace] action_overridden "
460
+ f"reason=duplicate_read file={action.argument} "
461
+ f"replacement={replacement.action_type}"
462
+ )
463
+ action = replacement
464
+
465
+ if action_source == "llm":
466
+ llm_steps += 1
467
+ else:
468
+ heuristic_steps += 1
469
+ if not API_KEY:
470
+ reason_key = "no_api_key"
471
+ elif llm_meta.get("error"):
472
+ reason_key = "llm_error"
473
+ elif llm_meta.get("attempted"):
474
+ reason_key = "empty_or_invalid_response"
475
+ else:
476
+ reason_key = "heuristic_default"
477
+ fallback_reasons[reason_key] = fallback_reasons.get(reason_key, 0) + 1
478
+
479
+ if trace_agent and not compliance_stdout:
480
+ print(f"[trace] step={step_idx + 1} action_source={action_source}")
481
+ if llm_meta.get("attempted"):
482
+ _trace_print(
483
+ True,
484
+ "[trace] raw model output:",
485
+ text=str(llm_meta.get("raw_output", "")),
486
+ max_chars=trace_max_chars,
487
+ )
488
+ if llm_meta.get("error"):
489
+ print(f"[trace] llm_error={llm_meta['error']}")
490
+ print(f"[trace] action={action.model_dump_json()}")
491
+
492
  obs, reward, done, info = env.step(action)
493
+ rewards.append(reward)
494
+ steps_taken = step_idx + 1
495
+
496
+ if action.action_type == "read_file":
497
+ read_attempt_counts[action.argument] = read_attempt_counts.get(action.argument, 0) + 1
498
+ if action.argument not in unique_read_files:
499
+ unique_read_files.append(action.argument)
500
+ if reward <= 0:
501
+ zero_gain_read_files.add(action.argument)
502
+ elif action.action_type == "search_code":
503
+ if action.argument not in search_patterns:
504
+ search_patterns.append(action.argument)
505
+
506
+ if done:
507
+ no_progress_streak = 0
508
+ elif reward <= 0:
509
+ no_progress_streak += 1
510
+ else:
511
+ no_progress_streak = 0
512
+
513
+ memory_hint = _build_episode_memory(
514
+ unique_read_files=unique_read_files,
515
+ zero_gain_read_files=zero_gain_read_files,
516
+ search_patterns=search_patterns,
517
+ blocked_duplicate_reads=blocked_duplicate_reads,
518
+ no_progress_streak=no_progress_streak,
519
+ max_chars=MEMORY_MAX_CHARS,
520
+ )
521
 
522
  step_error: str | None = None
523
  if isinstance(info, dict):
524
+ raw_err = info.get("last_action_error")
525
+ if raw_err:
526
+ step_error = str(raw_err)
527
+ if not step_error and obs.tool_output and str(obs.tool_output).startswith("ERROR:"):
528
+ step_error = str(obs.tool_output)
529
+
530
+ if compliance_stdout:
531
+ _compliance_log_step(
532
+ step=steps_taken,
533
+ action=action.model_dump_json(),
534
+ reward=reward,
535
+ done=done,
536
+ error=step_error,
537
+ )
538
+
539
+ if trace_agent and not compliance_stdout:
540
+ print(
541
+ f"[trace] step_result reward={reward:.3f} done={done} "
542
+ f"step_count={obs.step_count}"
543
+ )
544
+ if obs.tool_output:
545
+ _trace_print(
546
+ True,
547
+ "[trace] tool_output:",
548
+ text=obs.tool_output,
549
+ max_chars=trace_max_chars,
550
+ )
551
 
552
  if done:
553
+ # Terminal reward already includes cumulative progress + terminal score.
554
+ final_episode_score = reward
555
+ terminal_meta = {
556
+ "action_type": action.action_type,
557
+ "terminal_score": float(info.get("terminal_score", 0) or 0),
558
+ "progress_score": float(info.get("progress_score", 0) or 0),
559
+ "explore_sum": exploration_reward_total,
560
+ "episode_score": final_episode_score,
561
+ "llm_steps": llm_steps,
562
+ "heuristic_steps": heuristic_steps,
563
+ "fallback_reasons": dict(fallback_reasons),
564
+ "context_prune_events": prune_events,
565
+ "duplicate_read_blocks": blocked_duplicate_reads,
566
+ }
567
+ success = final_episode_score > 0.0
568
+ if print_terminal:
569
+ print(
570
+ f" Terminal: {action.action_type}({action.argument[:40]}) "
571
+ f"-> terminal={info.get('terminal_score', 0):.2f} "
572
+ f"progress={info.get('progress_score', 0):.2f} "
573
+ f"explore_sum={exploration_reward_total:.3f} "
574
+ f"episode_score={final_episode_score:.3f}"
575
+ )
576
  break
577
 
578
+ exploration_reward_total += reward
579
  messages.append({"role": "assistant", "content": action.model_dump_json()})
580
+ next_prompt = obs_to_prompt(obs, memory_hint=memory_hint, max_steps=env.max_steps)
581
+ messages.append({"role": "user", "content": next_prompt})
582
+ if trace_agent and trace_prompts and not compliance_stdout:
583
+ _trace_print(
584
+ True,
585
+ f"[trace] next user prompt (step={step_idx + 1}):",
586
+ text=next_prompt,
587
+ max_chars=trace_max_chars,
588
+ )
589
+ except Exception as exc:
590
+ terminal_meta["error"] = str(exc)
591
  success = False
592
+ if not compliance_stdout:
593
+ raise
594
  finally:
595
+ if compliance_stdout:
596
+ try:
597
+ env.close()
598
+ except Exception:
599
+ pass
600
+ _compliance_log_end(
601
+ success=success,
602
+ steps=steps_taken,
603
+ score=min(max(final_episode_score, 0.0), 1.0),
604
+ rewards=rewards,
605
+ )
606
+
607
+ return final_episode_score, terminal_meta
608
 
609
+
610
+ def _looks_like_placeholder_dataset(dataset_path: str) -> bool:
611
+ path = Path(dataset_path)
612
+ if not path.exists():
613
+ return False
614
+ try:
615
+ text = path.read_text(encoding="utf-8", errors="replace")
616
+ except Exception:
617
+ return False
618
+ return "fixture://" in text
619
 
620
 
621
  def _parse_args() -> argparse.Namespace:
622
+ parser = argparse.ArgumentParser(description="Run FlakySleuth baseline inference.")
623
  parser.add_argument(
624
  "--dataset-path",
625
  default="dataset/py_tasks.csv",
 
642
  default=MAX_STEPS,
643
  help="Max steps per episode.",
644
  )
645
+ parser.add_argument(
646
+ "--no-progress",
647
+ action="store_true",
648
+ help="Disable progress bars and print classic per-episode logs.",
649
+ )
650
+ parser.add_argument(
651
+ "--trace-agent",
652
+ action="store_true",
653
+ help=(
654
+ "Print detailed agent trace: model output, chosen action/tool call, and "
655
+ "step results for every episode."
656
+ ),
657
+ )
658
+ parser.add_argument(
659
+ "--trace-prompts",
660
+ action="store_true",
661
+ help="When tracing, also print full prompts sent to the model.",
662
+ )
663
+ parser.add_argument(
664
+ "--trace-max-chars",
665
+ type=int,
666
+ default=2500,
667
+ help="Max chars per traced text block (prompt/model output/tool output).",
668
+ )
669
+ parser.add_argument(
670
+ "--compliance-stdout",
671
+ dest="compliance_stdout",
672
+ action="store_true",
673
+ help=(
674
+ "Emit strict compliance logs to stdout using only [START]/[STEP]/[END] lines "
675
+ "for each episode."
676
+ ),
677
+ )
678
+ parser.add_argument(
679
+ "--no-compliance-stdout",
680
+ dest="compliance_stdout",
681
+ action="store_false",
682
+ help="Disable strict compliance logs and print baseline summaries/progress.",
683
+ )
684
  parser.add_argument(
685
  "--benchmark-name",
686
+ default="flakysleuth",
687
+ help="Benchmark name used in [START] lines when --compliance-stdout is enabled.",
688
+ )
689
+ parser.add_argument(
690
+ "--history-prune-start-step",
691
+ type=int,
692
+ default=12,
693
+ help="Start pruning conversation history only from this step onward.",
694
  )
695
+ parser.add_argument(
696
+ "--history-window-turns",
697
+ type=int,
698
+ default=4,
699
+ help="When pruning is active, keep this many recent assistant/user turns.",
700
+ )
701
+ parser.add_argument(
702
+ "--history-max-chars",
703
+ type=int,
704
+ default=50000,
705
+ help="Approx max chars for messages before forced pruning by size.",
706
+ )
707
+ parser.set_defaults(compliance_stdout=True)
708
  return parser.parse_args()
709
 
710
 
711
  def main() -> None:
712
+ run_start = time.perf_counter()
713
  args = _parse_args()
714
  env = FlakySleuthEnv(dataset_path=args.dataset_path, max_steps=args.max_steps)
 
715
  allowed_task_types = {"classify", "root_cause", "fix_proposal"}
716
  task_types = [t.strip() for t in args.task_types.split(",") if t.strip()]
717
+ invalid = [t for t in task_types if t not in allowed_task_types]
718
+ if invalid:
719
+ raise ValueError(
720
+ f"Invalid task type(s): {invalid}. "
721
+ "Valid values: classify,root_cause,fix_proposal."
722
+ )
723
  if not task_types:
724
+ raise ValueError(
725
+ "No task types selected. Pass --task-types with at least one value."
726
+ )
727
+ results: dict[str, list[float]] = defaultdict(list)
728
+
729
+ if _looks_like_placeholder_dataset(args.dataset_path) and not args.compliance_stdout:
730
+ print(
731
+ "[warning] dataset appears to contain fixture rows (fixture://...). "
732
+ "Build real dataset from py-data.csv for real evaluation."
733
+ )
734
+
735
+ use_progress = (
736
+ (tqdm is not None)
737
+ and (not args.no_progress)
738
+ and (not args.compliance_stdout)
739
+ and os.isatty(1)
740
+ )
741
+ if args.trace_agent and use_progress and not args.compliance_stdout:
742
+ print(
743
+ "[info] --trace-agent enabled, disabling progress bars for readable trace logs."
744
+ )
745
+ use_progress = False
746
+ overall_bar = None
747
+ if use_progress:
748
+ overall_bar = tqdm(
749
+ total=len(task_types) * args.episodes_per_task,
750
+ desc="All tasks",
751
+ unit="ep",
752
+ dynamic_ncols=True,
753
+ )
754
 
755
  for task_type in task_types:
756
+ task_start = time.perf_counter()
757
+ if not args.compliance_stdout:
758
+ print(f"\n-- Task type: {task_type} --")
759
  env.loader.force_task_type(task_type)
760
+ task_bar = None
761
+ if use_progress:
762
+ task_bar = tqdm(
763
+ total=args.episodes_per_task,
764
+ desc=f"{task_type}",
765
+ unit="ep",
766
+ leave=False,
767
+ dynamic_ncols=True,
768
+ )
769
+ for episode in range(args.episodes_per_task):
770
+ score, meta = run_episode(
771
  env,
772
+ print_terminal=(not use_progress) and (not args.compliance_stdout),
773
+ trace_agent=args.trace_agent,
774
+ trace_prompts=args.trace_prompts,
775
+ trace_max_chars=args.trace_max_chars,
776
+ episode_label=f"{task_type} ep={episode + 1}/{args.episodes_per_task}",
777
+ compliance_stdout=args.compliance_stdout,
778
  benchmark_name=args.benchmark_name,
779
+ compliance_task_name=task_type,
780
+ history_prune_start_step=args.history_prune_start_step,
781
+ history_window_turns=args.history_window_turns,
782
+ history_max_chars=args.history_max_chars,
783
  )
784
+ results[task_type].append(score)
785
+ if use_progress and task_bar is not None:
786
+ task_bar.update(1)
787
+ task_avg = sum(results[task_type]) / len(results[task_type])
788
+ task_bar.set_postfix(
789
+ score=f"{score:.3f}",
790
+ avg=f"{task_avg:.3f}",
791
+ term=f"{meta.get('terminal_score', 0):.2f}",
792
+ )
793
+ if overall_bar is not None:
794
+ overall_bar.update(1)
795
+ all_scores = [s for values in results.values() for s in values]
796
+ overall_avg = sum(all_scores) / len(all_scores)
797
+ overall_bar.set_postfix(task=task_type, avg=f"{overall_avg:.3f}")
798
+ elif not args.compliance_stdout:
799
+ print(f" Episode {episode + 1}: {score:.3f}")
800
+ if task_bar is not None:
801
+ task_bar.close()
802
+ task_elapsed = time.perf_counter() - task_start
803
+ if not args.compliance_stdout:
804
+ avg_task = sum(results[task_type]) / max(1, len(results[task_type]))
805
+ print(
806
+ f" [time] task={task_type} elapsed={_format_duration(task_elapsed)} "
807
+ f"avg_ep={task_elapsed / max(1, args.episodes_per_task):.2f}s "
808
+ f"avg_score={avg_task:.3f}"
809
+ )
810
+
811
+ if overall_bar is not None:
812
+ overall_bar.close()
813
+
814
+ if args.compliance_stdout:
815
+ return
816
+
817
+ total_elapsed = time.perf_counter() - run_start
818
+ print("\n== BASELINE RESULTS ==")
819
+ all_scores: list[float] = []
820
+ for task_type in task_types:
821
+ scores = results[task_type]
822
+ avg = sum(scores) / len(scores)
823
+ all_scores.extend(scores)
824
+ print(f" {task_type:12s} avg={avg:.3f} scores={[round(s, 3) for s in scores]}")
825
+
826
+ overall = sum(all_scores) / len(all_scores)
827
+ print(f" {'OVERALL':12s} avg={overall:.3f}")
828
+ print(
829
+ f" {'RUNTIME':12s} total={_format_duration(total_elapsed)} "
830
+ f"episodes={len(all_scores)} "
831
+ f"avg_ep={(total_elapsed / max(1, len(all_scores))):.2f}s"
832
+ )
833
 
834
 
835
  if __name__ == "__main__":
inference_debug.py CHANGED
@@ -75,6 +75,7 @@ MODEL_NAME = os.environ.get("MODEL_NAME", DEFAULT_MODEL)
75
  # Keep a conservative default to stay under common hackathon runtime limits.
76
  EPISODES_PER_TASK = 2
77
  MAX_STEPS = 20
 
78
 
79
  client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
80
 
@@ -140,7 +141,7 @@ def _compliance_log_end(success: bool, steps: int, score: float, rewards: list[f
140
  )
141
 
142
 
143
- def obs_to_prompt(obs: FlakySleuthObservation) -> str:
144
  tree_preview = "\n".join(obs.file_tree[:40])
145
  return f"""TASK: {obs.task_description}
146
 
@@ -159,6 +160,9 @@ Repository file tree:
159
  Last tool output:
160
  {obs.tool_output or "(No action taken yet)"}
161
 
 
 
 
162
  Return only JSON action."""
163
 
164
 
@@ -246,6 +250,85 @@ def _format_duration(seconds: float) -> str:
246
  return f"{mins:02d}m {secs:02d}s"
247
 
248
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
249
  def run_episode(
250
  env: FlakySleuthEnv,
251
  *,
@@ -257,6 +340,9 @@ def run_episode(
257
  compliance_stdout: bool = False,
258
  benchmark_name: str = "flakysleuth",
259
  compliance_task_name: str | None = None,
 
 
 
260
  ) -> tuple[float, dict[str, Any]]:
261
  rewards: list[float] = []
262
  steps_taken = 0
@@ -265,12 +351,30 @@ def run_episode(
265
  exploration_reward_total = 0.0
266
  final_episode_score = 0.0
267
  terminal_meta: dict[str, Any] = {}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
268
  if compliance_stdout:
269
  _compliance_log_start(episode_task_name, benchmark_name, MODEL_NAME)
270
  try:
271
  obs = env.reset()
272
 
273
- initial_prompt = obs_to_prompt(obs)
274
  messages = [
275
  {"role": "system", "content": SYSTEM_PROMPT},
276
  {"role": "user", "content": initial_prompt},
@@ -299,6 +403,24 @@ def run_episode(
299
  )
300
 
301
  for step_idx in range(MAX_STEPS):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
302
  action: FlakySleuthAction
303
  action_source = "heuristic"
304
  llm_meta: dict[str, Any] = {"attempted": False, "raw_output": "", "error": ""}
@@ -317,6 +439,36 @@ def run_episode(
317
  llm_meta["error"] = str(exc)
318
  action = heuristic_action(obs)
319
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
320
  if trace_agent and not compliance_stdout:
321
  print(f"[trace] step={step_idx + 1} action_source={action_source}")
322
  if llm_meta.get("attempted"):
@@ -334,6 +486,32 @@ def run_episode(
334
  rewards.append(reward)
335
  steps_taken = step_idx + 1
336
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
337
  step_error: str | None = None
338
  if isinstance(info, dict):
339
  raw_err = info.get("last_action_error")
@@ -373,6 +551,11 @@ def run_episode(
373
  "progress_score": float(info.get("progress_score", 0) or 0),
374
  "explore_sum": exploration_reward_total,
375
  "episode_score": final_episode_score,
 
 
 
 
 
376
  }
377
  success = final_episode_score > 0.0
378
  if print_terminal:
@@ -387,7 +570,7 @@ def run_episode(
387
 
388
  exploration_reward_total += reward
389
  messages.append({"role": "assistant", "content": action.model_dump_json()})
390
- next_prompt = obs_to_prompt(obs)
391
  messages.append({"role": "user", "content": next_prompt})
392
  if trace_agent and trace_prompts and not compliance_stdout:
393
  _trace_print(
@@ -483,6 +666,24 @@ def _parse_args() -> argparse.Namespace:
483
  default="flakysleuth",
484
  help="Benchmark name used in [START] lines when --compliance-stdout is enabled.",
485
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
486
  return parser.parse_args()
487
 
488
 
@@ -550,6 +751,9 @@ def main() -> None:
550
  compliance_stdout=args.compliance_stdout,
551
  benchmark_name=args.benchmark_name,
552
  compliance_task_name=task_type,
 
 
 
553
  )
554
  results[task_type].append(score)
555
  if use_progress and task_bar is not None:
 
75
  # Keep a conservative default to stay under common hackathon runtime limits.
76
  EPISODES_PER_TASK = 2
77
  MAX_STEPS = 20
78
+ MEMORY_MAX_CHARS = 900
79
 
80
  client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
81
 
 
141
  )
142
 
143
 
144
+ def obs_to_prompt(obs: FlakySleuthObservation, *, memory_hint: str | None = None) -> str:
145
  tree_preview = "\n".join(obs.file_tree[:40])
146
  return f"""TASK: {obs.task_description}
147
 
 
160
  Last tool output:
161
  {obs.tool_output or "(No action taken yet)"}
162
 
163
+ Episode memory:
164
+ {memory_hint or "(No memory yet.)"}
165
+
166
  Return only JSON action."""
167
 
168
 
 
250
  return f"{mins:02d}m {secs:02d}s"
251
 
252
 
253
+ def _build_episode_memory(
254
+ *,
255
+ unique_read_files: list[str],
256
+ zero_gain_read_files: set[str],
257
+ search_patterns: list[str],
258
+ blocked_duplicate_reads: int,
259
+ no_progress_streak: int,
260
+ max_chars: int,
261
+ ) -> str:
262
+ read_tail = ", ".join(unique_read_files[-8:]) if unique_read_files else "none"
263
+ zero_tail = ", ".join(sorted(zero_gain_read_files)[-8:]) if zero_gain_read_files else "none"
264
+ search_tail = ", ".join(search_patterns[-6:]) if search_patterns else "none"
265
+ loop_warning = (
266
+ "WARNING: Possible loop detected. Stop repeating similar exploration. "
267
+ "Switch strategy or take a terminal action."
268
+ if no_progress_streak >= 3 or blocked_duplicate_reads >= 2
269
+ else "Status: exploration progress appears normal."
270
+ )
271
+ memory = (
272
+ f"Read files (recent): {read_tail}\n"
273
+ f"Zero-gain read files: {zero_tail}\n"
274
+ f"Search patterns (recent): {search_tail}\n"
275
+ f"Blocked duplicate reads: {blocked_duplicate_reads}\n"
276
+ f"No-progress streak: {no_progress_streak}\n"
277
+ f"{loop_warning}\n"
278
+ "Guidance: Avoid rereading zero-gain files unless there is new evidence. "
279
+ "Prefer targeted search_code or terminal action when confidence is enough."
280
+ )
281
+ return _clip_text(memory, max_chars=max_chars)
282
+
283
+
284
+ def _duplicate_read_replacement_pattern(obs: FlakySleuthObservation) -> str:
285
+ test_hint = obs.test_name.split("::")[-1] if obs.test_name else "test"
286
+ return (
287
+ f"{test_hint}|random|sleep|time|timeout|retry|asyncio|thread|"
288
+ "fixture|global|shared|mock|patch"
289
+ )
290
+
291
+
292
+ def _messages_char_count(messages: list[dict[str, str]]) -> int:
293
+ # Lightweight size heuristic to avoid unbounded context growth.
294
+ return sum(len(str(msg.get("content", ""))) + 32 for msg in messages)
295
+
296
+
297
+ def _prune_messages_window(
298
+ messages: list[dict[str, str]],
299
+ *,
300
+ step_number: int,
301
+ prune_start_step: int,
302
+ window_turns: int,
303
+ max_chars: int,
304
+ ) -> tuple[list[dict[str, str]], dict[str, Any] | None]:
305
+ if len(messages) <= 2:
306
+ return messages, None
307
+
308
+ current_chars = _messages_char_count(messages)
309
+ exceeds_step_threshold = step_number >= prune_start_step
310
+ exceeds_char_budget = current_chars > max_chars
311
+ if not exceeds_step_threshold and not exceeds_char_budget:
312
+ return messages, None
313
+
314
+ base = messages[:2] # system + initial prompt
315
+ tail = messages[2:]
316
+ keep_tail_items = max(2, window_turns * 2)
317
+ if len(tail) > keep_tail_items:
318
+ tail = tail[-keep_tail_items:]
319
+ pruned = base + tail
320
+
321
+ reason = "step_threshold" if exceeds_step_threshold else "char_budget"
322
+ return pruned, {
323
+ "reason": reason,
324
+ "before_messages": len(messages),
325
+ "after_messages": len(pruned),
326
+ "before_chars": current_chars,
327
+ "after_chars": _messages_char_count(pruned),
328
+ "step": step_number,
329
+ }
330
+
331
+
332
  def run_episode(
333
  env: FlakySleuthEnv,
334
  *,
 
340
  compliance_stdout: bool = False,
341
  benchmark_name: str = "flakysleuth",
342
  compliance_task_name: str | None = None,
343
+ history_prune_start_step: int = 12,
344
+ history_window_turns: int = 4,
345
+ history_max_chars: int = 50000,
346
  ) -> tuple[float, dict[str, Any]]:
347
  rewards: list[float] = []
348
  steps_taken = 0
 
351
  exploration_reward_total = 0.0
352
  final_episode_score = 0.0
353
  terminal_meta: dict[str, Any] = {}
354
+ llm_steps = 0
355
+ heuristic_steps = 0
356
+ fallback_reasons: dict[str, int] = {}
357
+ prune_events = 0
358
+ read_attempt_counts: dict[str, int] = {}
359
+ unique_read_files: list[str] = []
360
+ zero_gain_read_files: set[str] = set()
361
+ search_patterns: list[str] = []
362
+ blocked_duplicate_reads = 0
363
+ no_progress_streak = 0
364
+ memory_hint = _build_episode_memory(
365
+ unique_read_files=unique_read_files,
366
+ zero_gain_read_files=zero_gain_read_files,
367
+ search_patterns=search_patterns,
368
+ blocked_duplicate_reads=blocked_duplicate_reads,
369
+ no_progress_streak=no_progress_streak,
370
+ max_chars=MEMORY_MAX_CHARS,
371
+ )
372
  if compliance_stdout:
373
  _compliance_log_start(episode_task_name, benchmark_name, MODEL_NAME)
374
  try:
375
  obs = env.reset()
376
 
377
+ initial_prompt = obs_to_prompt(obs, memory_hint=memory_hint)
378
  messages = [
379
  {"role": "system", "content": SYSTEM_PROMPT},
380
  {"role": "user", "content": initial_prompt},
 
403
  )
404
 
405
  for step_idx in range(MAX_STEPS):
406
+ messages, prune_info = _prune_messages_window(
407
+ messages,
408
+ step_number=step_idx + 1,
409
+ prune_start_step=history_prune_start_step,
410
+ window_turns=history_window_turns,
411
+ max_chars=history_max_chars,
412
+ )
413
+ if prune_info:
414
+ prune_events += 1
415
+ if trace_agent and not compliance_stdout:
416
+ print(
417
+ "[trace] context_pruned "
418
+ f"reason={prune_info['reason']} "
419
+ f"step={prune_info['step']} "
420
+ f"messages={prune_info['before_messages']}->{prune_info['after_messages']} "
421
+ f"chars={prune_info['before_chars']}->{prune_info['after_chars']}"
422
+ )
423
+
424
  action: FlakySleuthAction
425
  action_source = "heuristic"
426
  llm_meta: dict[str, Any] = {"attempted": False, "raw_output": "", "error": ""}
 
439
  llm_meta["error"] = str(exc)
440
  action = heuristic_action(obs)
441
 
442
+ if action.action_type == "read_file":
443
+ prior_reads = read_attempt_counts.get(action.argument, 0)
444
+ if prior_reads >= 1:
445
+ blocked_duplicate_reads += 1
446
+ replacement = FlakySleuthAction(
447
+ action_type="search_code",
448
+ argument=_duplicate_read_replacement_pattern(obs),
449
+ )
450
+ if trace_agent and not compliance_stdout:
451
+ print(
452
+ "[trace] action_overridden "
453
+ f"reason=duplicate_read file={action.argument} "
454
+ f"replacement={replacement.action_type}"
455
+ )
456
+ action = replacement
457
+
458
+ if action_source == "llm":
459
+ llm_steps += 1
460
+ else:
461
+ heuristic_steps += 1
462
+ if not API_KEY:
463
+ reason_key = "no_api_key"
464
+ elif llm_meta.get("error"):
465
+ reason_key = "llm_error"
466
+ elif llm_meta.get("attempted"):
467
+ reason_key = "empty_or_invalid_response"
468
+ else:
469
+ reason_key = "heuristic_default"
470
+ fallback_reasons[reason_key] = fallback_reasons.get(reason_key, 0) + 1
471
+
472
  if trace_agent and not compliance_stdout:
473
  print(f"[trace] step={step_idx + 1} action_source={action_source}")
474
  if llm_meta.get("attempted"):
 
486
  rewards.append(reward)
487
  steps_taken = step_idx + 1
488
 
489
+ if action.action_type == "read_file":
490
+ read_attempt_counts[action.argument] = read_attempt_counts.get(action.argument, 0) + 1
491
+ if action.argument not in unique_read_files:
492
+ unique_read_files.append(action.argument)
493
+ if reward <= 0:
494
+ zero_gain_read_files.add(action.argument)
495
+ elif action.action_type == "search_code":
496
+ if action.argument not in search_patterns:
497
+ search_patterns.append(action.argument)
498
+
499
+ if done:
500
+ no_progress_streak = 0
501
+ elif reward <= 0:
502
+ no_progress_streak += 1
503
+ else:
504
+ no_progress_streak = 0
505
+
506
+ memory_hint = _build_episode_memory(
507
+ unique_read_files=unique_read_files,
508
+ zero_gain_read_files=zero_gain_read_files,
509
+ search_patterns=search_patterns,
510
+ blocked_duplicate_reads=blocked_duplicate_reads,
511
+ no_progress_streak=no_progress_streak,
512
+ max_chars=MEMORY_MAX_CHARS,
513
+ )
514
+
515
  step_error: str | None = None
516
  if isinstance(info, dict):
517
  raw_err = info.get("last_action_error")
 
551
  "progress_score": float(info.get("progress_score", 0) or 0),
552
  "explore_sum": exploration_reward_total,
553
  "episode_score": final_episode_score,
554
+ "llm_steps": llm_steps,
555
+ "heuristic_steps": heuristic_steps,
556
+ "fallback_reasons": dict(fallback_reasons),
557
+ "context_prune_events": prune_events,
558
+ "duplicate_read_blocks": blocked_duplicate_reads,
559
  }
560
  success = final_episode_score > 0.0
561
  if print_terminal:
 
570
 
571
  exploration_reward_total += reward
572
  messages.append({"role": "assistant", "content": action.model_dump_json()})
573
+ next_prompt = obs_to_prompt(obs, memory_hint=memory_hint)
574
  messages.append({"role": "user", "content": next_prompt})
575
  if trace_agent and trace_prompts and not compliance_stdout:
576
  _trace_print(
 
666
  default="flakysleuth",
667
  help="Benchmark name used in [START] lines when --compliance-stdout is enabled.",
668
  )
669
+ parser.add_argument(
670
+ "--history-prune-start-step",
671
+ type=int,
672
+ default=12,
673
+ help="Start pruning conversation history only from this step onward.",
674
+ )
675
+ parser.add_argument(
676
+ "--history-window-turns",
677
+ type=int,
678
+ default=4,
679
+ help="When pruning is active, keep this many recent assistant/user turns.",
680
+ )
681
+ parser.add_argument(
682
+ "--history-max-chars",
683
+ type=int,
684
+ default=50000,
685
+ help="Approx max chars for messages before forced pruning by size.",
686
+ )
687
  return parser.parse_args()
688
 
689
 
 
751
  compliance_stdout=args.compliance_stdout,
752
  benchmark_name=args.benchmark_name,
753
  compliance_task_name=task_type,
754
+ history_prune_start_step=args.history_prune_start_step,
755
+ history_window_turns=args.history_window_turns,
756
+ history_max_chars=args.history_max_chars,
757
  )
758
  results[task_type].append(score)
759
  if use_progress and task_bar is not None:
server/app.py CHANGED
@@ -28,7 +28,7 @@ class FlakySleuthState(BaseModel):
28
 
29
  class InferenceRunRequest(BaseModel):
30
  dataset_path: str = Field(default="dataset/py_tasks.csv")
31
- episodes_per_task: int = Field(default=1, ge=1, le=50)
32
  task_types: str = Field(default="classify,root_cause,fix_proposal")
33
  max_steps: int = Field(default=20, ge=1, le=100)
34
  benchmark_name: str = Field(default="flakysleuth")
 
28
 
29
  class InferenceRunRequest(BaseModel):
30
  dataset_path: str = Field(default="dataset/py_tasks.csv")
31
+ episodes_per_task: int = Field(default=1, ge=1, le=100)
32
  task_types: str = Field(default="classify,root_cause,fix_proposal")
33
  max_steps: int = Field(default=20, ge=1, le=100)
34
  benchmark_name: str = Field(default="flakysleuth")
server/inference_runner.py CHANGED
@@ -48,8 +48,8 @@ class InferenceRunner:
48
 
49
  if not dataset_rel:
50
  raise ValueError("dataset_path must not be empty.")
51
- if episodes < 1 or episodes > 50:
52
- raise ValueError("episodes_per_task must be between 1 and 50.")
53
  if max_steps < 1 or max_steps > 100:
54
  raise ValueError("max_steps must be between 1 and 100.")
55
  if not task_types:
 
48
 
49
  if not dataset_rel:
50
  raise ValueError("dataset_path must not be empty.")
51
+ if episodes < 1 or episodes > 100:
52
+ raise ValueError("episodes_per_task must be between 1 and 100.")
53
  if max_steps < 1 or max_steps > 100:
54
  raise ValueError("max_steps must be between 1 and 100.")
55
  if not task_types:
server/ui.py CHANGED
@@ -8,7 +8,7 @@ def render_home_page() -> str:
8
  <head>
9
  <meta charset="utf-8" />
10
  <meta name="viewport" content="width=device-width, initial-scale=1" />
11
- <title>FlakySleuth Run Studio</title>
12
  <link rel="preconnect" href="https://fonts.googleapis.com" />
13
  <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
14
  <link href="https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@500;600;700&family=IBM+Plex+Mono:wght@400;500&display=swap" rel="stylesheet" />
@@ -123,6 +123,60 @@ def render_home_page() -> str:
123
  letter-spacing: -0.01em;
124
  }
125
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
  .form-grid {
127
  display: grid;
128
  gap: 12px;
@@ -230,6 +284,61 @@ def render_home_page() -> str:
230
  line-height: 1.4;
231
  }
232
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
233
  .actions {
234
  margin-top: 6px;
235
  display: flex;
@@ -373,6 +482,10 @@ def render_home_page() -> str:
373
  grid-template-columns: 1fr;
374
  }
375
 
 
 
 
 
376
  .field.span-2 {
377
  grid-column: span 1;
378
  }
@@ -386,12 +499,37 @@ def render_home_page() -> str:
386
  <body>
387
  <main class="shell">
388
  <section class="hero">
389
- <span class="eyebrow"><span class="dot"></span>FlakySleuth Space</span>
390
- <h1>Run Inference From The Browser</h1>
391
- <p>This evaluator console runs benchmark episodes for <strong>classification</strong>, <strong>root-cause identification</strong>, and <strong>fix proposal</strong>. Use it to review logs, score trends, and reproducible run settings while judging submission quality.</p>
392
  </section>
393
 
394
  <section class="panel-grid">
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
395
  <div class="panel">
396
  <h2>Run Configuration</h2>
397
  <form id="run-form" class="form-grid">
@@ -402,12 +540,24 @@ def render_home_page() -> str:
402
 
403
  <div class="field">
404
  <label for="episodes_per_task">Episodes Per Task</label>
405
- <input id="episodes_per_task" name="episodes_per_task" type="number" min="1" max="50" value="1" />
 
 
 
 
 
 
406
  </div>
407
 
408
  <div class="field">
409
  <label for="max_steps">Max Steps</label>
410
- <input id="max_steps" name="max_steps" type="number" min="1" max="100" value="20" />
 
 
 
 
 
 
411
  </div>
412
 
413
  <div class="field span-2">
@@ -430,6 +580,15 @@ def render_home_page() -> str:
430
  <input id="benchmark_name" name="benchmark_name" value="flakysleuth" />
431
  </div>
432
 
 
 
 
 
 
 
 
 
 
433
  <div class="field span-2">
434
  <label for="api_base_url">API Base URL (optional)</label>
435
  <input id="api_base_url" name="api_base_url" placeholder="https://api.openai.com/v1 or provider endpoint" />
@@ -494,6 +653,13 @@ def render_home_page() -> str:
494
  const taskChipsEl = document.getElementById("task-chips");
495
  const taskSelectEl = document.getElementById("task-type-select");
496
  const taskAddButton = document.getElementById("btn-add-task");
 
 
 
 
 
 
 
497
 
498
  const TASK_TYPE_ORDER = ["classify", "root_cause", "fix_proposal"];
499
  const TASK_TYPE_LABELS = {
@@ -501,6 +667,35 @@ def render_home_page() -> str:
501
  root_cause: "Root Cause",
502
  fix_proposal: "Fix Proposal",
503
  };
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
504
 
505
  function parseTaskTypes(raw) {
506
  const tokens = String(raw || "")
@@ -580,6 +775,7 @@ def render_home_page() -> str:
580
  taskInput.value = selectedTaskTypes.join(",");
581
  renderTaskChips();
582
  renderTaskSelect();
 
583
  }
584
 
585
  function addSelectedTaskType() {
@@ -591,12 +787,58 @@ def render_home_page() -> str:
591
  syncTaskTypes();
592
  }
593
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
594
  function readFormPayload() {
 
 
 
 
 
 
595
  return {
596
  dataset_path: form.dataset_path.value.trim(),
597
- episodes_per_task: Number(form.episodes_per_task.value),
598
  task_types: form.task_types.value.trim(),
599
- max_steps: Number(form.max_steps.value),
600
  benchmark_name: form.benchmark_name.value.trim(),
601
  api_base_url: form.api_base_url.value.trim() || null,
602
  model_name: form.model_name.value.trim() || null,
@@ -707,8 +949,18 @@ def render_home_page() -> str:
707
  addSelectedTaskType();
708
  }
709
  });
 
 
 
 
 
 
 
 
710
 
 
711
  syncTaskTypes();
 
712
 
713
  fetchStatus();
714
  window.setInterval(fetchStatus, 2200);
 
8
  <head>
9
  <meta charset="utf-8" />
10
  <meta name="viewport" content="width=device-width, initial-scale=1" />
11
+ <title>FlakyGym Control Center</title>
12
  <link rel="preconnect" href="https://fonts.googleapis.com" />
13
  <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
14
  <link href="https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@500;600;700&family=IBM+Plex+Mono:wght@400;500&display=swap" rel="stylesheet" />
 
123
  letter-spacing: -0.01em;
124
  }
125
 
126
+ .brief-grid {
127
+ display: grid;
128
+ grid-template-columns: repeat(2, minmax(0, 1fr));
129
+ gap: 12px;
130
+ }
131
+
132
+ .brief-card {
133
+ border: 1px solid rgba(15, 139, 99, 0.22);
134
+ border-radius: 12px;
135
+ background: rgba(255, 255, 255, 0.84);
136
+ padding: 10px;
137
+ display: grid;
138
+ gap: 8px;
139
+ }
140
+
141
+ .brief-card h3 {
142
+ margin: 0;
143
+ font-size: 0.95rem;
144
+ letter-spacing: -0.01em;
145
+ }
146
+
147
+ .brief-card p {
148
+ margin: 0;
149
+ font-size: 12px;
150
+ color: #36564a;
151
+ line-height: 1.45;
152
+ }
153
+
154
+ .brief-list {
155
+ margin: 0;
156
+ padding-left: 16px;
157
+ font-size: 12px;
158
+ color: #2f4f43;
159
+ line-height: 1.45;
160
+ display: grid;
161
+ gap: 5px;
162
+ }
163
+
164
+ .header-chips {
165
+ display: flex;
166
+ flex-wrap: wrap;
167
+ gap: 6px;
168
+ }
169
+
170
+ .header-chips code {
171
+ background: #eef8f3;
172
+ border: 1px solid rgba(15, 139, 99, 0.26);
173
+ border-radius: 999px;
174
+ padding: 4px 8px;
175
+ font: 500 11px/1.1 var(--mono);
176
+ color: #20443a;
177
+ white-space: nowrap;
178
+ }
179
+
180
  .form-grid {
181
  display: grid;
182
  gap: 12px;
 
284
  line-height: 1.4;
285
  }
286
 
287
+ .slider-wrap {
288
+ display: grid;
289
+ gap: 8px;
290
+ }
291
+
292
+ .slider-value-row {
293
+ display: flex;
294
+ justify-content: space-between;
295
+ gap: 10px;
296
+ font: 500 12px/1.3 var(--mono);
297
+ color: #3c5950;
298
+ }
299
+
300
+ input[type="range"] {
301
+ width: 100%;
302
+ padding: 0;
303
+ accent-color: var(--accent);
304
+ cursor: pointer;
305
+ }
306
+
307
+ .eta-box {
308
+ border: 1px solid rgba(15, 139, 99, 0.22);
309
+ border-radius: 12px;
310
+ background: rgba(255, 255, 255, 0.85);
311
+ padding: 10px;
312
+ display: grid;
313
+ gap: 6px;
314
+ }
315
+
316
+ .eta-value {
317
+ font: 700 17px/1 var(--display);
318
+ letter-spacing: -0.01em;
319
+ }
320
+
321
+ .eta-warning {
322
+ font: 500 12px/1.45 var(--mono);
323
+ border-radius: 8px;
324
+ padding: 6px 8px;
325
+ display: none;
326
+ }
327
+
328
+ .eta-warning.warn {
329
+ display: block;
330
+ color: #752f1b;
331
+ background: #fff0d7;
332
+ border: 1px solid rgba(213, 114, 65, 0.45);
333
+ }
334
+
335
+ .eta-warning.ok {
336
+ display: block;
337
+ color: #20443a;
338
+ background: #e8f7ef;
339
+ border: 1px solid rgba(15, 139, 99, 0.28);
340
+ }
341
+
342
  .actions {
343
  margin-top: 6px;
344
  display: flex;
 
482
  grid-template-columns: 1fr;
483
  }
484
 
485
+ .brief-grid {
486
+ grid-template-columns: 1fr;
487
+ }
488
+
489
  .field.span-2 {
490
  grid-column: span 1;
491
  }
 
499
  <body>
500
  <main class="shell">
501
  <section class="hero">
502
+ <span class="eyebrow"><span class="dot"></span>FlakyGym Space</span>
503
+ <h1>FlakyGym Control Center</h1>
504
+ <p>This console runs flaky-test benchmark episodes and streams live logs. Use it to configure runs, estimate runtime, and review grader outcomes quickly.</p>
505
  </section>
506
 
507
  <section class="panel-grid">
508
+ <div class="panel">
509
+ <h2>Quick Brief: Dataset + Graders</h2>
510
+ <div class="brief-grid">
511
+ <div class="brief-card">
512
+ <h3>Dataset: <code>dataset/py_tasks.csv</code></h3>
513
+ <p>Each row is one flaky-test investigation task created from <code>py-data.csv</code> (repo + SHA + target test + labels + optional known fix diff).</p>
514
+ <p class="field-note">Headers:</p>
515
+ <div class="header-chips">
516
+ <code>repo_url</code><code>sha</code><code>test_name</code><code>test_file</code>
517
+ <code>category</code><code>label</code><code>status</code><code>pr_link</code>
518
+ <code>task_types</code><code>test_code</code><code>known_fix_diff</code>
519
+ </div>
520
+ </div>
521
+
522
+ <div class="brief-card">
523
+ <h3>3 Graders (short)</h3>
524
+ <ul class="brief-list">
525
+ <li><strong>Task 1 (`classify`):</strong> exact-match flaky vs stable.</li>
526
+ <li><strong>Task 2 (`root_cause`):</strong> category similarity matrix (partial credit allowed).</li>
527
+ <li><strong>Task 3 (`fix_proposal`):</strong> weighted score from pattern match, patch applicability, and LLM judge.</li>
528
+ </ul>
529
+ </div>
530
+ </div>
531
+ </div>
532
+
533
  <div class="panel">
534
  <h2>Run Configuration</h2>
535
  <form id="run-form" class="form-grid">
 
540
 
541
  <div class="field">
542
  <label for="episodes_per_task">Episodes Per Task</label>
543
+ <div class="slider-wrap">
544
+ <input id="episodes_per_task" name="episodes_per_task" type="range" min="1" max="100" step="1" value="1" />
545
+ <div class="slider-value-row">
546
+ <span><strong id="episodes_per_task_value">1</strong> episode(s)</span>
547
+ <span>1-100</span>
548
+ </div>
549
+ </div>
550
  </div>
551
 
552
  <div class="field">
553
  <label for="max_steps">Max Steps</label>
554
+ <div class="slider-wrap">
555
+ <input id="max_steps" name="max_steps" type="range" min="1" max="100" step="1" value="20" />
556
+ <div class="slider-value-row">
557
+ <span><strong id="max_steps_value">20</strong> step(s)</span>
558
+ <span>1-100</span>
559
+ </div>
560
+ </div>
561
  </div>
562
 
563
  <div class="field span-2">
 
580
  <input id="benchmark_name" name="benchmark_name" value="flakysleuth" />
581
  </div>
582
 
583
+ <div class="field span-2">
584
+ <label>Runtime ETA</label>
585
+ <div class="eta-box">
586
+ <div class="eta-value" id="eta-value">~09m 00s</div>
587
+ <p class="field-note" id="eta-detail">3 task(s) × 1 episode(s) × 180s/episode</p>
588
+ <div class="eta-warning" id="eta-warning"></div>
589
+ </div>
590
+ </div>
591
+
592
  <div class="field span-2">
593
  <label for="api_base_url">API Base URL (optional)</label>
594
  <input id="api_base_url" name="api_base_url" placeholder="https://api.openai.com/v1 or provider endpoint" />
 
653
  const taskChipsEl = document.getElementById("task-chips");
654
  const taskSelectEl = document.getElementById("task-type-select");
655
  const taskAddButton = document.getElementById("btn-add-task");
656
+ const episodesInput = document.getElementById("episodes_per_task");
657
+ const episodesValueEl = document.getElementById("episodes_per_task_value");
658
+ const maxStepsInput = document.getElementById("max_steps");
659
+ const maxStepsValueEl = document.getElementById("max_steps_value");
660
+ const etaValueEl = document.getElementById("eta-value");
661
+ const etaDetailEl = document.getElementById("eta-detail");
662
+ const etaWarningEl = document.getElementById("eta-warning");
663
 
664
  const TASK_TYPE_ORDER = ["classify", "root_cause", "fix_proposal"];
665
  const TASK_TYPE_LABELS = {
 
667
  root_cause: "Root Cause",
668
  fix_proposal: "Fix Proposal",
669
  };
670
+ const ETA_SECONDS_PER_EPISODE = 180;
671
+ const HACKATHON_LIMIT_SECONDS = 20 * 60;
672
+
673
+ function clampInt(raw, min, max, fallback) {
674
+ const num = Number(raw);
675
+ if (!Number.isFinite(num)) return fallback;
676
+ return Math.max(min, Math.min(max, Math.trunc(num)));
677
+ }
678
+
679
+ function formatDuration(totalSeconds) {
680
+ const seconds = Math.max(0, Math.round(totalSeconds));
681
+ const mins = Math.floor(seconds / 60);
682
+ const secs = seconds % 60;
683
+ if (mins >= 60) {
684
+ const hrs = Math.floor(mins / 60);
685
+ const remMins = mins % 60;
686
+ return `${hrs}h ${String(remMins).padStart(2, "0")}m ${String(secs).padStart(2, "0")}s`;
687
+ }
688
+ return `${String(mins).padStart(2, "0")}m ${String(secs).padStart(2, "0")}s`;
689
+ }
690
+
691
+ function refreshSliderValues() {
692
+ const episodes = clampInt(episodesInput.value, 1, 100, 1);
693
+ const maxSteps = clampInt(maxStepsInput.value, 1, 100, 20);
694
+ episodesInput.value = String(episodes);
695
+ maxStepsInput.value = String(maxSteps);
696
+ episodesValueEl.textContent = String(episodes);
697
+ maxStepsValueEl.textContent = String(maxSteps);
698
+ }
699
 
700
  function parseTaskTypes(raw) {
701
  const tokens = String(raw || "")
 
775
  taskInput.value = selectedTaskTypes.join(",");
776
  renderTaskChips();
777
  renderTaskSelect();
778
+ updateRuntimeEstimate();
779
  }
780
 
781
  function addSelectedTaskType() {
 
787
  syncTaskTypes();
788
  }
789
 
790
+ function updateRuntimeEstimate() {
791
+ const episodes = clampInt(episodesInput.value, 1, 100, 1);
792
+ const maxSteps = clampInt(maxStepsInput.value, 1, 100, 20);
793
+ const taskCount = selectedTaskTypes.length;
794
+ const totalEpisodes = taskCount * episodes;
795
+ const etaSeconds = totalEpisodes * ETA_SECONDS_PER_EPISODE;
796
+
797
+ etaValueEl.textContent = `~${formatDuration(etaSeconds)}`;
798
+ etaDetailEl.textContent =
799
+ `${taskCount} task(s) × ${episodes} episode(s) × ${ETA_SECONDS_PER_EPISODE}s/episode`;
800
+
801
+ const notes = [];
802
+ if (episodes > 2) {
803
+ notes.push("Recommended: keep episodes per task at 1-2 for faster hackathon runs.");
804
+ }
805
+ if (etaSeconds > HACKATHON_LIMIT_SECONDS) {
806
+ notes.push("Warning: ETA exceeds 20 minutes, which may violate hackathon runtime guidance.");
807
+ }
808
+ if (maxSteps > 20) {
809
+ notes.push("Higher max steps can increase runtime beyond this ETA estimate.");
810
+ }
811
+ if (taskCount === 0) {
812
+ notes.push("Add at least one task chip to run inference.");
813
+ }
814
+
815
+ etaWarningEl.classList.remove("warn", "ok");
816
+ if (!notes.length) {
817
+ etaWarningEl.textContent = "Runtime looks within limits for a quick benchmark run.";
818
+ etaWarningEl.classList.add("ok");
819
+ return;
820
+ }
821
+
822
+ etaWarningEl.textContent = notes.join(" ");
823
+ if (etaSeconds > HACKATHON_LIMIT_SECONDS || episodes > 2 || taskCount === 0) {
824
+ etaWarningEl.classList.add("warn");
825
+ } else {
826
+ etaWarningEl.classList.add("ok");
827
+ }
828
+ }
829
+
830
  function readFormPayload() {
831
+ const episodes = clampInt(episodesInput.value, 1, 100, 1);
832
+ const maxSteps = clampInt(maxStepsInput.value, 1, 100, 20);
833
+ episodesInput.value = String(episodes);
834
+ maxStepsInput.value = String(maxSteps);
835
+ refreshSliderValues();
836
+ updateRuntimeEstimate();
837
  return {
838
  dataset_path: form.dataset_path.value.trim(),
839
+ episodes_per_task: episodes,
840
  task_types: form.task_types.value.trim(),
841
+ max_steps: maxSteps,
842
  benchmark_name: form.benchmark_name.value.trim(),
843
  api_base_url: form.api_base_url.value.trim() || null,
844
  model_name: form.model_name.value.trim() || null,
 
949
  addSelectedTaskType();
950
  }
951
  });
952
+ episodesInput.addEventListener("input", () => {
953
+ refreshSliderValues();
954
+ updateRuntimeEstimate();
955
+ });
956
+ maxStepsInput.addEventListener("input", () => {
957
+ refreshSliderValues();
958
+ updateRuntimeEstimate();
959
+ });
960
 
961
+ refreshSliderValues();
962
  syncTaskTypes();
963
+ updateRuntimeEstimate();
964
 
965
  fetchStatus();
966
  window.setInterval(fetchStatus, 2200);