Spaces:
Sleeping
FlakySleuth Grading: Exact Scoring Formulas
This document describes the exact scoring logic implemented in code for:
- Task 1:
classify(classify_flakiness) - Task 2:
root_cause(classify_root_cause) - Task 3:
fix_proposal(propose_fix)
It also explains how per-step rewards are combined inside the environment.
Source of Truth
env/environment.pygraders/__init__.pygraders/task1_grader.pygraders/task2_grader.pygraders/task3_grader.pydataset/category_similarity.json
1) Dispatch: Which grader is used?
graders/grade_action() selects grader by task["task_type"]:
classify-> Task 1 graderroot_cause-> Task 2 graderfix_proposal-> Task 3 grader- anything else ->
0.0
2) Environment reward pipeline (applies to all tasks)
At each env.step(action):
- If action is terminal (
classify_flakiness,classify_root_cause,propose_fix):- compute
terminal_score = grade_action(action, task) - compute penalties
- final step reward:
- compute
reward = clamp(
cumulative_progress + terminal_score - late_penalty - wrong_dir_penalty,
0.0,
1.0
)
Where:
late_penalty = max(0, step_count - 15) * 0.05wrong_dir_penalty = 0.2only when:- action is
classify_flakiness - predicted argument is
"stable" - ground-truth label is
"flaky"
- action is
done = True
- If action is non-terminal (exploration):
- compute
progressfrom exploration action - update cumulative progress:
- compute
cumulative_progress = clamp(cumulative_progress + progress, 0.0, 0.30)
reward = progress
- Timeout rule:
- if not already done and
step_count >= max_steps, setdone = True - no additional terminal score is applied at timeout.
- if not already done and
3) Exploration progress rewards (exact values)
read_file
- file missing/unsafe ->
progress = -0.05 - file already read in this episode ->
progress = 0.0 - new file:
- if file path contains
task["test_file"]->0.07 - else if file ends with
.py->0.03 - else ->
0.01
- if file path contains
search_code
- base reward:
- if query contains any flaky-signal tokens (
sleep,random,time,datetime,thread,asyncio,fixture,setup,teardown,global,shared,singleton,os.environ,socket,timeout,retry,mock,patch) ->0.04 - otherwise ->
0.01
- if query contains any flaky-signal tokens (
- spam penalties (all apply, then summed and capped):
- repeated same normalized search pattern in episode:
repeat_penalty = min(0.02 * (pattern_count - 1), 0.12)forpattern_count > 1
- repeated same search context (same normalized pattern + same extracted top
.pyhit files):context_penalty = min(0.03 * (context_count - 1), 0.15)forcontext_count > 1
- long search-only streak:
streak_penalty = min(0.02 * (consecutive_searches - 3), 0.20)forconsecutive_searches > 3
- total spam penalty cap:
min(sum_penalties, 0.35)
- repeated same normalized search pattern in episode:
- final
search_codeprogress:
progress = max(-0.25, base_reward - spam_penalty)
- environment appends
WARNING:text to tool output when penalties fire. consecutive_searchesresets on any non-search_codeaction.
run_test
- if category is not one of
OD,OD-Brit,OD-Vic->0.05 - if category is order-dependent (
OD,OD-Brit,OD-Vic) ->0.0
unsupported action type
progress = -0.05
4) Task 1 scorer (classify_flakiness)
Binary exact-match scorer:
if action_type != "classify_flakiness": return 0.001
if predicted not in {"flaky","stable"}: return 0.001
truth = task["label"] (default "flaky")
terminal_score = 0.999 if predicted == truth else 0.001
Notes:
- In current dataset builder, rows are written with
label = "flaky"by default. - Predicting
"stable"on flaky truth also triggers environmentwrong_dir_penalty = 0.2.
5) Task 2 scorer (classify_root_cause)
Matrix-based similarity scorer.
5.1 Category normalization
Prediction and truth are normalized by:
- trim
- replace
_with- - replace spaces with
- - uppercase and map through canonical aliases:
OD-BRIT->OD-BritOD-VIC->OD-Vic- etc.
If normalized value is not in valid set, score is 0.001.
Truth category is the first category if semicolon-separated:
raw_truth = str(task["category"]).split(";")[0]
5.2 Similarity scoring
if predicted == truth: return 0.999
else return clamp(similarity[predicted,truth] or similarity[truth,predicted] or 0.0, 0.001, 0.999)
The similarity matrix is loaded from dataset/category_similarity.json.
Current non-identity similarity entries:
OD,OD-Brit:0.7OD,OD-Vic:0.7OD-Brit,OD-Vic:0.8OD,NIO:0.4OD,NDOI:0.3NOD,TD:0.6NOD,TZD:0.5NOD,NDOI:0.5TD,TZD:0.7NOD,ID:0.3UD,OD:0.2UD,NOD:0.2UD,NIO:0.2UD,TD:0.2UD,ID:0.2
Any missing pair defaults to 0.0.
6) Task 3 scorer (propose_fix)
Hybrid weighted scorer:
if action_type != "propose_fix": return 0.001
if proposed_fix is empty: return 0.001
total = 0.35 * pattern_score + 0.25 * apply_score + 0.40 * judge_score
terminal_score = round(clamp(total, 0.001, 0.999), 4)
6.1 pattern_score
Category-specific keyword patterns are checked against the proposed diff.
For category with pattern list:
matches = number of patterns found (case-insensitive substring)
pattern_score = min(0.999, matches / max(1, len(patterns) * 0.4))
If category has no pattern list:
pattern_score = 0.5
Current pattern lists:
TD:freeze_time,mock,patch,utcnow,datetime,monkeypatchTZD:timezone,utc,pytz,zoneinfo,tzinfo,UTCNOD:seed,mock,patch,deterministic,sortedNIO:setup,teardown,fixture,yield,cleanup,autouseID:sorted(,list(,frozenset,OrderedDict
6.2 apply_score (_check_diff_applies)
if diff does not contain both '---' and '+++': return 0.001
if sandbox_root missing or not existing: return 0.3
else run: patch --dry-run -p1 -i <temp_patch>
return 0.999 if patch exit code == 0
return 0.001 otherwise
on exception: return 0.3
6.3 judge_score (_llm_judge)
LLM judge behavior:
- If no API key available ->
judge_score = 0.5 - Else sends a judge prompt asking for JSON
{"score": 0..10, "reason": ...} - Parses integer score, clamps to
[0,10], then scales to[0,1]:
judge_score = clamp(int_score, 0, 10) / 10
- On any judge exception / parse failure ->
judge_score = 0.5
API/model resolution in judge:
- API key preference:
API_KEY->OPENROUTER_API_KEY->OPENAI_API_KEY - Base URL:
- OpenRouter inferred ->
https://openrouter.ai/api/v1 - else ->
https://api.openai.com/v1
- OpenRouter inferred ->
- Model default:
- OpenRouter base URL ->
qwen/qwen3.6-plus:free - else ->
gpt-4o-mini
- OpenRouter base URL ->
7) Worked examples
Example A: Task 1 correct classify early
cumulative_progress = 0.05terminal_score = 0.999late_penalty = 0.0wrong_dir_penalty = 0.0
reward = clamp(0.05 + 0.999 - 0 - 0, 0, 1) = 0.999
Example B: Task 2 wrong category but some exploration
cumulative_progress = 0.05terminal_score = 0.001(no similarity match)- penalties =
0
reward = clamp(0.05 + 0.001, 0, 1) = 0.051
Example C: Task 3 with weak fix and no API key
judge_score = 0.5fallbackapply_scoreandpattern_scoredepend on diff contents- final weighted sum then clamped and rounded to 4 decimals.
8) Important implementation notes
cumulative_progressis capped at0.30and never below0.0.- Terminal reward can be reduced by late penalty after step 15.
- Timeout does not invoke grader; it only ends the episode.
- Dataset construction choices (especially
labeland category quality) heavily influence observed score behavior.
9) Inference-side controls (not grader formulas)
inference.py now includes policy/runtime controls that do not change grader math directly but change agent behavior:
- episode memory injected into every prompt (recent files, search patterns, no-progress streak)
- explicit loop warning prompt when no-progress/duplicate patterns are detected
- duplicate
read_fileattempts are overridden to targetedsearch_code - conversation compaction controls:
--history-prune-start-step(default12)--history-window-turns(default4)--history-max-chars(default50000)
- detailed tracing options (
--trace-agent,--trace-prompts) for audit/debug