Spaces:

bpHigh
/

financial-task-env

Running

App Files Files Community

financial-task-env / edits.md

bpHigh

GRPO Phase 13: custom rollout_func for markdown JSON tool calls

f6d4692 10 days ago

preview code

raw

history blame contribute delete

57.5 kB

Edits log — Round-2 environment extension

This file tracks every change made on top of the Round-1 submission, in order. Useful both as a journal and as a re-deploy checklist.

Round-1 baseline: commit bf77949 ("Update readme") on main. Single family (xlsx), 10 hand-curated Finch tasks, monolithic graders.py, heuristic step-rewards.

Round-2 target: unified office-document RL environment — xlsx + docx + pptx, real enterprise artifacts, gaming-resistant multi-layer grading, manifest-driven.

State at Round-1 (baseline)

Area	What was there
Task families	xlsx only
Number of tasks	10 hand-curated, all from Finch
Task definitions	Hardcoded `TASKS = {...}` dict in `tasks.py`
Source data	`data/<orig_id>/{src,ref}_0.xlsx` — 10 dirs
Grading	One `graders.py` module, two functions: `grade_qa` (text) and `grade_xlsx` (cell-diff)
Step rewards	`_compute_code_reward` in `server/financial_environment.py`: heuristics on code string (regex `save(`, count substantive lines, length of stdout). Cap 0.10/step.
Sandboxing	None — agent's subprocess has full filesystem access
Reward components	4 signals, all heuristic, partly gameable
Train/eval split	None
Deps	`openpyxl` only

Known weaknesses identified before changes

save( string match misses prs.save(), Document.save() — wouldn't generalize past xlsx.
No measurement of whether file actually changed; just whether code mentioned save.
No "moving toward gold" signal.
Hardcoded task table — can't scale past ~30 tasks without bloat.
Gold files reachable from sandbox via glob(data/**) → reward hacking.

Phase 1 — Manifest loader + 50 stratified Finch tasks

Goal: scale beyond 10 hand-curated tasks; introduce a manifest the env loads at startup so future task families (docx, pptx) plug in cleanly.

New files

data_pipeline/finch_pull.py — stratified puller for the FinWorkBench/Finch HF dataset (172 tasks). Picks 50 xlsx-only MODIFY tasks across 7 tag buckets:

Tag	Picked	of total
Calculation	16	of 119
Structuring / Formatting	11	of 86
Data Entry / Import	6	of 44
Validation / Review	5	of 37
Cross-sheet/file Retrieval	5	of 36
Summary / Visualization	4	of 33
Financial Modeling	3	of 15

Web Search dropped — all such tasks have non-xlsx sources. Slots reallocated to Calculation + Structuring.

data/manifest.jsonl — 50 rows, schema:

{"id": "finch_10", "family": "xlsx", "origin": "finch", "orig_id": "10",
 "split": "eval", "primary_tag": "Calculation",
 "all_tags": ["Calculation", "Financial Modeling"],
 "business_type": "Predictive Modeling",
 "instruction": "...", "constraints": "...",
 "source_file": "data/finch_50/10/10_src_0.xlsx",
 "reference_file": "data/finch_50/10/10_ref_0.xlsx",
 "task_type": "MODIFY", "max_steps": 15}

data/finch_50/<id>/{src,ref}.xlsx — ~42 MB, 50 tasks × 2 files.

Train/eval split

40 train / 10 eval (stratified — at least 1 holdout per tag).
Driven by per-tag EVAL_HOLDOUT budget in the puller.

Modified files

tasks.py — added _load_manifest() that reads data/manifest.jsonl and merges rows into TASKS (skipping any whose ID already exists, so the original 10 hand-curated tasks remain). Added list_tasks(split=, family=), split_ids() filters.

Resulting task counts

60 total (10 original + 50 Finch), 50 train / 10 eval.

Phase 2 — Unified `RewardTracker`

Goal: replace heuristic code-string scoring with real file-state signals, generalizable across xlsx/pptx/docx.

New file

rewards.py — RewardTracker class, one instance per episode.

Reward components (all per-step, summed and clamped to 0.10)

Component	Range	What it actually checks
`exec_health`	0–0.020	Subprocess return code; bonus if stdout non-empty
`lib_engagement`	0–0.010	Code matches `_LIB_PATTERNS[family]` regex (xlsx → openpyxl/load_workbook/Workbook; pptx → Presentation; docx → Document)
`mutation`	0–0.030	SHA-256 of working file changed since last step
`validity`	0–0.020	Mutated file still parses with the family's loader
`progress`	0–0.040	Structural distance to gold decreased this step (gated by `enable_progress`)

Per-family structural distance (in `rewards.py`)

_xlsx_distance — fraction of gold cells matched (mirrors final grader)
_pptx_distance — fraction of gold (slide_idx, shape_idx) text-frames matched
_docx_distance — fraction of gold paragraphs matched at same index

Modified files

server/financial_environment.py:
- Replaced _compute_code_reward with a delegate to RewardTracker
- _compute_code_reward now returns (total, breakdown_dict) instead of just float
- Per-episode tracker stood up in reset() after copying source to workdir
- FINANCIAL_ENV_PROGRESS=0 env var disables the progress signal (for clean eval)
- Reward decomposition surfaced in feedback for debugging

Smoke test results

Read-only step: 0.030 (exec_health 0.020 + lib_engagement 0.010)
Save+modify step: 0.080 (+ mutation 0.030 + validity 0.020)
Failed code: 0.005 (exec_health_fail only)
Decomposition logged in feedback, e.g.: Reward: total=0.080 (exec_health=0.020, lib_engagement=0.010, mutation=0.030, validity=0.020, progress=0.000)

Phase 3 — DOCX family (OSWorld-Verified writer subset)

Goal: add Microsoft Word (.docx) tasks alongside xlsx, with real property-checking evaluators ported from OSWorld.

New files

data_pipeline/osworld_writer_pull.py — pulls 21 strict-docx tasks from xlang-ai/OSWorld (GitHub) and xlangai/ubuntu_osworld_file_cache (HF). Of the 23 published writer UUIDs, drops 2 (one .odt, one .pdf source) leaving 21 strict-docx.

Schema normalization: OSWorld evaluators come in two shapes (single-string func vs. compound func: list[str] with conj: "or"|"and" and parallel expected/options lists). The puller normalizes everything to evaluator: {conj, checks: [{func, options, expected_files: [...]}, ...]}. Multi-gold (multi: true) tasks have expected_files as a list per check.

graders/docx_metrics.py — port of 16 evaluator functions from OSWorld's desktop_env/evaluators/metrics/docs.py (Apache-2.0). Heavy deps (skimage, easyocr) imported lazily; one function (find_default_font) stubbed because it operates on a LibreOffice config XML that doesn't exist in our headless sandbox.

Added infeasible handler: passes iff agent didn't modify the source (the agent should refuse). The bb8ccc78 task ("Share this document with my team and let us edit it together in real-time") uses this — it's genuinely impossible from a code-execution sandbox.

Dispatcher: run_evaluator(conj, checks, working_file, source_file) — and = min(scores), or = max(scores).

Evaluator	Tasks	Style
`compare_docx_files`	7×	Content diff (with options: ignore_blanks, ignore_case, fuzzy_match, …)
`compare_line_spacing`	3×	Property
`compare_docx_tables`	3×	Structure
`check_tabstops`	1×	Property + position-distance
`compare_subscript_contains`	1×	Property
`has_page_numbers_in_footers`	1×	Single-file property
`compare_font_names`	1×	Single-file property
`is_first_line_centered`	1×	Single-file property
`compare_docx_images`	1×	Pixel-byte diff
`compare_unique_train_records`	1×	Multi-file domain logic
`evaluate_strike_through_last_paragraph`	1×	Property
`evaluate_colored_words_in_tables`	1×	Skimage CIE delta-E
`infeasible`	1×	Sentinel (file-unchanged check)
`check_italic_font_size_14`	1×	Property
`contains_page_break`	1×	Property
`find_default_font`	1×	Stubbed (LO-config-dependent)

File reorganization (mid-phase)

Renamed graders.py (root module) → graders/__init__.py (package). Forced because graders/ (new dir for docx_metrics.py) collides with graders.py (old root file) — Python won't accept both. Existing from graders import grade_task imports still work transparently.

New 3-layer DOCX grader

In graders/__init__.py:

def grade_docx(task, output_path):
    if not _docx_validity(output_path):     # layer 1 — validity gate
        return 0.001
    diff_score    = _docx_diff(output_path, task["reference_file"])    # layer 2
    primary_score = run_evaluator(...)                                 # layer 3
    return 0.4 * diff_score + 0.6 * primary_score

The dispatcher (grade_task) routes by task["family"] — xlsx still uses the cell-diff path, docx uses the new 3-layer path.

Modified files

tasks.py — manifest loader now passes through evaluator, primary_tag, all_tags. Resolves evaluator's expected_files to absolute paths (matters for the gold-stash dedup in Phase 4).
pyproject.toml + Dockerfile — added python-docx>=1.1.0, rapidfuzz>=3.0.0, Pillow>=10.0.0.

Resulting task counts

81 total (10 original + 50 Finch xlsx + 21 OSWorld docx).
17 docx train, 4 docx eval (stratified to cover 4 distinct evaluator funcs).

Smoke test results

Submit gold to compound and×2 task → 0.999 ✓
Submit corrupted bytes → 0.001 (validity gate rejects) ✓
Submit unmodified source → 0.400 (diff layer says similar, per-task says no-edit)

OSWorld quirk noted

osworld_0a0faba3 (check_tabstops task): the gold itself doesn't satisfy word_number_split_by_tabstop=3 for paragraph [2] ("Make payment\t..." has only 2 words before the tab). This is a faithful port of OSWorld's behavior, not a bug in our code. May want to relax the rule for training or move that task to eval-only.

Phase 4 — Reward-hacking defenses

Goal: plug the two biggest hacking surfaces identified in the Q2 audit.

Defense 1 — Gold file moved out of sandbox at episode start

Threat: glob('/app/env/data/**/*Gold*.docx') or glob('**/*_ref_*.xlsx') finds the gold; agent submits it for an instant 0.999.

Fix: at reset():

Make a per-episode COPY of the global TASKS[id] dict (so episode-time path mutations don't pollute the global)
Create a tmpdir at /tmp/oe_gold_<random>/
Move (rename) every gold file from data/... into the tmpdir with a generic name (gold_ref<ext>, check_<i>_<j>_<random><ext>)
Track the moves in self._gold_originals so close() can restore
Rewrite the episode-task's reference_file and evaluator.checks[*].expected_files to point at the tmpdir paths

De-dup: when the same path appears as both reference_file and an evaluator expected_files entry (common — the puller sets reference_file = first check's first expected_file), the stasher uses a path_map to ensure both new paths point to the same stashed location.

Restore: close() renames stashed files back to their original data/ locations. reset() calls close() at the start of each episode in case the prior episode didn't end cleanly.

Defense 2 — Per-task evaluator as 6th reward signal

Threat: the previous 5 components rewarded "moved closer to gold via generic structural distance", which an agent could optimize without satisfying the actual property check the task is testing.

Fix: new eval_check component (0–0.020). Computes the per-task evaluator at episode start, then on each mutating step. Rewards increases in spec-aligned score.

# rewards.py
if self.task_evaluator is not None and file_valid:
    cur_eval = self._safe_task_eval()
    if self._prev_eval is not None and cur_eval > self._prev_eval:
        delta = cur_eval - self._prev_eval
        sig.eval_check = min(EVAL_CHECK_MAX, EVAL_CHECK_MAX * delta)
    self._prev_eval = cur_eval

For docx, the env passes task_evaluator = run_evaluator(conj, checks, ...) into the tracker. xlsx/pptx pass None.

Modified files

rewards.py:
- Added task_evaluator param to RewardTracker.__init__
- Added eval_check field to StepSignals + recomputed total
- Added EVAL_CHECK_MAX = 0.020 constant
- Added _safe_task_eval() helper
server/financial_environment.py:
- __init__: added _gold_stash_dir, _gold_originals fields
- reset(): copies task dict, creates stash dir, calls _stash_gold_files, builds task_evaluator callable for docx, passes it to RewardTracker
- New methods: _stash_gold_files(task, stash_dir), _make_task_evaluator()
- close(): restores moved gold files to data/, removes stash dir

Smoke test results

Scenario	Score	Expected	Result
Compound and×2 docx, submit stashed gold	0.999	~0.999	✓
Single-check docx, submit stashed gold	0.999	~0.999	✓
Submit corrupted bytes	0.001	0.001	✓ (validity gate)
Submit source (unmodified)	0.400	partial	✓ (diff 1.0, per-task 0)
xlsx (no per-task evaluator), submit gold	0.999	~0.999	✓
Code step copies stashed gold to working file	total=0.090 with `eval_check=0.020`	should fire	✓
Original gold file present in data/ during episode	False on disk	False	✓ (moved out)
Original restored after `close()`	True on disk	True	✓

Phase 5 — PPTX family (PPTArena ingest)

Goal: add Microsoft PowerPoint (.pptx) tasks. PPTArena chosen over TSBench because PPTArena ships actual gold .pptx files; TSBench only has ideal_description text and would need an LLM judge.

Source

Local checkout of PPTArena unpacked at ~/Downloads/PPTArena-main. The repo's src/evaluation_pairs_refined.json has 100 well-curated task pairs:

{
  "name": "Case 31: Fix Text Overflow",
  "prompt": "...",
  "style_target": "<detailed expected output spec>",
  "original": "Original/<file>.pptx",
  "ground_truth": "GroundTruth/<file>.pptx",
  "category": ["Content", "Layout"],
  "edit_type": "Text & Typography"
}

Distribution across the 100:

edit_type	count
Text & Typography	29
Charts	10
Images & Pictures	10
Theme & Background	9
Alignment, Distribution & Z-order	8
Slide/Section Management & Footers	8
Tables	8
Shapes & Drawing	4
SmartArt & Diagrams	4
Slide Layout & Placeholders	3
Accessibility & Semantics	2
Long-tail singletons (Transitions, Hyperlinks, Master, Audio/Video, Animations)	1 each

New file

data_pipeline/pptarena_pull.py — reads evaluation_pairs_refined.json, picks 38 tasks stratified by edit_type. Sub-budget below; sum is 38 (close to the 40 target — the gap is from the long-tail edit_types having only 1 sample each).

edit_type	picked	of total
Text & Typography	6	of 29
Charts	4	of 10
Images & Pictures	4	of 10
Theme & Background	3	of 9
Alignment, Distribution & Z-order	3	of 8
Slide/Section Management & Footers	3	of 8
Tables	3	of 8
Shapes & Drawing	2	of 4
SmartArt & Diagrams	2	of 4
Slide Layout & Placeholders	2	of 3
Accessibility & Semantics	1	of 2
Long-tail singletons	5 × 1	of 5

Long-tail singletons all go to train (only 1 sample each — can't hold out). Eval holdout = 8: 2 from Text & Typography, 1 each from {Charts, Images, Theme, Alignment, Slide Mgmt, Tables}.

The agent-facing instruction is prompt + "\n\nDetails:\n" + style_target — style_target carries the explicit spec PPTArena uses internally for evaluation, exposed to the agent as a "hidden but visible" constraint.

Data layout

data/pptarena/<slug>/
   <slug>_src.pptx   # copied from PPTArena-main/Original/
   <slug>_ref.pptx   # copied from PPTArena-main/GroundTruth/

Total disk: ~244 MB for 38 tasks (pptx files are larger than docx/xlsx — they contain embedded images and themes).

Grader — `grade_pptx` (2-layer, no per-task evaluator)

In graders/__init__.py:

def grade_pptx(task, output_path):
    if not _pptx_validity(output_path):  # layer 1
        return 0.001
    # layer 2: structural diff
    #   slide-count match (30%) + per-shape text-equality (70%, fuzzy 90%+ allowed)
    ...

Per-task evaluator is intentionally not wired. PPTArena's published evaluator is a VLM-as-judge pipeline (instruction-following + visual quality) which is expensive and non-deterministic. Skipping for v1; wiring it as an optional RENDER_FOR_VLM=1 flag is in the Open Issues list.

Modified files

graders/__init__.py: added _pptx_validity, _pptx_load_shape_text, grade_pptx. Dispatcher now routes pptx → grade_pptx.
pyproject.toml + Dockerfile: added python-pptx>=1.0.0.

Resulting task counts (cumulative)

Family	Origin	Train	Eval	Total
xlsx	hand-curated	10	0	10
xlsx	Finch	40	10	50
docx	OSWorld	17	4	21
pptx	PPTArena	30	8	38
total		97	22	119

Smoke test results

Scenario	Score	Expected	Result
Submit stashed gold (eval task)	0.999	~0.999	✓
Submit corrupted .pptx bytes	0.001	0.001	✓ (validity gate)
Code step that mutates + saves (add blank slide)	total=0.080	≥0.06	✓ (exec=0.020, lib=0.010, mutation=0.030, validity=0.020)
Gold-stash works for pptx (file moves out of `data/`)	True	True	✓
`close()` restores gold to `data/`	True	True	✓

Known limitation: text-only diff is weak for layout tasks

For an Alignment / Layout task (e.g. Case 60: Fix Text Placement), source and ground-truth have near-identical text content — only shape positions differ. Our diff layer scores 0.999 on the unmodified source for this case, which is not what we want. Two paths to fix:

Extend grade_pptx with position+size diff (cheap; ~30 lines): for each (slide_idx, shape_idx) pair, compare (left, top, width, height) within tolerance. Recompose the score as 0.2 * slide_count + 0.8 * avg( 0.5 * text_match + 0.25 * position_match + 0.25 * size_match).
Wire VLM judge behind PPTX_VLM_JUDGE=1 env var — render slides via headless LibreOffice → PNG, send (instruction, before, after, ref) to a VLM. Matches PPTArena's published methodology but is expensive.

Recommended: (1) before any RL training; (2) for the final eval scoreboard.

Phase 5 follow-up: layout-aware diff (delivered)

Implemented option (1) above. The grader now loads every shape's (left, top, width, height) (in EMU) and computes a per-shape composite score:

Text (50%) — exact match → 1.0; rapidfuzz partial credit otherwise.
Position (25%) — _coord_match(left, denom=slide_w) averaged with same for top. Tolerance: delta ≤ 2% of slide dim → 1.0; delta ≥ 20% → 0.0; linear in between. Both sides None (placeholder inheriting from layout) is treated as a match.
Size (25%) — same _coord_match for width/height.

Final score reweighted: 0.2 * slide_count + 0.8 * avg(per-shape composite).

Smoke results on all 8 pptx eval tasks (source-vs-gold)

Task	Before fix	After fix	Notes
`case_36_add_speaker_notes`	0.999	0.683	Big drop — entire shapes added in gold
`case_32_arrange_image_and_text`	0.999	0.824	Position diff captured
`case_7_update_quarter_two_data_b`	0.999	0.948	Chart text + size diff
`case_60_fix_text_placement`	0.999	0.981	Modest — positions in tolerance band
`case_35_structural_fix`	0.999	0.971	Modest
`case_49_normalize_thousand_separators`	0.999	0.992	Tiny text edit, no layout change
`case_40_hindu_center_titles`	0.999	0.997	Title-alignment only — small px shift
`case_26_match_slide_colors_to_theme`	0.999	0.999	Pure color/theme — geometry unchanged

5 of 8 eval tasks now show meaningful drop. The remaining 3 (case_40, case_49, case_26) still score ~0.99 because their edits are styling-only — color, font, fill — which our geometry-only diff doesn't see.

Remaining gap: styling-only tasks (29 of 100 PPTArena tasks)

Styling tasks edit shape fill, line, font name/size/bold/italic/color, or theme — none of which are captured by text + geometry. Two ways to close the gap, both filed as new follow-ups:

a. Per-shape style diff: for each shape, compare fill.solid().fore_color.rgb, line.color.rgb, and for the first run in each text frame: font.name, font.size, font.bold, font.italic, font.color.rgb. Add as a 4th component in _shape_match_score. ~50 lines.

b. VLM judge (option 2 above) — catches styling for free since it compares rendered images. Defer to eval-time only because of cost.

For training, (a) is sufficient. For the final scoreboard, (b) is nicer.

Phase 5 follow-up #2: style-aware diff (delivered)

Implemented option (a) above. New _shape_style() extractor pulls 7 attributes per shape (all None-tolerant — failures during read become None, which counts as a match against another None):

Attribute	Weight	Source
`fill_rgb`	0.30	`shape.fill.fore_color.rgb` (solid fills only)
`font_rgb`	0.20	first-run `font.color.rgb`
`font_size_pt`	0.15	first-run `font.size.pt`
`font_name`	0.10	first-run `font.name`
`line_rgb`	0.10	`shape.line.color.rgb`
`font_bold`	0.075	first-run `font.bold`
`font_italic`	0.075	first-run `font.italic`

Per-shape composite reweighted from 50% text + 25% pos + 25% size to:

40% text + 20% style + 20% position + 20% size

Why these weights? Text is still dominant because most edits affect text content. Style gets equal weight to position/size, reflecting that styling edits are common in PPTArena (~29 tasks).

Smoke results across all 8 pptx eval tasks (source-vs-gold)

Task	Phase-5 layout-only	Phase-5+style	Discrimination (gold − source)
`case_26_match_slide_colors_to_theme`	0.999	0.971	0.000 → 0.028 ✓ unblocked
`case_36_add_speaker_notes`	0.683	0.715	0.316 → 0.284
`case_32_arrange_image_and_text`	0.824	0.855	0.175 → 0.144
`case_60_fix_text_placement`	0.981	0.985	0.018 → 0.014
`case_35_structural_fix`	0.971	0.975	0.028 → 0.024
`case_7_update_quarter_two_data_b`	0.948	0.951	0.051 → 0.048
`case_40_hindu_center_titles`	0.997	0.998	tiny
`case_49_normalize_thousand_separators`	0.992	0.994	tiny

Gold-vs-gold remained 0.999 on all 8 (no regression).

Trade-off observed: the styling task discrimination went from 0 → 0.028, but text/layout-heavy tasks lost a few percentage points of discrimination because the text weight dropped from 50% → 40%. Net positive but not dramatic.

The dilution problem (now the binding limitation)

For tasks where only a few shapes out of many are edited (e.g. case_40_hindu_center_titles edits 1 title shape per slide), the diff averages across all shapes — the un-edited majority dominates and the score barely moves between source and gold. This is structural to average-based diff and not a bug.

Two follow-ups to consider:

a. Edit-zone masking — score only shapes whose attributes differ between source and gold (using task.source_file as the baseline). Changes scoring semantics: instead of "how close to gold", you measure "did the agent fix the parts that were supposed to change". ~30 lines, but more invasive than (b) below.

b. VLM judge — compares rendered images, naturally focuses on visible differences. The right long-term answer; expensive — defer to eval-time behind a flag.

Phase 6 — Inference script v2 (manifest-aware benchmarking)

Goal: Round-1's inference.py was hardcoded to 5 xlsx tasks and produced stdout-only output. Round-2 needs a script that:

Selects tasks from the manifest (filterable by split/family/ids)
Picks the right system prompt per family (openpyxl / python-docx / python-pptx)
Persists results to disk so we can produce reward curves and before/after plots for the judging story

CLI (new)

python inference.py [--split eval|train|all]
                    [--family xlsx|docx|pptx|all]
                    [--limit N]
                    [--task-ids id1,id2,…]
                    [--output-dir runs/<custom>]
                    [--model <name>]
                    [--api-base <url>] [--env-url <http://…>]
                    [--max-steps 15] [--task-timeout 360]
                    [--temperature 0.0] [--max-tokens 12000]

--task-ids overrides --split/--family. Selection is sorted deterministically by (family, primary_tag, id).

Output structure (new)

Each run writes a runs/<timestamp>_<model_slug>/ directory:

results.json           # summary + per-task records
summary.csv            # flat table for plotting
trajectories/<id>.jsonl # full step trace per task (action, reward, feedback)
log.txt                # mirrors stdout

results.json shape:

{
  "model": "...",
  "split": "eval", "family": "all",
  "n_tasks": 22, "avg_score": 0.456, "success_rate": 0.318,
  "total_elapsed_s": 1840.5,
  "by_family": {
    "xlsx": {"n": 10, "avg": 0.521},
    "docx": {"n": 4, "avg": 0.402},
    "pptx": {"n": 8, "avg": 0.388}
  },
  "results": [{ "task_id":..., "score":..., "step_rewards":[...], ...}]
}

summary.csv columns: task_id, family, primary_tag, split, score, success, steps, elapsed_s, error — feeds straight into matplotlib/seaborn for the hero plot in the README.

Family-aware system prompts (new)

The single prompt mentioning openpyxl is replaced by three:

Family	Prompt mentions
xlsx	`openpyxl.load_workbook`, `wb.save(path)`
docx	`from docx import Document`, `doc.save(path)`, common imports for shared/enum
pptx	`from pptx import Presentation`, `prs.save(path)`, color/util imports

Selection is by obs["family"] (env-provided, with fallback to the manifest's family field).

Other changes

MAX_STEPS default raised from 10 → 15 to match the env's actual cap (was undercutting agents on hard tasks)
TASK_TIMEOUT raised from 240s → 360s — pptx tasks have larger files and need more inspection time
Task selection auto-injects the 10 hand-curated task_1..task_10 (which live in tasks.py, not the manifest) so they remain runnable via --task-ids
Action extractor now also recognizes docx/pptx strings as code-block hints (was openpyxl-only)
Trajectory persistence: every (action, reward, feedback) tuple is saved per task — useful as input to SFT warm-start in the eventual training loop

Smoke validation

--help prints clean usage
Loads 119 tasks from manifest + injects 10 hand-curated; selects:
- --split eval → 22 tasks (10 xlsx + 4 docx + 8 pptx) ✓
- --task-ids finch_10,osworld_0a0faba3,pptarena_case_60_fix_text_placement → 3 tasks ✓
Output writers (json/csv/jsonl) round-trip cleanly via synthetic test

A full live benchmark (with model API + env server) is the user's next action — costs ~$0.50-2 in API tokens for a 22-task eval depending on model.

Modified files

inference.py — full rewrite (~400 lines, was ~350)

Files unchanged in Phase 6

All env-server code, graders, manifest, data, deps

Phase 7 — Live-discovered exploit + anti-exploit fix

Trigger: during Kimi-K2.5 eval (Apr 25, 2026), the model submitted the unmodified source file in step 1 for two tasks and scored very high:

Task	Edit type	Score on src-unchanged submit	Why it worked
`pptarena_case_40_hindu_center_titles`	Title alignment	0.998	Paragraph-level `alignment` wasn't in `_shape_style`; everything else (text, position, size, font attrs) was identical between source and gold
`pptarena_case_26_match_slide_colors_to_theme`	Theme color	0.971	Gold uses theme-color references (None RGB); source uses explicit RGB. The mismatch dilutes across 30 shapes for only ~3% drop

This is genuine reward hacking by an inference-time agent, exactly what the "hard to game" criterion in the judging guide warns about. Two fixes delivered:

Fix 1: extended `_shape_style` (catches the per-attribute gaps)

Added two new attributes to the per-shape style extractor:

Attribute	Source	Catches
`para_alignment`	`shape.text_frame.paragraphs[0].alignment`	"Center the title" / "right-align" tasks
`fill_theme`	`shape.fill.fore_color.theme_color` (when fill is solid but `.rgb` raises)	"Match colors to theme" tasks where gold uses theme refs and source uses explicit RGB

Reweighted _STYLE_WEIGHTS from 7 attrs → 9 attrs:

fill_rgb 0.22 | fill_theme 0.08 | font_rgb 0.17 | para_alignment 0.15
font_size_pt 0.12 | line_rgb 0.08 | font_name 0.08
font_bold 0.05 | font_italic 0.05

Status: improves shape-level discrimination, but the dilution problem still wins when only 2 of 55 shapes change (case_40 src-vs-gold went from 0.998 → 0.997 — basically unchanged because of averaging). This is why we need Fix 2.

Fix 2: byte-equality anti-exploit at grade time (the actual fix)

Added in graders/__init__.py's grade_task: if the agent's submitted file is byte-identical to the source AND the task isn't OSWorld's infeasible sentinel, return 0.001 immediately.

if src_file_exists and not is_infeasible_task:
    if same_bytes(output_path, source_file):
        return 0.001  # agent didn't actually do anything

This kills the entire class of "submit source unchanged" exploits across all three families, regardless of which specific attribute the diff misses. Validation:

Test	Before fix	After fix
Submit unmodified source on `case_40`	0.998	0.001 ✓
Submit unmodified source on `case_26`	0.971	0.001 ✓
Submit gold on `case_40`	0.999	0.999 ✓ no regression
Submit gold on `case_26`	0.999	0.999 ✓ no regression
All 8 pptx eval tasks, gold-vs-gold	0.999	0.999 ✓ no regression

The OSWorld infeasible task (where not modifying is the correct answer) is correctly excluded — that path uses the existing infeasible evaluator function which already does its own equality check and credits the agent.

Important implication for SFT corpus building

When we eventually filter trajectories for the SFT corpus, drop any trajectory where n_steps == 1 and the only action was submit_file even after this fix. Reasons:

Defense in depth — if a future grader gap appears, we don't want the student model trained on "submit unchanged" wins
A real solve takes at least one code step; 1-step submit_file is structurally suspicious

This filter is documented as a TODO for the SFT collection script.

Re-eval needed

The Kimi-K2.5 baseline numbers from runs/baseline_kimi_k25_eval/ were collected with the pre-fix grader. The two exploited tasks are now correctly graded at 0.001 instead of 0.998/0.971, lowering the run's average. Either re-run Kimi on those two tasks with --resume, or recompute the average locally:

# Quick local recompute (no re-inference) — assumes you already pushed
# updated graders. The OLD numbers are inflated; the NEW numbers reflect
# what Kimi actually solved.

(Recommendation: re-run with --resume --task-ids pptarena_case_40_hindu_center_titles,pptarena_case_26_match_slide_colors_to_theme. Costs <$0.10.)

Phase 8 — SFT corpus builder (trajectory → messages-format JSONL)

Goal: turn teacher trajectories (collected on the train split via inference.py --split train) into an SFT-ready corpus for warm-starting a small student model (Qwen2.5-Coder-3B-Instruct) before GRPO.

New file

data_pipeline/build_sft_corpus.py — reads a runs/<dir>/{summary.csv, trajectories/*.jsonl} produced by inference.py, applies six filters, and emits a JSONL where each row is one accepted episode in the TRL SFTTrainer messages format:

{"task_id": "...", "family": "xlsx", "primary_tag": "Calculation",
 "split": "train", "score": 0.94, "n_steps": 6,
 "messages": [
   {"role": "system",    "content": <SYSTEM_PROMPTS[family]>},
   {"role": "user",      "content": <task instruction + source path + family>},
   {"role": "assistant", "content": "```python\n…\n```"},
   {"role": "user",      "content": "Code execution result (step 1/15):\n…"},
   {"role": "assistant", "content": "SUBMIT_FILE: /…"},
   ...
 ]}

Filters (in order)

#	Filter	What it drops	Why
1	`error` column non-empty	Failed runs (timeouts, model crashes)	No useful signal
2	`n_steps < --min-steps` (default 2)	Trivial 1-step runs	Real solves take ≥1 code step
3	1-step `submit_file`	Trajectories where the only action is `submit_file`	Defense in depth against grader exploits — Phase 7 proved a model can submit source unchanged and beat the diff threshold; even with the byte-equality check, future grader gaps could re-open this. A real solve takes ≥1 code step; we never want to teach the student "skip the work". Always dropped, regardless of score.
4	`final_score < --score-threshold` (default 0.4)	Low-quality solves	Don't train on partial-fail patterns
5	Malformed action types	Action types outside `{code, submit, submit_file}`	Schema enforcement
6	No real work	Trajectories with no successful code step (`reward > 0.005`)	Drops "model only made syntax errors" cases

The --min-steps 2 and the explicit 1-step-submit-file check are redundant by design — both catch the same exploit class so a future refactor that loosens one doesn't open the door.

Message reconstruction details

System prompt: imported verbatim from inference.SYSTEM_PROMPTS[family] so the SFT corpus matches what the model sees at deployment.
First user message: task instruction + constraints + source-file path (extracted from the trajectory's first code action via regex, falls back to manifest's source_file) + family + task type. The env's xlsx-summary section is intentionally skipped to avoid re-opening files at corpus-build time.
Assistant turns: action content wrapped in the format the extract_action() parser expects:
- code → ```python\n{content}\n```
- submit → SUBMIT_ANSWER: {content}
- submit_file → SUBMIT_FILE: {content}

User turns: mirror inference.py's per-step feedback message:

Code execution result (step {n}/{max_steps}):
{feedback}

Source file: {path}

Smoke test (against the MiniMax-M2.1 eval run)

Input rows    : 22
Accepted      : 10
Drops:
  low_score                    12

Accepted breakdown:
  docx      2
  pptx      4
  xlsx      4
Avg steps   : 10.8
Avg score   : 0.794

For the actual SFT corpus we'll use train-split teacher trajectories from Kimi-K2.5, not the eval baseline. With 97 train tasks at ~30–50% retention rate that's ~30–50 high-quality episodes — enough for a meaningful SFT warm-start before GRPO.

Modified files

None (new file only)

Files unchanged in Phase 8

env server, graders, manifest, data, deps

Phase 9 — Hard early-submit gate at the env layer

Trigger: during Phase-2 trajectory collection on the train split, Kimi-K2.5 was still trying to submit the unmodified source file at step 1 (e.g., pptarena_case_91_add_qr_code), even though the Phase-7 grader correctly scored it 0.001. Post-grading defense alone wasn't enough — every wasted "submit at step 1" episode was lost training data and burned API budget.

Fix: refuse the action before grading

server/financial_environment.py now tracks _code_steps_taken (incremented in _handle_code regardless of success — even a failed code attempt counts). Both submit handlers (_handle_submit_file, _handle_submit_text) check _code_steps_taken >= _min_code_steps_before_submit (default 1) and return early with explanatory feedback if not.

Crucially, the rejection does NOT end the episode:

The agent gets back a feedback message: ❌ Submit rejected: you must execute at least 1 code step before submitting...
The reward for the rejected step is 0.001
done=False — the agent has its remaining steps (15 - n_used) to recover

This shape is exactly right for an RL agent: ending the episode would make a single bad attempt catastrophic; keeping it open turns it into a corrective signal.

The minimum is overridable via FINANCIAL_ENV_MIN_CODE_STEPS env var. Set to 0 to disable the gate (useful only for debugging).

Belt-and-suspenders: prompt also tells the model

inference.py's _BASE_RULES now includes:

You MUST execute at least one code step before submitting. The environment will reject SUBMIT_ANSWER and SUBMIT_FILE on step 1 — you need to read or modify the file with code first. Submitting the source file unchanged is never a correct solve and will be rejected.

Defense in depth: the prompt prevents wasted retries on models that follow instructions; the env layer enforces the rule on models that don't.

Smoke test results

Reset:          code_steps_taken = 0, min_required = 1

Step 1: submit_file (early)       → reward=0.001, done=False  ✓ rejected
Step 2: code (any code)           → counter increments to 1    ✓
Step 3: submit_file (after code)  → reward=normal, done=True   ✓ allowed
Step 1: submit (QA, early)        → reward=0.001, done=False   ✓ same gate
Disabled (env var=0)              → submit goes through        ✓

Stack of defenses against the "submit unchanged" exploit class

This is now the third independent defense, all targeting the same exploit class:

Layer	Phase	What it does
Env action gate	9 (this one)	Refuse the submit action itself if no code step has been taken
Grader byte-equality	7	If submit happens AND output is byte-identical to source → 0.001
SFT corpus filter	8	Drop trajectories with `n_steps==1` and `submit_file` even at high score

Layer 9 prevents the trajectory from existing in the first place. Layer 7 catches it if Layer 9 is somehow bypassed (e.g., FINANCIAL_ENV_MIN_CODE_STEPS=0). Layer 8 prevents future grader gaps from leaking into SFT training data.

Modified files

server/financial_environment.py — added _code_steps_taken, _min_code_steps_before_submit, _early_submit_rejected(). Both submit handlers gated.
inference.py — added rule #6 to _BASE_RULES.

Files unchanged in Phase 9

graders, manifest, data, deps

Phase 9.1 — `--skip-completed` for cheap re-runs

After Phase 9 landed, the natural question was: "do I just run with --resume and the env will sort it out?" Answer: no — --resume alone re-runs every selected task and merges. To save API spend on already-good trajectories, added a --skip-completed flag to inference.py.

When set with --resume, drops tasks whose prior result is clean:

error column empty
score >= --skip-completed-threshold (default 0.05)
steps > 1 — single-step results are the Phase-7 exploit pattern; always retried regardless of score

Re-runs only tasks that errored, scored low, or were single-step. Concretely for the existing MiniMax baseline run: 13 skipped (clean), 9 retried (low score). For a Kimi train-split run with 1-step submit_file exploits, those all fall into the "steps ≤ 1" bucket and get correctly re-tried under the new Phase-9 env gate.

Usage:

python3 inference.py \
  --split train \
  --resume --skip-completed \
  --output-dir runs/teacher_kimi_k25_train \
  --model moonshotai/Kimi-K2.5 ...

If everything's already clean, the script prints "Nothing to do" and exits without spending a cent.

Phase 10 — SFT training script

Goal: warm-start Qwen2.5-Coder-3B-Instruct on the SFT corpus built in Phase 8, before GRPO. Per the $45 budget plan (1× A100 80GB on HF Jobs @ $2.50/hr), SFT runs ~6h ≈ $15 leaving ~$30 for GRPO + eval.

New file

train_sft.py — TRL SFTTrainer driver. Loads the messages-format JSONL, applies the model's chat template, masks loss on user/system tokens (assistant-only loss), trains a LoRA adapter, optionally pushes to HF Hub.

Key choices

Decision	Why
`assistant_only_loss=True`	Multi-turn agent SFT — we don't want to train on env-generated user feedback, only on assistant turns (the things the model produces)
LoRA r=32, alpha=64, all-linear targets	Sweet spot for 3B+ models; full-FT memory cost is unjustified for a $45 budget
bf16 + gradient checkpointing + 8K seq len	Fits a 3B model + 32-rank LoRA + 8K context comfortably on A100 80GB; can be dropped to 4K + r=16 for L40S 48GB
`packing=False`	Multi-turn examples are too varied to pack cleanly; each episode is its own sample
CLI: `--push-to-hub`	Optional push for the GRPO step to pull the SFT adapter from Hub instead of local disk
CLI: `--use-qlora`	4-bit quantization fallback for tighter VRAM (e.g. consumer GPU dev)

Command (HF Jobs)

hf jobs run \
  --hardware "Nvidia A100 - large" \
  --timeout 8h \
  --image "huggingface/transformers-pytorch-gpu:latest" \
  --secrets HF_TOKEN \
  -- \
  bash -c "pip install -U 'trl>=0.11' peft accelerate bitsandbytes && \
           python train_sft.py \
             --dataset data/sft_kimi_k25.jsonl \
             --output-dir /tmp/qwen3b-sft \
             --push-to-hub bpHigh/qwen3b-office-sft"

Local smoke test

The argparse layer imports cleanly without GPU. The full training requires a GPU + the trl/peft/accelerate stack — not run locally as part of CI; the real validation is the HF Jobs run.

Modified files

None (new file only)

Files unchanged in Phase 10

env server, graders, manifest, data, deps

Current state (post-Phase 10)

Repo layout

openenv_financial_task_env/
├── data/
│   ├── manifest.jsonl                # 109 rows: 50 Finch + 21 OSWorld + 38 PPTArena
│   ├── 0/, 21/, 24/, …               # original 10 hand-curated task dirs (xlsx)
│   ├── finch_50/<orig_id>/{src,ref}.xlsx
│   ├── osworld_writer/<uuid>/<src + N gold files>.docx
│   └── pptarena/<slug>/{<slug>_src,<slug>_ref}.pptx
├── data_pipeline/
│   ├── finch_pull.py                 # Phase 1
│   ├── osworld_writer_pull.py        # Phase 3
│   └── pptarena_pull.py              # Phase 5
├── graders/
│   ├── __init__.py                   # grade_xlsx + grade_docx + grade_pptx + dispatcher
│   └── docx_metrics.py               # 16 OSWorld evaluator functions
├── rewards.py                        # Phase 2; updated in Phase 4
├── server/financial_environment.py   # gold stash + per-task eval signal wired in
├── tasks.py                          # manifest loader; absolute-path resolution
├── models.py                         # unchanged
├── client.py                         # unchanged
├── inference.py                      # unchanged
├── pyproject.toml                    # +python-docx, +python-pptx, +rapidfuzz, +Pillow
├── Dockerfile                        # +python-docx, +python-pptx, +rapidfuzz, +Pillow
├── openenv.yaml                      # unchanged from Round 1
└── edits.md                          # this file

Task inventory

Family	Source	Train	Eval	Total
xlsx	hand-curated	10	0	10
xlsx	Finch	40	10	50
docx	OSWorld writer	17	4	21
pptx	PPTArena	30	8	38
total		97	22	119

Reward signal stack

Layer	Purpose	Mode
Per-step `RewardTracker`	Dense process reward (6 components)	Always on
`progress`	Structural distance to gold ↓	On for training, off for eval (`FINANCIAL_ENV_PROGRESS=0`)
`eval_check`	Per-task evaluator score ↑	Auto-enabled when task has an evaluator block (currently docx only)
Final grade — xlsx	30% sheet-name + 70% cell-level diff	Submit-only
Final grade — docx	Validity gate + 40% diff + 60% per-task evaluator	Submit-only
Final grade — pptx	Validity gate + 20% slide-count + 80% avg(40% text + 20% style + 20% position + 20% size)	Submit-only

Defenses against reward hacking

Vector	Status	Details
Persistent globals	✅ Each step is fresh `subprocess.run`
Time runaway	✅ 30s subprocess timeout
Memory runaway	⚠️ No `ulimit` yet (TODO)
Glob the gold via `data/`	✅ Gold moved out of `data/` for the episode
Read manifest.jsonl to find gold path	⚠️ Still reachable; would need full sandbox isolation (TODO)
Generic-distance gaming	✅ `eval_check` rewards spec-aligned progress
Submit-source-unchanged (Phase 7)	✅ Byte-equality check at grade time → 0.001
1-step-submit-file in SFT corpus (Phase 8)	✅ Builder drops these even at high score
Early submit before any code step (Phase 9)	✅ Env refuses the action itself; episode stays open for recovery
`lib_engagement` regex gaming	🟡 Trivial cap (0.010); AST-based check would harden (TODO)
`mutation` spam	🟡 Capped per-step but could spam-save garbage; could couple to progress (TODO)

Open issues / next steps (not yet done)

~~Layout-aware pptx diff~~ — DONE in Phase 5 follow-up. Position
- size matching with tolerance now active. 5 of 8 eval tasks meaningfully degrade source-vs-gold; 3 styling-only tasks still don't (see #2).
~~Style-aware pptx diff~~ — DONE (Phase 5 follow-up #2). 7-attribute style match (fill/line color, first-run font name/size/bold/italic/color). Unblocked the pure-styling task case_26 (discrimination 0 → 0.028).
Edit-zone masking for pptx — current diff averages over all shapes, so small targeted edits get diluted. Mask the score to shapes whose attributes differ between source and gold. Changes semantics: "did the agent fix the parts that were supposed to change" instead of "how close to gold overall". ~30 lines. Priority: medium — biggest improvement on tasks where edit surface is <5% of the deck.
PPTX VLM judge (optional, behind PPTX_VLM_JUDGE=1): render slides via headless LibreOffice → PNG, send (instruction, before, after, ref) to a VLM. Matches PPTArena's published methodology. Expensive — defer to final eval-time only, not training inner loop.
TSBench — skipped this round because it ships only ideal_description text (no gold files). Could add later as an LLM-judge family. Would need a separate grader; structurally similar to a per-task evaluator that calls Claude/GPT-4o with (diff_summary, ideal_description).
Memory cgroup on agent subprocess: prevent OOM-bomb step from killing the env server.
AST-based library check in rewards.py: replace regex with real call detection so import openpyxl # decoy doesn't earn the bonus.
Couple mutation reward to progress: only credit mutation if progress > 0 in the same step OR last N steps — kills the spam-save strategy while preserving exploration credit.
Manifest hiding for full sandbox isolation: at server startup, also move/redact data/manifest.jsonl so a determined agent can't read it from the subprocess. Better: deploy with the data tree mounted at a path the agent's cwd subtree can't reach (bwrap, or docker bind-mounts to e.g. /var/lib/openenv_data).
Test on more docx evaluator types end-to-end. Currently smoke-tested compare_docx_files (single + compound and) and compare_docx_tables. Should sweep all 16 evaluators with synthetic agent outputs.
osworld_0a0faba3 quirk — gold doesn't self-pass check_tabstops constraint due to a 2-words-before-tab paragraph. Either move to eval-only or relax the constraint.
Inference baseline — re-run the Round-1 inference script across all 119 tasks (or a stratified subset) to refresh the README scoreboard.
README rewrite — current README is Round-1. Needs the cross-format pitch (xlsx + docx + pptx), the multi-layer grader story, the gaming-resistance angle.
Training script — TRL/Unsloth GRPO with LoRA on Qwen2.5-Coder-3B, trajectory-collection from a teacher (Claude Haiku 4.5), + SFT warm-start. Per the earlier $100-budget plan.

Phase 13 — GRPO rollout fix: custom `rollout_func` for markdown JSON tool calls

Symptom: First GRPO run started with environment_factory=OfficeDocumentEnv showed reward stuck at 0.0 across every step in Trackio. Captured a completion sample mid-run and confirmed the model was emitting:

```json
{"name": "run_python_code", "arguments": {"code": "..."}}


…but TRL's `environment_factory` path runs `add_response_schema(tokenizer)` →
`qwen3_schema`, whose regex only matches `<tool_call>...</tool_call>` XML.
The parser found 0 tool calls per completion, the env never received a
step, reward stayed 0, advantage was 0, and gradient flow through the
model was effectively zero.  ~5 min of A100 time burned learning nothing.

**Root cause:** the SFT'd model (`bpHigh/qwen3b-office-sft-kimi`) was
trained on 53 Kimi-K2.5 trajectories where the assistant emits markdown
JSON blocks.  The SFT overwrote Qwen2.5-Coder's native `<tool_call>` XML
behavior.  TRL's tool-call parser is hardcoded to one of five known
schemas (glm4, gptoss, llama3, qwen3, qwen3_5) — none of which match
markdown blocks.

### Fix: bypass the parser by writing our own rollout

Switched `train_grpo.py` from `environment_factory=OfficeDocumentEnv` to
`rollout_func=rollout_func`.  TRL's two rollout paths:

| Mode | Who drives the loop | Tool-call format | Used here? |
|---|---|---|---|
| `environment_factory` | TRL's internal parser | `<tool_call>...</tool_call>` XML only | ❌ broken for our SFT model |
| `rollout_func` | User callback | Anything you want — you parse it | ✅ |

### New `rollout_func(prompts, trainer)` — ~150 LOC in [`train_grpo.py`](train_grpo.py)

For each `prompt × num_generations`:

1. Spawn an `OfficeDocumentEnv` and reset it with the task's `task_id`
   (recovered from a `<task_id:...>` marker we now embed in the user
   prompt — TRL doesn't pass dataset columns to `rollout_func`).
2. Apply the chat template to the initial `[system, user]` messages,
   tokenize → `prompt_ids`.
3. Loop up to 12 turns:
   a. Batch-call `trainer.vllm_generation.generate()` for every alive
      rollout in parallel (one generation per rollout per turn).
   b. Decode each completion → text.
   c. Parse via `parse_tool_call(text)`:
      - First try ```` ```json {"name": ..., "arguments": ...} ``` ````
        (primary SFT format).
      - Fall back to ```` ```python ... ``` ```` → `run_python_code`.
      - Fall back to Kimi K2.5 `<|tool_call_begin|>` markers.
   d. Dispatch to `env.run_python_code` / `env.submit_file` /
      `env.submit_text_answer`.
   e. Tokenize the env feedback as a user-message wire format
      (chat-template diff: `tok.apply_chat_template(after) − before`),
      append to `completion_ids` with `logprob=0` and `env_mask=0`.
4. After loop, return per-rollout:
   - `prompt_ids`, `completion_ids`, `logprobs`, `env_mask`
   - `env_reward_value` (extra field) — TRL forwards this as a kwarg
     to the reward function

### Reward function update

Old: `def env_reward(environments, **kwargs)` — read from TRL-managed env
instances.

New: `def env_reward(prompts=None, completions=None, env_reward_value=None, **kwargs)`
— read directly from the extra field returned by `rollout_func`.

### Why `env_mask` matters

The `env_mask` field tells TRL "these tokens are NOT model-emitted, don't
flow loss through them."  Without it, GRPO would compute loss on env
feedback tokens too, which is meaningless (the model didn't pick those
tokens — the env did).

### Modified files

- [`train_grpo.py`](train_grpo.py):
  - SYSTEM_PROMPT rewritten to instruct the model in its native markdown
    JSON format (not XML).
  - User prompt now prefixes `<task_id:NAME>\n\n` so `rollout_func` can
    recover task identity.
  - Added `parse_tool_call(text) -> dict | None` — three-format parser.
  - Added `rollout_func(prompts, trainer) -> dict` — the new rollout.
  - Removed `tokenizer.response_schema = qwen3_schema` (no longer
    needed — we don't go through TRL's parser).
  - Removed `max_tool_calling_iterations` from `GRPOConfig` (we cap
    turns ourselves at 12).
  - GRPOTrainer constructor: `environment_factory=...` → `rollout_func=...`.

### Files unchanged in Phase 13

- [`server/financial_environment.py`](server/financial_environment.py)
- [`server/app.py`](server/app.py)
- [`client.py`](client.py)
- All SFT artifacts and dashboard code

The env-side concurrent-session work from the prior commits
(`SUPPORTS_CONCURRENT_SESSIONS=True`, `max_concurrent_envs=16`,
`FINANCIAL_ENV_GOLD_STASH=copy`) is still required — `rollout_func`
opens batch_size × num_generations env sessions in parallel within each
gradient step.

### Risks / things to watch

1. **Token alignment fragility**: tokenizing the env-feedback "wire
   format" via a chat-template diff assumes the template doesn't insert
   anything weird mid-conversation.  If Qwen2.5-Coder's template ever
   changes, the diff approach could mis-attribute boundary tokens.
   Mitigation: print sample completions from the first training step
   and verify env_mask boundaries by hand.

2. **Concurrency on the env Space**: with `num_generations=2` and
   `gradient_accumulation_steps=8`, each gradient step opens 16 env
   sessions in parallel — exactly at the Space's `max_concurrent_envs=16`
   limit.  If we bump `num_generations` to 4, also bump
   `max_concurrent_envs` to 32.

3. **Per-turn cap of 1024 tokens**: `_ROLLOUT_MAX_TOKENS_PER_TURN` was
   chosen for safety, but if the model wants to emit a long python block
   it gets truncated.  Tune up if we see long-code tasks failing.

### Trackio run hygiene

The first (failed) GRPO run logged `office-doc-grpo` to
`bpHigh/trackio-office-grpo`.  Renamed/archived rather than deleted —
it's evidence of the parser-format mismatch.  The post-fix run logs to
the same project name; the failed run is suffixed `-attempt1` for
provenance.

---

## Re-deploy checklist

If a fresh contributor wants to reproduce the current state from
commit `bf77949`:

1. `pip install -e ".[dev]"` (now pulls python-docx, python-pptx, rapidfuzz, Pillow)
2. `python data_pipeline/finch_pull.py` — ~3 min, downloads ~42 MB
3. `python data_pipeline/osworld_writer_pull.py` — ~30 s, downloads ~10 MB
4. Download/clone PPTArena to a local path (e.g. `~/Downloads/PPTArena-main`),
   then `python data_pipeline/pptarena_pull.py --root ~/Downloads/PPTArena-main`
   — copies ~244 MB
5. Check `data/manifest.jsonl` has 109 lines (50 + 21 + 38)
6. `python -c "from tasks import TASKS; print(len(TASKS))"` should print 119
7. Smoke test: `python -c "from server.financial_environment import FinancialEnvironment; e = FinancialEnvironment(); o = e.reset(task_id='finch_10'); print(o.task_id)"`
8. Docker build: `docker build -t financial-task-env:latest .` — should complete cleanly with the new deps

For training (RL):
- Set `FINANCIAL_ENV_PROGRESS=1` (default) for dense gradient
- Ensure each rollout worker uses its own `FinancialEnvironment` instance — gold-stash is single-tenant per task

Edits log — Round-2 environment extension

State at Round-1 (baseline)

Known weaknesses identified before changes

Phase 1 — Manifest loader + 50 stratified Finch tasks

New files

Train/eval split

Modified files

Resulting task counts

Phase 2 — Unified RewardTracker

New file

Reward components (all per-step, summed and clamped to 0.10)

Per-family structural distance (in rewards.py)

Modified files

Smoke test results

Phase 3 — DOCX family (OSWorld-Verified writer subset)

New files

File reorganization (mid-phase)

New 3-layer DOCX grader

Modified files

Resulting task counts

Smoke test results

OSWorld quirk noted

Phase 4 — Reward-hacking defenses

Defense 1 — Gold file moved out of sandbox at episode start

Defense 2 — Per-task evaluator as 6th reward signal

Modified files

Smoke test results

Phase 5 — PPTX family (PPTArena ingest)

Source

New file

Data layout

Grader — grade_pptx (2-layer, no per-task evaluator)

Modified files

Resulting task counts (cumulative)

Smoke test results

Known limitation: text-only diff is weak for layout tasks

Phase 5 follow-up: layout-aware diff (delivered)

Smoke results on all 8 pptx eval tasks (source-vs-gold)

Remaining gap: styling-only tasks (29 of 100 PPTArena tasks)

Phase 5 follow-up #2: style-aware diff (delivered)

Smoke results across all 8 pptx eval tasks (source-vs-gold)

The dilution problem (now the binding limitation)

Phase 6 — Inference script v2 (manifest-aware benchmarking)

CLI (new)

Output structure (new)

Family-aware system prompts (new)

Other changes

Smoke validation

Modified files

Files unchanged in Phase 6

Phase 7 — Live-discovered exploit + anti-exploit fix

Fix 1: extended _shape_style (catches the per-attribute gaps)

Fix 2: byte-equality anti-exploit at grade time (the actual fix)

Important implication for SFT corpus building

Re-eval needed

Phase 8 — SFT corpus builder (trajectory → messages-format JSONL)

New file

Filters (in order)

Message reconstruction details

Smoke test (against the MiniMax-M2.1 eval run)

Modified files

Files unchanged in Phase 8

Phase 9 — Hard early-submit gate at the env layer

Fix: refuse the action before grading

Belt-and-suspenders: prompt also tells the model

Smoke test results

Stack of defenses against the "submit unchanged" exploit class

Modified files

Files unchanged in Phase 9

Phase 9.1 — --skip-completed for cheap re-runs

Phase 10 — SFT training script

New file

Key choices

Command (HF Jobs)

Local smoke test

Modified files

Files unchanged in Phase 10

Current state (post-Phase 10)

Repo layout

Task inventory

Phase 2 — Unified `RewardTracker`

Per-family structural distance (in `rewards.py`)

Grader — `grade_pptx` (2-layer, no per-task evaluator)

Fix 1: extended `_shape_style` (catches the per-attribute gaps)

Phase 9.1 — `--skip-completed` for cheap re-runs

Phase 13 — GRPO rollout fix: custom `rollout_func` for markdown JSON tool calls