Spaces:
Running
Edits log β Round-2 environment extension
This file tracks every change made on top of the Round-1 submission, in order. Useful both as a journal and as a re-deploy checklist.
Round-1 baseline: commit bf77949 ("Update readme") on main. Single
family (xlsx), 10 hand-curated Finch tasks, monolithic graders.py,
heuristic step-rewards.
Round-2 target: unified office-document RL environment β xlsx + docx + pptx, real enterprise artifacts, gaming-resistant multi-layer grading, manifest-driven.
State at Round-1 (baseline)
| Area | What was there |
|---|---|
| Task families | xlsx only |
| Number of tasks | 10 hand-curated, all from Finch |
| Task definitions | Hardcoded TASKS = {...} dict in tasks.py |
| Source data | data/<orig_id>/{src,ref}_0.xlsx β 10 dirs |
| Grading | One graders.py module, two functions: grade_qa (text) and grade_xlsx (cell-diff) |
| Step rewards | _compute_code_reward in server/financial_environment.py: heuristics on code string (regex save(, count substantive lines, length of stdout). Cap 0.10/step. |
| Sandboxing | None β agent's subprocess has full filesystem access |
| Reward components | 4 signals, all heuristic, partly gameable |
| Train/eval split | None |
| Deps | openpyxl only |
Known weaknesses identified before changes
save(string match missesprs.save(),Document.save()β wouldn't generalize past xlsx.- No measurement of whether file actually changed; just whether code mentioned save.
- No "moving toward gold" signal.
- Hardcoded task table β can't scale past ~30 tasks without bloat.
- Gold files reachable from sandbox via
glob(data/**)β reward hacking.
Phase 1 β Manifest loader + 50 stratified Finch tasks
Goal: scale beyond 10 hand-curated tasks; introduce a manifest the env loads at startup so future task families (docx, pptx) plug in cleanly.
New files
data_pipeline/finch_pull.pyβ stratified puller for theFinWorkBench/FinchHF dataset (172 tasks). Picks 50 xlsx-only MODIFY tasks across 7 tag buckets:Tag Picked of total Calculation 16 of 119 Structuring / Formatting 11 of 86 Data Entry / Import 6 of 44 Validation / Review 5 of 37 Cross-sheet/file Retrieval 5 of 36 Summary / Visualization 4 of 33 Financial Modeling 3 of 15 Web Search dropped β all such tasks have non-xlsx sources. Slots reallocated to Calculation + Structuring.
data/manifest.jsonlβ 50 rows, schema:{"id": "finch_10", "family": "xlsx", "origin": "finch", "orig_id": "10", "split": "eval", "primary_tag": "Calculation", "all_tags": ["Calculation", "Financial Modeling"], "business_type": "Predictive Modeling", "instruction": "...", "constraints": "...", "source_file": "data/finch_50/10/10_src_0.xlsx", "reference_file": "data/finch_50/10/10_ref_0.xlsx", "task_type": "MODIFY", "max_steps": 15}data/finch_50/<id>/{src,ref}.xlsxβ ~42 MB, 50 tasks Γ 2 files.
Train/eval split
- 40 train / 10 eval (stratified β at least 1 holdout per tag).
- Driven by per-tag
EVAL_HOLDOUTbudget in the puller.
Modified files
tasks.pyβ added_load_manifest()that readsdata/manifest.jsonland merges rows intoTASKS(skipping any whose ID already exists, so the original 10 hand-curated tasks remain). Addedlist_tasks(split=, family=),split_ids()filters.
Resulting task counts
- 60 total (10 original + 50 Finch), 50 train / 10 eval.
Phase 2 β Unified RewardTracker
Goal: replace heuristic code-string scoring with real file-state signals, generalizable across xlsx/pptx/docx.
New file
rewards.pyβRewardTrackerclass, one instance per episode.
Reward components (all per-step, summed and clamped to 0.10)
| Component | Range | What it actually checks |
|---|---|---|
exec_health |
0β0.020 | Subprocess return code; bonus if stdout non-empty |
lib_engagement |
0β0.010 | Code matches _LIB_PATTERNS[family] regex (xlsx β openpyxl/load_workbook/Workbook; pptx β Presentation; docx β Document) |
mutation |
0β0.030 | SHA-256 of working file changed since last step |
validity |
0β0.020 | Mutated file still parses with the family's loader |
progress |
0β0.040 | Structural distance to gold decreased this step (gated by enable_progress) |
Per-family structural distance (in rewards.py)
_xlsx_distanceβ fraction of gold cells matched (mirrors final grader)_pptx_distanceβ fraction of gold (slide_idx, shape_idx) text-frames matched_docx_distanceβ fraction of gold paragraphs matched at same index
Modified files
server/financial_environment.py:- Replaced
_compute_code_rewardwith a delegate toRewardTracker _compute_code_rewardnow returns(total, breakdown_dict)instead of justfloat- Per-episode tracker stood up in
reset()after copying source to workdir FINANCIAL_ENV_PROGRESS=0env var disables the progress signal (for clean eval)- Reward decomposition surfaced in feedback for debugging
- Replaced
Smoke test results
- Read-only step: 0.030 (exec_health 0.020 + lib_engagement 0.010)
- Save+modify step: 0.080 (+ mutation 0.030 + validity 0.020)
- Failed code: 0.005 (exec_health_fail only)
- Decomposition logged in feedback, e.g.:
Reward: total=0.080 (exec_health=0.020, lib_engagement=0.010, mutation=0.030, validity=0.020, progress=0.000)
Phase 3 β DOCX family (OSWorld-Verified writer subset)
Goal: add Microsoft Word (.docx) tasks alongside xlsx, with real property-checking evaluators ported from OSWorld.
New files
data_pipeline/osworld_writer_pull.pyβ pulls 21 strict-docx tasks fromxlang-ai/OSWorld(GitHub) andxlangai/ubuntu_osworld_file_cache(HF). Of the 23 published writer UUIDs, drops 2 (one.odt, one.pdfsource) leaving 21 strict-docx.Schema normalization: OSWorld evaluators come in two shapes (single-string
funcvs. compoundfunc: list[str]withconj: "or"|"and"and parallelexpected/optionslists). The puller normalizes everything toevaluator: {conj, checks: [{func, options, expected_files: [...]}, ...]}. Multi-gold (multi: true) tasks haveexpected_filesas a list per check.graders/docx_metrics.pyβ port of 16 evaluator functions from OSWorld'sdesktop_env/evaluators/metrics/docs.py(Apache-2.0). Heavy deps (skimage,easyocr) imported lazily; one function (find_default_font) stubbed because it operates on a LibreOffice config XML that doesn't exist in our headless sandbox.Added
infeasiblehandler: passes iff agent didn't modify the source (the agent should refuse). Thebb8ccc78task ("Share this document with my team and let us edit it together in real-time") uses this β it's genuinely impossible from a code-execution sandbox.Dispatcher:
run_evaluator(conj, checks, working_file, source_file)βand=min(scores),or=max(scores).Evaluator Tasks Style compare_docx_files7Γ Content diff (with options: ignore_blanks, ignore_case, fuzzy_match, β¦) compare_line_spacing3Γ Property compare_docx_tables3Γ Structure check_tabstops1Γ Property + position-distance compare_subscript_contains1Γ Property has_page_numbers_in_footers1Γ Single-file property compare_font_names1Γ Single-file property is_first_line_centered1Γ Single-file property compare_docx_images1Γ Pixel-byte diff compare_unique_train_records1Γ Multi-file domain logic evaluate_strike_through_last_paragraph1Γ Property evaluate_colored_words_in_tables1Γ Skimage CIE delta-E infeasible1Γ Sentinel (file-unchanged check) check_italic_font_size_141Γ Property contains_page_break1Γ Property find_default_font1Γ Stubbed (LO-config-dependent)
File reorganization (mid-phase)
- Renamed
graders.py(root module) βgraders/__init__.py(package). Forced becausegraders/(new dir fordocx_metrics.py) collides withgraders.py(old root file) β Python won't accept both. Existingfrom graders import grade_taskimports still work transparently.
New 3-layer DOCX grader
def grade_docx(task, output_path):
if not _docx_validity(output_path): # layer 1 β validity gate
return 0.001
diff_score = _docx_diff(output_path, task["reference_file"]) # layer 2
primary_score = run_evaluator(...) # layer 3
return 0.4 * diff_score + 0.6 * primary_score
The dispatcher (grade_task) routes by task["family"] β xlsx still uses
the cell-diff path, docx uses the new 3-layer path.
Modified files
tasks.pyβ manifest loader now passes throughevaluator,primary_tag,all_tags. Resolves evaluator'sexpected_filesto absolute paths (matters for the gold-stash dedup in Phase 4).pyproject.toml+Dockerfileβ addedpython-docx>=1.1.0,rapidfuzz>=3.0.0,Pillow>=10.0.0.
Resulting task counts
- 81 total (10 original + 50 Finch xlsx + 21 OSWorld docx).
- 17 docx train, 4 docx eval (stratified to cover 4 distinct evaluator funcs).
Smoke test results
- Submit gold to compound
andΓ2task β 0.999 β - Submit corrupted bytes β 0.001 (validity gate rejects) β
- Submit unmodified source β 0.400 (diff layer says similar, per-task says no-edit)
OSWorld quirk noted
osworld_0a0faba3(check_tabstopstask): the gold itself doesn't satisfyword_number_split_by_tabstop=3for paragraph [2] ("Make payment\t..."has only 2 words before the tab). This is a faithful port of OSWorld's behavior, not a bug in our code. May want to relax the rule for training or move that task to eval-only.
Phase 4 β Reward-hacking defenses
Goal: plug the two biggest hacking surfaces identified in the Q2 audit.
Defense 1 β Gold file moved out of sandbox at episode start
Threat: glob('/app/env/data/**/*Gold*.docx') or glob('**/*_ref_*.xlsx')
finds the gold; agent submits it for an instant 0.999.
Fix: at reset():
- Make a per-episode COPY of the global
TASKS[id]dict (so episode-time path mutations don't pollute the global) - Create a tmpdir at
/tmp/oe_gold_<random>/ - Move (rename) every gold file from
data/...into the tmpdir with a generic name (gold_ref<ext>,check_<i>_<j>_<random><ext>) - Track the moves in
self._gold_originalssoclose()can restore - Rewrite the episode-task's
reference_fileandevaluator.checks[*].expected_filesto point at the tmpdir paths
De-dup: when the same path appears as both reference_file and an
evaluator expected_files entry (common β the puller sets reference_file =
first check's first expected_file), the stasher uses a path_map to ensure
both new paths point to the same stashed location.
Restore: close() renames stashed files back to their original data/
locations. reset() calls close() at the start of each episode in case
the prior episode didn't end cleanly.
Defense 2 β Per-task evaluator as 6th reward signal
Threat: the previous 5 components rewarded "moved closer to gold via generic structural distance", which an agent could optimize without satisfying the actual property check the task is testing.
Fix: new eval_check component (0β0.020). Computes the per-task
evaluator at episode start, then on each mutating step. Rewards
increases in spec-aligned score.
# rewards.py
if self.task_evaluator is not None and file_valid:
cur_eval = self._safe_task_eval()
if self._prev_eval is not None and cur_eval > self._prev_eval:
delta = cur_eval - self._prev_eval
sig.eval_check = min(EVAL_CHECK_MAX, EVAL_CHECK_MAX * delta)
self._prev_eval = cur_eval
For docx, the env passes task_evaluator = run_evaluator(conj, checks, ...)
into the tracker. xlsx/pptx pass None.
Modified files
-
- Added
task_evaluatorparam toRewardTracker.__init__ - Added
eval_checkfield toStepSignals+ recomputedtotal - Added
EVAL_CHECK_MAX = 0.020constant - Added
_safe_task_eval()helper
- Added
server/financial_environment.py:__init__: added_gold_stash_dir,_gold_originalsfieldsreset(): copies task dict, creates stash dir, calls_stash_gold_files, buildstask_evaluatorcallable for docx, passes it toRewardTracker- New methods:
_stash_gold_files(task, stash_dir),_make_task_evaluator() close(): restores moved gold files to data/, removes stash dir
Smoke test results
| Scenario | Score | Expected | Result |
|---|---|---|---|
| Compound andΓ2 docx, submit stashed gold | 0.999 | ~0.999 | β |
| Single-check docx, submit stashed gold | 0.999 | ~0.999 | β |
| Submit corrupted bytes | 0.001 | 0.001 | β (validity gate) |
| Submit source (unmodified) | 0.400 | partial | β (diff 1.0, per-task 0) |
| xlsx (no per-task evaluator), submit gold | 0.999 | ~0.999 | β |
| Code step copies stashed gold to working file | total=0.090 with eval_check=0.020 |
should fire | β |
| Original gold file present in data/ during episode | False on disk | False | β (moved out) |
Original restored after close() |
True on disk | True | β |
Phase 5 β PPTX family (PPTArena ingest)
Goal: add Microsoft PowerPoint (.pptx) tasks. PPTArena chosen over TSBench
because PPTArena ships actual gold .pptx files; TSBench only has
ideal_description text and would need an LLM judge.
Source
Local checkout of PPTArena
unpacked at ~/Downloads/PPTArena-main. The repo's
src/evaluation_pairs_refined.json has 100 well-curated task pairs:
{
"name": "Case 31: Fix Text Overflow",
"prompt": "...",
"style_target": "<detailed expected output spec>",
"original": "Original/<file>.pptx",
"ground_truth": "GroundTruth/<file>.pptx",
"category": ["Content", "Layout"],
"edit_type": "Text & Typography"
}
Distribution across the 100:
| edit_type | count |
|---|---|
| Text & Typography | 29 |
| Charts | 10 |
| Images & Pictures | 10 |
| Theme & Background | 9 |
| Alignment, Distribution & Z-order | 8 |
| Slide/Section Management & Footers | 8 |
| Tables | 8 |
| Shapes & Drawing | 4 |
| SmartArt & Diagrams | 4 |
| Slide Layout & Placeholders | 3 |
| Accessibility & Semantics | 2 |
| Long-tail singletons (Transitions, Hyperlinks, Master, Audio/Video, Animations) | 1 each |
New file
data_pipeline/pptarena_pull.pyβ readsevaluation_pairs_refined.json, picks 38 tasks stratified byedit_type. Sub-budget below; sum is 38 (close to the 40 target β the gap is from the long-tail edit_types having only 1 sample each).edit_type picked of total Text & Typography 6 of 29 Charts 4 of 10 Images & Pictures 4 of 10 Theme & Background 3 of 9 Alignment, Distribution & Z-order 3 of 8 Slide/Section Management & Footers 3 of 8 Tables 3 of 8 Shapes & Drawing 2 of 4 SmartArt & Diagrams 2 of 4 Slide Layout & Placeholders 2 of 3 Accessibility & Semantics 1 of 2 Long-tail singletons 5 Γ 1 of 5 Long-tail singletons all go to train (only 1 sample each β can't hold out). Eval holdout = 8: 2 from Text & Typography, 1 each from {Charts, Images, Theme, Alignment, Slide Mgmt, Tables}.
The agent-facing instruction is
prompt + "\n\nDetails:\n" + style_targetβstyle_targetcarries the explicit spec PPTArena uses internally for evaluation, exposed to the agent as a "hidden but visible" constraint.
Data layout
data/pptarena/<slug>/
<slug>_src.pptx # copied from PPTArena-main/Original/
<slug>_ref.pptx # copied from PPTArena-main/GroundTruth/
Total disk: ~244 MB for 38 tasks (pptx files are larger than docx/xlsx β they contain embedded images and themes).
Grader β grade_pptx (2-layer, no per-task evaluator)
def grade_pptx(task, output_path):
if not _pptx_validity(output_path): # layer 1
return 0.001
# layer 2: structural diff
# slide-count match (30%) + per-shape text-equality (70%, fuzzy 90%+ allowed)
...
Per-task evaluator is intentionally not wired. PPTArena's published
evaluator is a VLM-as-judge pipeline (instruction-following + visual quality)
which is expensive and non-deterministic. Skipping for v1; wiring it as an
optional RENDER_FOR_VLM=1 flag is in the Open Issues list.
Modified files
graders/__init__.py: added_pptx_validity,_pptx_load_shape_text,grade_pptx. Dispatcher now routes pptx β grade_pptx.pyproject.toml+Dockerfile: addedpython-pptx>=1.0.0.
Resulting task counts (cumulative)
| Family | Origin | Train | Eval | Total |
|---|---|---|---|---|
| xlsx | hand-curated | 10 | 0 | 10 |
| xlsx | Finch | 40 | 10 | 50 |
| docx | OSWorld | 17 | 4 | 21 |
| pptx | PPTArena | 30 | 8 | 38 |
| total | 97 | 22 | 119 |
Smoke test results
| Scenario | Score | Expected | Result |
|---|---|---|---|
| Submit stashed gold (eval task) | 0.999 | ~0.999 | β |
| Submit corrupted .pptx bytes | 0.001 | 0.001 | β (validity gate) |
| Code step that mutates + saves (add blank slide) | total=0.080 | β₯0.06 | β (exec=0.020, lib=0.010, mutation=0.030, validity=0.020) |
Gold-stash works for pptx (file moves out of data/) |
True | True | β |
close() restores gold to data/ |
True | True | β |
Known limitation: text-only diff is weak for layout tasks
For an Alignment / Layout task (e.g. Case 60: Fix Text Placement), source and ground-truth have near-identical text content β only shape positions differ. Our diff layer scores 0.999 on the unmodified source for this case, which is not what we want. Two paths to fix:
Extend
grade_pptxwith position+size diff (cheap; ~30 lines): for each (slide_idx, shape_idx) pair, compare(left, top, width, height)within tolerance. Recompose the score as0.2 * slide_count + 0.8 * avg( 0.5 * text_match + 0.25 * position_match + 0.25 * size_match).Wire VLM judge behind
PPTX_VLM_JUDGE=1env var β render slides via headless LibreOffice β PNG, send (instruction, before, after, ref) to a VLM. Matches PPTArena's published methodology but is expensive.
Recommended: (1) before any RL training; (2) for the final eval scoreboard.
Phase 5 follow-up: layout-aware diff (delivered)
Implemented option (1) above. The grader now loads every shape's
(left, top, width, height) (in EMU) and computes a per-shape composite
score:
- Text (50%) β exact match β 1.0; rapidfuzz partial credit otherwise.
- Position (25%) β
_coord_match(left, denom=slide_w)averaged with same fortop. Tolerance:delta β€ 2%of slide dim β 1.0;delta β₯ 20%β 0.0; linear in between. Both sides None (placeholder inheriting from layout) is treated as a match. - Size (25%) β same
_coord_matchfor width/height.
Final score reweighted: 0.2 * slide_count + 0.8 * avg(per-shape composite).
Smoke results on all 8 pptx eval tasks (source-vs-gold)
| Task | Before fix | After fix | Notes |
|---|---|---|---|
case_36_add_speaker_notes |
0.999 | 0.683 | Big drop β entire shapes added in gold |
case_32_arrange_image_and_text |
0.999 | 0.824 | Position diff captured |
case_7_update_quarter_two_data_b |
0.999 | 0.948 | Chart text + size diff |
case_60_fix_text_placement |
0.999 | 0.981 | Modest β positions in tolerance band |
case_35_structural_fix |
0.999 | 0.971 | Modest |
case_49_normalize_thousand_separators |
0.999 | 0.992 | Tiny text edit, no layout change |
case_40_hindu_center_titles |
0.999 | 0.997 | Title-alignment only β small px shift |
case_26_match_slide_colors_to_theme |
0.999 | 0.999 | Pure color/theme β geometry unchanged |
5 of 8 eval tasks now show meaningful drop. The remaining 3 (case_40,
case_49, case_26) still score ~0.99 because their edits are
styling-only β color, font, fill β which our geometry-only diff
doesn't see.
Remaining gap: styling-only tasks (29 of 100 PPTArena tasks)
Styling tasks edit shape fill, line, font name/size/bold/italic/color,
or theme β none of which are captured by text + geometry. Two ways to
close the gap, both filed as new follow-ups:
a. Per-shape style diff: for each shape, compare
fill.solid().fore_color.rgb, line.color.rgb, and for the first run
in each text frame: font.name, font.size, font.bold, font.italic, font.color.rgb. Add as a 4th component in _shape_match_score. ~50 lines.
b. VLM judge (option 2 above) β catches styling for free since it compares rendered images. Defer to eval-time only because of cost.
For training, (a) is sufficient. For the final scoreboard, (b) is nicer.
Phase 5 follow-up #2: style-aware diff (delivered)
Implemented option (a) above. New _shape_style() extractor pulls 7
attributes per shape (all None-tolerant β failures during read become
None, which counts as a match against another None):
| Attribute | Weight | Source |
|---|---|---|
fill_rgb |
0.30 | shape.fill.fore_color.rgb (solid fills only) |
font_rgb |
0.20 | first-run font.color.rgb |
font_size_pt |
0.15 | first-run font.size.pt |
font_name |
0.10 | first-run font.name |
line_rgb |
0.10 | shape.line.color.rgb |
font_bold |
0.075 | first-run font.bold |
font_italic |
0.075 | first-run font.italic |
Per-shape composite reweighted from 50% text + 25% pos + 25% size to:
40% text + 20% style + 20% position + 20% size
Why these weights? Text is still dominant because most edits affect text content. Style gets equal weight to position/size, reflecting that styling edits are common in PPTArena (~29 tasks).
Smoke results across all 8 pptx eval tasks (source-vs-gold)
| Task | Phase-5 layout-only | Phase-5+style | Discrimination (gold β source) |
|---|---|---|---|
case_26_match_slide_colors_to_theme |
0.999 | 0.971 | 0.000 β 0.028 β unblocked |
case_36_add_speaker_notes |
0.683 | 0.715 | 0.316 β 0.284 |
case_32_arrange_image_and_text |
0.824 | 0.855 | 0.175 β 0.144 |
case_60_fix_text_placement |
0.981 | 0.985 | 0.018 β 0.014 |
case_35_structural_fix |
0.971 | 0.975 | 0.028 β 0.024 |
case_7_update_quarter_two_data_b |
0.948 | 0.951 | 0.051 β 0.048 |
case_40_hindu_center_titles |
0.997 | 0.998 | tiny |
case_49_normalize_thousand_separators |
0.992 | 0.994 | tiny |
Gold-vs-gold remained 0.999 on all 8 (no regression).
Trade-off observed: the styling task discrimination went from 0 β 0.028, but text/layout-heavy tasks lost a few percentage points of discrimination because the text weight dropped from 50% β 40%. Net positive but not dramatic.
The dilution problem (now the binding limitation)
For tasks where only a few shapes out of many are edited (e.g.
case_40_hindu_center_titles edits 1 title shape per slide), the diff
averages across all shapes β the un-edited majority dominates and
the score barely moves between source and gold. This is structural to
average-based diff and not a bug.
Two follow-ups to consider:
a. Edit-zone masking β score only shapes whose attributes differ
between source and gold (using task.source_file as the baseline).
Changes scoring semantics: instead of "how close to gold", you measure
"did the agent fix the parts that were supposed to change". ~30 lines,
but more invasive than (b) below.
b. VLM judge β compares rendered images, naturally focuses on visible differences. The right long-term answer; expensive β defer to eval-time behind a flag.
Phase 6 β Inference script v2 (manifest-aware benchmarking)
Goal: Round-1's inference.py was hardcoded to 5 xlsx
tasks and produced stdout-only output. Round-2 needs a script that:
- Selects tasks from the manifest (filterable by split/family/ids)
- Picks the right system prompt per family (openpyxl / python-docx / python-pptx)
- Persists results to disk so we can produce reward curves and before/after plots for the judging story
CLI (new)
python inference.py [--split eval|train|all]
[--family xlsx|docx|pptx|all]
[--limit N]
[--task-ids id1,id2,β¦]
[--output-dir runs/<custom>]
[--model <name>]
[--api-base <url>] [--env-url <http://β¦>]
[--max-steps 15] [--task-timeout 360]
[--temperature 0.0] [--max-tokens 12000]
--task-ids overrides --split/--family. Selection is sorted
deterministically by (family, primary_tag, id).
Output structure (new)
Each run writes a runs/<timestamp>_<model_slug>/ directory:
results.json # summary + per-task records
summary.csv # flat table for plotting
trajectories/<id>.jsonl # full step trace per task (action, reward, feedback)
log.txt # mirrors stdout
results.json shape:
{
"model": "...",
"split": "eval", "family": "all",
"n_tasks": 22, "avg_score": 0.456, "success_rate": 0.318,
"total_elapsed_s": 1840.5,
"by_family": {
"xlsx": {"n": 10, "avg": 0.521},
"docx": {"n": 4, "avg": 0.402},
"pptx": {"n": 8, "avg": 0.388}
},
"results": [{ "task_id":..., "score":..., "step_rewards":[...], ...}]
}
summary.csv columns: task_id, family, primary_tag, split, score, success, steps, elapsed_s, error β feeds straight into matplotlib/seaborn for the
hero plot in the README.
Family-aware system prompts (new)
The single prompt mentioning openpyxl is replaced by three:
| Family | Prompt mentions |
|---|---|
| xlsx | openpyxl.load_workbook, wb.save(path) |
| docx | from docx import Document, doc.save(path), common imports for shared/enum |
| pptx | from pptx import Presentation, prs.save(path), color/util imports |
Selection is by obs["family"] (env-provided, with fallback to the
manifest's family field).
Other changes
MAX_STEPSdefault raised from 10 β 15 to match the env's actual cap (was undercutting agents on hard tasks)TASK_TIMEOUTraised from 240s β 360s β pptx tasks have larger files and need more inspection time- Task selection auto-injects the 10 hand-curated
task_1..task_10(which live intasks.py, not the manifest) so they remain runnable via--task-ids - Action extractor now also recognizes
docx/pptxstrings as code-block hints (was openpyxl-only) - Trajectory persistence: every (action, reward, feedback) tuple is saved per task β useful as input to SFT warm-start in the eventual training loop
Smoke validation
--helpprints clean usage- Loads 119 tasks from manifest + injects 10 hand-curated; selects:
--split evalβ 22 tasks (10 xlsx + 4 docx + 8 pptx) β--task-ids finch_10,osworld_0a0faba3,pptarena_case_60_fix_text_placementβ 3 tasks β
- Output writers (json/csv/jsonl) round-trip cleanly via synthetic test
A full live benchmark (with model API + env server) is the user's next action β costs ~$0.50-2 in API tokens for a 22-task eval depending on model.
Modified files
inference.pyβ full rewrite (~400 lines, was ~350)
Files unchanged in Phase 6
- All env-server code, graders, manifest, data, deps
Phase 7 β Live-discovered exploit + anti-exploit fix
Trigger: during Kimi-K2.5 eval (Apr 25, 2026), the model submitted the unmodified source file in step 1 for two tasks and scored very high:
| Task | Edit type | Score on src-unchanged submit | Why it worked |
|---|---|---|---|
pptarena_case_40_hindu_center_titles |
Title alignment | 0.998 | Paragraph-level alignment wasn't in _shape_style; everything else (text, position, size, font attrs) was identical between source and gold |
pptarena_case_26_match_slide_colors_to_theme |
Theme color | 0.971 | Gold uses theme-color references (None RGB); source uses explicit RGB. The mismatch dilutes across 30 shapes for only ~3% drop |
This is genuine reward hacking by an inference-time agent, exactly what the "hard to game" criterion in the judging guide warns about. Two fixes delivered:
Fix 1: extended _shape_style (catches the per-attribute gaps)
Added two new attributes to the per-shape style extractor:
| Attribute | Source | Catches |
|---|---|---|
para_alignment |
shape.text_frame.paragraphs[0].alignment |
"Center the title" / "right-align" tasks |
fill_theme |
shape.fill.fore_color.theme_color (when fill is solid but .rgb raises) |
"Match colors to theme" tasks where gold uses theme refs and source uses explicit RGB |
Reweighted _STYLE_WEIGHTS from 7 attrs β 9 attrs:
fill_rgb 0.22 | fill_theme 0.08 | font_rgb 0.17 | para_alignment 0.15
font_size_pt 0.12 | line_rgb 0.08 | font_name 0.08
font_bold 0.05 | font_italic 0.05
Status: improves shape-level discrimination, but the dilution problem still wins when only 2 of 55 shapes change (case_40 src-vs-gold went from 0.998 β 0.997 β basically unchanged because of averaging). This is why we need Fix 2.
Fix 2: byte-equality anti-exploit at grade time (the actual fix)
Added in graders/__init__.py's grade_task:
if the agent's submitted file is byte-identical to the source AND the
task isn't OSWorld's infeasible sentinel, return 0.001 immediately.
if src_file_exists and not is_infeasible_task:
if same_bytes(output_path, source_file):
return 0.001 # agent didn't actually do anything
This kills the entire class of "submit source unchanged" exploits across all three families, regardless of which specific attribute the diff misses. Validation:
| Test | Before fix | After fix |
|---|---|---|
Submit unmodified source on case_40 |
0.998 | 0.001 β |
Submit unmodified source on case_26 |
0.971 | 0.001 β |
Submit gold on case_40 |
0.999 | 0.999 β no regression |
Submit gold on case_26 |
0.999 | 0.999 β no regression |
| All 8 pptx eval tasks, gold-vs-gold | 0.999 | 0.999 β no regression |
The OSWorld infeasible task (where not modifying is the correct
answer) is correctly excluded β that path uses the existing infeasible
evaluator function which already does its own equality check and credits
the agent.
Important implication for SFT corpus building
When we eventually filter trajectories for the SFT corpus, drop any
trajectory where n_steps == 1 and the only action was submit_file
even after this fix. Reasons:
- Defense in depth β if a future grader gap appears, we don't want the student model trained on "submit unchanged" wins
- A real solve takes at least one code step; 1-step
submit_fileis structurally suspicious
This filter is documented as a TODO for the SFT collection script.
Re-eval needed
The Kimi-K2.5 baseline numbers from runs/baseline_kimi_k25_eval/ were
collected with the pre-fix grader. The two exploited tasks are now
correctly graded at 0.001 instead of 0.998/0.971, lowering the run's
average. Either re-run Kimi on those two tasks with --resume, or
recompute the average locally:
# Quick local recompute (no re-inference) β assumes you already pushed
# updated graders. The OLD numbers are inflated; the NEW numbers reflect
# what Kimi actually solved.
(Recommendation: re-run with --resume --task-ids pptarena_case_40_hindu_center_titles,pptarena_case_26_match_slide_colors_to_theme. Costs <$0.10.)
Phase 8 β SFT corpus builder (trajectory β messages-format JSONL)
Goal: turn teacher trajectories (collected on the train split via
inference.py --split train) into an SFT-ready corpus for warm-starting
a small student model (Qwen2.5-Coder-3B-Instruct) before GRPO.
New file
data_pipeline/build_sft_corpus.pyβ reads aruns/<dir>/{summary.csv, trajectories/*.jsonl}produced byinference.py, applies six filters, and emits a JSONL where each row is one accepted episode in the TRLSFTTrainermessagesformat:{"task_id": "...", "family": "xlsx", "primary_tag": "Calculation", "split": "train", "score": 0.94, "n_steps": 6, "messages": [ {"role": "system", "content": <SYSTEM_PROMPTS[family]>}, {"role": "user", "content": <task instruction + source path + family>}, {"role": "assistant", "content": "```python\nβ¦\n```"}, {"role": "user", "content": "Code execution result (step 1/15):\nβ¦"}, {"role": "assistant", "content": "SUBMIT_FILE: /β¦"}, ... ]}
Filters (in order)
| # | Filter | What it drops | Why |
|---|---|---|---|
| 1 | error column non-empty |
Failed runs (timeouts, model crashes) | No useful signal |
| 2 | n_steps < --min-steps (default 2) |
Trivial 1-step runs | Real solves take β₯1 code step |
| 3 | 1-step submit_file |
Trajectories where the only action is submit_file |
Defense in depth against grader exploits β Phase 7 proved a model can submit source unchanged and beat the diff threshold; even with the byte-equality check, future grader gaps could re-open this. A real solve takes β₯1 code step; we never want to teach the student "skip the work". Always dropped, regardless of score. |
| 4 | final_score < --score-threshold (default 0.4) |
Low-quality solves | Don't train on partial-fail patterns |
| 5 | Malformed action types | Action types outside {code, submit, submit_file} |
Schema enforcement |
| 6 | No real work | Trajectories with no successful code step (reward > 0.005) |
Drops "model only made syntax errors" cases |
The --min-steps 2 and the explicit 1-step-submit-file check are
redundant by design β both catch the same exploit class so a future
refactor that loosens one doesn't open the door.
Message reconstruction details
- System prompt: imported verbatim from
inference.SYSTEM_PROMPTS[family]so the SFT corpus matches what the model sees at deployment. - First user message: task instruction + constraints + source-file
path (extracted from the trajectory's first code action via regex,
falls back to manifest's
source_file) + family + task type. The env's xlsx-summary section is intentionally skipped to avoid re-opening files at corpus-build time. - Assistant turns: action content wrapped in the format the
extract_action()parser expects:codeβ```python\n{content}\n```submitβSUBMIT_ANSWER: {content}submit_fileβSUBMIT_FILE: {content}
- User turns: mirror inference.py's per-step feedback message:
Code execution result (step {n}/{max_steps}): {feedback} Source file: {path}
Smoke test (against the MiniMax-M2.1 eval run)
Input rows : 22
Accepted : 10
Drops:
low_score 12
Accepted breakdown:
docx 2
pptx 4
xlsx 4
Avg steps : 10.8
Avg score : 0.794
For the actual SFT corpus we'll use train-split teacher trajectories from Kimi-K2.5, not the eval baseline. With 97 train tasks at ~30β50% retention rate that's ~30β50 high-quality episodes β enough for a meaningful SFT warm-start before GRPO.
Modified files
- None (new file only)
Files unchanged in Phase 8
- env server, graders, manifest, data, deps
Phase 9 β Hard early-submit gate at the env layer
Trigger: during Phase-2 trajectory collection on the train split,
Kimi-K2.5 was still trying to submit the unmodified source file at
step 1 (e.g., pptarena_case_91_add_qr_code), even though the Phase-7
grader correctly scored it 0.001. Post-grading defense alone wasn't
enough β every wasted "submit at step 1" episode was lost training data
and burned API budget.
Fix: refuse the action before grading
server/financial_environment.py now
tracks _code_steps_taken (incremented in _handle_code regardless of
success β even a failed code attempt counts). Both submit handlers
(_handle_submit_file, _handle_submit_text) check
_code_steps_taken >= _min_code_steps_before_submit (default 1) and
return early with explanatory feedback if not.
Crucially, the rejection does NOT end the episode:
- The agent gets back a feedback message:
β Submit rejected: you must execute at least 1 code step before submitting... - The reward for the rejected step is
0.001 done=Falseβ the agent has its remaining steps (15 - n_used) to recover
This shape is exactly right for an RL agent: ending the episode would make a single bad attempt catastrophic; keeping it open turns it into a corrective signal.
The minimum is overridable via FINANCIAL_ENV_MIN_CODE_STEPS env var.
Set to 0 to disable the gate (useful only for debugging).
Belt-and-suspenders: prompt also tells the model
inference.py's _BASE_RULES now includes:
- You MUST execute at least one code step before submitting. The environment will reject SUBMIT_ANSWER and SUBMIT_FILE on step 1 β you need to read or modify the file with code first. Submitting the source file unchanged is never a correct solve and will be rejected.
Defense in depth: the prompt prevents wasted retries on models that follow instructions; the env layer enforces the rule on models that don't.
Smoke test results
Reset: code_steps_taken = 0, min_required = 1
Step 1: submit_file (early) β reward=0.001, done=False β rejected
Step 2: code (any code) β counter increments to 1 β
Step 3: submit_file (after code) β reward=normal, done=True β allowed
Step 1: submit (QA, early) β reward=0.001, done=False β same gate
Disabled (env var=0) β submit goes through β
Stack of defenses against the "submit unchanged" exploit class
This is now the third independent defense, all targeting the same exploit class:
| Layer | Phase | What it does |
|---|---|---|
| Env action gate | 9 (this one) | Refuse the submit action itself if no code step has been taken |
| Grader byte-equality | 7 | If submit happens AND output is byte-identical to source β 0.001 |
| SFT corpus filter | 8 | Drop trajectories with n_steps==1 and submit_file even at high score |
Layer 9 prevents the trajectory from existing in the first place.
Layer 7 catches it if Layer 9 is somehow bypassed (e.g.,
FINANCIAL_ENV_MIN_CODE_STEPS=0).
Layer 8 prevents future grader gaps from leaking into SFT training data.
Modified files
server/financial_environment.pyβ added_code_steps_taken,_min_code_steps_before_submit,_early_submit_rejected(). Both submit handlers gated.inference.pyβ added rule #6 to_BASE_RULES.
Files unchanged in Phase 9
- graders, manifest, data, deps
Phase 9.1 β --skip-completed for cheap re-runs
After Phase 9 landed, the natural question was: "do I just run with --resume
and the env will sort it out?" Answer: no β --resume alone re-runs every
selected task and merges. To save API spend on already-good trajectories,
added a --skip-completed flag to inference.py.
When set with --resume, drops tasks whose prior result is clean:
errorcolumn emptyscore >= --skip-completed-threshold(default0.05)steps > 1β single-step results are the Phase-7 exploit pattern; always retried regardless of score
Re-runs only tasks that errored, scored low, or were single-step. Concretely for the existing MiniMax baseline run: 13 skipped (clean), 9 retried (low score). For a Kimi train-split run with 1-step submit_file exploits, those all fall into the "steps β€ 1" bucket and get correctly re-tried under the new Phase-9 env gate.
Usage:
python3 inference.py \
--split train \
--resume --skip-completed \
--output-dir runs/teacher_kimi_k25_train \
--model moonshotai/Kimi-K2.5 ...
If everything's already clean, the script prints "Nothing to do" and exits without spending a cent.
Phase 10 β SFT training script
Goal: warm-start Qwen2.5-Coder-3B-Instruct on the SFT corpus built
in Phase 8, before GRPO. Per the $45 budget plan (1Γ A100 80GB on HF Jobs
@ $2.50/hr), SFT runs ~6h β $15 leaving ~$30 for GRPO + eval.
New file
train_sft.pyβ TRLSFTTrainerdriver. Loads themessages-format JSONL, applies the model's chat template, masks loss on user/system tokens (assistant-only loss), trains a LoRA adapter, optionally pushes to HF Hub.
Key choices
| Decision | Why |
|---|---|
assistant_only_loss=True |
Multi-turn agent SFT β we don't want to train on env-generated user feedback, only on assistant turns (the things the model produces) |
| LoRA r=32, alpha=64, all-linear targets | Sweet spot for 3B+ models; full-FT memory cost is unjustified for a $45 budget |
| bf16 + gradient checkpointing + 8K seq len | Fits a 3B model + 32-rank LoRA + 8K context comfortably on A100 80GB; can be dropped to 4K + r=16 for L40S 48GB |
packing=False |
Multi-turn examples are too varied to pack cleanly; each episode is its own sample |
CLI: --push-to-hub |
Optional push for the GRPO step to pull the SFT adapter from Hub instead of local disk |
CLI: --use-qlora |
4-bit quantization fallback for tighter VRAM (e.g. consumer GPU dev) |
Command (HF Jobs)
hf jobs run \
--hardware "Nvidia A100 - large" \
--timeout 8h \
--image "huggingface/transformers-pytorch-gpu:latest" \
--secrets HF_TOKEN \
-- \
bash -c "pip install -U 'trl>=0.11' peft accelerate bitsandbytes && \
python train_sft.py \
--dataset data/sft_kimi_k25.jsonl \
--output-dir /tmp/qwen3b-sft \
--push-to-hub bpHigh/qwen3b-office-sft"
Local smoke test
The argparse layer imports cleanly without GPU. The full training requires a GPU + the trl/peft/accelerate stack β not run locally as part of CI; the real validation is the HF Jobs run.
Modified files
- None (new file only)
Files unchanged in Phase 10
- env server, graders, manifest, data, deps
Current state (post-Phase 10)
Repo layout
openenv_financial_task_env/
βββ data/
β βββ manifest.jsonl # 109 rows: 50 Finch + 21 OSWorld + 38 PPTArena
β βββ 0/, 21/, 24/, β¦ # original 10 hand-curated task dirs (xlsx)
β βββ finch_50/<orig_id>/{src,ref}.xlsx
β βββ osworld_writer/<uuid>/<src + N gold files>.docx
β βββ pptarena/<slug>/{<slug>_src,<slug>_ref}.pptx
βββ data_pipeline/
β βββ finch_pull.py # Phase 1
β βββ osworld_writer_pull.py # Phase 3
β βββ pptarena_pull.py # Phase 5
βββ graders/
β βββ __init__.py # grade_xlsx + grade_docx + grade_pptx + dispatcher
β βββ docx_metrics.py # 16 OSWorld evaluator functions
βββ rewards.py # Phase 2; updated in Phase 4
βββ server/financial_environment.py # gold stash + per-task eval signal wired in
βββ tasks.py # manifest loader; absolute-path resolution
βββ models.py # unchanged
βββ client.py # unchanged
βββ inference.py # unchanged
βββ pyproject.toml # +python-docx, +python-pptx, +rapidfuzz, +Pillow
βββ Dockerfile # +python-docx, +python-pptx, +rapidfuzz, +Pillow
βββ openenv.yaml # unchanged from Round 1
βββ edits.md # this file
Task inventory
| Family | Source | Train | Eval | Total |
|---|---|---|---|---|
| xlsx | hand-curated | 10 | 0 | 10 |
| xlsx | Finch | 40 | 10 | 50 |
| docx | OSWorld writer | 17 | 4 | 21 |
| pptx | PPTArena | 30 | 8 | 38 |
| total | 97 | 22 | 119 |
Reward signal stack
| Layer | Purpose | Mode |
|---|---|---|
Per-step RewardTracker |
Dense process reward (6 components) | Always on |
progress |
Structural distance to gold β | On for training, off for eval (FINANCIAL_ENV_PROGRESS=0) |
eval_check |
Per-task evaluator score β | Auto-enabled when task has an evaluator block (currently docx only) |
| Final grade β xlsx | 30% sheet-name + 70% cell-level diff | Submit-only |
| Final grade β docx | Validity gate + 40% diff + 60% per-task evaluator | Submit-only |
| Final grade β pptx | Validity gate + 20% slide-count + 80% avg(40% text + 20% style + 20% position + 20% size) | Submit-only |
Defenses against reward hacking
| Vector | Status | Details |
|---|---|---|
| Persistent globals | β
Each step is fresh subprocess.run |
|
| Time runaway | β 30s subprocess timeout | |
| Memory runaway | β οΈ No ulimit yet (TODO) |
|
Glob the gold via data/ |
β
Gold moved out of data/ for the episode |
|
| Read manifest.jsonl to find gold path | β οΈ Still reachable; would need full sandbox isolation (TODO) | |
| Generic-distance gaming | β
eval_check rewards spec-aligned progress |
|
| Submit-source-unchanged (Phase 7) | β Byte-equality check at grade time β 0.001 | |
| 1-step-submit-file in SFT corpus (Phase 8) | β Builder drops these even at high score | |
| Early submit before any code step (Phase 9) | β Env refuses the action itself; episode stays open for recovery | |
lib_engagement regex gaming |
π‘ Trivial cap (0.010); AST-based check would harden (TODO) | |
mutation spam |
π‘ Capped per-step but could spam-save garbage; could couple to progress (TODO) |
Open issues / next steps (not yet done)
Layout-aware pptx diffβ DONE in Phase 5 follow-up. Position- size matching with tolerance now active. 5 of 8 eval tasks meaningfully degrade source-vs-gold; 3 styling-only tasks still don't (see #2).
Style-aware pptx diffβ DONE (Phase 5 follow-up #2). 7-attribute style match (fill/line color, first-run font name/size/bold/italic/color). Unblocked the pure-styling taskcase_26(discrimination 0 β 0.028).Edit-zone masking for pptx β current diff averages over all shapes, so small targeted edits get diluted. Mask the score to shapes whose attributes differ between source and gold. Changes semantics: "did the agent fix the parts that were supposed to change" instead of "how close to gold overall". ~30 lines. Priority: medium β biggest improvement on tasks where edit surface is <5% of the deck.
PPTX VLM judge (optional, behind
PPTX_VLM_JUDGE=1): render slides via headless LibreOffice β PNG, send (instruction, before, after, ref) to a VLM. Matches PPTArena's published methodology. Expensive β defer to final eval-time only, not training inner loop.TSBench β skipped this round because it ships only
ideal_descriptiontext (no gold files). Could add later as an LLM-judge family. Would need a separate grader; structurally similar to a per-task evaluator that calls Claude/GPT-4o with(diff_summary, ideal_description).Memory cgroup on agent subprocess: prevent OOM-bomb step from killing the env server.
AST-based library check in rewards.py: replace regex with real call detection so
import openpyxl # decoydoesn't earn the bonus.Couple mutation reward to progress: only credit
mutationifprogress > 0in the same step OR last N steps β kills the spam-save strategy while preserving exploration credit.Manifest hiding for full sandbox isolation: at server startup, also move/redact
data/manifest.jsonlso a determined agent can't read it from the subprocess. Better: deploy with the data tree mounted at a path the agent's cwd subtree can't reach (bwrap, or docker bind-mounts to e.g./var/lib/openenv_data).Test on more docx evaluator types end-to-end. Currently smoke-tested
compare_docx_files(single + compoundand) andcompare_docx_tables. Should sweep all 16 evaluators with synthetic agent outputs.osworld_0a0faba3quirk β gold doesn't self-passcheck_tabstopsconstraint due to a 2-words-before-tab paragraph. Either move to eval-only or relax the constraint.Inference baseline β re-run the Round-1 inference script across all 119 tasks (or a stratified subset) to refresh the README scoreboard.
README rewrite β current README is Round-1. Needs the cross-format pitch (xlsx + docx + pptx), the multi-layer grader story, the gaming-resistance angle.
Training script β TRL/Unsloth GRPO with LoRA on Qwen2.5-Coder-3B, trajectory-collection from a teacher (Claude Haiku 4.5), + SFT warm-start. Per the earlier $100-budget plan.
Phase 13 β GRPO rollout fix: custom rollout_func for markdown JSON tool calls
Symptom: First GRPO run started with environment_factory=OfficeDocumentEnv
showed reward stuck at 0.0 across every step in Trackio. Captured a
completion sample mid-run and confirmed the model was emitting:
```json
{"name": "run_python_code", "arguments": {"code": "..."}}
β¦but TRL's `environment_factory` path runs `add_response_schema(tokenizer)` β
`qwen3_schema`, whose regex only matches `<tool_call>...</tool_call>` XML.
The parser found 0 tool calls per completion, the env never received a
step, reward stayed 0, advantage was 0, and gradient flow through the
model was effectively zero. ~5 min of A100 time burned learning nothing.
**Root cause:** the SFT'd model (`bpHigh/qwen3b-office-sft-kimi`) was
trained on 53 Kimi-K2.5 trajectories where the assistant emits markdown
JSON blocks. The SFT overwrote Qwen2.5-Coder's native `<tool_call>` XML
behavior. TRL's tool-call parser is hardcoded to one of five known
schemas (glm4, gptoss, llama3, qwen3, qwen3_5) β none of which match
markdown blocks.
### Fix: bypass the parser by writing our own rollout
Switched `train_grpo.py` from `environment_factory=OfficeDocumentEnv` to
`rollout_func=rollout_func`. TRL's two rollout paths:
| Mode | Who drives the loop | Tool-call format | Used here? |
|---|---|---|---|
| `environment_factory` | TRL's internal parser | `<tool_call>...</tool_call>` XML only | β broken for our SFT model |
| `rollout_func` | User callback | Anything you want β you parse it | β
|
### New `rollout_func(prompts, trainer)` β ~150 LOC in [`train_grpo.py`](train_grpo.py)
For each `prompt Γ num_generations`:
1. Spawn an `OfficeDocumentEnv` and reset it with the task's `task_id`
(recovered from a `<task_id:...>` marker we now embed in the user
prompt β TRL doesn't pass dataset columns to `rollout_func`).
2. Apply the chat template to the initial `[system, user]` messages,
tokenize β `prompt_ids`.
3. Loop up to 12 turns:
a. Batch-call `trainer.vllm_generation.generate()` for every alive
rollout in parallel (one generation per rollout per turn).
b. Decode each completion β text.
c. Parse via `parse_tool_call(text)`:
- First try ```` ```json {"name": ..., "arguments": ...} ``` ````
(primary SFT format).
- Fall back to ```` ```python ... ``` ```` β `run_python_code`.
- Fall back to Kimi K2.5 `<|tool_call_begin|>` markers.
d. Dispatch to `env.run_python_code` / `env.submit_file` /
`env.submit_text_answer`.
e. Tokenize the env feedback as a user-message wire format
(chat-template diff: `tok.apply_chat_template(after) β before`),
append to `completion_ids` with `logprob=0` and `env_mask=0`.
4. After loop, return per-rollout:
- `prompt_ids`, `completion_ids`, `logprobs`, `env_mask`
- `env_reward_value` (extra field) β TRL forwards this as a kwarg
to the reward function
### Reward function update
Old: `def env_reward(environments, **kwargs)` β read from TRL-managed env
instances.
New: `def env_reward(prompts=None, completions=None, env_reward_value=None, **kwargs)`
β read directly from the extra field returned by `rollout_func`.
### Why `env_mask` matters
The `env_mask` field tells TRL "these tokens are NOT model-emitted, don't
flow loss through them." Without it, GRPO would compute loss on env
feedback tokens too, which is meaningless (the model didn't pick those
tokens β the env did).
### Modified files
- [`train_grpo.py`](train_grpo.py):
- SYSTEM_PROMPT rewritten to instruct the model in its native markdown
JSON format (not XML).
- User prompt now prefixes `<task_id:NAME>\n\n` so `rollout_func` can
recover task identity.
- Added `parse_tool_call(text) -> dict | None` β three-format parser.
- Added `rollout_func(prompts, trainer) -> dict` β the new rollout.
- Removed `tokenizer.response_schema = qwen3_schema` (no longer
needed β we don't go through TRL's parser).
- Removed `max_tool_calling_iterations` from `GRPOConfig` (we cap
turns ourselves at 12).
- GRPOTrainer constructor: `environment_factory=...` β `rollout_func=...`.
### Files unchanged in Phase 13
- [`server/financial_environment.py`](server/financial_environment.py)
- [`server/app.py`](server/app.py)
- [`client.py`](client.py)
- All SFT artifacts and dashboard code
The env-side concurrent-session work from the prior commits
(`SUPPORTS_CONCURRENT_SESSIONS=True`, `max_concurrent_envs=16`,
`FINANCIAL_ENV_GOLD_STASH=copy`) is still required β `rollout_func`
opens batch_size Γ num_generations env sessions in parallel within each
gradient step.
### Risks / things to watch
1. **Token alignment fragility**: tokenizing the env-feedback "wire
format" via a chat-template diff assumes the template doesn't insert
anything weird mid-conversation. If Qwen2.5-Coder's template ever
changes, the diff approach could mis-attribute boundary tokens.
Mitigation: print sample completions from the first training step
and verify env_mask boundaries by hand.
2. **Concurrency on the env Space**: with `num_generations=2` and
`gradient_accumulation_steps=8`, each gradient step opens 16 env
sessions in parallel β exactly at the Space's `max_concurrent_envs=16`
limit. If we bump `num_generations` to 4, also bump
`max_concurrent_envs` to 32.
3. **Per-turn cap of 1024 tokens**: `_ROLLOUT_MAX_TOKENS_PER_TURN` was
chosen for safety, but if the model wants to emit a long python block
it gets truncated. Tune up if we see long-code tasks failing.
### Trackio run hygiene
The first (failed) GRPO run logged `office-doc-grpo` to
`bpHigh/trackio-office-grpo`. Renamed/archived rather than deleted β
it's evidence of the parser-format mismatch. The post-fix run logs to
the same project name; the failed run is suffixed `-attempt1` for
provenance.
---
## Re-deploy checklist
If a fresh contributor wants to reproduce the current state from
commit `bf77949`:
1. `pip install -e ".[dev]"` (now pulls python-docx, python-pptx, rapidfuzz, Pillow)
2. `python data_pipeline/finch_pull.py` β ~3 min, downloads ~42 MB
3. `python data_pipeline/osworld_writer_pull.py` β ~30 s, downloads ~10 MB
4. Download/clone PPTArena to a local path (e.g. `~/Downloads/PPTArena-main`),
then `python data_pipeline/pptarena_pull.py --root ~/Downloads/PPTArena-main`
β copies ~244 MB
5. Check `data/manifest.jsonl` has 109 lines (50 + 21 + 38)
6. `python -c "from tasks import TASKS; print(len(TASKS))"` should print 119
7. Smoke test: `python -c "from server.financial_environment import FinancialEnvironment; e = FinancialEnvironment(); o = e.reset(task_id='finch_10'); print(o.task_id)"`
8. Docker build: `docker build -t financial-task-env:latest .` β should complete cleanly with the new deps
For training (RL):
- Set `FINANCIAL_ENV_PROGRESS=1` (default) for dense gradient
- Ensure each rollout worker uses its own `FinancialEnvironment` instance β gold-stash is single-tenant per task