Evolving Evaluation: A Framework-Agnostic Design for Diagnostic Feedback in Code Evolution
The Problem
Standard evolutionary code optimization loops look like this:
for generation in range(N):
code = mutate(parent_code)
score = evaluate(code)
if score > best_score:
best = code
The evolution LLM receives score=68.1 for 40 consecutive generations with no direction to improve. This is a dead signal β it tells you how well the code performs but nothing about why it fails or what to change.
For simple optimization targets (minimize X, maximize Y), the score gradient is sufficient. For complex programs β competitive programming, simulation solvers, algorithmic implementations β the bottleneck is almost never obvious from the score alone:
- Is it a timeout? On which inputs? Which function?
- Is it wrong output? On what edge case? Where does it diverge?
- Is it a suboptimal heuristic? Which decision point is the bottleneck?
The score cannot answer these questions. An agent that analyzes the code and its behavior can.
The Core Idea
Evolve the evaluation alongside the code.
Introduce a second agent β the eval agent β that runs periodically alongside the evolution loop. Its job is not to evaluate correctness but to produce diagnostic feedback: specific, evidence-backed insights about why the current best code is not scoring higher, and what direction would improve it.
This feedback is injected into the evolution agent's prompt, turning a dead scalar signal into an actionable diagnosis.
βββββββββββββββββββββββββββββββ
β Evolution Loop β
β β
ββββββββββββ β mutate β evaluate β select β
β Eval βββββββ€ β
β Agent β β every K generations: β
β ββββββΊβ inject diagnostic_report β
ββββββββββββ β β
βββββββββββββββββββββββββββββββ
The eval agent produces two outputs:
diagnostic_report.mdβ a text analysis injected into the evolution agent's next prompt. Contains: what the bottleneck is, what was measured, and what direction to try.auxiliary_metrics.pyβ a small script that compiles and runs the code on specific inputs to measure properties the primary evaluator does not capture. Runs automatically on every future generation, producing trend data.
Architecture
Minimal Interface Requirements
To integrate this into any evolution framework, you need three hooks:
Hook 1: Feedback injection point The evolution agent's prompt must accept a text field that gets injected before generation. In most frameworks this is a "system message" or "context" field.
# In your evolution prompt builder:
diagnostic = read_file("eval_agent_memory/diagnostic_report.md")
prompt = base_prompt + f"\n\n## Diagnostic Feedback\n{diagnostic}"
Hook 2: Periodic trigger The eval agent is triggered every K generations (K=5 works well). Pass it:
- Path to the current best code
- Per-generation score history
- Previous diagnostic report (for continuity)
if generation % K == 0:
eval_agent.trigger(
best_code_path=f"gen_{best_gen}/main.cpp",
score_history=get_score_history(),
results_dir=experiment_dir,
)
Hook 3: Aux metric execution
After each evaluation, run auxiliary_metrics.py and merge its output into the generation's metrics.
# In your evaluator wrapper:
primary_metrics = primary_evaluator(code)
aux_metrics = run_aux_metrics(f"gen_{gen}/", primary_metrics)
full_metrics = {**primary_metrics, **aux_metrics}
Aux Metrics Contract
def evaluate_aux(results_dir, primary_result=None):
"""
results_dir: path to gen_N/ (code is at results_dir/main.cpp)
primary_result: dict of primary metrics (optional)
returns: dict of metric_name -> float
"""
code_path = os.path.join(results_dir, 'main.cpp')
# compile, run, measure, return
return {"metric_name": float_value}
Critical implementation note: results_dir must point to the generation directory containing the source file (e.g. gen_10/), not a subdirectory. If your framework stores results in gen_10/results/, pass gen_10/ to evaluate_aux, not gen_10/results/. Getting this wrong causes silent compilation failures.
Eval Agent Task Message
The eval agent receives a task message containing:
Code to diagnose: /path/to/gen_N/main.cpp
Metrics history: [per-generation score and aux metric values]
Previous diagnostic: [last report, if any]
Compile command: g++ -O2 -std=gnu++17 {code_path} -o /tmp/binary
Test data: /path/to/testdata/
Your outputs:
- diagnostic_report.md: Hypothesis / Experiment / Verdict / Direction
- auxiliary_metrics.py: automated experiment that runs every generation
What Works
The Core Insight Holds
A well-targeted diagnostic can break a 40-generation plateau in a single intervention. Evidence from experiments on 172 competitive programming problems:
- p79: Vanilla evolution never solved (50 gens, score=0). Diagnosis identified "bit-by-bit correlation attack fails for non-independent bits β use beam search". Agent solved it.
- p30: Diagnosis found "deduplication bug in candidate tracking" β bitset redesign β score 100.
- p15: Diagnosis identified "greedy optimizes connections instead of lexicographic order" β single-line priority fix β +55 points.
These wins share a pattern: the diagnosis identified a specific, actionable algorithmic insight that the evolution LLM would be unlikely to discover by random mutation.
When It Helps
- Problems where the current approach is fundamentally wrong (wrong algorithm, wrong optimization objective)
- Problems where the bottleneck is non-obvious from the score (TLE on large inputs only, wrong answer on specific edge cases)
- Problems where the direction requires structural code changes that the LLM won't stumble into by gradient descent on the score
Diagnostic Report Quality Matters More Than Aux Metrics
The text diagnosis is more impactful than any numeric aux metric. In A/B comparison, pure code review captured 80-90% of the diagnostic value. Aux metrics provide evidence for the diagnosis and trend tracking, but the bottleneck is the quality of the directional advice.
What Doesn't Work (Honest Assessment)
Aggregate Results Are Weak
Across 9 experiments on 20-problem subsets (50 generations each), the average delta between agent and vanilla is +0.52 points β within random variance. Individual experiments range from -1.0 to +3.2.
The signal is real but small relative to noise. At 20 problems Γ 50 generations, random variance dominates.
The Transmission Gap
The fundamental bottleneck: the eval agent produces text direction, and the evolution LLM must translate that into specific code changes.
"Expand the search space for parameter a and relax the early exit condition" β the LLM still has to figure out what values to use, which lines to change, and whether the change is correct. This is not much better than the LLM exploring randomly.
Compare this to the winning cases (p30, p79): those diagnoses were specific enough that the LLM could act on them directly ("you have a deduplication bug on line X, use a bitset").
Aux Metrics Rarely Change Behavior
Showing the evolution LLM runtime_ms=4500 doesn't help it write faster code. The metric needs to be paired with a specific direction ("the bottleneck is solveFullDP, replace it with this approach") to be useful.
Design Principles
1. Diagnose the best attempt, not the latest
The latest generation may be a regression (compilation error, bad mutation). Always diagnose the highest-scoring generation since the last analysis β that represents the true algorithmic frontier.
2. Evidence before direction
"The sort is O(nΒ²), causing TLE" is more useful than "try a faster sort". Back every direction with a measured number.
Good: "Query count = 8800 (1.4Γ optimal of 6268). Eliminate redundant queries
in the staircase walk by batching adjacent cells."
Bad: "The query count might be too high."
3. Run code, don't read code
Static analysis of code is limited. Compile the code, run it on a specific input, measure the output. Ground truth beats inference.
Backward experiment: "Run on failing test case, measure where it breaks." Forward experiment: "Run on n=10, n=100, n=1000. Find the frontier."
4. The Trivial Metric Test
Before writing any aux metric, ask: "Does this require running the code?"
If you can compute it from the primary evaluator's output alone (score, per-case results), it adds zero information. Only measure things the primary evaluator cannot.
Valuable: Parse program stdout, run on custom inputs, measure structural output properties. Trivial: Repackage score, count lines of code, restate pass/fail counts.
5. Direction richness over metric richness
Five precise diagnostic sentences beat twenty auxiliary metrics. Write the diagnosis first; add metrics only if they provide evidence for the diagnosis.
Known Limitations and Open Problems
Problem 1: Text direction is insufficient
Current state: agent writes text. Evolution LLM reads text and still has to figure out the code change.
Potential fix: Agent generates code-level suggestions β specific function replacements, pseudocode for the suggested approach, or even a complete rewrite of the identified bottleneck function. This closes the gap between diagnosis and implementation.
Problem 2: K=5 trigger frequency may be too sparse
With K=5, an agent that correctly diagnoses a problem at gen 5 has 45 remaining generations to benefit from the feedback. But if the LLM ignores or misinterprets the direction, the next correction only comes at gen 10.
Potential fix: trigger on plateau detection (no improvement in last K generations) rather than fixed interval.
Problem 3: Single direction per trigger
Each trigger produces one diagnostic report. If the diagnosis is wrong, all K following generations are misguided.
Potential fix: generate multiple competing hypotheses; use aux metrics to select the correct one on the next trigger.
Problem 4: No memory of what was tried
The agent sees score history but not "which specific changes were tried and what happened". A code change that moved from 68 to 72 might have been partially correct; the agent should know to push further in that direction.
Potential fix: maintain a change log alongside score history: gen_12: replaced linear scan with binary search β +4 points.
Integration Checklist
To add evolving evaluation to your framework:
- Feedback injection: Evolution prompt has a slot for diagnostic text
- Periodic trigger: Eval agent called every K generations with best code + score history
- Aux metric runner:
evaluate_aux(gen_dir)called after each primary evaluation; output merged into metrics - Path consistency:
results_dirpassed toevaluate_auxis the directory containing the source file - Best-gen targeting: Eval agent analyzes highest-scoring generation since last trigger, not the latest
- Aux metric persistence: Agent updates
auxiliary_metrics.pyeach trigger; framework runs it on every generation - Trend data in context: Score history AND aux metric history both visible to eval agent at trigger time
Summary
| Aspect | Status |
|---|---|
| Core insight (scalar score is insufficient) | Validated |
| Text diagnosis improves evolution | Validated on specific cases; aggregate signal small |
| Aux metrics provide independent value | Marginal β evidence for diagnosis, not standalone signal |
| Agent helps on "fundamentally wrong approach" problems | Yes, clear wins |
| Agent helps on "already correct approach, needs tuning" problems | No clear signal |
| Text direction alone closes the gap | No β code-level suggestions needed |
| Works with any evolution framework | Yes, 3 integration hooks |
The idea is sound. The initial implementation left the hardest translation step ("text direction β specific code change") to the evolution LLM, which limited effectiveness (+0.52, within noise).
Breakthrough (Exp 10): Making the agent generate code directly (agent_candidate.cpp) instead of just text advice yielded +4.24 avg under fair comparison (same total LLM calls). The agent went from "advisor" to "participant" β it competes in the evolution pool alongside random mutations. Big wins (p59 +27, p254 +21, p245 +17, p42 +31) come from the agent identifying AND implementing specific fixes, bypassing the broken "text β code" transmission chain.
Reference Implementation
This design is implemented in ShinkaEvolve for competitive programming (Frontier-CS benchmark).
Key Source Files
| Component | File | Description |
|---|---|---|
| Eval agent prompt | eval_agent/ev2_prompt.j2 |
Jinja2 template for the eval agent's task. Defines workflow, output format, aux_metrics template. |
| Eval service | eval_agent/ev2_service_standalone.py |
FastAPI service wrapping the eval agent. Handles trigger logic, aux metric execution, state persistence, feedback computation. |
| Feedback module | eval_agent/feedback.py |
Computes metric effectiveness report: which aux metrics improved, plateaued, or degraded since last trigger. |
| Evolution runner | shinka/core/runner.py |
Calls eval service at trigger points; injects diagnostic into evolution prompt. |
| Evolution sampler | shinka/core/sampler.py |
Reads diagnostic_report.md from disk and includes it in the LLM prompt. |
| Primary evaluator (example) | tasks/frontier_cs_entry/evaluate_algorithmic.py |
Example evaluator for competitive programming. Shows correct definition and how metrics are returned. |
| Fork tool | tasks/frontier_cs_entry/fork_experiment.py |
Forks a completed vanilla run at generation N for controlled A/B comparison. |
| Run scripts | scripts/ev2_agentic/ |
Shell scripts for launching parallel agentic experiments. |
Experiment Results
results/frontier_cs_algorithmic/experiment_summary.md β full history of 9 experiments with scores, bugs found, and per-problem comparison tables.
Design Evolution Log
Experiments in chronological order, each corresponding to a prompt/code change:
| Version | Dir | Key Change |
|---|---|---|
| v1 (broken) | agent_g50_20260326 |
First attempt. "Metric inventor" prompt. Agent harmful. |
| v2 (pipeline broken) | agent_fork_g5_20260402_073345 |
Diagnostic analyst prompt. Diagnosis never reached evolution LLM. |
| v3 (pipeline fixed) | agent_fork_g5_20260407_220308 |
Fixed injection bug. First positive result (+0.9). |
| v4 (aux metrics) | agent_aux_test_20260408 |
Added aux metric workflow. Most metrics still trivial. |
| v5 (trivial test) | agent_aux_v2_20260409 |
Added Trivial Metric Test. +3.2 avg, best W:L ratio. |
| v6 (context cleanup) | agent_aux_v3_20260409 |
Best-gen targeting, dedup fixes. High variance. |
| v7 (code exec env) | agent_aux_v4_20260414 |
Compile commands + testdata paths in task message. |
| v8 (hypothesis-driven) | agent_hyp_v5b_20260414 |
Hypothesis β Experiment β Verdict format. |
| v9 (correct fix) | agent_v3_fork_g5_20260415_175746 |
Fixed correct definition + path bug. +0.52, no systematic improvement. |