shinka-backup / docs /evolving_evaluation.md
JustinTX's picture
Add files using upload-large-folder tool
1556404 verified

Evolving Evaluation: A Framework-Agnostic Design for Diagnostic Feedback in Code Evolution

The Problem

Standard evolutionary code optimization loops look like this:

for generation in range(N):
    code = mutate(parent_code)
    score = evaluate(code)
    if score > best_score:
        best = code

The evolution LLM receives score=68.1 for 40 consecutive generations with no direction to improve. This is a dead signal β€” it tells you how well the code performs but nothing about why it fails or what to change.

For simple optimization targets (minimize X, maximize Y), the score gradient is sufficient. For complex programs β€” competitive programming, simulation solvers, algorithmic implementations β€” the bottleneck is almost never obvious from the score alone:

  • Is it a timeout? On which inputs? Which function?
  • Is it wrong output? On what edge case? Where does it diverge?
  • Is it a suboptimal heuristic? Which decision point is the bottleneck?

The score cannot answer these questions. An agent that analyzes the code and its behavior can.


The Core Idea

Evolve the evaluation alongside the code.

Introduce a second agent β€” the eval agent β€” that runs periodically alongside the evolution loop. Its job is not to evaluate correctness but to produce diagnostic feedback: specific, evidence-backed insights about why the current best code is not scoring higher, and what direction would improve it.

This feedback is injected into the evolution agent's prompt, turning a dead scalar signal into an actionable diagnosis.

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚       Evolution Loop         β”‚
                    β”‚                              β”‚
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚  mutate β†’ evaluate β†’ select  β”‚
  β”‚ Eval     │◄─────                              β”‚
  β”‚ Agent    β”‚     β”‚  every K generations:        β”‚
  β”‚          │────►│  inject diagnostic_report    β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚                              β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The eval agent produces two outputs:

  1. diagnostic_report.md β€” a text analysis injected into the evolution agent's next prompt. Contains: what the bottleneck is, what was measured, and what direction to try.

  2. auxiliary_metrics.py β€” a small script that compiles and runs the code on specific inputs to measure properties the primary evaluator does not capture. Runs automatically on every future generation, producing trend data.


Architecture

Minimal Interface Requirements

To integrate this into any evolution framework, you need three hooks:

Hook 1: Feedback injection point The evolution agent's prompt must accept a text field that gets injected before generation. In most frameworks this is a "system message" or "context" field.

# In your evolution prompt builder:
diagnostic = read_file("eval_agent_memory/diagnostic_report.md")
prompt = base_prompt + f"\n\n## Diagnostic Feedback\n{diagnostic}"

Hook 2: Periodic trigger The eval agent is triggered every K generations (K=5 works well). Pass it:

  • Path to the current best code
  • Per-generation score history
  • Previous diagnostic report (for continuity)
if generation % K == 0:
    eval_agent.trigger(
        best_code_path=f"gen_{best_gen}/main.cpp",
        score_history=get_score_history(),
        results_dir=experiment_dir,
    )

Hook 3: Aux metric execution After each evaluation, run auxiliary_metrics.py and merge its output into the generation's metrics.

# In your evaluator wrapper:
primary_metrics = primary_evaluator(code)
aux_metrics = run_aux_metrics(f"gen_{gen}/", primary_metrics)
full_metrics = {**primary_metrics, **aux_metrics}

Aux Metrics Contract

def evaluate_aux(results_dir, primary_result=None):
    """
    results_dir: path to gen_N/ (code is at results_dir/main.cpp)
    primary_result: dict of primary metrics (optional)
    returns: dict of metric_name -> float
    """
    code_path = os.path.join(results_dir, 'main.cpp')
    # compile, run, measure, return
    return {"metric_name": float_value}

Critical implementation note: results_dir must point to the generation directory containing the source file (e.g. gen_10/), not a subdirectory. If your framework stores results in gen_10/results/, pass gen_10/ to evaluate_aux, not gen_10/results/. Getting this wrong causes silent compilation failures.

Eval Agent Task Message

The eval agent receives a task message containing:

Code to diagnose: /path/to/gen_N/main.cpp
Metrics history: [per-generation score and aux metric values]
Previous diagnostic: [last report, if any]
Compile command: g++ -O2 -std=gnu++17 {code_path} -o /tmp/binary
Test data: /path/to/testdata/

Your outputs:
- diagnostic_report.md: Hypothesis / Experiment / Verdict / Direction
- auxiliary_metrics.py: automated experiment that runs every generation

What Works

The Core Insight Holds

A well-targeted diagnostic can break a 40-generation plateau in a single intervention. Evidence from experiments on 172 competitive programming problems:

  • p79: Vanilla evolution never solved (50 gens, score=0). Diagnosis identified "bit-by-bit correlation attack fails for non-independent bits β†’ use beam search". Agent solved it.
  • p30: Diagnosis found "deduplication bug in candidate tracking" β†’ bitset redesign β†’ score 100.
  • p15: Diagnosis identified "greedy optimizes connections instead of lexicographic order" β†’ single-line priority fix β†’ +55 points.

These wins share a pattern: the diagnosis identified a specific, actionable algorithmic insight that the evolution LLM would be unlikely to discover by random mutation.

When It Helps

  • Problems where the current approach is fundamentally wrong (wrong algorithm, wrong optimization objective)
  • Problems where the bottleneck is non-obvious from the score (TLE on large inputs only, wrong answer on specific edge cases)
  • Problems where the direction requires structural code changes that the LLM won't stumble into by gradient descent on the score

Diagnostic Report Quality Matters More Than Aux Metrics

The text diagnosis is more impactful than any numeric aux metric. In A/B comparison, pure code review captured 80-90% of the diagnostic value. Aux metrics provide evidence for the diagnosis and trend tracking, but the bottleneck is the quality of the directional advice.


What Doesn't Work (Honest Assessment)

Aggregate Results Are Weak

Across 9 experiments on 20-problem subsets (50 generations each), the average delta between agent and vanilla is +0.52 points β€” within random variance. Individual experiments range from -1.0 to +3.2.

The signal is real but small relative to noise. At 20 problems Γ— 50 generations, random variance dominates.

The Transmission Gap

The fundamental bottleneck: the eval agent produces text direction, and the evolution LLM must translate that into specific code changes.

"Expand the search space for parameter a and relax the early exit condition" β†’ the LLM still has to figure out what values to use, which lines to change, and whether the change is correct. This is not much better than the LLM exploring randomly.

Compare this to the winning cases (p30, p79): those diagnoses were specific enough that the LLM could act on them directly ("you have a deduplication bug on line X, use a bitset").

Aux Metrics Rarely Change Behavior

Showing the evolution LLM runtime_ms=4500 doesn't help it write faster code. The metric needs to be paired with a specific direction ("the bottleneck is solveFullDP, replace it with this approach") to be useful.


Design Principles

1. Diagnose the best attempt, not the latest

The latest generation may be a regression (compilation error, bad mutation). Always diagnose the highest-scoring generation since the last analysis β€” that represents the true algorithmic frontier.

2. Evidence before direction

"The sort is O(nΒ²), causing TLE" is more useful than "try a faster sort". Back every direction with a measured number.

Good: "Query count = 8800 (1.4Γ— optimal of 6268). Eliminate redundant queries 
      in the staircase walk by batching adjacent cells."
Bad:  "The query count might be too high."

3. Run code, don't read code

Static analysis of code is limited. Compile the code, run it on a specific input, measure the output. Ground truth beats inference.

Backward experiment: "Run on failing test case, measure where it breaks." Forward experiment: "Run on n=10, n=100, n=1000. Find the frontier."

4. The Trivial Metric Test

Before writing any aux metric, ask: "Does this require running the code?"

If you can compute it from the primary evaluator's output alone (score, per-case results), it adds zero information. Only measure things the primary evaluator cannot.

Valuable: Parse program stdout, run on custom inputs, measure structural output properties. Trivial: Repackage score, count lines of code, restate pass/fail counts.

5. Direction richness over metric richness

Five precise diagnostic sentences beat twenty auxiliary metrics. Write the diagnosis first; add metrics only if they provide evidence for the diagnosis.


Known Limitations and Open Problems

Problem 1: Text direction is insufficient

Current state: agent writes text. Evolution LLM reads text and still has to figure out the code change.

Potential fix: Agent generates code-level suggestions β€” specific function replacements, pseudocode for the suggested approach, or even a complete rewrite of the identified bottleneck function. This closes the gap between diagnosis and implementation.

Problem 2: K=5 trigger frequency may be too sparse

With K=5, an agent that correctly diagnoses a problem at gen 5 has 45 remaining generations to benefit from the feedback. But if the LLM ignores or misinterprets the direction, the next correction only comes at gen 10.

Potential fix: trigger on plateau detection (no improvement in last K generations) rather than fixed interval.

Problem 3: Single direction per trigger

Each trigger produces one diagnostic report. If the diagnosis is wrong, all K following generations are misguided.

Potential fix: generate multiple competing hypotheses; use aux metrics to select the correct one on the next trigger.

Problem 4: No memory of what was tried

The agent sees score history but not "which specific changes were tried and what happened". A code change that moved from 68 to 72 might have been partially correct; the agent should know to push further in that direction.

Potential fix: maintain a change log alongside score history: gen_12: replaced linear scan with binary search β†’ +4 points.


Integration Checklist

To add evolving evaluation to your framework:

  • Feedback injection: Evolution prompt has a slot for diagnostic text
  • Periodic trigger: Eval agent called every K generations with best code + score history
  • Aux metric runner: evaluate_aux(gen_dir) called after each primary evaluation; output merged into metrics
  • Path consistency: results_dir passed to evaluate_aux is the directory containing the source file
  • Best-gen targeting: Eval agent analyzes highest-scoring generation since last trigger, not the latest
  • Aux metric persistence: Agent updates auxiliary_metrics.py each trigger; framework runs it on every generation
  • Trend data in context: Score history AND aux metric history both visible to eval agent at trigger time

Summary

Aspect Status
Core insight (scalar score is insufficient) Validated
Text diagnosis improves evolution Validated on specific cases; aggregate signal small
Aux metrics provide independent value Marginal β€” evidence for diagnosis, not standalone signal
Agent helps on "fundamentally wrong approach" problems Yes, clear wins
Agent helps on "already correct approach, needs tuning" problems No clear signal
Text direction alone closes the gap No β€” code-level suggestions needed
Works with any evolution framework Yes, 3 integration hooks

The idea is sound. The initial implementation left the hardest translation step ("text direction β†’ specific code change") to the evolution LLM, which limited effectiveness (+0.52, within noise).

Breakthrough (Exp 10): Making the agent generate code directly (agent_candidate.cpp) instead of just text advice yielded +4.24 avg under fair comparison (same total LLM calls). The agent went from "advisor" to "participant" β€” it competes in the evolution pool alongside random mutations. Big wins (p59 +27, p254 +21, p245 +17, p42 +31) come from the agent identifying AND implementing specific fixes, bypassing the broken "text β†’ code" transmission chain.


Reference Implementation

This design is implemented in ShinkaEvolve for competitive programming (Frontier-CS benchmark).

Key Source Files

Component File Description
Eval agent prompt eval_agent/ev2_prompt.j2 Jinja2 template for the eval agent's task. Defines workflow, output format, aux_metrics template.
Eval service eval_agent/ev2_service_standalone.py FastAPI service wrapping the eval agent. Handles trigger logic, aux metric execution, state persistence, feedback computation.
Feedback module eval_agent/feedback.py Computes metric effectiveness report: which aux metrics improved, plateaued, or degraded since last trigger.
Evolution runner shinka/core/runner.py Calls eval service at trigger points; injects diagnostic into evolution prompt.
Evolution sampler shinka/core/sampler.py Reads diagnostic_report.md from disk and includes it in the LLM prompt.
Primary evaluator (example) tasks/frontier_cs_entry/evaluate_algorithmic.py Example evaluator for competitive programming. Shows correct definition and how metrics are returned.
Fork tool tasks/frontier_cs_entry/fork_experiment.py Forks a completed vanilla run at generation N for controlled A/B comparison.
Run scripts scripts/ev2_agentic/ Shell scripts for launching parallel agentic experiments.

Experiment Results

results/frontier_cs_algorithmic/experiment_summary.md β€” full history of 9 experiments with scores, bugs found, and per-problem comparison tables.

Design Evolution Log

Experiments in chronological order, each corresponding to a prompt/code change:

Version Dir Key Change
v1 (broken) agent_g50_20260326 First attempt. "Metric inventor" prompt. Agent harmful.
v2 (pipeline broken) agent_fork_g5_20260402_073345 Diagnostic analyst prompt. Diagnosis never reached evolution LLM.
v3 (pipeline fixed) agent_fork_g5_20260407_220308 Fixed injection bug. First positive result (+0.9).
v4 (aux metrics) agent_aux_test_20260408 Added aux metric workflow. Most metrics still trivial.
v5 (trivial test) agent_aux_v2_20260409 Added Trivial Metric Test. +3.2 avg, best W:L ratio.
v6 (context cleanup) agent_aux_v3_20260409 Best-gen targeting, dedup fixes. High variance.
v7 (code exec env) agent_aux_v4_20260414 Compile commands + testdata paths in task message.
v8 (hypothesis-driven) agent_hyp_v5b_20260414 Hypothesis β†’ Experiment β†’ Verdict format.
v9 (correct fix) agent_v3_fork_g5_20260415_175746 Fixed correct definition + path bug. +0.52, no systematic improvement.