shinka-backup / docs /evaluation_agent_design.md

JustinTX

Add files using upload-large-folder tool

1556404 verified 22 days ago

preview code

raw

history blame contribute delete

41.2 kB

Evaluation Agent 设计方案

📋 可行性分析

结论：完全可行！ 将evaluation脚本改造成agent不仅技术上可行，而且可以显著增强系统的适应性和智能化程度。

🏗️ 当前架构分析

当前Evaluation工作流

┌─────────────────────────────────────────────────────┐
│  EvolutionRunner (控制器)                           │
│  ├─ 生成新代码 (gen_N/main.py)                     │
│  ├─ 提交job到JobScheduler                          │
│  └─ 等待结果                                        │
└────────────────┬────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────┐
│  JobScheduler 执行命令:                              │
│  python evaluate_with_auxiliary.py \                │
│    --program_path gen_N/main.py \                   │
│    --results_dir gen_N/results                      │
└────────────────┬────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────┐
│  Evaluation Script (独立进程)                       │
│  ├─ 加载程序                                        │
│  ├─ 运行实验 (run_packing)                         │
│  ├─ 验证结果 (validate_packing)                    │
│  ├─ 计算metrics (固定的7个auxiliary metrics)       │
│  ├─ 生成文本反馈                                    │
│  └─ 保存结果到文件:                                 │
│      • metrics.json                                 │
│      • correct.json                                 │
│      • extra.npz                                    │
│      • auxiliary_analysis.json                      │
└────────────────┬────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────┐
│  EvolutionRunner 读取结果                           │
│  ├─ 解析 metrics.json                              │
│  ├─ 提取 combined_score, public_metrics            │
│  ├─ 写入数据库 (ProgramDatabase)                   │
│  └─ 用于选择下一代父程序                            │
└─────────────────────────────────────────────────────┘

关键接口契约

输入接口 (命令行参数):

--program_path: str      # 要评估的程序路径
--results_dir: str       # 结果保存目录
--aux_config: str        # (可选) 辅助评估配置

输出接口 (文件系统):

# metrics.json
{
  "combined_score": 2.635,        # 主评分 (必须)
  "public": {                      # 公开指标 (LLM可见)
    "centers_str": "...",
    "num_circles": 26,
    "aux_packing_efficiency": 0.842,
    "aux_gap_analysis": 0.756,
    ...
  },
  "private": {                     # 私有指标 (仅记录)
    "reported_sum_of_radii": 2.635
  },
  "text_feedback": "..."          # (可选) 文本反馈
}

# correct.json
{
  "correct": true,
  "error": null
}

数据库Schema (Program表):

@dataclass
class Program:
    # 身份标识
    id: str
    code: str
    generation: int
    parent_id: Optional[str]
    
    # 评估结果 (由evaluation写入)
    combined_score: float
    public_metrics: Dict[str, Any]
    private_metrics: Dict[str, Any]
    text_feedback: str
    correct: bool
    
    # 辅助数据
    embedding: List[float]
    metadata: Dict[str, Any]
    
    # 进化关系
    archive_inspiration_ids: List[str]
    top_k_inspiration_ids: List[str]
    children_count: int

🤖 Agent化改造方案

核心设计理念

Agent ≠ 脚本的区别:

自主决策: Agent能根据context决定分析策略
动态工具使用: Agent能调用不同工具、生成新代码
历史感知: Agent能访问数据库了解进化历史
元学习: Agent能改进自己的评估策略

Agent架构设计

┌─────────────────────────────────────────────────────────────┐
│              EvaluationAgent (主控制器)                     │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Core Components:                                     │  │
│  │  • LLM (decision maker)                              │  │
│  │  • Tool Registry (可调用的工具集)                    │  │
│  │  • Database Access (读写历史数据)                    │  │
│  │  • Code Executor (安全执行生成的代码)                │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                             │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Workflow:                                            │  │
│  │  1. 接收评估请求 (program_path, results_dir)         │  │
│  │  2. 查询数据库获取context                            │  │
│  │  3. LLM规划评估策略                                  │  │
│  │  4. 执行评估步骤 (调用工具/生成代码)                 │  │
│  │  5. 聚合结果并生成反馈                               │  │
│  │  6. (可选) 更新数据库元信息                          │  │
│  │  7. 保存标准输出文件                                 │  │
│  └───────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

            Agent可用的工具 (Tools):
            
┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│  Ground Truth    │  │  Auxiliary       │  │  Dynamic Metric  │
│  Evaluation      │  │  Metrics         │  │  Generator       │
│  • 运行程序      │  │  • 预定义指标    │  │  • LLM生成代码   │
│  • 验证约束      │  │  • 注册系统      │  │  • 编译并执行    │
│  • 计算主分数    │  │                  │  │  • 安全沙箱      │
└──────────────────┘  └──────────────────┘  └──────────────────┘

┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│  Database Query  │  │  Visualization   │  │  Meta Analysis   │
│  • 查询历史      │  │  • 生成图表      │  │  • 趋势分析      │
│  • 统计分析      │  │  • 保存可视化    │  │  • 策略推荐      │
│  • 对比程序      │  │                  │  │                  │
└──────────────────┘  └──────────────────┘  └──────────────────┘

            数据库访问权限:
            
┌──────────────────────────────────────────────────────────┐
│  Database Interface (ProgramDatabase)                    │
│  ┌────────────────────────────────────────────────────┐  │
│  │  READ Operations (Agent可用):                      │  │
│  │  • get_all_programs()                              │  │
│  │  • get_programs_by_generation(gen)                 │  │
│  │  • get_top_programs(n, metric)                     │  │
│  │  • get_best_program(metric)                        │  │
│  │  • get_program(id)                                 │  │
│  │  • 自定义SQL查询 (受限)                            │  │
│  └────────────────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────────────────┐  │
│  │  WRITE Operations (谨慎使用):                      │  │
│  │  • 只能写入metadata字段                            │  │
│  │  • 不能修改 combined_score, correct 等核心字段     │  │
│  │  • 可以添加额外的分析结果到metadata                │  │
│  └────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────┘

🔌 Agent对外接口设计

1. 命令行接口 (保持兼容)

# 基本接口 (与当前完全兼容)
python evaluation_agent.py \
    --program_path gen_42/main.py \
    --results_dir gen_42/results

# 扩展接口 (新增功能)
python evaluation_agent.py \
    --program_path gen_42/main.py \
    --results_dir gen_42/results \
    --db_path path/to/evolution.sqlite \      # Agent可访问数据库
    --agent_mode adaptive \                    # 评估模式: static|adaptive|exploratory
    --enable_dynamic_metrics \                 # 允许生成新metrics
    --feedback_style detailed                  # 反馈风格: minimal|normal|detailed

2. Python API接口

from shinka.evaluation import EvaluationAgent, AgentConfig

# 配置Agent
agent_config = AgentConfig(
    # LLM配置
    llm_model="native-gemini-2.5-pro",
    llm_temperature=0.7,
    
    # 评估模式
    mode="adaptive",  # static | adaptive | exploratory
    
    # 工具访问权限
    enable_ground_truth=True,         # 必须
    enable_auxiliary_metrics=True,    # 预定义辅助指标
    enable_dynamic_metrics=True,      # LLM生成新指标
    enable_database_read=True,        # 读取历史数据
    enable_database_write_metadata=False,  # 写入元数据
    
    # 安全配置
    code_execution_timeout=30,        # 生成代码执行超时
    max_tool_calls=20,                # 最大工具调用次数
    sandboxed_execution=True,         # 沙箱执行
    
    # 输出配置
    generate_text_feedback=True,
    save_detailed_analysis=True,
    visualization=True,
)

# 创建Agent
agent = EvaluationAgent(
    config=agent_config,
    db_path="path/to/evolution.sqlite"  # 可选
)

# 执行评估
metrics, correct, error = agent.evaluate(
    program_path="gen_42/main.py",
    results_dir="gen_42/results"
)

# Agent会自动保存标准输出文件
# - metrics.json
# - correct.json
# - auxiliary_analysis.json
# - (可选) agent_reasoning.json  # Agent的决策过程

3. EvolutionRunner集成接口

from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.database import DatabaseConfig
from shinka.launch import LocalJobConfig
from shinka.evaluation import EvaluationAgentConfig  # 新增

# 配置job使用Agent评估器
job_config = LocalJobConfig(
    eval_program_path="shinka/evaluation/agent_main.py",  # Agent入口
    extra_cmd_args={
        "agent_mode": "adaptive",
        "enable_dynamic_metrics": True,
        "db_path": "auto",  # 自动传递数据库路径
    }
)

# 数据库配置
db_config = DatabaseConfig(
    db_path="evolution_db.sqlite",
    # ... 其他配置
)

# 进化配置
evo_config = EvolutionConfig(
    # ... 其他配置
    use_text_feedback=True,  # 接收Agent生成的反馈
)

# 运行时，Agent会自动获得:
# 1. 当前generation的程序路径
# 2. 数据库访问权限 (通过--db_path参数)
# 3. 历史程序信息
runner = EvolutionRunner(
    job_config=job_config,
    db_config=db_config,
    evo_config=evo_config,
)
runner.run()

🛠️ Agent工具系统设计

工具接口规范

from typing import Any, Dict, Optional
from dataclasses import dataclass

@dataclass
class ToolResult:
    """工具执行结果"""
    success: bool
    data: Any
    error: Optional[str] = None
    cost: float = 0.0  # API成本

class Tool:
    """工具基类"""
    name: str
    description: str
    parameters: Dict[str, Any]  # JSON Schema
    
    def execute(self, **kwargs) -> ToolResult:
        """执行工具逻辑"""
        raise NotImplementedError

核心工具清单

# ============================================================================
# 1. GROUND TRUTH EVALUATION (必需工具)
# ============================================================================

class RunProgramTool(Tool):
    """运行被评估程序并获取原始结果"""
    name = "run_program"
    description = "Execute the program and get raw results (centers, radii, score)"
    
    def execute(self, program_path: str, num_runs: int = 1) -> ToolResult:
        # 调用 run_shinka_eval 的底层逻辑
        # 返回: centers, radii, reported_score
        pass

class ValidateResultsTool(Tool):
    """验证程序输出是否满足约束"""
    name = "validate_results"
    description = "Validate if results satisfy all constraints"
    
    def execute(self, centers, radii) -> ToolResult:
        # 调用 adapted_validate_packing
        # 返回: is_valid, error_message
        pass

# ============================================================================
# 2. AUXILIARY METRICS (预定义分析工具)
# ============================================================================

class ComputeMetricTool(Tool):
    """计算预定义的辅助指标"""
    name = "compute_metric"
    description = "Compute a predefined auxiliary metric"
    parameters = {
        "metric_name": {
            "type": "string",
            "enum": ["packing_efficiency", "gap_analysis", "edge_utilization", ...]
        }
    }
    
    def execute(self, metric_name: str, centers, radii) -> ToolResult:
        # 调用 METRIC_REGISTRY.get(metric_name)
        pass

class ListMetricsTool(Tool):
    """列出所有可用的预定义指标"""
    name = "list_metrics"
    
    def execute(self) -> ToolResult:
        return ToolResult(
            success=True,
            data=METRIC_REGISTRY.list_metrics()
        )

# ============================================================================
# 3. DATABASE ACCESS (历史数据工具)
# ============================================================================

class QueryDatabaseTool(Tool):
    """查询数据库获取历史程序信息"""
    name = "query_database"
    description = "Query historical programs from database"
    parameters = {
        "query_type": {
            "type": "string",
            "enum": ["top_programs", "by_generation", "best_program", "all"]
        },
        "filters": {
            "type": "object",
            "properties": {
                "metric": {"type": "string"},
                "n": {"type": "integer"},
                "generation": {"type": "integer"}
            }
        }
    }
    
    def execute(self, query_type: str, filters: Dict) -> ToolResult:
        if query_type == "top_programs":
            programs = self.db.get_top_programs(
                n=filters.get("n", 10),
                metric=filters.get("metric", "combined_score")
            )
        elif query_type == "by_generation":
            programs = self.db.get_programs_by_generation(filters["generation"])
        # ...
        
        return ToolResult(
            success=True,
            data=[p.to_dict() for p in programs]
        )

class CompareWithHistoryTool(Tool):
    """对比当前程序与历史程序"""
    name = "compare_with_history"
    
    def execute(self, current_metrics: Dict, comparison_type: str) -> ToolResult:
        # comparison_type: "best" | "parent" | "generation_average"
        # 返回对比分析结果
        pass

# ============================================================================
# 4. DYNAMIC METRIC GENERATION (LLM生成新指标)
# ============================================================================

class GenerateMetricCodeTool(Tool):
    """让LLM生成新的评估指标代码"""
    name = "generate_metric_code"
    description = "Generate Python code for a new evaluation metric"
    parameters = {
        "metric_purpose": {"type": "string"},
        "inspiration_from": {"type": "string"}  # 参考已有指标
    }
    
    def execute(self, metric_purpose: str, inspiration_from: str = None) -> ToolResult:
        # 调用LLM生成新metric代码
        # 使用 LLMGeneratedMetric 框架
        prompt = f"""
        Generate a Python function to compute a new auxiliary metric for circle packing.
        
        Purpose: {metric_purpose}
        
        Requirements:
        1. Function signature: def metric_name(centers: np.ndarray, radii: np.ndarray) -> MetricResult
        2. Return MetricResult with name, value, interpretation, description, details
        3. Use numpy for computations
        4. Handle edge cases gracefully
        
        Example structure:
        ```python
        def my_metric(centers, radii):
            # Your analysis logic here
            score = ...
            
            return MetricResult(
                name="my_metric",
                value=float(score),
                interpretation="higher_better",
                description="What this metric measures",
                details={{"key": "value"}}
            )
        ```
        """
        
        llm_response = self.llm.query(prompt)
        code = extract_code_from_response(llm_response)
        
        return ToolResult(
            success=True,
            data={"code": code, "cost": llm_response.cost}
        )

class CompileAndTestMetricTool(Tool):
    """编译并测试LLM生成的指标代码"""
    name = "compile_and_test_metric"
    
    def execute(self, code: str, test_data: Dict) -> ToolResult:
        metric = LLMGeneratedMetric(
            name="llm_metric",
            code=code,
            description="LLM generated metric",
            interpretation="higher_better"
        )
        
        if not metric.compile():
            return ToolResult(success=False, error="Compilation failed")
        
        # 测试执行
        try:
            result = metric.evaluate(
                centers=test_data["centers"],
                radii=test_data["radii"]
            )
            return ToolResult(success=True, data=result)
        except Exception as e:
            return ToolResult(success=False, error=str(e))

# ============================================================================
# 5. VISUALIZATION & ANALYSIS (分析工具)
# ============================================================================

class VisualizeTool(Tool):
    """生成可视化"""
    name = "visualize"
    
    def execute(self, vis_type: str, data: Dict, output_path: str) -> ToolResult:
        # vis_type: "packing" | "metrics_trend" | "comparison"
        pass

class StatisticalAnalysisTool(Tool):
    """统计分析工具"""
    name = "statistical_analysis"
    
    def execute(self, data: List[float], analysis_type: str) -> ToolResult:
        # analysis_type: "trend" | "distribution" | "correlation"
        pass

# ============================================================================
# 6. META OPERATIONS (元操作)
# ============================================================================

class UpdateMetadataTool(Tool):
    """更新程序的metadata字段"""
    name = "update_metadata"
    description = "Add analysis results to program metadata (write to DB)"
    
    def execute(self, program_id: str, metadata: Dict) -> ToolResult:
        # 仅允许写入metadata字段，不能修改核心评估字段
        program = self.db.get_program(program_id)
        if program:
            program.metadata.update(metadata)
            # 写回数据库
            # 注意：需要扩展ProgramDatabase添加update_metadata方法
        pass

🧠 Agent决策流程

Mode 1: Static Mode (兼容模式)

def static_evaluation(agent, program_path, results_dir):
    """
    完全兼容现有evaluation脚本的行为
    """
    # 1. 运行程序
    result = agent.tools["run_program"].execute(program_path)
    centers, radii, score = result.data
    
    # 2. 验证结果
    validation = agent.tools["validate_results"].execute(centers, radii)
    correct = validation.data["is_valid"]
    
    # 3. 计算预定义auxiliary metrics
    auxiliary_results = {}
    for metric_name in agent.config.enabled_metrics:
        metric_result = agent.tools["compute_metric"].execute(
            metric_name, centers, radii
        )
        auxiliary_results[metric_name] = metric_result.data.value
    
    # 4. 生成标准反馈
    feedback = generate_standard_feedback(auxiliary_results, score)
    
    # 5. 保存结果
    metrics = {
        "combined_score": score,
        "public": {
            "centers_str": format_centers_string(centers),
            "num_circles": len(centers),
            **{f"aux_{k}": v for k, v in auxiliary_results.items()}
        },
        "private": {"reported_sum_of_radii": score},
        "text_feedback": feedback
    }
    
    save_metrics(results_dir, metrics, correct)
    return metrics, correct

Mode 2: Adaptive Mode (智能模式)

def adaptive_evaluation(agent, program_path, results_dir, db_path):
    """
    Agent根据context智能决策评估策略
    """
    # 1. 获取context
    context = agent.gather_context(program_path, db_path)
    
    # 2. LLM规划评估策略
    plan = agent.llm.plan_evaluation(context)
    
    # 示例plan:
    # {
    #   "steps": [
    #     {"action": "run_program", "params": {...}},
    #     {"action": "query_database", "params": {"query_type": "best_program"}},
    #     {"action": "compute_metric", "params": {"metric_name": "packing_efficiency"}},
    #     {"action": "compare_with_history", "params": {"comparison_type": "best"}},
    #     {"action": "generate_feedback", "params": {...}}
    #   ]
    # }
    
    # 3. 执行plan
    execution_log = []
    for step in plan["steps"]:
        tool = agent.tools[step["action"]]
        result = tool.execute(**step["params"])
        execution_log.append(result)
        
        # 如果某步失败，LLM可以调整策略
        if not result.success:
            plan = agent.llm.replan(plan, execution_log, result.error)
    
    # 4. LLM聚合结果并生成反馈
    final_metrics, feedback = agent.llm.aggregate_results(execution_log, context)
    
    # 5. 保存结果 (保证接口兼容性)
    save_metrics(results_dir, final_metrics, correct)
    
    # 6. (可选) 保存Agent推理过程
    save_agent_reasoning(results_dir, plan, execution_log)
    
    return final_metrics, correct

Mode 3: Exploratory Mode (探索模式)

def exploratory_evaluation(agent, program_path, results_dir, db_path):
    """
    Agent主动探索新的评估方法
    """
    # 1. 标准评估
    base_metrics, correct = adaptive_evaluation(agent, program_path, results_dir, db_path)
    
    # 2. 分析历史趋势
    trend_analysis = agent.tools["statistical_analysis"].execute(
        data=get_historical_scores(agent.db),
        analysis_type="trend"
    )
    
    # 3. 如果发现评估盲点，生成新metric
    if agent.detect_evaluation_gap(trend_analysis):
        # LLM生成新metric代码
        new_metric_code = agent.tools["generate_metric_code"].execute(
            metric_purpose="Identify patterns missed by existing metrics"
        )
        
        # 编译并测试
        test_result = agent.tools["compile_and_test_metric"].execute(
            code=new_metric_code.data["code"],
            test_data={"centers": centers, "radii": radii}
        )
        
        if test_result.success:
            # 注册新metric到全局registry
            register_new_metric(new_metric_code.data["code"])
            
            # 重新评估包含新metric
            extended_metrics = compute_with_new_metric(centers, radii)
            base_metrics["public"].update(extended_metrics)
    
    # 4. 保存扩展结果
    save_metrics(results_dir, base_metrics, correct)
    
    return base_metrics, correct

🔒 安全性设计

代码执行沙箱

class SafeCodeExecutor:
    """安全的代码执行环境"""
    
    def __init__(self, timeout=30):
        self.timeout = timeout
        self.allowed_imports = {
            'numpy', 'scipy', 'math', 'statistics'
        }
        self.forbidden_operations = {
            '__import__', 'eval', 'exec', 'compile',
            'open', 'file', 'input', 'raw_input'
        }
    
    def execute(self, code: str, inputs: Dict) -> Any:
        """在受限环境中执行代码"""
        # 1. 静态分析检查
        if self.has_forbidden_operations(code):
            raise SecurityError("Forbidden operations detected")
        
        # 2. 创建受限namespace
        namespace = {
            'np': numpy,
            'MetricResult': MetricResult,
            # ... 只提供必要的模块
        }
        namespace.update(inputs)
        
        # 3. 超时执行
        with timeout(self.timeout):
            exec(code, namespace)
        
        return namespace

数据库访问权限控制

class RestrictedDatabaseAccess:
    """受限的数据库访问接口"""
    
    def __init__(self, db: ProgramDatabase):
        self.db = db
        self.read_only_methods = [
            'get_all_programs', 'get_programs_by_generation',
            'get_top_programs', 'get_best_program', 'get_program'
        ]
        self.write_allowed_fields = ['metadata']  # 只能写metadata
    
    def __getattr__(self, name):
        if name in self.read_only_methods:
            return getattr(self.db, name)
        else:
            raise PermissionError(f"Method {name} not allowed for agent")
    
    def update_metadata(self, program_id: str, metadata: Dict):
        """唯一允许的写入操作"""
        program = self.db.get_program(program_id)
        if program:
            program.metadata.update(metadata)
            # 需要在ProgramDatabase中添加此方法
            self.db.update_program_metadata(program_id, program.metadata)

📊 Agent与外界的数据流

┌─────────────────────────────────────────────────────────────┐
│  EvolutionRunner (主系统)                                   │
│                                                             │
│  [每一代进化]                                                │
│  ├─ 生成新代码: gen_N/main.py                              │
│  ├─ 调用Agent评估 ──────────────────────────┐              │
│  │                                           ▼              │
│  │                            ┌──────────────────────────┐  │
│  │                            │  EvaluationAgent         │  │
│  │                            │  (独立进程)              │  │
│  │                            │                          │  │
│  │                            │  输入:                   │  │
│  │                            │  • program_path          │  │
│  │                            │  • results_dir           │  │
│  │                            │  • db_path (可选)        │  │
│  │                            │                          │  │
│  │                            │  Agent内部流程:          │  │
│  │                            │  1. 加载程序             │  │
│  │                            │  2. 运行评估             │  │
│  │    读取数据库 ◄───────────┼  3. 查询DB历史 ──┐       │  │
│  │                            │  4. LLM规划      │       │  │
│  │                            │  5. 工具调用     │       │  │
│  │                            │  6. 聚合结果     │       │  │
│  │    (可选)写metadata ◄─────┼  7. 保存输出     │       │  │
│  │                            │                  │       │  │
│  │                            │  输出文件:       │       │  │
│  │                            │  • metrics.json  │       │  │
│  │                            │  • correct.json  │       │  │
│  │                            │  • agent_log.json│       │  │
│  │                            └──────────────────┼───────┘  │
│  │                                               │          │
│  ├─ 读取评估结果 ◄────────────────────────────────┘          │
│  │   • combined_score                                       │
│  │   • public_metrics (含aux metrics)                      │
│  │   • text_feedback                                        │
│  │                                                          │
│  ├─ 写入数据库 (ProgramDatabase)                            │
│  │   • 创建新Program记录                                    │
│  │   • 保存所有metrics                                      │
│  │   • 更新archive                                          │
│  │                                                          │
│  └─ 选择父代 → 下一代 ────────────────────────►              │
└─────────────────────────────────────────────────────────────┘

           数据库 Schema (共享状态):
           
┌──────────────────────────────────────────────────────────┐
│  SQLite: evolution_db.sqlite                             │
│  ┌────────────────────────────────────────────────────┐  │
│  │  programs 表                                       │  │
│  │  ├─ id (gen_N)                                    │  │
│  │  ├─ code                                          │  │
│  │  ├─ generation (N)                                │  │
│  │  ├─ combined_score  ◄── EvolutionRunner写入      │  │
│  │  ├─ public_metrics  ◄── EvolutionRunner写入      │  │
│  │  ├─ text_feedback   ◄── EvolutionRunner写入      │  │
│  │  ├─ correct         ◄── EvolutionRunner写入      │  │
│  │  │                                                │  │
│  │  └─ metadata        ◄── Agent可写入 (可选)       │  │
│  │     {                                             │  │
│  │       "agent_analysis": {...},                    │  │
│  │       "custom_metrics": {...},                    │  │
│  │       "evaluation_reasoning": "..."               │  │
│  │     }                                              │  │
│  └────────────────────────────────────────────────────┘  │
│                                                          │
│  Agent可读取全部历史数据，但只能写入metadata字段         │
└──────────────────────────────────────────────────────────┘

🎯 Agent对外接口总结

必需接口 (保持兼容)

# 1. 命令行参数接口
--program_path: str      # 必需
--results_dir: str       # 必需

# 2. 输出文件接口 (标准契约)
metrics.json: {
    "combined_score": float,      # 必需
    "public": dict,                # 必需
    "private": dict,               # 可选
    "text_feedback": str           # 可选 (use_text_feedback=True时)
}
correct.json: {
    "correct": bool,               # 必需
    "error": str | null            # 必需
}

扩展接口 (Agent特性)

# 1. 数据库访问接口
--db_path: str           # 可选，提供后Agent可访问历史数据

# 2. Agent模式配置
--agent_mode: str        # static | adaptive | exploratory
--enable_dynamic_metrics: bool
--max_tool_calls: int

# 3. 额外输出文件
agent_reasoning.json: {  # Agent的决策过程 (用于调试和分析)
    "plan": [...],
    "execution_log": [...],
    "tool_costs": {...},
    "total_cost": float
}

auxiliary_analysis.json  # 详细的辅助分析 (已有)

visualizations/          # 可视化文件 (可选)
├─ packing_viz.png
├─ metrics_trend.png
└─ comparison.png

Python API接口

# 1. Agent类接口
class EvaluationAgent:
    def __init__(
        self, 
        config: AgentConfig, 
        db_path: Optional[str] = None
    ):
        pass
    
    def evaluate(
        self, 
        program_path: str, 
        results_dir: str
    ) -> Tuple[Dict, bool, Optional[str]]:
        """
        返回: (metrics, correct, error)
        与 run_shinka_eval 完全兼容
        """
        pass

# 2. 工具接口 (供Agent内部使用)
class Tool:
    def execute(self, **kwargs) -> ToolResult:
        pass

# 3. 数据库接口扩展
class ProgramDatabase:
    # 新增方法供Agent使用
    def update_program_metadata(
        self, 
        program_id: str, 
        metadata: Dict
    ) -> bool:
        pass

🚀 实现路线图

Phase 1: 基础Agent框架 (2-3天)

✓ 1. 创建 EvaluationAgent 类骨架
✓ 2. 实现 Tool 基类和工具注册系统
✓ 3. 重构现有evaluation代码为工具
    - RunProgramTool
    - ValidateResultsTool
    - ComputeMetricTool
✓ 4. 实现 static_mode (完全兼容现有行为)
✓ 5. 单元测试

Phase 2: 数据库集成 (1-2天)

✓ 1. 创建 RestrictedDatabaseAccess 接口
✓ 2. 实现数据库查询工具
    - QueryDatabaseTool
    - CompareWithHistoryTool
✓ 3. 扩展 ProgramDatabase.update_program_metadata()
✓ 4. 集成测试

Phase 3: Adaptive Mode (3-4天)

✓ 1. 实现 LLM planning 逻辑
✓ 2. Context gathering (历史数据分析)
✓ 3. 动态工具调用
✓ 4. 结果聚合和反馈生成
✓ 5. 端到端测试

Phase 4: Dynamic Metrics (2-3天)

✓ 1. 实现 GenerateMetricCodeTool
✓ 2. SafeCodeExecutor 沙箱
✓ 3. 动态metric注册和验证
✓ 4. Exploratory mode 实现
✓ 5. 安全性测试

Phase 5: 可视化和分析 (1-2天)

✓ 1. VisualizeTool
✓ 2. StatisticalAnalysisTool
✓ 3. Agent推理过程可视化

Phase 6: 生产就绪 (2-3天)

✓ 1. 性能优化
✓ 2. 错误处理和恢复
✓ 3. 日志和监控
✓ 4. 文档完善
✓ 5. 集成到EvolutionRunner

总计: 11-17天开发时间

📝 使用示例

示例1: 静态模式 (完全兼容)

from shinka.evaluation import EvaluationAgent, AgentConfig

config = AgentConfig(mode="static")
agent = EvaluationAgent(config)

metrics, correct, error = agent.evaluate(
    program_path="gen_42/main.py",
    results_dir="gen_42/results"
)

# 输出与现有evaluate_with_auxiliary.py完全相同

示例2: 自适应模式 (智能评估)

config = AgentConfig(
    mode="adaptive",
    enable_database_read=True,
    llm_model="native-gemini-2.5-pro"
)

agent = EvaluationAgent(
    config=config,
    db_path="evolution_db.sqlite"
)

metrics, correct, error = agent.evaluate(
    program_path="gen_100/main.py",
    results_dir="gen_100/results"
)

# Agent会:
# 1. 查询前99代的最佳程序
# 2. 分析当前程序相对历史的改进
# 3. 智能选择最相关的auxiliary metrics
# 4. 生成个性化的反馈

示例3: 探索模式 (自动发现新指标)

config = AgentConfig(
    mode="exploratory",
    enable_dynamic_metrics=True,
    enable_database_read=True
)

agent = EvaluationAgent(config, db_path="evolution_db.sqlite")

metrics, correct, error = agent.evaluate(
    program_path="gen_150/main.py",
    results_dir="gen_150/results"
)

# Agent可能会:
# 1. 发现现有metrics都在plateau
# 2. 生成新的metric来检测"corner circle size pattern"
# 3. 验证新metric与主分数的相关性
# 4. 如果有效，注册到全局registry供后续使用

💡 优势和影响

对进化系统的改进

更智能的评估: Agent可以根据进化阶段调整评估策略
自适应反馈: 针对当前代的具体问题提供针对性建议
自动发现: 探索新的评估维度，突破人工设计的局限
可解释性: Agent的推理过程可追溯，方便调试

保持兼容性

接口兼容: 完全遵守现有的输入输出契约
渐进式采用: 可以从static模式开始，逐步启用高级功能
性能可控: 可以配置Agent的计算预算
无破坏性: 不影响现有实验的可复现性

🎓 总结

Agent的核心对外接口

输入接口:
├─ 必需: program_path, results_dir
└─ 可选: db_path, agent_config

输出接口:
├─ 必需: metrics.json, correct.json
└─ 可选: agent_reasoning.json, visualizations/

数据库接口:
├─ READ: 可读取所有历史程序数据
└─ WRITE: 仅可写入program.metadata字段

工具接口:
├─ Ground Truth: 运行和验证程序
├─ Auxiliary Metrics: 预定义分析指标
├─ Database: 查询历史数据
├─ Dynamic: 生成新指标
└─ Visualization: 分析和可视化

关键设计原则

接口兼容优先: Agent必须能完全替代现有evaluation脚本
安全性: 代码执行沙箱、数据库权限控制
可扩展性: 工具系统支持持续添加新能力
可观测性: Agent的决策过程可追溯和调试
性能可控: 通过配置平衡智能程度和计算成本

实现可行性

✅ 技术可行: 所有组件都有成熟的实现方案 ✅ 架构友好: 与现有系统无缝集成 ✅ 渐进式: 可以分阶段实现和部署 ✅ 向后兼容: 不破坏现有实验

这个Agent将evaluation从固定流程提升为智能决策过程，同时保持与现有系统的完美兼容！ 🚀