Evaluation Agent è®Ÿè®¡æ¹æ¡
ð å¯è¡æ§åæ
ç»è®ºïŒå®å šå¯è¡ïŒ å°evaluationèæ¬æ¹é æagentäžä» ææ¯äžå¯è¡ïŒèäžå¯ä»¥æŸèå¢åŒºç³»ç»çéåºæ§åæºèœåçšåºŠã
ðïž åœåæ¶æåæ
åœåEvaluationå·¥äœæµ
âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â EvolutionRunner (æ§å¶åš) â
â ââ çææ°ä»£ç (gen_N/main.py) â
â ââ æäº€jobå°JobScheduler â
â ââ çåŸ
ç»æ â
ââââââââââââââââââ¬âââââââââââââââââââââââââââââââââââââ
â
âŒ
âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â JobScheduler æ§è¡åœä»€: â
â python evaluate_with_auxiliary.py \ â
â --program_path gen_N/main.py \ â
â --results_dir gen_N/results â
ââââââââââââââââââ¬âââââââââââââââââââââââââââââââââââââ
â
âŒ
âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Evaluation Script (ç¬ç«è¿çš) â
â ââ å 蜜çšåº â
â ââ è¿è¡å®éª (run_packing) â
â ââ éªè¯ç»æ (validate_packing) â
â ââ 计ç®metrics (åºå®ç7䞪auxiliary metrics) â
â ââ çæææ¬åéŠ â
â ââ ä¿åç»æå°æä»¶: â
â ⢠metrics.json â
â ⢠correct.json â
â ⢠extra.npz â
â ⢠auxiliary_analysis.json â
ââââââââââââââââââ¬âââââââââââââââââââââââââââââââââââââ
â
âŒ
âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â EvolutionRunner 读åç»æ â
â ââ è§£æ metrics.json â
â ââ æå combined_score, public_metrics â
â ââ åå
¥æ°æ®åº (ProgramDatabase) â
â ââ çšäºéæ©äžäžä»£ç¶çšåº â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
å ³é®æ¥å£å¥çºŠ
èŸå ¥æ¥å£ (åœä»€è¡åæ°):
--program_path: str # èŠè¯äŒ°ççšåºè·¯åŸ
--results_dir: str # ç»æä¿åç®åœ
--aux_config: str # (å¯é) èŸ
å©è¯äŒ°é
眮
èŸåºæ¥å£ (æä»¶ç³»ç»):
# metrics.json
{
"combined_score": 2.635, # äž»è¯å (å¿
é¡»)
"public": { # å
¬åŒææ (LLMå¯è§)
"centers_str": "...",
"num_circles": 26,
"aux_packing_efficiency": 0.842,
"aux_gap_analysis": 0.756,
...
},
"private": { # ç§æææ (ä»
è®°åœ)
"reported_sum_of_radii": 2.635
},
"text_feedback": "..." # (å¯é) ææ¬åéŠ
}
# correct.json
{
"correct": true,
"error": null
}
æ°æ®åºSchema (Program衚):
@dataclass
class Program:
# 身仜æ è¯
id: str
code: str
generation: int
parent_id: Optional[str]
# è¯äŒ°ç»æ (ç±evaluationåå
¥)
combined_score: float
public_metrics: Dict[str, Any]
private_metrics: Dict[str, Any]
text_feedback: str
correct: bool
# èŸ
婿°æ®
embedding: List[float]
metadata: Dict[str, Any]
# è¿åå
³ç³»
archive_inspiration_ids: List[str]
top_k_inspiration_ids: List[str]
children_count: int
ð€ Agentåæ¹é æ¹æ¡
æ žå¿è®Ÿè®¡ç念
Agent â èæ¬çåºå«:
- èªäž»å³ç: Agentèœæ ¹æ®contextå³å®åæçç¥
- åšæå·¥å ·äœ¿çš: Agentèœè°çšäžåå·¥å ·ãçææ°ä»£ç
- å岿ç¥: Agentèœè®¿é®æ°æ®åºäºè§£è¿ååå²
- å åŠä¹ : Agentèœæ¹è¿èªå·±çè¯äŒ°çç¥
Agentæ¶æè®Ÿè®¡
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â EvaluationAgent (äž»æ§å¶åš) â
â âââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â â Core Components: â â
â â ⢠LLM (decision maker) â â
â â ⢠Tool Registry (å¯è°çšçå·¥å
·é) â â
â â ⢠Database Access (读åå岿°æ®) â â
â â ⢠Code Executor (å®å
šæ§è¡çæç代ç ) â â
â âââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â â
â âââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â â Workflow: â â
â â 1. æ¥æ¶è¯äŒ°è¯·æ± (program_path, results_dir) â â
â â 2. æ¥è¯¢æ°æ®åºè·åcontext â â
â â 3. LLMè§åè¯äŒ°çç¥ â â
â â 4. æ§è¡è¯äŒ°æ¥éª€ (è°çšå·¥å
·/çæä»£ç ) â â
â â 5. èåç»æå¹¶çæåéŠ â â
â â 6. (å¯é) æŽæ°æ°æ®åºå
ä¿¡æ¯ â â
â â 7. ä¿åæ åèŸåºæä»¶ â â
â âââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Agentå¯çšçå·¥å
· (Tools):
ââââââââââââââââââââ ââââââââââââââââââââ ââââââââââââââââââââ
â Ground Truth â â Auxiliary â â Dynamic Metric â
â Evaluation â â Metrics â â Generator â
â ⢠è¿è¡çšåº â â ⢠é¢å®ä¹ææ â â ⢠LLMçæä»£ç â
â ⢠éªè¯çºŠæ â â ⢠泚åç³»ç» â â ⢠çŒè¯å¹¶æ§è¡ â
â ⢠计ç®äž»åæ° â â â â ⢠å®å
šæ²ç®± â
ââââââââââââââââââââ ââââââââââââââââââââ ââââââââââââââââââââ
ââââââââââââââââââââ ââââââââââââââââââââ ââââââââââââââââââââ
â Database Query â â Visualization â â Meta Analysis â
â ⢠æ¥è¯¢åå² â â ⢠çæåŸè¡š â â ⢠è¶å¿åæ â
â ⢠ç»è®¡åæ â â ⢠ä¿åå¯è§å â â ⢠çç¥æšè â
â ⢠对æ¯çšåº â â â â â
ââââââââââââââââââââ ââââââââââââââââââââ ââââââââââââââââââââ
æ°æ®åºè®¿é®æé:
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Database Interface (ProgramDatabase) â
â ââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â â READ Operations (Agentå¯çš): â â
â â ⢠get_all_programs() â â
â â ⢠get_programs_by_generation(gen) â â
â â ⢠get_top_programs(n, metric) â â
â â ⢠get_best_program(metric) â â
â â ⢠get_program(id) â â
â â ⢠èªå®ä¹SQLæ¥è¯¢ (åé) â â
â ââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â ââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â â WRITE Operations (è°šæ
䜿çš): â â
â â ⢠åªèœåå
¥metadataåæ®µ â â
â â ⢠äžèœä¿®æ¹ combined_score, correct çæ žå¿å段 â â
â â ⢠å¯ä»¥æ·»å é¢å€çåæç»æå°metadata â â
â ââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
ð Agent坹倿¥å£è®Ÿè®¡
1. åœä»€è¡æ¥å£ (ä¿æå Œå®¹)
# åºæ¬æ¥å£ (äžåœåå®å
šå
Œå®¹)
python evaluation_agent.py \
--program_path gen_42/main.py \
--results_dir gen_42/results
# æ©å±æ¥å£ (æ°å¢åèœ)
python evaluation_agent.py \
--program_path gen_42/main.py \
--results_dir gen_42/results \
--db_path path/to/evolution.sqlite \ # Agentå¯è®¿é®æ°æ®åº
--agent_mode adaptive \ # è¯äŒ°æš¡åŒ: static|adaptive|exploratory
--enable_dynamic_metrics \ # å
è®žçææ°metrics
--feedback_style detailed # åéŠé£æ Œ: minimal|normal|detailed
2. Python APIæ¥å£
from shinka.evaluation import EvaluationAgent, AgentConfig
# é
眮Agent
agent_config = AgentConfig(
# LLMé
眮
llm_model="native-gemini-2.5-pro",
llm_temperature=0.7,
# è¯äŒ°æš¡åŒ
mode="adaptive", # static | adaptive | exploratory
# å·¥å
·è®¿é®æé
enable_ground_truth=True, # å¿
é¡»
enable_auxiliary_metrics=True, # é¢å®ä¹èŸ
婿æ
enable_dynamic_metrics=True, # LLMçææ°ææ
enable_database_read=True, # 读åå岿°æ®
enable_database_write_metadata=False, # åå
¥å
æ°æ®
# å®å
šé
眮
code_execution_timeout=30, # çæä»£ç æ§è¡è¶
æ¶
max_tool_calls=20, # æå€§å·¥å
·è°ç𿬡æ°
sandboxed_execution=True, # æ²ç®±æ§è¡
# èŸåºé
眮
generate_text_feedback=True,
save_detailed_analysis=True,
visualization=True,
)
# å建Agent
agent = EvaluationAgent(
config=agent_config,
db_path="path/to/evolution.sqlite" # å¯é
)
# æ§è¡è¯äŒ°
metrics, correct, error = agent.evaluate(
program_path="gen_42/main.py",
results_dir="gen_42/results"
)
# AgentäŒèªåšä¿åæ åèŸåºæä»¶
# - metrics.json
# - correct.json
# - auxiliary_analysis.json
# - (å¯é) agent_reasoning.json # Agentçå³çè¿çš
3. EvolutionRunneréææ¥å£
from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.database import DatabaseConfig
from shinka.launch import LocalJobConfig
from shinka.evaluation import EvaluationAgentConfig # æ°å¢
# é
眮job䜿çšAgentè¯äŒ°åš
job_config = LocalJobConfig(
eval_program_path="shinka/evaluation/agent_main.py", # Agentå
¥å£
extra_cmd_args={
"agent_mode": "adaptive",
"enable_dynamic_metrics": True,
"db_path": "auto", # èªåšäŒ éæ°æ®åºè·¯åŸ
}
)
# æ°æ®åºé
眮
db_config = DatabaseConfig(
db_path="evolution_db.sqlite",
# ... å
¶ä»é
眮
)
# è¿åé
眮
evo_config = EvolutionConfig(
# ... å
¶ä»é
眮
use_text_feedback=True, # æ¥æ¶AgentçæçåéŠ
)
# è¿è¡æ¶ïŒAgentäŒèªåšè·åŸ:
# 1. åœågenerationççšåºè·¯åŸ
# 2. æ°æ®åºè®¿é®æé (éè¿--db_pathåæ°)
# 3. åå²çšåºä¿¡æ¯
runner = EvolutionRunner(
job_config=job_config,
db_config=db_config,
evo_config=evo_config,
)
runner.run()
ð ïž Agentå·¥å ·ç³»ç»è®Ÿè®¡
å·¥å ·æ¥å£è§è
from typing import Any, Dict, Optional
from dataclasses import dataclass
@dataclass
class ToolResult:
"""å·¥å
·æ§è¡ç»æ"""
success: bool
data: Any
error: Optional[str] = None
cost: float = 0.0 # APIææ¬
class Tool:
"""å·¥å
·åºç±»"""
name: str
description: str
parameters: Dict[str, Any] # JSON Schema
def execute(self, **kwargs) -> ToolResult:
"""æ§è¡å·¥å
·é»èŸ"""
raise NotImplementedError
æ žå¿å·¥å ·æž å
# ============================================================================
# 1. GROUND TRUTH EVALUATION (å¿
éå·¥å
·)
# ============================================================================
class RunProgramTool(Tool):
"""è¿è¡è¢«è¯äŒ°çšåºå¹¶è·ååå§ç»æ"""
name = "run_program"
description = "Execute the program and get raw results (centers, radii, score)"
def execute(self, program_path: str, num_runs: int = 1) -> ToolResult:
# è°çš run_shinka_eval çåºå±é»èŸ
# è¿å: centers, radii, reported_score
pass
class ValidateResultsTool(Tool):
"""éªè¯çšåºèŸåºæ¯åŠæ»¡è¶³çºŠæ"""
name = "validate_results"
description = "Validate if results satisfy all constraints"
def execute(self, centers, radii) -> ToolResult:
# è°çš adapted_validate_packing
# è¿å: is_valid, error_message
pass
# ============================================================================
# 2. AUXILIARY METRICS (é¢å®ä¹åæå·¥å
·)
# ============================================================================
class ComputeMetricTool(Tool):
"""计ç®é¢å®ä¹çèŸ
婿æ """
name = "compute_metric"
description = "Compute a predefined auxiliary metric"
parameters = {
"metric_name": {
"type": "string",
"enum": ["packing_efficiency", "gap_analysis", "edge_utilization", ...]
}
}
def execute(self, metric_name: str, centers, radii) -> ToolResult:
# è°çš METRIC_REGISTRY.get(metric_name)
pass
class ListMetricsTool(Tool):
"""ååºææå¯çšçé¢å®ä¹ææ """
name = "list_metrics"
def execute(self) -> ToolResult:
return ToolResult(
success=True,
data=METRIC_REGISTRY.list_metrics()
)
# ============================================================================
# 3. DATABASE ACCESS (å岿°æ®å·¥å
·)
# ============================================================================
class QueryDatabaseTool(Tool):
"""æ¥è¯¢æ°æ®åºè·ååå²çšåºä¿¡æ¯"""
name = "query_database"
description = "Query historical programs from database"
parameters = {
"query_type": {
"type": "string",
"enum": ["top_programs", "by_generation", "best_program", "all"]
},
"filters": {
"type": "object",
"properties": {
"metric": {"type": "string"},
"n": {"type": "integer"},
"generation": {"type": "integer"}
}
}
}
def execute(self, query_type: str, filters: Dict) -> ToolResult:
if query_type == "top_programs":
programs = self.db.get_top_programs(
n=filters.get("n", 10),
metric=filters.get("metric", "combined_score")
)
elif query_type == "by_generation":
programs = self.db.get_programs_by_generation(filters["generation"])
# ...
return ToolResult(
success=True,
data=[p.to_dict() for p in programs]
)
class CompareWithHistoryTool(Tool):
"""对æ¯åœåçšåºäžåå²çšåº"""
name = "compare_with_history"
def execute(self, current_metrics: Dict, comparison_type: str) -> ToolResult:
# comparison_type: "best" | "parent" | "generation_average"
# è¿å对æ¯åæç»æ
pass
# ============================================================================
# 4. DYNAMIC METRIC GENERATION (LLMçææ°ææ )
# ============================================================================
class GenerateMetricCodeTool(Tool):
"""让LLMçææ°çè¯äŒ°ææ 代ç """
name = "generate_metric_code"
description = "Generate Python code for a new evaluation metric"
parameters = {
"metric_purpose": {"type": "string"},
"inspiration_from": {"type": "string"} # åèå·²æææ
}
def execute(self, metric_purpose: str, inspiration_from: str = None) -> ToolResult:
# è°çšLLMçææ°metric代ç
# äœ¿çš LLMGeneratedMetric æ¡æ¶
prompt = f"""
Generate a Python function to compute a new auxiliary metric for circle packing.
Purpose: {metric_purpose}
Requirements:
1. Function signature: def metric_name(centers: np.ndarray, radii: np.ndarray) -> MetricResult
2. Return MetricResult with name, value, interpretation, description, details
3. Use numpy for computations
4. Handle edge cases gracefully
Example structure:
```python
def my_metric(centers, radii):
# Your analysis logic here
score = ...
return MetricResult(
name="my_metric",
value=float(score),
interpretation="higher_better",
description="What this metric measures",
details={{"key": "value"}}
)
```
"""
llm_response = self.llm.query(prompt)
code = extract_code_from_response(llm_response)
return ToolResult(
success=True,
data={"code": code, "cost": llm_response.cost}
)
class CompileAndTestMetricTool(Tool):
"""çŒè¯å¹¶æµè¯LLMçæçææ ä»£ç """
name = "compile_and_test_metric"
def execute(self, code: str, test_data: Dict) -> ToolResult:
metric = LLMGeneratedMetric(
name="llm_metric",
code=code,
description="LLM generated metric",
interpretation="higher_better"
)
if not metric.compile():
return ToolResult(success=False, error="Compilation failed")
# æµè¯æ§è¡
try:
result = metric.evaluate(
centers=test_data["centers"],
radii=test_data["radii"]
)
return ToolResult(success=True, data=result)
except Exception as e:
return ToolResult(success=False, error=str(e))
# ============================================================================
# 5. VISUALIZATION & ANALYSIS (åæå·¥å
·)
# ============================================================================
class VisualizeTool(Tool):
"""çæå¯è§å"""
name = "visualize"
def execute(self, vis_type: str, data: Dict, output_path: str) -> ToolResult:
# vis_type: "packing" | "metrics_trend" | "comparison"
pass
class StatisticalAnalysisTool(Tool):
"""ç»è®¡åæå·¥å
·"""
name = "statistical_analysis"
def execute(self, data: List[float], analysis_type: str) -> ToolResult:
# analysis_type: "trend" | "distribution" | "correlation"
pass
# ============================================================================
# 6. META OPERATIONS (å
æäœ)
# ============================================================================
class UpdateMetadataTool(Tool):
"""æŽæ°çšåºçmetadataåæ®µ"""
name = "update_metadata"
description = "Add analysis results to program metadata (write to DB)"
def execute(self, program_id: str, metadata: Dict) -> ToolResult:
# ä»
å
讞åå
¥metadataåæ®µïŒäžèœä¿®æ¹æ žå¿è¯äŒ°å段
program = self.db.get_program(program_id)
if program:
program.metadata.update(metadata)
# ååæ°æ®åº
# 泚æïŒéèŠæ©å±ProgramDatabaseæ·»å update_metadataæ¹æ³
pass
ð§ Agentå³çæµçš
Mode 1: Static Mode (å Œå®¹æš¡åŒ)
def static_evaluation(agent, program_path, results_dir):
"""
å®å
šå
Œå®¹ç°æevaluationèæ¬çè¡äžº
"""
# 1. è¿è¡çšåº
result = agent.tools["run_program"].execute(program_path)
centers, radii, score = result.data
# 2. éªè¯ç»æ
validation = agent.tools["validate_results"].execute(centers, radii)
correct = validation.data["is_valid"]
# 3. 计ç®é¢å®ä¹auxiliary metrics
auxiliary_results = {}
for metric_name in agent.config.enabled_metrics:
metric_result = agent.tools["compute_metric"].execute(
metric_name, centers, radii
)
auxiliary_results[metric_name] = metric_result.data.value
# 4. çææ ååéŠ
feedback = generate_standard_feedback(auxiliary_results, score)
# 5. ä¿åç»æ
metrics = {
"combined_score": score,
"public": {
"centers_str": format_centers_string(centers),
"num_circles": len(centers),
**{f"aux_{k}": v for k, v in auxiliary_results.items()}
},
"private": {"reported_sum_of_radii": score},
"text_feedback": feedback
}
save_metrics(results_dir, metrics, correct)
return metrics, correct
Mode 2: Adaptive Mode (æºèœæš¡åŒ)
def adaptive_evaluation(agent, program_path, results_dir, db_path):
"""
Agentæ ¹æ®contextæºèœå³çè¯äŒ°çç¥
"""
# 1. è·åcontext
context = agent.gather_context(program_path, db_path)
# 2. LLMè§åè¯äŒ°çç¥
plan = agent.llm.plan_evaluation(context)
# 瀺äŸplan:
# {
# "steps": [
# {"action": "run_program", "params": {...}},
# {"action": "query_database", "params": {"query_type": "best_program"}},
# {"action": "compute_metric", "params": {"metric_name": "packing_efficiency"}},
# {"action": "compare_with_history", "params": {"comparison_type": "best"}},
# {"action": "generate_feedback", "params": {...}}
# ]
# }
# 3. æ§è¡plan
execution_log = []
for step in plan["steps"]:
tool = agent.tools[step["action"]]
result = tool.execute(**step["params"])
execution_log.append(result)
# åŠæææ¥å€±èŽ¥ïŒLLMå¯ä»¥è°æŽçç¥
if not result.success:
plan = agent.llm.replan(plan, execution_log, result.error)
# 4. LLMèåç»æå¹¶çæåéŠ
final_metrics, feedback = agent.llm.aggregate_results(execution_log, context)
# 5. ä¿åç»æ (ä¿è¯æ¥å£å
Œå®¹æ§)
save_metrics(results_dir, final_metrics, correct)
# 6. (å¯é) ä¿åAgentæšçè¿çš
save_agent_reasoning(results_dir, plan, execution_log)
return final_metrics, correct
Mode 3: Exploratory Mode (æ¢çŽ¢æš¡åŒ)
def exploratory_evaluation(agent, program_path, results_dir, db_path):
"""
Agentäž»åšæ¢çŽ¢æ°çè¯äŒ°æ¹æ³
"""
# 1. æ åè¯äŒ°
base_metrics, correct = adaptive_evaluation(agent, program_path, results_dir, db_path)
# 2. åæåå²è¶å¿
trend_analysis = agent.tools["statistical_analysis"].execute(
data=get_historical_scores(agent.db),
analysis_type="trend"
)
# 3. åŠæåç°è¯äŒ°ç²ç¹ïŒçææ°metric
if agent.detect_evaluation_gap(trend_analysis):
# LLMçææ°metric代ç
new_metric_code = agent.tools["generate_metric_code"].execute(
metric_purpose="Identify patterns missed by existing metrics"
)
# çŒè¯å¹¶æµè¯
test_result = agent.tools["compile_and_test_metric"].execute(
code=new_metric_code.data["code"],
test_data={"centers": centers, "radii": radii}
)
if test_result.success:
# æ³šåæ°metricå°å
šå±registry
register_new_metric(new_metric_code.data["code"])
# éæ°è¯äŒ°å
嫿°metric
extended_metrics = compute_with_new_metric(centers, radii)
base_metrics["public"].update(extended_metrics)
# 4. ä¿åæ©å±ç»æ
save_metrics(results_dir, base_metrics, correct)
return base_metrics, correct
ð å®å šæ§è®Ÿè®¡
ä»£ç æ§è¡æ²ç®±
class SafeCodeExecutor:
"""å®å
šçä»£ç æ§è¡ç¯å¢"""
def __init__(self, timeout=30):
self.timeout = timeout
self.allowed_imports = {
'numpy', 'scipy', 'math', 'statistics'
}
self.forbidden_operations = {
'__import__', 'eval', 'exec', 'compile',
'open', 'file', 'input', 'raw_input'
}
def execute(self, code: str, inputs: Dict) -> Any:
"""åšåéç¯å¢äžæ§è¡ä»£ç """
# 1. éæåææ£æ¥
if self.has_forbidden_operations(code):
raise SecurityError("Forbidden operations detected")
# 2. å建åénamespace
namespace = {
'np': numpy,
'MetricResult': MetricResult,
# ... åªæäŸå¿
èŠçæš¡å
}
namespace.update(inputs)
# 3. è¶
æ¶æ§è¡
with timeout(self.timeout):
exec(code, namespace)
return namespace
æ°æ®åºè®¿é®æéæ§å¶
class RestrictedDatabaseAccess:
"""åéçæ°æ®åºè®¿é®æ¥å£"""
def __init__(self, db: ProgramDatabase):
self.db = db
self.read_only_methods = [
'get_all_programs', 'get_programs_by_generation',
'get_top_programs', 'get_best_program', 'get_program'
]
self.write_allowed_fields = ['metadata'] # åªèœåmetadata
def __getattr__(self, name):
if name in self.read_only_methods:
return getattr(self.db, name)
else:
raise PermissionError(f"Method {name} not allowed for agent")
def update_metadata(self, program_id: str, metadata: Dict):
"""å¯äžå
讞çåå
¥æäœ"""
program = self.db.get_program(program_id)
if program:
program.metadata.update(metadata)
# éèŠåšProgramDatabaseäžæ·»å æ€æ¹æ³
self.db.update_program_metadata(program_id, program.metadata)
ð Agentäžå€ççæ°æ®æµ
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â EvolutionRunner (䞻系ç») â
â â
â [æ¯äžä»£è¿å] â
â ââ çææ°ä»£ç : gen_N/main.py â
â ââ è°çšAgentè¯äŒ° âââââââââââââââââââââââââââ â
â â ⌠â
â â ââââââââââââââââââââââââââââ â
â â â EvaluationAgent â â
â â â (ç¬ç«è¿çš) â â
â â â â â
â â â èŸå
¥: â â
â â â ⢠program_path â â
â â â ⢠results_dir â â
â â â ⢠db_path (å¯é) â â
â â â â â
â â â Agentå
éšæµçš: â â
â â â 1. å 蜜çšåº â â
â â â 2. è¿è¡è¯äŒ° â â
â â è¯»åæ°æ®åº ââââââââââââ⌠3. æ¥è¯¢DBåå² âââ â â
â â â 4. LLMè§å â â â
â â â 5. å·¥å
·è°çš â â â
â â â 6. èåç»æ â â â
â â (å¯é)åmetadata ââââââ⌠7. ä¿åèŸåº â â â
â â â â â â
â â â èŸåºæä»¶: â â â
â â â ⢠metrics.json â â â
â â â ⢠correct.json â â â
â â â ⢠agent_log.jsonâ â â
â â ââââââââââââââââââââŒââââââââ â
â â â â
â ââ 读åè¯äŒ°ç»æ ââââââââââââââââââââââââââââââââââ â
â â ⢠combined_score â
â â ⢠public_metrics (å«aux metrics) â
â â ⢠text_feedback â
â â â
â ââ åå
¥æ°æ®åº (ProgramDatabase) â
â â ⢠å建æ°Programè®°åœ â
â â ⢠ä¿åææmetrics â
â â â¢ æŽæ°archive â
â â â
â ââ éæ©ç¶ä»£ â äžäžä»£ ââââââââââââââââââââââââ⺠â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
æ°æ®åº Schema (å
±äº«ç¶æ):
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â SQLite: evolution_db.sqlite â
â ââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â â programs 衚 â â
â â ââ id (gen_N) â â
â â ââ code â â
â â ââ generation (N) â â
â â ââ combined_score âââ EvolutionRunneråå
¥ â â
â â ââ public_metrics âââ EvolutionRunneråå
¥ â â
â â ââ text_feedback âââ EvolutionRunneråå
¥ â â
â â ââ correct âââ EvolutionRunneråå
¥ â â
â â â â â
â â ââ metadata âââ Agentå¯åå
¥ (å¯é) â â
â â { â â
â â "agent_analysis": {...}, â â
â â "custom_metrics": {...}, â â
â â "evaluation_reasoning": "..." â â
â â } â â
â ââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â â
â Agentå¯è¯»åå
šéšå岿°æ®ïŒäœåªèœåå
¥metadataåæ®µ â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
ð¯ Agent坹倿¥å£æ»ç»
å¿ éæ¥å£ (ä¿æå Œå®¹)
# 1. åœä»€è¡åæ°æ¥å£
--program_path: str # å¿
é
--results_dir: str # å¿
é
# 2. èŸåºæä»¶æ¥å£ (æ åå¥çºŠ)
metrics.json: {
"combined_score": float, # å¿
é
"public": dict, # å¿
é
"private": dict, # å¯é
"text_feedback": str # å¯é (use_text_feedback=Trueæ¶)
}
correct.json: {
"correct": bool, # å¿
é
"error": str | null # å¿
é
}
æ©å±æ¥å£ (Agentç¹æ§)
# 1. æ°æ®åºè®¿é®æ¥å£
--db_path: str # å¯éïŒæäŸåAgentå¯è®¿é®å岿°æ®
# 2. Agentæš¡åŒé
眮
--agent_mode: str # static | adaptive | exploratory
--enable_dynamic_metrics: bool
--max_tool_calls: int
# 3. é¢å€èŸåºæä»¶
agent_reasoning.json: { # Agentçå³çè¿çš (çšäºè°è¯ååæ)
"plan": [...],
"execution_log": [...],
"tool_costs": {...},
"total_cost": float
}
auxiliary_analysis.json # 诊ç»çèŸ
å©åæ (å·²æ)
visualizations/ # å¯è§åæä»¶ (å¯é)
ââ packing_viz.png
ââ metrics_trend.png
ââ comparison.png
Python APIæ¥å£
# 1. Agentç±»æ¥å£
class EvaluationAgent:
def __init__(
self,
config: AgentConfig,
db_path: Optional[str] = None
):
pass
def evaluate(
self,
program_path: str,
results_dir: str
) -> Tuple[Dict, bool, Optional[str]]:
"""
è¿å: (metrics, correct, error)
äž run_shinka_eval å®å
šå
Œå®¹
"""
pass
# 2. å·¥å
·æ¥å£ (äŸAgentå
éšäœ¿çš)
class Tool:
def execute(self, **kwargs) -> ToolResult:
pass
# 3. æ°æ®åºæ¥å£æ©å±
class ProgramDatabase:
# æ°å¢æ¹æ³äŸAgent䜿çš
def update_program_metadata(
self,
program_id: str,
metadata: Dict
) -> bool:
pass
ð å®ç°è·¯çº¿åŸ
Phase 1: åºç¡Agentæ¡æ¶ (2-3倩)
â 1. å建 EvaluationAgent 类骚æ¶
â 2. å®ç° Tool åºç±»åå·¥å
·æ³šåç³»ç»
â 3. éæç°æevaluation代ç 䞺工å
·
- RunProgramTool
- ValidateResultsTool
- ComputeMetricTool
â 4. å®ç° static_mode (å®å
šå
Œå®¹ç°æè¡äžº)
â 5. åå
æµè¯
Phase 2: æ°æ®åºéæ (1-2倩)
â 1. å建 RestrictedDatabaseAccess æ¥å£
â 2. å®ç°æ°æ®åºæ¥è¯¢å·¥å
·
- QueryDatabaseTool
- CompareWithHistoryTool
â 3. æ©å± ProgramDatabase.update_program_metadata()
â 4. éææµè¯
Phase 3: Adaptive Mode (3-4倩)
â 1. å®ç° LLM planning é»èŸ
â 2. Context gathering (å岿°æ®åæ)
â 3. åšæå·¥å
·è°çš
â 4. ç»æèåååéŠçæ
â 5. 端å°ç«¯æµè¯
Phase 4: Dynamic Metrics (2-3倩)
â 1. å®ç° GenerateMetricCodeTool
â 2. SafeCodeExecutor æ²ç®±
â 3. åšæmetric泚ååéªè¯
â 4. Exploratory mode å®ç°
â 5. å®å
šæ§æµè¯
Phase 5: å¯è§åååæ (1-2倩)
â 1. VisualizeTool
â 2. StatisticalAnalysisTool
â 3. Agentæšçè¿çšå¯è§å
Phase 6: ç产就绪 (2-3倩)
â 1. æ§èœäŒå
â 2. é误å€ç忢å€
â 3. æ¥å¿åçæ§
â 4. ææ¡£å®å
â 5. éæå°EvolutionRunner
æ»è®¡: 11-17倩åŒåæ¶éŽ
ð 䜿çšç€ºäŸ
瀺äŸ1: éææš¡åŒ (å®å šå Œå®¹)
from shinka.evaluation import EvaluationAgent, AgentConfig
config = AgentConfig(mode="static")
agent = EvaluationAgent(config)
metrics, correct, error = agent.evaluate(
program_path="gen_42/main.py",
results_dir="gen_42/results"
)
# èŸåºäžç°æevaluate_with_auxiliary.pyå®å
šçžå
瀺äŸ2: èªéåºæš¡åŒ (æºèœè¯äŒ°)
config = AgentConfig(
mode="adaptive",
enable_database_read=True,
llm_model="native-gemini-2.5-pro"
)
agent = EvaluationAgent(
config=config,
db_path="evolution_db.sqlite"
)
metrics, correct, error = agent.evaluate(
program_path="gen_100/main.py",
results_dir="gen_100/results"
)
# AgentäŒ:
# 1. æ¥è¯¢å99代çæäœ³çšåº
# 2. åæåœåçšåºçžå¯¹åå²çæ¹è¿
# 3. æºèœéæ©æçžå
³çauxiliary metrics
# 4. çæäžªæ§åçåéŠ
瀺äŸ3: æ¢çŽ¢æš¡åŒ (èªåšåç°æ°ææ )
config = AgentConfig(
mode="exploratory",
enable_dynamic_metrics=True,
enable_database_read=True
)
agent = EvaluationAgent(config, db_path="evolution_db.sqlite")
metrics, correct, error = agent.evaluate(
program_path="gen_150/main.py",
results_dir="gen_150/results"
)
# Agentå¯èœäŒ:
# 1. åç°ç°æmetricséœåšplateau
# 2. çææ°çmetricæ¥æ£æµ"corner circle size pattern"
# 3. éªè¯æ°metricäžäž»åæ°ççžå
³æ§
# 4. åŠæææïŒæ³šåå°å
šå±registryäŸåç»äœ¿çš
ð¡ äŒå¿å圱å
对è¿åç³»ç»çæ¹è¿
- æŽæºèœçè¯äŒ°: Agentå¯ä»¥æ ¹æ®è¿åé¶æ®µè°æŽè¯äŒ°çç¥
- èªéåºåéŠ: é对åœå代çå ·äœé®é¢æäŸé对æ§å»ºè®®
- èªåšåç°: æ¢çŽ¢æ°çè¯äŒ°ç»ŽåºŠïŒçªç Žäººå·¥è®Ÿè®¡çå±é
- å¯è§£éæ§: Agentçæšçè¿çšå¯è¿œæº¯ïŒæ¹äŸ¿è°è¯
ä¿æå Œå®¹æ§
- æ¥å£å Œå®¹: å®å šéµå®ç°æçèŸå ¥èŸåºå¥çºŠ
- æžè¿åŒéçš: å¯ä»¥ä»staticæš¡åŒåŒå§ïŒéæ¥å¯çšé«çº§åèœ
- æ§èœå¯æ§: å¯ä»¥é 眮Agentç计ç®é¢ç®
- æ ç Žåæ§: äžåœ±åç°æå®éªçå¯å€ç°æ§
ð æ»ç»
Agentçæ žå¿å¯¹å€æ¥å£
èŸå
¥æ¥å£:
ââ å¿
é: program_path, results_dir
ââ å¯é: db_path, agent_config
èŸåºæ¥å£:
ââ å¿
é: metrics.json, correct.json
ââ å¯é: agent_reasoning.json, visualizations/
æ°æ®åºæ¥å£:
ââ READ: å¯è¯»åææåå²çšåºæ°æ®
ââ WRITE: ä»
å¯åå
¥program.metadataåæ®µ
å·¥å
·æ¥å£:
ââ Ground Truth: è¿è¡åéªè¯çšåº
ââ Auxiliary Metrics: é¢å®ä¹åæææ
ââ Database: æ¥è¯¢å岿°æ®
ââ Dynamic: çææ°ææ
ââ Visualization: åæåå¯è§å
å ³é®è®Ÿè®¡åå
- æ¥å£å Œå®¹äŒå : Agentå¿ é¡»èœå®å šæ¿ä»£ç°æevaluationèæ¬
- å®å šæ§: ä»£ç æ§è¡æ²ç®±ãæ°æ®åºæéæ§å¶
- 坿©å±æ§: å·¥å ·ç³»ç»æ¯ææç»æ·»å æ°èœå
- å¯è§æµæ§: Agentçå³çè¿çšå¯è¿œæº¯åè°è¯
- æ§èœå¯æ§: éè¿é 眮平衡æºèœçšåºŠåè®¡ç®ææ¬
å®ç°å¯è¡æ§
â ææ¯å¯è¡: ææç»ä»¶éœææççå®ç°æ¹æ¡ â æ¶æå奜: äžç°æç³»ç»æ çŒéæ â æžè¿åŒ: å¯ä»¥åé¶æ®µå®ç°åéšçœ² â ååå Œå®¹: äžç Žåç°æå®éª
è¿äžªAgentå°evaluationä»åºå®æµçšæå䞺æºèœå³çè¿çšïŒåæ¶ä¿æäžç°æç³»ç»çå®çŸå Œå®¹ïŒ ð