Inzimam786's picture
Upload 23 files
fe64236 verified
Raw
History Blame Contribute Delete
5.6 kB
"""
models.py
====================================
This file defines the core data structures ("contracts") used in the
PyDebug-Optimizer environment.
We use Pydantic (v2) for:
✅ Data validation (ensures agent outputs are correct format)
✅ Type safety (prevents runtime bugs)
✅ Serialization (easy JSON conversion for OpenEnv)
🧠 MDP CONNECTION:
------------------
In Reinforcement Learning (RL), environments are modeled as a Markov Decision Process (MDP):
(S, A, R, T)
Where:
- S = State (Observation)
- A = Action (Agent decision)
- R = Reward (Feedback signal)
- T = Transition (handled in env.py)
This file defines:
- Observation → State (S)
- Action → Action (A)
- Reward → Reward (R)
These models enforce STRUCTURE on how the agent interacts with the environment.
"""
from typing import Dict, Literal
from pydantic import BaseModel, Field
# ============================================================
# 🧩 OBSERVATION MODEL (STATE)
# ============================================================
class Observation(BaseModel):
"""
Observation = STATE (S) in the Markov Decision Process.
This represents what the agent "sees" at each step.
Why Pydantic?
-------------
- Ensures every observation always has required fields
- Prevents missing or malformed data
- Automatically validates types (e.g., strings only)
Components:
-----------
code_snippet:
The buggy Python code the agent must analyze and fix.
error_feedback:
Runtime errors, stack traces, or hints from previous execution.
Helps the agent reason about what went wrong.
task_description:
Natural language explanation of the task.
Example:
"Fix the off-by-one error in this loop"
"""
code_snippet: str = Field(..., description="Buggy Python code")
error_feedback: str = Field(..., description="Execution error or logs")
task_description: str = Field(..., description="Description of the task")
# ============================================================
# ⚙️ ACTION MODEL (AGENT DECISION)
# ============================================================
class Action(BaseModel):
"""
Action = AGENT DECISION (A) in the MDP.
This is the MOST IMPORTANT model in this project.
It forces the agent to behave like a Senior AI Engineer by
following a structured reasoning pipeline.
Instead of just "fixing code", the agent must:
1. Diagnose the problem
2. Explain reasoning
3. Fix the code
4. Optimize performance
Why this matters:
-----------------
- Encourages chain-of-thought reasoning
- Makes evaluation interpretable
- Prevents shallow guessing
- Improves training signal for RL agents
Fields:
-------
error_type:
Classification of the bug.
Restricted using Literal for strict validation.
Allowed values:
- "syntax"
- "runtime"
- "logical"
error_justification:
Explanation of WHY this error type was chosen.
Example:
"Missing colon after function definition causes SyntaxError"
fixed_code:
Corrected version of the buggy code.
fix_justification:
Explanation of how the fix resolves the issue.
optimized_code:
Improved version focusing on time complexity.
Example:
O(n^2) → O(n) using hash maps
complexity_justification:
Explanation of complexity improvement using Big-O notation.
"""
error_type: Literal["syntax", "runtime", "logical"] = Field(
..., description="Type of error identified"
)
error_justification: str = Field(
..., description="Why this error type was chosen"
)
fixed_code: str = Field(
..., description="Corrected version of the code"
)
fix_justification: str = Field(
..., description="Explanation of the fix"
)
optimized_code: str = Field(
..., description="Optimized version of the code"
)
complexity_justification: str = Field(
..., description="Explanation of time complexity improvement"
)
# ============================================================
# 🎯 REWARD MODEL (FEEDBACK SIGNAL)
# ============================================================
class Reward(BaseModel):
"""
Reward = FEEDBACK (R) in the MDP.
This tells the agent how well it performed.
Why structured reward?
----------------------
Instead of a single number, we track components:
- Makes training more stable
- Helps debugging agent behavior
- Enables detailed evaluation
value:
Final scalar reward in range [0.0, 1.0]
component_scores:
Breakdown of reward into parts:
Example:
{
"identification": 0.2,
"repair": 0.2,
"correctness": 0.2,
"optimization": 0.3
}
MDP Insight:
------------
The agent's goal is to maximize expected cumulative reward:
max E[ Σ R_t ]
By shaping reward into components, we guide learning more effectively.
"""
value: float = Field(
..., ge=0.0, le=1.0, description="Total reward (0.0 to 1.0)"
)
component_scores: Dict[str, float] = Field(
default_factory=dict,
description="Breakdown of reward components"
)