Spaces:

ujjwalpardeshi
/

pytorch-training-debugger

Running

App Files Files Community

omkarrr88 commited on Mar 28

Commit

e2f8b29

0 Parent(s):

Version 1

Browse files

Files changed (40) hide show

.claude/plan/pytorch-debugger-mvp.md +1647 -0
.coverage +0 -0
.dockerignore +13 -0
.gitignore +14 -0
.python-version +1 -0
CLAUDE.md +186 -0
Dockerfile +24 -0
PRD.md +367 -0
README.md +149 -0
ROADMAP.md +441 -0
baseline_heuristic.py +186 -0
deploy.sh +52 -0
ml-training-debugger-spec.md +0 -0
ml_training_debugger/__init__.py +3 -0
ml_training_debugger/client.py +21 -0
ml_training_debugger/code_templates.py +248 -0
ml_training_debugger/graders.py +207 -0
ml_training_debugger/models.py +195 -0
ml_training_debugger/pytorch_engine.py +240 -0
ml_training_debugger/reward_engine.py +104 -0
ml_training_debugger/scenarios.py +155 -0
ml_training_debugger/simulation.py +225 -0
openenv.yaml +58 -0
pyproject.toml +41 -0
requirements.txt +6 -0
server/__init__.py +0 -0
server/_baseline_results.py +27 -0
server/app.py +287 -0
server/environment.py +516 -0
tests/__init__.py +0 -0
tests/conftest.py +36 -0
tests/test_code_templates.py +65 -0
tests/test_episode_lifecycle.py +220 -0
tests/test_graders.py +168 -0
tests/test_models.py +168 -0
tests/test_pytorch_engine.py +93 -0
tests/test_reward_engine.py +176 -0
tests/test_scenarios.py +51 -0
tests/test_simulation.py +72 -0
tests/test_simulation_extended.py +81 -0

.claude/plan/pytorch-debugger-mvp.md ADDED Viewed

	@@ -0,0 +1,1647 @@

+# Implementation Plan: PyTorch Training Run Debugger — OpenEnv Environment
+**Generated:** 2026-03-28
+**King File:** `ml-training-debugger-spec.md` — single source of truth for all conflicts
+**Runtime:** Python 3.12 · PyTorch CPU-only · openenv-core (installed in .venv)
+**MVP Scope:** Tasks 1, 3, 5 + rule-based baseline + all required endpoints + Docker + HF Spaces
+---
+## Markdown Files Confirmed Read
+| File | Lines | Role |
+|------|-------|------|
+| `ml-training-debugger-spec.md` | 1549 | **KING FILE** — final authority on all design decisions |
+| `CLAUDE.md` | ~280 | Coding standards, non-negotiable rules, reward constants |
+| `PRD.md` | ~368 | Product requirements, success metrics, timeline |
+| `ROADMAP.md` | ~442 | Phased roadmap with acceptance criteria |
+All four files read in full. The spec is the definitive authority.
+---
+## Complete Project Structure (Final State)
+```
+ML Debugger/                           # Project root
+├── .claude/
+│   └── plan/
+│       └── pytorch-debugger-mvp.md    # This plan
+├── .dockerignore
+├── .gitignore
+├── .python-version                    # "3.12"
+├── CLAUDE.md                          # Already exists
+├── Dockerfile
+├── PRD.md                             # Already exists
+├── README.md
+├── ROADMAP.md                         # Already exists
+├── baseline_heuristic.py              # Rule-based baseline (no API key)
+├── baseline_inference.py              # LLM baseline (optional, requires OPENAI_API_KEY)
+├── deploy.sh                          # One-command build+test+validate script
+├── ml-training-debugger-spec.md       # Already exists (king file)
+├── openenv.yaml
+├── pyproject.toml
+├── requirements.txt
+│
+├── ml_training_debugger/
+│   ├── __init__.py
+│   ├── models.py                      # All Pydantic models + RootCauseDiagnosis enum
+│   ├── client.py                      # EnvClient extension with typed action/observation
+│   ├── scenarios.py                   # ScenarioParams + sample_scenario()
+│   ├── pytorch_engine.py              # SimpleCNN, fault injection, gradient/weight extraction
+│   ├── simulation.py                  # Parametric curve generation (torch.Tensor ops)
+│   ├── code_templates.py              # Task 6: code snippets with bugs + validate_fix()
+│   ├── reward_engine.py               # compute_reward() — all 7 components
+│   └── graders.py                     # Per-task grader functions (0.0–1.0)
+│
+├── server/
+│   ├── __init__.py
+│   ├── environment.py                 # MLTrainingEnvironment(Environment)
+│   ├── app.py                         # create_app() + custom routes
+│   └── dashboard.html                 # Live diagnostic dashboard (Phase 3)
+│
+├── validation/                        # PyTorch validation suite (Phase 3)
+│   ├── requirements.txt
+│   ├── conftest.py
+│   ├── validate_exploding_gradients.py
+│   ├── validate_vanishing_gradients.py
+│   ├── validate_data_leakage.py
+│   ├── validate_overfitting.py
+│   ├── validate_batchnorm_eval.py
+│   ├── validate_code_bugs.py
+│   └── reports/                       # Pre-computed fidelity plots
+│
+└── tests/
+    ├── __init__.py
+    ├── conftest.py                    # Shared fixtures
+    ├── test_models.py
+    ├── test_scenarios.py
+    ├── test_pytorch_engine.py
+    ├── test_simulation.py
+    ├── test_code_templates.py
+    ├── test_reward_engine.py
+    ├── test_graders.py
+    ├── test_episode_lifecycle.py
+    ├── test_endpoints.py
+    └── test_baseline_reproducibility.py
+```
+---
+## Phase 0: Project Initialization & Validation Setup
+### Goal
+A running skeleton server that proves the toolchain works end-to-end. Zero business logic — just plumbing.
+### Files to Create
+**Step 0.1 — Project config files:**
+1. **`.python-version`** — content: `3.12`
+2. **`.gitignore`**:
+```
+.venv/
+__pycache__/
+*.pyc
+*.pyo
+.env
+run*.json
+.pytest_cache/
+htmlcov/
+*.egg-info/
+dist/
+build/
+validation/reports/*.png
+.mypy_cache/
+```
+3. **`.dockerignore`**:
+```
+.venv/
+__pycache__/
+.git/
+.pytest_cache/
+tests/
+validation/
+*.md
+!README.md
+.claude/
+run*.json
+htmlcov/
+```
+4. **`pyproject.toml`**:
+```toml
+[project]
+name = "pytorch-training-debugger"
+version = "1.0.0"
+description = "OpenEnv RL environment for PyTorch training failure debugging"
+requires-python = ">=3.12"
+dependencies = [
+    "torch",
+    "openenv-core",
+    "pydantic>=2.0",
+    "fastapi",
+    "uvicorn",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest",
+    "pytest-cov",
+    "pytest-asyncio",
+    "black",
+    "ruff",
+    "isort",
+    "httpx",
+    "websockets",
+]
+llm = [
+    "openai",
+]
+[tool.black]
+line-length = 88
+[tool.isort]
+profile = "black"
+[tool.ruff]
+line-length = 88
+target-version = "py312"
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+asyncio_mode = "auto"
+```
+5. **`requirements.txt`** (for Docker — flat list, no dev deps):
+```
+torch
+openenv-core
+pydantic>=2.0
+fastapi
+uvicorn
+openai
+```
+**Step 0.2 — Package stubs:**
+6. **`ml_training_debugger/__init__.py`**:
+```python
+"""PyTorch Training Run Debugger — OpenEnv Environment."""
+__version__ = "1.0.0"
+```
+7. **`ml_training_debugger/models.py`** — STUB with all Pydantic models:
+```python
+"""All Pydantic models, enums, and typed data structures.
+No business logic. Pure data definitions.
+"""
+from __future__ import annotations
+import enum
+from typing import Literal, Optional
+import torch
+from openenv.core.env_server.types import Action, Observation
+from pydantic import BaseModel, Field
+class RootCauseDiagnosis(str, enum.Enum):
+    """Closed enumeration of ML failure root causes."""
+    LR_TOO_HIGH = "lr_too_high"
+    VANISHING_GRADIENTS = "vanishing_gradients"
+    DATA_LEAKAGE = "data_leakage"
+    OVERFITTING = "overfitting"
+    BATCHNORM_EVAL_MODE = "batchnorm_eval_mode"
+    CODE_BUG = "code_bug"
+class TrainingConfig(BaseModel):
+    """Typed hyperparameter configuration."""
+    learning_rate: float = 0.001
+    weight_decay: float = 0.0001
+    batch_size: int = 64
+    hidden_dim: int = 64
+    num_layers: int = 3
+    optimizer: str = "adam"
+    dropout_rate: float = 0.0
+    gradient_clip_norm: Optional[float] = None
+class GradientStats(BaseModel):
+    """Per-layer gradient information from real torch.autograd."""
+    layer_name: str
+    norm_history: list[float]
+    mean_norm: float
+    max_norm: float
+    is_exploding: bool
+    is_vanishing: bool
+class ModelWeightStats(BaseModel):
+    """Per-layer weight statistics from real state_dict()."""
+    layer_name: str
+    weight_norm: float
+    weight_mean: float
+    weight_std: float
+    weight_min: float
+    weight_max: float
+    dead_neuron_pct: float = 0.0
+    has_nan: bool = False
+    has_inf: bool = False
+class DataBatchStats(BaseModel):
+    """Data batch inspection results."""
+    label_distribution: dict[int, float]
+    feature_mean: float
+    feature_std: float
+    null_count: int = 0
+    class_overlap_score: float
+    batch_size: int
+    duplicate_ratio: float = 0.0
+class CodeSnippet(BaseModel):
+    """PyTorch code for Task 6 inspection."""
+    code: str
+    filename: str = "train.py"
+    line_count: int
+    imports: list[str]
+    hint: Optional[str] = None
+class EpisodeState(BaseModel):
+    """Tracks agent history within an episode."""
+    step_count: int = 0
+    gradients_inspected: bool = False
+    gradients_were_normal: bool = False
+    data_inspected: bool = False
+    model_modes_inspected: bool = False
+    model_weights_inspected: bool = False
+    code_inspected: bool = False
+    fix_action_taken: bool = False
+    restart_after_fix: bool = False
+    diagnosis_submitted: bool = False
+    actions_taken: list[str] = Field(default_factory=list)
+    def compute_available_actions(self) -> list[str]:
+        """Dynamically compute available actions based on current state."""
+        actions = [
+            "inspect_gradients",
+            "inspect_data_batch",
+            "inspect_model_modes",
+            "inspect_model_weights",
+            "inspect_code",
+            "modify_config",
+            "add_callback",
+            "replace_optimizer",
+            "patch_data_loader",
+            "fix_model_mode",
+        ]
+        if self.code_inspected:
+            actions.append("fix_code")
+        if self.fix_action_taken:
+            actions.append("restart_run")
+        if self.restart_after_fix:
+            actions.append("rollback_checkpoint")
+        if not self.diagnosis_submitted:
+            actions.append("mark_diagnosed")
+        return actions
+ACTION_TYPES = Literal[
+    "inspect_gradients",
+    "inspect_data_batch",
+    "inspect_model_modes",
+    "inspect_model_weights",
+    "inspect_code",
+    "modify_config",
+    "add_callback",
+    "replace_optimizer",
+    "patch_data_loader",
+    "fix_model_mode",
+    "fix_code",
+    "restart_run",
+    "mark_diagnosed",
+    "rollback_checkpoint",
+]
+class MLTrainingAction(Action):
+    """What the agent can do — extends openenv Action."""
+    action_type: str
+    target: Optional[str] = None
+    value: Optional[float | int | str] = None
+    diagnosis: Optional[str] = None
+    line: Optional[int] = None
+    replacement: Optional[str] = None
+class MLTrainingObservation(Observation):
+    """Full observation — extends openenv Observation (has done, reward, metadata)."""
+    run_id: str = ""
+    framework: str = "pytorch"
+    epoch: int = 20
+    training_loss_history: list[float] = Field(default_factory=list)
+    val_loss_history: list[float] = Field(default_factory=list)
+    val_accuracy_history: list[float] = Field(default_factory=list)
+    gradient_stats: list[GradientStats] = Field(default_factory=list)
+    model_weight_stats: Optional[list[ModelWeightStats]] = None
+    gpu_memory_used_gb: float = 6.2
+    gpu_memory_total_gb: float = 16.0
+    learning_rate: float = 0.001
+    current_config: TrainingConfig = Field(default_factory=TrainingConfig)
+    error_log: Optional[str] = None
+    data_batch_stats: Optional[DataBatchStats] = None
+    model_mode_info: Optional[dict[str, str]] = None
+    code_snippet: Optional[CodeSnippet] = None
+    available_actions: list[str] = Field(default_factory=list)
+    episode_state: EpisodeState = Field(default_factory=EpisodeState)
+    notes: Optional[str] = None
+```
+8. **`ml_training_debugger/client.py`** — STUB:
+```python
+"""Typed EnvClient for baseline scripts."""
+from openenv.core.env_client import EnvClient
+from ml_training_debugger.models import MLTrainingAction, MLTrainingObservation
+class MLTrainingEnvClient(EnvClient[MLTrainingAction, MLTrainingObservation, dict]):
+    """Typed client for the PyTorch Training Debugger environment."""
+    def _step_payload(self, action: MLTrainingAction) -> dict:
+        return action.model_dump(exclude_none=True)
+    def _parse_observation(self, data: dict) -> MLTrainingObservation:
+        return MLTrainingObservation.model_validate(data)
+```
+9. **`server/__init__.py`** — empty file
+10. **`server/environment.py`** — STUB:
+```python
+"""MLTrainingEnvironment — extends openenv Environment."""
+from typing import Any, Optional
+from openenv.core.env_server.interfaces import Environment
+from ml_training_debugger.models import (
+    EpisodeState,
+    MLTrainingAction,
+    MLTrainingObservation,
+    TrainingConfig,
+)
+class MLTrainingEnvironment(
+    Environment[MLTrainingAction, MLTrainingObservation, dict]
+):
+    """OpenEnv environment for PyTorch training run debugging."""
+    SUPPORTS_CONCURRENT_SESSIONS = True
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        **kwargs: Any,
+    ) -> MLTrainingObservation:
+        """Reset environment, return initial observation."""
+        state = EpisodeState()
+        obs = MLTrainingObservation(
+            run_id=episode_id or "episode_001",
+            training_loss_history=[2.3] * 20,
+            val_loss_history=[2.3] * 20,
+            val_accuracy_history=[0.1] * 20,
+            current_config=TrainingConfig(),
+            available_actions=state.compute_available_actions(),
+            episode_state=state,
+            done=False,
+            reward=0.0,
+        )
+        return obs
+    def step(
+        self,
+        action: MLTrainingAction,
+        timeout_s: Optional[float] = None,
+        **kwargs: Any,
+    ) -> MLTrainingObservation:
+        """Process one agent action."""
+        state = EpisodeState()
+        obs = MLTrainingObservation(
+            run_id="episode_001",
+            training_loss_history=[2.3] * 20,
+            val_loss_history=[2.3] * 20,
+            val_accuracy_history=[0.1] * 20,
+            current_config=TrainingConfig(),
+            available_actions=state.compute_available_actions(),
+            episode_state=state,
+            done=False,
+            reward=-0.01,
+        )
+        return obs
+    @property
+    def state(self) -> dict:
+        """Return current environment state."""
+        return {"status": "active"}
+```
+11. **`server/app.py`** — STUB with all endpoints:
+```python
+"""FastAPI app — openenv create_app() + custom routes."""
+import logging
+from fastapi import FastAPI
+from fastapi.responses import JSONResponse
+from openenv.core.env_server.http_server import create_app
+from ml_training_debugger.models import MLTrainingAction, MLTrainingObservation
+from server.environment import MLTrainingEnvironment
+logger = logging.getLogger(__name__)
+# create_app takes the class (factory), not an instance
+app: FastAPI = create_app(
+    MLTrainingEnvironment,
+    MLTrainingAction,
+    MLTrainingObservation,
+    env_name="pytorch_training_debugger",
+    max_concurrent_envs=5,
+)
+@app.get("/health")
+def health_check() -> dict:
+    """Health check — required by hackathon auto-validator."""
+    return {"status": "ready", "tasks": 3}
+@app.get("/tasks")
+def get_tasks() -> list[dict]:
+    """Return task list with IDs, difficulties, and action schema."""
+    schema = MLTrainingAction.model_json_schema()
+    return [
+        {"id": "task_001", "difficulty": "easy", "max_steps": 20, "action_schema": schema},
+        {"id": "task_003", "difficulty": "medium", "max_steps": 25, "action_schema": schema},
+        {"id": "task_005", "difficulty": "hard", "max_steps": 30, "action_schema": schema},
+    ]
+@app.post("/grader")
+def post_grader() -> dict:
+    """Return grader score for most recently completed episode."""
+    return {"score": None, "error": "no_completed_episode"}
+@app.post("/baseline")
+async def post_baseline() -> dict:
+    """Trigger baseline run, return scores."""
+    return {"scores": {"task_001": 0.0, "task_003": 0.0, "task_005": 0.0}}
+```
+12. **`openenv.yaml`**:
+```yaml
+spec_version: 1
+name: pytorch-training-debugger
+type: space
+runtime: fastapi
+app: server.app:app
+port: 7860
+# Extended metadata
+version: "1.0.0"
+description: |
+  PyTorch-native fault injection engine for training failure debugging.
+  An AI agent investigates, diagnoses, fixes, and verifies broken
+  training runs using real torch.nn.Module models, torch.autograd
+  gradients, state_dict() weight inspection, and PyTorch code-level
+  debugging.
+framework: openenv
+tags: [ml-debugging, pytorch, reinforcement-learning, root-cause-analysis, fault-injection]
+observation_space:
+  type: MLTrainingObservation
+  description: "Training run snapshot with progressive reveal"
+action_space:
+  type: MLTrainingAction
+  description: "Investigation, fix, and diagnosis actions with dynamic availability"
+tasks:
+  - id: task_001
+    difficulty: easy
+    max_steps: 20
+  - id: task_003
+    difficulty: medium
+    max_steps: 25
+  - id: task_005
+    difficulty: hard
+    max_steps: 30
+reward:
+  range: [-1.0, 1.0]
+  shaped: true
+  step_penalty: -0.01
+  investigation_bonus: 0.05
+  correct_diagnosis: 0.50
+  terminal_convergence: 0.40
+endpoints:
+  websocket: "/ws"
+  tasks: "GET /tasks"
+  grader: "POST /grader"
+  baseline: "POST /baseline"
+  health: "GET /health"
+```
+13. **`Dockerfile`**:
+```dockerfile
+FROM python:3.12-slim
+WORKDIR /app
+# Install PyTorch CPU-only first (largest layer, cached)
+RUN pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu
+# Install remaining dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application code
+COPY ml_training_debugger/ ml_training_debugger/
+COPY server/ server/
+COPY openenv.yaml .
+COPY baseline_heuristic.py .
+# Copy pre-computed validation reports if they exist
+COPY validation/reports/ validation/reports/ 2>/dev/null || true
+EXPOSE 7860
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]
+```
+14. **`tests/__init__.py`** — empty file
+15. **`tests/conftest.py`**:
+```python
+"""Shared test fixtures."""
+import pytest
+from ml_training_debugger.models import (
+    EpisodeState,
+    MLTrainingAction,
+    MLTrainingObservation,
+    TrainingConfig,
+)
+@pytest.fixture
+def fresh_episode_state() -> EpisodeState:
+    return EpisodeState()
+@pytest.fixture
+def sample_config() -> TrainingConfig:
+    return TrainingConfig(learning_rate=0.001)
+@pytest.fixture
+def sample_observation() -> MLTrainingObservation:
+    state = EpisodeState()
+    return MLTrainingObservation(
+        run_id="test_episode",
+        training_loss_history=[2.3 - i * 0.1 for i in range(20)],
+        val_loss_history=[2.3 - i * 0.08 for i in range(20)],
+        val_accuracy_history=[0.1 + i * 0.04 for i in range(20)],
+        current_config=TrainingConfig(),
+        available_actions=state.compute_available_actions(),
+        episode_state=state,
+        done=False,
+        reward=0.0,
+    )
+```
+16. **`tests/test_models.py`**:
+```python
+"""Test all Pydantic models instantiate and serialize correctly."""
+import json
+import pytest
+from ml_training_debugger.models import (
+    CodeSnippet,
+    DataBatchStats,
+    EpisodeState,
+    GradientStats,
+    MLTrainingAction,
+    MLTrainingObservation,
+    ModelWeightStats,
+    RootCauseDiagnosis,
+    TrainingConfig,
+)
+class TestRootCauseDiagnosis:
+    def test_all_six_values_exist(self):
+        assert len(RootCauseDiagnosis) == 6
+    def test_values_are_strings(self):
+        for d in RootCauseDiagnosis:
+            assert isinstance(d.value, str)
+class TestTrainingConfig:
+    def test_default_instantiation(self):
+        config = TrainingConfig()
+        assert config.learning_rate == 0.001
+    def test_json_roundtrip(self):
+        config = TrainingConfig(learning_rate=0.01)
+        data = json.loads(config.model_dump_json())
+        restored = TrainingConfig.model_validate(data)
+        assert restored.learning_rate == 0.01
+class TestEpisodeState:
+    def test_fresh_state(self):
+        state = EpisodeState()
+        assert state.step_count == 0
+        assert not state.gradients_inspected
+        assert not state.diagnosis_submitted
+    def test_available_actions_initial(self):
+        state = EpisodeState()
+        actions = state.compute_available_actions()
+        assert "inspect_gradients" in actions
+        assert "mark_diagnosed" in actions
+        assert "fix_code" not in actions
+        assert "restart_run" not in actions
+    def test_fix_code_available_after_code_inspected(self):
+        state = EpisodeState(code_inspected=True)
+        actions = state.compute_available_actions()
+        assert "fix_code" in actions
+    def test_restart_run_available_after_fix(self):
+        state = EpisodeState(fix_action_taken=True)
+        actions = state.compute_available_actions()
+        assert "restart_run" in actions
+    def test_mark_diagnosed_disappears_after_submission(self):
+        state = EpisodeState(diagnosis_submitted=True)
+        actions = state.compute_available_actions()
+        assert "mark_diagnosed" not in actions
+class TestMLTrainingObservation:
+    def test_extends_observation(self):
+        from openenv.core.env_server.types import Observation
+        assert issubclass(MLTrainingObservation, Observation)
+    def test_has_done_and_reward(self):
+        obs = MLTrainingObservation(done=True, reward=0.5)
+        assert obs.done is True
+        assert obs.reward == 0.5
+    def test_json_serialization(self):
+        obs = MLTrainingObservation(
+            run_id="test",
+            training_loss_history=[1.0, 2.0],
+            val_accuracy_history=[0.5],
+        )
+        data = json.loads(obs.model_dump_json())
+        assert data["run_id"] == "test"
+class TestMLTrainingAction:
+    def test_extends_action(self):
+        from openenv.core.env_server.types import Action
+        assert issubclass(MLTrainingAction, Action)
+    def test_basic_action(self):
+        action = MLTrainingAction(action_type="inspect_gradients")
+        assert action.action_type == "inspect_gradients"
+    def test_modify_config_action(self):
+        action = MLTrainingAction(
+            action_type="modify_config",
+            target="learning_rate",
+            value=0.001,
+        )
+        assert action.target == "learning_rate"
+    def test_mark_diagnosed_action(self):
+        action = MLTrainingAction(
+            action_type="mark_diagnosed",
+            diagnosis="lr_too_high",
+        )
+        assert action.diagnosis == "lr_too_high"
+    def test_fix_code_action(self):
+        action = MLTrainingAction(
+            action_type="fix_code",
+            line=13,
+            replacement="loss = criterion(output, batch_y)",
+        )
+        assert action.line == 13
+```
+**Step 0.3 — Validation Commands:**
+```bash
+# In project root with venv activated
+source .venv/bin/activate
+# 1. Verify imports
+python -c "from ml_training_debugger.models import MLTrainingAction, MLTrainingObservation; print('models OK')"
+python -c "from ml_training_debugger.client import MLTrainingEnvClient; print('client OK')"
+python -c "from server.app import app; print('app OK')"
+# 2. Run tests
+pytest tests/test_models.py -v
+# 3. Start server
+uvicorn server.app:app --host 0.0.0.0 --port 7860 &
+sleep 3
+curl http://localhost:7860/health
+curl http://localhost:7860/tasks
+curl http://localhost:7860/docs
+kill %1
+# 4. Formatting
+black ml_training_debugger/ server/ tests/ --check
+ruff check ml_training_debugger/ server/ tests/
+isort ml_training_debugger/ server/ tests/ --check --profile black
+```
+### Acceptance Criteria — Phase 0
+- [ ] All Pydantic models instantiate without error and serialize to valid JSON
+- [ ] `MLTrainingObservation` extends `Observation` (has `done`, `reward`, `metadata`)
+- [ ] `MLTrainingAction` extends `Action` (has `metadata`)
+- [ ] `EpisodeState.compute_available_actions()` returns correct dynamic action lists
+- [ ] Server starts on port 7860 and responds to `/health` with `{"status": "ready", "tasks": 3}`
+- [ ] `/tasks` returns 3 tasks with action schema
+- [ ] `pytest tests/test_models.py` passes all tests
+- [ ] `client.py` imports without error
+- [ ] `black --check`, `ruff check`, `isort --check` all pass
+---
+## Phase 1: Core Data Models & Pydantic Types
+### Goal
+Finalize all model fields to match the spec exactly. No business logic yet — just data shapes.
+### Files to Edit
+**`ml_training_debugger/models.py`** — Already created in Phase 0. Verify:
+- All fields match spec Section 10 exactly
+- `GradientStats.is_exploding` threshold: `mean_norm > 10.0`
+- `GradientStats.is_vanishing` threshold: `mean_norm < 1e-6`
+- `TrainingConfig` field names match `modify_config` target options
+- `EpisodeState.compute_available_actions()` logic matches spec Section 10 dynamic rules
+### Tests (write BEFORE implementation — TDD)
+All tests already written in `tests/test_models.py` from Phase 0. Extend with:
+```python
+class TestGradientStats:
+    def test_exploding_threshold(self):
+        stats = GradientStats(
+            layer_name="fc", norm_history=[15.0], mean_norm=15.0, max_norm=15.0,
+            is_exploding=True, is_vanishing=False,
+        )
+        assert stats.is_exploding is True
+    def test_vanishing_threshold(self):
+        stats = GradientStats(
+            layer_name="conv1", norm_history=[1e-7], mean_norm=1e-7, max_norm=1e-7,
+            is_exploding=False, is_vanishing=True,
+        )
+        assert stats.is_vanishing is True
+    def test_normal_gradients(self):
+        stats = GradientStats(
+            layer_name="conv1", norm_history=[0.5], mean_norm=0.5, max_norm=0.5,
+            is_exploding=False, is_vanishing=False,
+        )
+        assert not stats.is_exploding
+        assert not stats.is_vanishing
+```
+### Acceptance Criteria — Phase 1
+- [ ] Every field in every model matches the spec Section 10 types exactly
+- [ ] No `Dict[str, Any]` in any public model (typed Pydantic everywhere)
+- [ ] `import torch` appears in `models.py`
+- [ ] All model tests pass
+---
+## Phase 2: PyTorch-Native Fault Injection Engine + Simulation
+### Goal
+Real PyTorch models with real gradients + parametric curve generators. This is the technical heart.
+### Files to Create
+**Step 2.1 — `ml_training_debugger/scenarios.py`** (~120 lines):
+```python
+"""ScenarioParams and scenario sampling."""
+from __future__ import annotations
+import dataclasses
+from typing import Optional
+import torch
+from ml_training_debugger.models import RootCauseDiagnosis
+@dataclasses.dataclass(frozen=True)
+class ScenarioParams:
+    """Internal scenario parameters — not exposed to agent."""
+    task_id: str
+    root_cause: RootCauseDiagnosis
+    seed: int
+    learning_rate: float = 0.001
+    weight_decay: float = 0.0001
+    leakage_pct: float = 0.0
+    depth_multiplier: float = 1.0
+    divergence_epoch: int = 5
+    red_herring_intensity: float = 1.0
+    red_herring_spike_layer: str = "fc"
+    bug_type: Optional[str] = None
+    notes: Optional[str] = None
+    error_log: Optional[str] = None
+    gpu_memory_used_gb: float = 6.2
+    max_steps: int = 20
+def sample_scenario(task_id: str, seed: int) -> ScenarioParams:
+    """Sample a ScenarioParams for the given task."""
+    rng = torch.Generator()
+    rng.manual_seed(seed)
+    # Use torch for random selection
+    def choose(options: list) -> any:
+        idx = int(torch.randint(0, len(options), (1,), generator=rng).item())
+        return options[idx]
+    if task_id == "task_001":
+        lr = choose([0.05, 0.08, 0.10, 0.15, 0.30])
+        return ScenarioParams(
+            task_id=task_id,
+            root_cause=RootCauseDiagnosis.LR_TOO_HIGH,
+            seed=seed,
+            learning_rate=lr,
+            error_log=f"RuntimeError: Loss is NaN at epoch 12 (lr={lr})",
+            max_steps=20,
+        )
+    elif task_id == "task_003":
+        leakage = choose([0.12, 0.18, 0.22, 0.28])
+        return ScenarioParams(
+            task_id=task_id,
+            root_cause=RootCauseDiagnosis.DATA_LEAKAGE,
+            seed=seed,
+            leakage_pct=leakage,
+            notes="Model architecture upgraded from 2-layer to 4-layer CNN at epoch 2. Performance improvement may reflect increased model capacity.",
+            max_steps=25,
+        )
+    elif task_id == "task_005":
+        intensity = (
+            torch.empty(1).uniform_(0.8, 2.5, generator=rng).item()
+        )
+        spike_layer = choose(["fc", "conv1"])
+        return ScenarioParams(
+            task_id=task_id,
+            root_cause=RootCauseDiagnosis.BATCHNORM_EVAL_MODE,
+            seed=seed,
+            red_herring_intensity=intensity,
+            red_herring_spike_layer=spike_layer,
+            gpu_memory_used_gb=14.56,  # 91% of 16GB
+            error_log="Warning: GPU memory pressure detected, consider reducing batch size or enabling gradient checkpointing",
+            max_steps=30,
+        )
+    raise ValueError(f"Unknown task_id: {task_id}")
+```
+**Step 2.2 — `ml_training_debugger/pytorch_engine.py`** (~250 lines):
+Key functions:
+- `SimpleCNN(torch.nn.Module)` — 3-layer CNN, ~50K params
+- `create_model_and_inject_fault(scenario: ScenarioParams) -> tuple[torch.nn.Module, dict]`
+- `extract_gradient_stats(model: torch.nn.Module) -> list[GradientStats]`
+- `extract_weight_stats(model: torch.nn.Module) -> list[ModelWeightStats]`
+- `extract_model_modes(model: torch.nn.Module) -> dict[str, str]`
+Implementation notes:
+- `torch.manual_seed(scenario.seed)` at the start of every call
+- For Task 1: set lr high, run 2 forward+backward passes → gradients explode
+- For Task 3: normal model, no gradient anomaly
+- For Task 5: call `model.eval()` before training → BatchNorm frozen
+- All gradient stats come from real `param.grad` tensors
+- All weight stats come from real `model.state_dict()`
+**Step 2.3 — `ml_training_debugger/simulation.py`** (~180 lines):
+Key functions:
+- `gen_loss_history(scenario: ScenarioParams) -> list[float]` — all torch.Tensor ops
+- `gen_val_accuracy_history(scenario: ScenarioParams) -> list[float]`
+- `gen_val_loss_history(scenario: ScenarioParams) -> list[float]`
+Per-task parametric curves from spec Section 6:
+- Task 1: `loss = torch.exp(torch.tensor(lr) * torch.arange(20))`
+- Task 3: `val_acc = torch.sigmoid(torch.linspace(-3, 3, 20)) * (1 - leakage_pct)`
+- Task 5: Normal loss + elevated variance, slow val_acc degradation
+### Tests to Create FIRST (TDD)
+**`tests/test_scenarios.py`**:
+- `sample_scenario("task_001", seed=42)` returns `root_cause == LR_TOO_HIGH`
+- `sample_scenario("task_003", seed=42)` returns `root_cause == DATA_LEAKAGE`
+- `sample_scenario("task_005", seed=42)` returns `root_cause == BATCHNORM_EVAL_MODE`
+- Different seeds produce different parameters (but same root cause per task)
+- Unknown task_id raises ValueError
+**`tests/test_pytorch_engine.py`**:
+- `SimpleCNN` is a real `torch.nn.Module` with ~50K params
+- Task 1 fault injection: `is_exploding=True` on all layers
+- Task 5 fault injection: `is_exploding=False` on all layers, `model.training==False`
+- `extract_gradient_stats` returns `list[GradientStats]` with real float norms
+- `extract_weight_stats` returns `list[ModelWeightStats]` from real state_dict
+- `extract_model_modes` returns dict mapping layer names to "train"/"eval"
+- **CRITICAL**: `import torch` in pytorch_engine.py, zero `import numpy`
+**`tests/test_simulation.py`**:
+- All outputs are `list[float]` of length 20
+- Task 1 (exploding): loss diverges (last value >> first value)
+- Task 3 (leakage): val_acc suspiciously high from early epochs
+- Task 5 (batchnorm): slow val_acc degradation (~1-2% per epoch)
+- All computation uses torch (no numpy)
+### Acceptance Criteria — Phase 2
+- [ ] `SimpleCNN` is a real `torch.nn.Module` with ~50K parameters
+- [ ] `create_model_and_inject_fault` for Task 1 produces exploding gradients (`is_exploding=True` all layers)
+- [ ] `create_model_and_inject_fault` for Task 5 produces `model.training==False` on all layers
+- [ ] `extract_gradient_stats` returns real floats from `torch.norm(param.grad)`
+- [ ] `extract_weight_stats` returns real floats from `state_dict()`
+- [ ] Parametric curves produce 20-element lists with correct shapes per task
+- [ ] `import torch` in `pytorch_engine.py` and `simulation.py` — zero `import numpy`
+- [ ] `torch.manual_seed(seed)` ensures reproducibility
+- [ ] All Phase 2 tests pass
+---
+## Phase 3: MVP Tasks (1, 3, 5) + Reward Engine + Graders
+### Goal
+All reward logic and graders implemented. The environment can score episodes.
+### Files to Create
+**Step 3.1 — `ml_training_debugger/reward_engine.py`** (~100 lines):
+```python
+def compute_reward(
+    action: MLTrainingAction,
+    episode_state: EpisodeState,
+    scenario: ScenarioParams,
+    is_valid_action: bool,
+    is_correct_fix: bool | None = None,
+    convergence_confirmed: bool = False,
+) -> float:
+```
+All 7 components per spec Section 12:
+1. Step penalty: -0.01 (flat, unconditional)
+2. Investigation bonus: +0.05 (first-time per type)
+3. Context-gated penalty: -0.20 (ONLY when `gradients_inspected AND gradients_were_normal`)
+4. Invalid action: -0.05
+5. Wrong code fix: -0.10
+6. Correct diagnosis: +0.50 / Wrong diagnosis: -0.30
+7. Terminal convergence: +0.40 (gated on `fix_action_taken AND restart_after_fix`)
+Hard cap at [-1.0, 1.0].
+**Step 3.2 — `ml_training_debugger/graders.py`** (~150 lines):
+One function per task. Each returns float in [0.0, 1.0]:
+- `grade_task_001(state: EpisodeState, scenario: ScenarioParams) -> float`
+- `grade_task_003(state: EpisodeState, scenario: ScenarioParams) -> float`
+- `grade_task_005(state: EpisodeState, scenario: ScenarioParams) -> float`
+Grader scoring per spec Section 11:
+- Task 1: inspect_gradients(+0.05), correct LR fix(+0.20), restart+converge(+0.35), correct diagnosis(+0.40) = 1.0
+- Task 3: inspect_data(+0.05), patch_data_loader(+0.30), restart+converge(+0.30), correct diagnosis(+0.35) = 1.0
+- Task 5: inspect_gradients(+0.05), inspect_model_modes(+0.05), fix_model_mode(+0.25), restart+converge(+0.30), correct diagnosis(+0.40) = 1.05 → capped at 1.0. Penalty: add_callback after normal gradients = -0.20.
+**CRITICAL — Grader is NOT a sum of step rewards.** It evaluates EpisodeState holistically.
+### Tests to Create FIRST (TDD)
+**`tests/test_reward_engine.py`** — THE MOST CRITICAL TEST FILE:
+```python
+class TestContextGatedPenalty:
+    """The project's primary innovation — must be exact."""
+    def test_no_penalty_before_inspection(self):
+        """add_callback at step 1 (no prior inspection) -> NO penalty."""
+        state = EpisodeState()  # gradients_inspected=False
+        action = MLTrainingAction(action_type="add_callback")
+        reward = compute_reward(action, state, scenario, is_valid_action=True)
+        # Should be just step penalty: -0.01
+        assert reward == pytest.approx(-0.01)
+    def test_penalty_after_normal_gradients(self):
+        """inspect_gradients (normal) then add_callback -> -0.20 penalty."""
+        state = EpisodeState(gradients_inspected=True, gradients_were_normal=True)
+        action = MLTrainingAction(action_type="add_callback")
+        reward = compute_reward(action, state, scenario, is_valid_action=True)
+        # Step penalty + context-gated penalty: -0.01 + -0.20 = -0.21
+        assert reward == pytest.approx(-0.21)
+    def test_no_penalty_after_abnormal_gradients(self):
+        """inspect_gradients (exploding) then add_callback -> no context penalty."""
+        state = EpisodeState(gradients_inspected=True, gradients_were_normal=False)
+        action = MLTrainingAction(action_type="add_callback")
+        reward = compute_reward(action, state, scenario, is_valid_action=True)
+        assert reward == pytest.approx(-0.01)
+```
+Also test:
+- Step penalty is flat -0.01 (NOT multiplied by step_count)
+- Investigation bonus +0.05 first-time only
+- Investigation bonus NOT awarded on repeat
+- Correct diagnosis: +0.50
+- Wrong diagnosis: -0.30
+- Terminal convergence: +0.40 when all gates met
+- Invalid action: -0.05
+- Wrong code fix: -0.10
+- Reward capped at [-1.0, 1.0]
+**`tests/test_graders.py`**:
+- Each grader returns float in [0.0, 1.0]
+- Perfect Task 1 path scores 1.0
+- Wrong diagnosis on Task 1 scores < 0.5
+- Task 5: agent that chases red herring scores 0.80-0.85
+- Task 5: optimal path scores 1.0
+- Grader is deterministic (same state → same score)
+### Acceptance Criteria — Phase 3
+- [ ] `compute_reward` implements all 7 components exactly per spec Section 12
+- [ ] Context-gated penalty fires ONLY when `gradients_inspected=True AND gradients_were_normal=True`
+- [ ] Context-gated penalty does NOT fire before `inspect_gradients` has been called
+- [ ] Step penalty is flat -0.01 (never multiplied by step_count)
+- [ ] All 3 graders return [0.0, 1.0] with meaningful variance
+- [ ] Grader != reward function (separate modules, separate logic)
+- [ ] All Phase 3 tests pass
+---
+## Phase 4: Environment Lifecycle, EpisodeState, and Action Handling
+### Goal
+Full `reset()` and `step()` implementations in `environment.py`. The environment is functionally complete.
+### Files to Edit
+**`server/environment.py`** — Full implementation:
+`reset(task_id)`:
+1. Parse `task_id` from `kwargs` (framework passes it via kwargs or episode_id)
+2. Derive deterministic seed from task_id
+3. Call `sample_scenario(task_id, seed)`
+4. Call `torch.manual_seed(scenario.seed)`
+5. Call `create_model_and_inject_fault(scenario)` → get real model
+6. Generate parametric curves via `simulation.py`
+7. Create fresh `EpisodeState`
+8. Store `(scenario, model, state)` keyed by session/episode ID
+9. Return `MLTrainingObservation` with populated loss/acc histories, config, error_log, available_actions — but empty gradient_stats, null data_batch_stats, null model_mode_info, null code_snippet
+`step(action)`:
+1. Validate action (see spec Section 16 error handling matrix)
+2. Increment `step_count`
+3. Dispatch by `action.action_type`:
+   - **`inspect_gradients`**: Extract real gradient stats, set `gradients_inspected=True`, compute `gradients_were_normal` (all layers `is_exploding==False`)
+   - **`inspect_data_batch`**: Generate data batch stats, set `data_inspected=True`
+   - **`inspect_model_modes`**: Extract model modes, set `model_modes_inspected=True`
+   - **`inspect_model_weights`**: Extract real weight stats, set `model_weights_inspected=True`
+   - **`inspect_code`**: Generate code snippet (if task supports it), set `code_inspected=True`
+   - **`modify_config`**: Validate target/value, apply change, set `fix_action_taken=True`
+   - **`add_callback`**: Apply callback, set `fix_action_taken=True`
+   - **`replace_optimizer`**: Apply, set `fix_action_taken=True`
+   - **`patch_data_loader`**: Apply, set `fix_action_taken=True`
+   - **`fix_model_mode`**: Apply, set `fix_action_taken=True`
+   - **`fix_code`**: Validate fix via `validate_fix()`, set `fix_action_taken=True`
+   - **`restart_run`**: Requires `fix_action_taken`, set `restart_after_fix=True`, check convergence
+   - **`mark_diagnosed`**: Set `diagnosis_submitted=True`, `done=True`
+   - **`rollback_checkpoint`**: Requires `restart_after_fix`
+4. Call `compute_reward(action, state, scenario, ...)`
+5. Check step limit → set `done=True` if reached
+6. Update `available_actions` via `state.compute_available_actions()`
+7. Return `MLTrainingObservation` with all updated fields
+**Session isolation**:
+- Store per-session state in `self._sessions: dict[str, SessionData]`
+- Session ID comes from the framework (via `episode_id` or WebSocket session)
+- Clean up on episode completion or disconnect
+### Error Handling (spec Section 16 — ALL cases):
+| Error | Behavior | Reward |
+|-------|----------|--------|
+| Invalid action_type | Return obs unchanged + error note | -0.05 |
+| Action not in available_actions | Return obs unchanged + error note | -0.05 |
+| modify_config missing target/value | Return obs unchanged + error note | -0.05 |
+| modify_config with unknown target | Return obs unchanged + error note | -0.05 |
+| mark_diagnosed missing diagnosis | Return obs unchanged + error note | -0.05 |
+| mark_diagnosed with invalid diagnosis | Return obs unchanged + error note | -0.05 |
+| fix_code missing line/replacement | Return obs unchanged + error note | -0.05 |
+| Action after done=True | Return final obs, no state change | 0.0 |
+| Step limit reached | Set done=True, return obs | 0.0 |
+**CRITICAL**: `step()` must NEVER raise an unhandled exception.
+### Tests to Create FIRST (TDD)
+**`tests/test_episode_lifecycle.py`**:
+- Full reset→inspect→fix→restart→diagnose flow for Task 1
+- Full flow for Task 3
+- Full flow for Task 5
+- `available_actions` updates correctly at each step
+- `done=True` after `mark_diagnosed`
+- Step limit triggers `done=True`
+- Action after done returns final obs with no state change
+- Invalid action returns -0.05 penalty
+- `restart_run` not available before `fix_action_taken`
+- `fix_code` not available before `code_inspected`
+- Session isolation: two episodes don't interfere
+### Acceptance Criteria — Phase 4
+- [ ] `reset(task_id)` for tasks 001/003/005 returns valid `MLTrainingObservation` with correct initial state
+- [ ] `step()` dispatches all 14 action types correctly
+- [ ] Task 1: `inspect_gradients` → `is_exploding=True` all layers (real torch.autograd)
+- [ ] Task 5: `inspect_gradients` → `is_exploding=False` all layers, `gradients_were_normal=True`
+- [ ] Task 3: `inspect_data_batch` → `class_overlap_score > 0.5`
+- [ ] Task 5: `inspect_model_modes` → all layers in "eval" mode
+- [ ] All error conditions from spec Section 16 handled (never raises)
+- [ ] Progressive information reveal works (gradient_stats empty until inspected)
+- [ ] All Phase 4 tests pass
+---
+## Phase 5: Server (FastAPI + openenv-core) + All Required Endpoints
+### Goal
+Wire the real environment into the server. All hackathon-required endpoints return real data.
+### Files to Edit
+**`server/app.py`** — Full implementation:
+```python
+# Store reference to last completed episode for /grader
+_last_completed: dict[str, dict] = {}  # session_id -> {score, task_id, steps}
+_baseline_running: bool = False
+@app.get("/health")
+def health_check():
+    return {"status": "ready", "tasks": 3}
+@app.get("/tasks")
+def get_tasks():
+    schema = MLTrainingAction.model_json_schema()
+    return [
+        {"id": "task_001", "difficulty": "easy", "max_steps": 20, "action_schema": schema},
+        {"id": "task_003", "difficulty": "medium", "max_steps": 25, "action_schema": schema},
+        {"id": "task_005", "difficulty": "hard", "max_steps": 30, "action_schema": schema},
+    ]
+@app.post("/grader")
+def post_grader(session_id: str | None = None):
+    # Return score for most recently completed episode
+    # Edge cases per spec Section 14
+@app.post("/baseline")
+async def post_baseline():
+    # Run baseline_heuristic logic internally
+    # Return {"scores": {"task_001": float, ...}}
+    # Return 409 if already running
+```
+**Grader endpoint edge cases** (spec Section 14):
+- No episode completed → `{"score": null, "error": "no_completed_episode"}`
+- Episode in progress → `{"score": null, "error": "episode_in_progress"}`
+- Episode completed → `{"score": 0.85, "task_id": "task_003", "steps": 6}`
+- Always HTTP 200 with JSON body
+### Tests to Create FIRST (TDD)
+**`tests/test_endpoints.py`**:
+- `GET /health` returns `{"status": "ready", "tasks": 3}` with 200
+- `GET /tasks` returns 3 tasks with action schema
+- `POST /grader` returns `{"score": null, "error": "no_completed_episode"}` initially
+- `POST /baseline` returns scores for all tasks
+- `POST /baseline` while running returns 409
+- Integration: reset→step→grader returns valid score
+### Acceptance Criteria — Phase 5
+- [ ] `GET /health` returns `{"status": "ready", "tasks": 3}` (200)
+- [ ] `GET /tasks` returns 3 tasks with IDs, difficulties, action schema
+- [ ] `POST /grader` handles all edge cases per spec Section 14
+- [ ] `POST /baseline` runs baseline and returns scores
+- [ ] Framework auto-provides: `/reset`, `/step`, `/state`, `/ws`, `/schema`, `/docs`
+- [ ] All Phase 5 tests pass
+---
+## Phase 6: Rule-Based Baseline + Reproducibility Guarantees
+### Goal
+Deterministic baseline that produces bit-exact identical scores on two runs.
+### Files to Create
+**`baseline_heuristic.py`** (~150 lines):
+Decision tree from spec Section 17:
+```
+1. reset(task_id)
+2. inspect_gradients
+3. IF any layer is_exploding → modify_config(lr=0.001) → restart → diagnose lr_too_high
+4. IF any layer is_vanishing → modify_config(lr=0.01) → restart → diagnose vanishing_gradients
+5. inspect_data_batch
+6. IF class_overlap_score > 0.5 → patch_data_loader → restart → diagnose data_leakage
+7. IF val_loss diverging → modify_config(weight_decay=0.01) → restart → diagnose overfitting
+8. inspect_model_modes → IF any eval → fix_model_mode → restart → diagnose batchnorm_eval_mode
+9. inspect_code → attempt fix → restart → diagnose code_bug
+10. FALLBACK: diagnose overfitting
+```
+Uses `MLTrainingEnvClient` or `GenericEnvClient` to connect via WebSocket.
+**Reproducibility requirements:**
+- `torch.manual_seed(seed)` at every `reset()` with deterministic seed per task
+- No floating-point non-determinism in parametric curves
+- Heuristic is pure logic with no randomness
+- Two runs must produce identical JSON output
+### Tests to Create FIRST (TDD)
+**`tests/test_baseline_reproducibility.py`**:
+- Run baseline twice → `diff run1.json run2.json` is empty
+- All scores in [0.0, 1.0]
+- Expected approximate scores: task_001 ~0.85, task_003 ~0.70, task_005 ~0.45
+### Acceptance Criteria — Phase 6
+- [ ] `baseline_heuristic.py` runs all 3 MVP tasks without error
+- [ ] Two consecutive runs produce bit-exact identical JSON output
+- [ ] No API key required
+- [ ] All scores in [0.0, 1.0] with meaningful variance
+- [ ] Decision tree follows spec Section 17 exactly
+---
+## Phase 7: Docker, HF Spaces, Logging, Error Handling & Edge Cases
+### Goal
+Production-ready container that deploys cleanly.
+### Files to Edit
+**`Dockerfile`** — Finalize:
+- Base: `python:3.12-slim`
+- PyTorch CPU-only: `pip install torch --index-url https://download.pytorch.org/whl/cpu`
+- Target: <500MB
+- `EXPOSE 7860`
+- `CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]`
+**Note on Dockerfile COPY**: Cannot use `COPY ... 2>/dev/null || true` in Dockerfile. Instead, ensure all files exist or use multi-stage approach.
+**Logging** — Add to `server/app.py` and `server/environment.py`:
+- JSON structured logging to stdout
+- Log every `reset()`, `step()`, episode completion, errors
+**WebSocket edge cases** (spec Section 16):
+- Client disconnects mid-episode → retain state 60s
+- Malformed JSON → return error, keep connection
+- step() before reset() → return "no_active_episode" error
+- reset() during active episode → terminate current, start new
+### Acceptance Criteria — Phase 7
+- [ ] `docker build -t pytorch-debugger .` succeeds
+- [ ] Docker image <500MB
+- [ ] `docker run -p 7860:7860 pytorch-debugger` starts and serves in <60s
+- [ ] `curl http://localhost:7860/health` returns `{"status": "ready", "tasks": 3}`
+- [ ] All WebSocket edge cases handled per spec Section 16
+- [ ] Structured JSON logging on all significant events
+---
+## Phase 8: Full Testing Suite + Pre-Submission Smoke Tests
+### Goal
+>80% test coverage, all edge cases covered.
+### Files to Create/Extend
+All test files listed above, plus:
+- Fill coverage gaps identified by `pytest --cov`
+- Add edge case tests for every error in spec Section 16
+- Add test for `step()` after `done=True`
+- Add test for step limit termination
+### Commands
+```bash
+pytest tests/ -v --cov=ml_training_debugger --cov=server --cov-report=term-missing
+```
+### Acceptance Criteria — Phase 8
+- [ ] `pytest --cov` shows >80% coverage on all modules
+- [ ] Every error condition from spec Section 16 has a test
+- [ ] Context-gated penalty tests pass (both paths)
+- [ ] Dynamic available_actions tests pass
+- [ ] All 3 graders tested with multiple scenarios
+- [ ] Zero test failures
+---
+## Phase 9: Final Polish & Submission Readiness
+### Goal
+README complete, all endpoints verified, `openenv validate` passes, deploy to HF Spaces.
+### Files to Create
+**`README.md`** (~200 lines):
+- Environment description and motivation
+- Action/observation space definitions
+- Task descriptions with difficulty
+- Setup instructions
+- Baseline scores table
+**`deploy.sh`**:
+```bash
+#!/bin/bash
+set -euo pipefail
+echo "=== Building Docker image ==="
+docker build -t pytorch-debugger .
+echo "=== Starting container ==="
+docker run -d -p 7860:7860 --name smoke-test pytorch-debugger
+sleep 10
+echo "=== Health check ==="
+curl -f http://localhost:7860/health || { echo "FAIL: health"; exit 1; }
+echo "=== Tasks endpoint ==="
+curl -f http://localhost:7860/tasks | python3 -m json.tool || { echo "FAIL: tasks"; exit 1; }
+echo "=== Baseline reproducibility ==="
+python3 baseline_heuristic.py > run1.json 2>/dev/null
+python3 baseline_heuristic.py > run2.json 2>/dev/null
+diff run1.json run2.json && echo "PASS: reproducible" || { echo "FAIL: non-reproducible"; exit 1; }
+echo "=== Baseline via endpoint ==="
+curl -f -X POST http://localhost:7860/baseline | python3 -m json.tool || { echo "FAIL: baseline endpoint"; exit 1; }
+echo "=== Grader via endpoint ==="
+curl -f -X POST http://localhost:7860/grader | python3 -m json.tool || { echo "FAIL: grader endpoint"; exit 1; }
+echo "=== Tests ==="
+pytest tests/ -v --cov=ml_training_debugger --cov-report=term-missing
+echo "=== Cleanup ==="
+docker stop smoke-test && docker rm smoke-test
+rm -f run1.json run2.json
+echo "=== ALL CHECKS PASSED ==="
+```
+### Acceptance Criteria — Phase 9
+- [ ] `openenv validate` passes
+- [ ] `deploy.sh` runs end-to-end with zero failures
+- [ ] README is complete per hackathon requirements
+- [ ] Docker image <500MB, starts <60s
+- [ ] Baseline bit-exact reproducible
+- [ ] 3+ tasks with graders returning [0.0, 1.0] with meaningful variance
+- [ ] HF Space deployed, tagged `openenv`, responds to `reset()`
+- [ ] All typed Pydantic models — no `Dict[str, Any]`
+- [ ] `import torch` in every core module — zero numpy in core
+- [ ] Context-gated penalty fires correctly and does not fire prematurely
+- [ ] Test suite passes with >80% coverage
+---
+## Technical Risk Mitigations
+| Risk | Impact | Mitigation |
+|------|--------|------------|
+| **WebSocket + HTTP composition** | ~~High~~ RESOLVED | `create_app()` returns standard FastAPI. Custom routes add cleanly. Verified in Phase 0. |
+| **Docker image size** | Medium | `python:3.12-slim` + torch CPU-only (~150MB). Target <500MB. Test early in Phase 7. |
+| **Task 6 fix validation fragility** | Medium | Multi-strategy pipeline: normalize → tokenize → semantic patterns → AST fallback. Test 5+ whitespace variations. (Post-MVP Phase 2 stretch) |
+| **Red-herring penalty gating** | HIGH | `gradients_were_normal` set inside `inspect_gradients` handler when ALL layers have `is_exploding=False`. Threshold: `mean_norm > 10.0`. Test BOTH paths explicitly. |
+| **Session isolation** | Medium | `dict[str, SessionData]` keyed by session ID. Framework provides session management. |
+| **Baseline reproducibility** | HIGH | `torch.manual_seed(seed)` at every `reset()`. Seed derived deterministically from task_id. Heuristic is pure logic. Test with `diff run1.json run2.json`. |
+| **Dockerfile build time** | Low | No real training during build. Validation reports pre-computed locally. |
+| **openenv.yaml format** | Medium | Template uses `spec_version: 1`, `type: space`, `runtime: fastapi`, `app: server.app:app`. Extended fields (tasks, reward, etc.) are additive. Test with `openenv validate` early. |
+| **Port mismatch** | Low | Spec says 7860 (HF Spaces default). openenv template says 8000. Use 7860 everywhere. |
+---
+## Exact openenv.yaml (Final)
+```yaml
+spec_version: 1
+name: pytorch-training-debugger
+type: space
+runtime: fastapi
+app: server.app:app
+port: 7860
+version: "1.0.0"
+description: |
+  PyTorch-native fault injection engine for training failure debugging.
+  An AI agent investigates, diagnoses, fixes, and verifies broken
+  training runs using real torch.nn.Module models, torch.autograd
+  gradients, state_dict() weight inspection, and PyTorch code-level
+  debugging. 3 tasks across 3 difficulty tiers with context-gated
+  reward shaping.
+framework: openenv
+tags: [ml-debugging, pytorch, reinforcement-learning, root-cause-analysis, fault-injection, openenv]
+observation_space:
+  type: MLTrainingObservation
+  description: "Training run snapshot with progressive reveal — gradients, weights, data stats, model modes revealed on inspection"
+action_space:
+  type: MLTrainingAction
+  description: "Investigation, fix, and diagnosis actions with dynamic availability"
+tasks:
+  - id: task_001
+    difficulty: easy
+    max_steps: 20
+  - id: task_003
+    difficulty: medium
+    max_steps: 25
+  - id: task_005
+    difficulty: hard
+    max_steps: 30
+reward:
+  range: [-1.0, 1.0]
+  shaped: true
+  step_penalty: -0.01
+  investigation_bonus: 0.05
+  max_investigation_bonus: 0.25
+  correct_diagnosis: 0.50
+  terminal_convergence: 0.40
+endpoints:
+  websocket: "/ws"
+  tasks: "GET /tasks"
+  grader: "POST /grader"
+  baseline: "POST /baseline"
+  health: "GET /health"
+```
+---
+## Exact Dockerfile (Final)
+```dockerfile
+FROM python:3.12-slim
+WORKDIR /app
+# Install PyTorch CPU-only first (largest layer, cached)
+RUN pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu
+# Install remaining dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application code
+COPY ml_training_debugger/ ml_training_debugger/
+COPY server/ server/
+COPY openenv.yaml .
+COPY baseline_heuristic.py .
+COPY README.md .
+EXPOSE 7860
+HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
+    CMD curl -f http://localhost:7860/health || exit 1
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]
+```
+---
+## Pre-Submission Smoke Test Sequence
+```bash
+# 1. Clean build
+docker build --no-cache -t pytorch-debugger .
+# 2. Start container
+docker run -d -p 7860:7860 --name smoke-test pytorch-debugger
+sleep 10
+# 3. Health check
+curl -f http://localhost:7860/health
+# 4. Tasks endpoint
+curl -f http://localhost:7860/tasks | python3 -m json.tool
+# 5. Baseline reproducibility
+python3 baseline_heuristic.py > run1.json 2>/dev/null
+python3 baseline_heuristic.py > run2.json 2>/dev/null
+diff run1.json run2.json && echo "PASS: reproducible" || echo "FAIL"
+# 6. Baseline via endpoint
+curl -f -X POST http://localhost:7860/baseline | python3 -m json.tool
+# 7. Grader via endpoint
+curl -f -X POST http://localhost:7860/grader | python3 -m json.tool
+# 8. OpenEnv validation
+openenv validate
+# 9. Test suite
+pytest tests/ -v --cov=ml_training_debugger --cov-report=term-missing
+# 10. Cleanup
+docker stop smoke-test && docker rm smoke-test
+rm -f run1.json run2.json
+echo "=== All checks passed ==="
+```
+---
+## Post-MVP Stretch (Phase 2 from ROADMAP)
+**Only after MVP is 100% deployed and passing all auto-validation:**
+1. **Task 6** (code debugging) — highest impact differentiator
+   - Create `ml_training_debugger/code_templates.py`
+   - 4 bug variants: eval_mode, detach_loss, zero_grad_missing, inplace_relu
+   - Multi-strategy fix validation: normalize → tokenize → semantic → AST
+   - Diagnosis is ALWAYS `code_bug` regardless of variant
+2. **Tasks 2 & 4** — fill out to 6 tasks
+   - Task 2: vanishing gradients (easy, mirror of Task 1)
+   - Task 4: overfitting (medium, train-val divergence)
+3. **Dashboard** — `server/dashboard.html`, Plotly.js via CDN
+4. **Validation Suite** — `validation/*.py`, R² > 0.85
+5. **LLM Baseline** — `baseline_inference.py`, GPT-4o
+Update `openenv.yaml`, `/tasks`, `/health` task count as tasks are added.
+---
+## SESSION_ID
+- CODEX_SESSION: N/A (codeagent-wrapper not available)
+- GEMINI_SESSION: N/A (codeagent-wrapper not available)
+Plan generated by Claude Opus 4.6 via deep analysis of all 4 project markdown files + openenv-core framework API inspection.

.coverage ADDED Viewed

Binary file (53.2 kB). View file

.dockerignore ADDED Viewed

	@@ -0,0 +1,13 @@

+.venv/
+__pycache__/
+.git/
+.pytest_cache/
+tests/
+validation/
+*.md
+!README.md
+.claude/
+run*.json
+htmlcov/
+.mypy_cache/
+.ruff_cache/

.gitignore ADDED Viewed

	@@ -0,0 +1,14 @@

+.venv/
+__pycache__/
+*.pyc
+*.pyo
+.env
+run*.json
+.pytest_cache/
+htmlcov/
+*.egg-info/
+dist/
+build/
+validation/reports/*.png
+.mypy_cache/
+.ruff_cache/

.python-version ADDED Viewed

	@@ -0,0 +1 @@


1	+ 3.12

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,186 @@

+# CLAUDE.md — PyTorch Training Run Debugger
+OpenEnv RL environment for the Meta PyTorch OpenEnv Hackathon x Scaler School of Technology.
+An AI agent debugs broken PyTorch training runs by investigating gradients, weights, data, model modes, and source code to diagnose and fix real ML failure patterns.
+**Spec:** `ml-training-debugger-spec.md` is the single source of truth. If this file and the spec conflict, the spec wins.
+**Runtime:** Python 3.12 · PyTorch CPU-only · openenv-core v0.2.2
+---
+## Non-Negotiable Rules
+### MVP-First Execution
+Ship Tasks 1, 3, 5 (easy/medium/hard) + rule-based baseline + Docker + HF deploy **before** touching anything else. A deployed MVP that passes auto-validation beats a half-finished 6-task environment. Priority order after MVP: Task 6 > Tasks 2 & 4 > dashboard > validation suite > LLM baseline.
+### Context-Gated Penalty Must Be Exact
+The -0.20 penalty for `add_callback` fires **only when both** `gradients_inspected == True` AND `gradients_were_normal == True`. It must **never** fire before `inspect_gradients` has been called. This is the project's primary innovation. Get the gate conditions wrong and the differentiator is broken. Test both paths:
+- `add_callback` at step 1 (no prior inspection) -> **no penalty**
+- `inspect_gradients` (normal) then `add_callback` -> **-0.20 penalty**
+### Task 6 Diagnosis Is Always `code_bug`
+Regardless of the specific bug variant (`eval_mode`, `detach_loss`, `zero_grad_missing`, `inplace_relu`), Task 6's correct diagnosis is **always** `code_bug`. Submitting `batchnorm_eval_mode` on Task 6's `eval_mode` variant is a wrong diagnosis (-0.30). The grader enforces this with a strict equality check.
+### PyTorch-Native Only — No NumPy
+Every computation in core modules uses `torch.Tensor`, not `numpy.ndarray`. `import torch` must appear in `models.py`, `simulation.py`, `pytorch_engine.py`, `reward_engine.py`, and `graders.py`. This is a Meta PyTorch hackathon — judges will notice. The only exception is test utilities and the validation suite where `scipy`/`matplotlib` are acceptable.
+### Grader != Reward Function
+These are separate modules with separate purposes. The **reward function** (`reward_engine.py`) returns a float per step for RL training signal. The **grader** (`graders.py`) returns a normalized 0.0-1.0 score at episode end for the `/grader` endpoint and auto-validation. The grader evaluates `EpisodeState` holistically — it is **not** a sum of step rewards. Never conflate them.
+### Opaque Task IDs
+Task IDs are `task_001` through `task_006`. The agent must never be able to infer the diagnosis from the task ID. Do not use descriptive names anywhere the agent can observe them.
+---
+## Architecture Constraints
+### Framework Integration (Verified)
+```
+openenv-core v0.2.2 → create_app() → returns standard FastAPI instance
+```
+- `MLTrainingAction` extends `Action` from `openenv.core.env_server.types`
+- `MLTrainingObservation` extends `Observation` from `openenv.core.env_server.types` (has built-in `done`, `reward`, `metadata`)
+- `MLTrainingEnvironment` extends `Environment` from `openenv.core.env_server.interfaces` (must implement `reset()`, `step()`, `state` property)
+- `MLTrainingEnvClient` in `client.py` extends `EnvClient` with typed `action_type` and `observation_type` — used by baseline scripts
+- `create_app()` takes the **class** (factory), not an instance
+- Custom routes (`/tasks`, `/grader`, `/baseline`, `/health`) are added directly to the returned FastAPI app via `@app.get()`/`@app.post()` decorators
+- Framework auto-provides: `POST /reset`, `POST /step`, `GET /state`, `WS /ws`, `GET /schema`, `GET /docs`, `/mcp`
+### Key Constraints (see spec for full detail)
+- **Real PyTorch models:** `pytorch_engine.py` instantiates `SimpleCNN` (~50K params) at every `reset()`, runs 1-2 real forward+backward passes. Gradient and weight stats come from real `torch.autograd` and `model.state_dict()`.
+- **Typed Pydantic models everywhere:** No `Dict[str, Any]`. `available_actions` is dynamically computed from `EpisodeState`, never hardcoded.
+- **Session isolation:** Each WebSocket client gets its own `EpisodeState` keyed by session ID. `SUPPORTS_CONCURRENT_SESSIONS = True`.
+---
+## Coding Standards
+### Formatting & Linting
+- **black** for formatting (line length 88)
+- **ruff** for linting
+- **isort** for import ordering (profile=black)
+- Run all three before every commit
+### Type Hints
+Type annotations on **every** function signature and return type. No `Any` in public APIs. Use `Optional[X]` for nullable fields, `Literal[...]` for closed string unions, `list[X]` (lowercase) for Python 3.12+.
+### Testing
+- **pytest** for all tests
+- Every module in `ml_training_debugger/` has a corresponding `tests/test_*.py`
+- Minimum test coverage: 80%
+- Critical tests that must exist:
+  - `test_reward_engine.py`: context-gated penalty fires/doesn't fire under correct conditions
+  - `test_graders.py`: each grader returns 0.0-1.0, correct diagnosis scores high, wrong diagnosis scores low
+  - `test_pytorch_engine.py`: model instantiation, fault injection, gradient/weight extraction produces real tensors
+  - `test_code_templates.py`: all 4 bug variants generate valid code, fix validation accepts correct fixes and rejects wrong ones (including whitespace/comment variations)
+  - `test_episode_lifecycle.py`: full episode flow reset->inspect->fix->restart->diagnose produces expected state transitions
+### File Size Limits
+- 400 lines typical, 800 max per file
+- `models.py` may exceed 400 lines due to many Pydantic models — this is acceptable
+- `pytorch_engine.py` must stay under 300 lines (isolate model definitions if needed)
+### Error Handling
+`step()` must **never** raise an unhandled exception. All invalid actions return a valid observation with `-0.05` penalty and an error note. All edge cases (step after done, step before reset, malformed JSON) return structured error responses.
+---
+## Key Risks to Watch
+### Task 6 Code Fix Validation
+LLM agents will submit fixes with trailing spaces, inline comments, or minor reformatting. Use the multi-strategy validation pipeline:
+1. Normalize whitespace + strip comments
+2. Token-stream comparison via `tokenize` module
+3. 2-3 semantic equivalence patterns per bug variant
+4. `ast.parse()` fallback to verify buggy pattern is absent
+Test with intentionally messy fixes: `"  loss = criterion(output, batch_y)  # fixed  "` must pass.
+### Red-Herring Penalty Gating
+The `gradients_were_normal` flag is set **inside** the `inspect_gradients` handler, based on whether `is_exploding` is False on **all** layers. The threshold for `is_exploding` is `mean_norm > 10.0`. The threshold for `is_vanishing` is `mean_norm < 1e-6`. In Task 5, the FC spike has `is_exploding: False` (it spiked but the mean norm stays below 10.0), so `gradients_were_normal` is set to True. This is the gate that makes the penalty fire when the agent then calls `add_callback`.
+### Docker Image Size
+Target: <500MB. PyTorch CPU-only wheel is ~150MB. Use `python:3.12-slim` base. Install torch with `--index-url https://download.pytorch.org/whl/cpu`. Do NOT install CUDA. Pre-compute validation reports locally — do not run real training in Docker build.
+### Baseline Reproducibility
+The rule-based baseline must produce **bit-exact identical** scores on two consecutive runs. This requires:
+- `torch.manual_seed(seed)` at every `reset()` with a deterministic seed per task
+- No floating-point non-determinism in the parametric curve generators
+- The heuristic decision tree is pure logic with no randomness
+### Auto-Validator Endpoints
+These endpoints are checked programmatically. They must respond correctly or you are disqualified:
+- `GET /health` -> `{"status": "ready", "tasks": N}` (200) — N is the number of active tasks (3 for MVP, 6 for full)
+- `GET /tasks` -> list of tasks with IDs and action schema (200)
+- `POST /grader` -> `{"score": float}` after a completed episode (200)
+- `POST /baseline` -> scores for all tasks (200)
+- `WS /ws` -> responds to `reset` message
+---
+## Reward Constants (Do Not Change)
+See spec Section 12 for full rationale. Summary:
+| Event | Value | Gate |
+|---|---|---|
+| Step penalty | -0.01 | Unconditional, flat (never multiply by step_count) |
+| Investigation bonus | +0.05 | First-time only per inspection type |
+| Context-gated penalty | -0.20 | `gradients_inspected AND gradients_were_normal` |
+| Invalid action | -0.05 | Action not in `available_actions` |
+| Wrong code fix | -0.10 | `fix_code` with wrong line/replacement |
+| Correct diagnosis | +0.50 | `diagnosis == true_root_cause` |
+| Wrong diagnosis | -0.30 | `diagnosis != true_root_cause` |
+| Terminal convergence | +0.40 | `fix_action_taken AND restart_after_fix AND convergence` |
+---
+## Success Criteria — "Perfect" Submission
+All of these must be true:
+- [ ] `openenv validate` passes
+- [ ] `docker build && docker run` starts server on port 7860 in <60s
+- [ ] HF Space deploys, responds to `reset()`, tagged with `openenv`
+- [ ] `baseline_heuristic.py` produces identical scores on two runs
+- [ ] 3+ tasks with graders returning scores in [0.0, 1.0] with meaningful variance
+- [ ] Hard task (Task 5 or 6) genuinely challenges frontier models (score < 0.7 for heuristic)
+- [ ] Context-gated penalty fires correctly and does not fire prematurely
+- [ ] All typed Pydantic models, no `Dict[str, Any]`
+- [ ] `import torch` in every core module, zero numpy imports in core
+- [ ] README documents: environment description, action/observation spaces, task descriptions with difficulty, setup instructions, baseline scores
+- [ ] POST `/baseline`, POST `/grader`, GET `/tasks` all respond correctly
+- [ ] Test suite passes with >80% coverage
+---
+## Commands
+```bash
+# Development (from project root: ML Debugger/)
+source .venv/bin/activate
+uvicorn server.app:app --reload --host 0.0.0.0 --port 7860
+# Tests
+pytest tests/ -v --cov=ml_training_debugger --cov-report=term-missing
+# Formatting
+black ml_training_debugger/ server/ tests/
+ruff check ml_training_debugger/ server/ tests/ --fix
+isort ml_training_debugger/ server/ tests/ --profile black
+# Docker
+docker build -t pytorch-debugger .
+docker run -p 7860:7860 pytorch-debugger
+# Smoke test
+curl http://localhost:7860/health
+curl http://localhost:7860/tasks
+python baseline_heuristic.py > run1.json
+python baseline_heuristic.py > run2.json
+diff run1.json run2.json  # Must be empty
+# OpenEnv validation
+openenv validate
+```

Dockerfile ADDED Viewed

	@@ -0,0 +1,24 @@

+FROM python:3.12-slim
+WORKDIR /app
+# Install PyTorch CPU-only first (largest layer, cached)
+RUN pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu
+# Install remaining dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application code
+COPY ml_training_debugger/ ml_training_debugger/
+COPY server/ server/
+COPY openenv.yaml .
+COPY baseline_heuristic.py .
+COPY README.md .
+EXPOSE 7860
+HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
+    CMD curl -f http://localhost:7860/health || exit 1
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]

PRD.md ADDED Viewed

	@@ -0,0 +1,367 @@

+# PRD — PyTorch Training Run Debugger
+**Product:** OpenEnv RL environment for ML training failure diagnosis
+**Hackathon:** Meta PyTorch OpenEnv Hackathon x Scaler School of Technology, Round 1
+**Deadline:** April 8, 2026 (submission window opens March 28)
+**Runtime:** Python 3.12 · PyTorch CPU-only · openenv-core v0.2.2
+**Source of truth:** `ml-training-debugger-spec.md` for all implementation detail beyond this PRD
+---
+## 1. Overview
+### 1.1 What We Are Building
+An OpenEnv-compliant reinforcement learning environment where an AI agent receives a snapshot of a broken PyTorch training run and must investigate, diagnose, fix, and verify the failure through a multi-step interactive process. The environment exposes real PyTorch model internals (gradients from `torch.autograd`, weights from `model.state_dict()`) and covers 6 failure scenarios across 3 difficulty tiers.
+### 1.2 Problem Being Solved
+MLOps teams spend 15-25% of engineer time debugging silent training failures — runs that produce no error, no crash, just bad metrics. Each misdiagnosed restart wastes GPU compute at $2-8/hour/card. The diagnostic process is hard because:
+- Multiple symptoms can point to multiple causes simultaneously
+- Some bugs produce no error — just mysteriously bad performance
+- Fixing the wrong thing wastes hours of compute and restarts
+- Static analysis catches some bugs but cannot reason through ambiguous runtime signals
+No existing OpenEnv environment covers this domain. The OpenEnv Hub currently contains a demo echo environment and a code execution environment. This fills a genuine gap.
+### 1.3 Why This Domain Wins
+1. **Strategic alignment** — PyTorch debugging for a Meta PyTorch hackathon. Judges from Meta and Hugging Face will see their own framework as the core subject matter.
+2. **Novel reward design** — Context-gated penalties that encode evidence-based reasoning into the reward signal. No existing OpenEnv environment attempts this.
+3. **Code-level debugging** — Task 6 requires the agent to read and fix actual PyTorch code. Directly addresses Meta's interest: can an AI agent debug PyTorch?
+4. **Ecosystem gap** — Zero competition in the OpenEnv ecosystem for ML training failure diagnosis.
+### 1.4 Key Differentiators
+| Differentiator | What It Is | Why It Matters |
+|---|---|---|
+| Context-gated reward shaping | Penalty fires only when agent ignores evidence it already gathered; no penalty for reasonable priors | Encodes evidence-based decision making — a capability no other OpenEnv environment has |
+| PyTorch-native internals | Real `torch.nn.Module` models, real `torch.autograd` gradients, real `state_dict()` snapshots | Every model-level observation is grounded in real PyTorch computation, not synthetic data |
+| Code-level debugging (Task 6) | Agent reads PyTorch code, identifies buggy line, submits code fix | Tests code understanding, not just metric interpretation — aligned with Meta's core interest |
+---
+## 2. Target Users
+### 2.1 Primary: Hackathon Judges (Meta + Hugging Face Engineers)
+**What they evaluate:**
+- Real-world utility (30%) — Is this a genuine task? Would someone use this to train/evaluate agents?
+- Task & grader quality (25%) — Well-defined tasks, accurate graders, meaningful difficulty progression?
+- Environment design (20%) — Clean state management, sensible action/observation spaces, good reward shaping?
+- Code quality & spec compliance (15%) — OpenEnv spec, clean structure, typed models, working Dockerfile?
+- Creativity & novelty (10%) — Novel domain, interesting mechanics, original approach?
+**What impresses them:**
+- Real `import torch` in core modules (not numpy wrappers)
+- A live dashboard where they can watch an agent investigate in real time
+- Deterministic graders that produce different scores for different agent quality levels
+- The context-gated penalty — nuanced reward design that goes beyond standard practice
+**What disqualifies:**
+- HF Space doesn't deploy or respond to `reset()`
+- Plagiarized or trivially modified existing environments
+- Graders that always return the same score
+- No baseline inference script
+- Dockerfile doesn't build
+### 2.2 Secondary: RL Researchers and Agent Developers
+**What they need:**
+- A challenging benchmark that differentiates heuristic agents from reasoning-capable ones
+- Clear, typed action/observation schemas for agent integration
+- Reproducible baseline scores for comparison
+- Environments that produce meaningful reward signal across the full trajectory (not just sparse terminal reward)
+### 2.3 Tertiary: Auto-Validation System (Phase 1 Gate)
+A non-human "user" that must pass before any human judge sees the submission:
+- Pings HF Space URL — must return 200 and respond to `reset()`
+- Validates `openenv.yaml`, typed models, `step()`/`reset()`/`state()` endpoints
+- Runs `docker build` on submitted repo
+- Runs baseline script twice — scores must be identical
+- Enumerates tasks, runs each grader — scores must be in [0.0, 1.0]
+---
+## 3. Success Metrics
+### 3.1 Evaluation Criteria Targets
+| Criterion | Weight | Target Score | How We Hit It |
+|---|---|---|---|
+| Real-world utility | 30% | 26-30 | ML debugging is a $B+ problem; every PyTorch team encounters these failures; fills a genuine OpenEnv gap |
+| Task & grader quality | 25% | 21-25 | 6 tasks (3 MVP), 3 difficulty tiers, deterministic graders, hard tasks challenge frontier models |
+| Environment design | 20% | 17-20 | Progressive reveal, context-gated penalties, dynamic `available_actions`, proper episode boundaries |
+| Code quality & spec compliance | 15% | 13-15 | Full OpenEnv spec, typed Pydantic models, working Dockerfile + HF Space, two baselines |
+| Creativity & novelty | 10% | 9-10 | Context-gated rewards, real PyTorch model internals, code fix task — all new to OpenEnv |
+| **Total** | **100%** | **86-100** | |
+### 3.2 Quantitative Success Criteria
+| Metric | Target | Measurement |
+|---|---|---|
+| Auto-validation | Pass all 5 gates | `openenv validate` + smoke test sequence |
+| Grader score range | Meaningful variance per task | Heuristic baseline ~0.30-0.85 across tasks (not flat) |
+| Heuristic-LLM gap | Measurable difference | LLM scores higher than heuristic on Tasks 5 and 6 |
+| `reset()` latency | <200ms | Model instantiation + 2 forward passes + parametric curves |
+| `step()` latency | <10ms | Action dispatch + reward computation + state update |
+| Baseline reproducibility | Bit-exact across runs | `diff run1.json run2.json` produces no output |
+| Docker image size | <500MB | PyTorch CPU-only + python:3.12-slim |
+| Test coverage | >80% | `pytest --cov` |
+### 3.3 Qualitative Success Criteria
+- A judge can open `/dashboard`, trigger a baseline run, and understand the agent's reasoning at a glance
+- Task 5 (BatchNorm eval mode) visibly differentiates disciplined investigation from red-herring chasing
+- Task 6 (code bug) produces a "wow" moment — an agent reading and fixing PyTorch code in front of Meta judges
+- The context-gated penalty creates a story: "this agent gathered evidence and then ignored it"
+---
+## 4. Functional Requirements
+> **Complete typed specifications for all data models, actions, observations, tasks, reward components, and error handling are in `ml-training-debugger-spec.md` Sections 10-16.** This section provides a product-level summary.
+### 4.1 Agent Interaction Loop
+```
+reset(task_id) → initial observation (loss curves, config, error log — no gradients/weights/data/code)
+     ↓
+step(action)   → updated observation + reward + done flag (progressive reveal)
+     ↓
+  ... repeat ...
+     ↓
+step(mark_diagnosed) → terminal observation, done=True, episode scored by grader
+```
+### 4.2 Observation Space Summary
+The `MLTrainingObservation` extends `Observation` from openenv-core. Key design:
+- **Always visible from reset:** loss/accuracy histories, config, error_log, GPU memory, episode state, available actions
+- **Progressively revealed:** gradient stats (real torch.autograd), weight stats (real state_dict), data batch stats, model mode info, code snippets — each populated only after the corresponding `inspect_*` action
+- All fields are typed Pydantic models with explicit types. See spec Section 10 for complete field definitions.
+### 4.3 Action Space Summary
+The `MLTrainingAction` extends `Action` from openenv-core. 14 action types in 3 categories:
+- **Investigation** (5): `inspect_gradients`, `inspect_data_batch`, `inspect_model_modes`, `inspect_model_weights`, `inspect_code`
+- **Fix** (7): `modify_config`, `add_callback`, `replace_optimizer`, `patch_data_loader`, `fix_model_mode`, `fix_code`, `rollback_checkpoint`
+- **Terminal** (2): `restart_run`, `mark_diagnosed`
+Dynamic availability: `restart_run` requires `fix_action_taken`, `fix_code` requires `code_inspected`, `mark_diagnosed` disappears after submission. See spec Section 10 for complete action definitions and required fields.
+### 4.4 Diagnosis Enum (RootCauseDiagnosis)
+Closed set of 6 values. Grader is a single equality check — no fuzzy matching.
+| Value | Description |
+|---|---|
+| `lr_too_high` | Learning rate too large for the architecture |
+| `vanishing_gradients` | LR too low or architecture too deep, gradients decay to near-zero |
+| `data_leakage` | Validation samples appearing in training batches |
+| `overfitting` | Model memorizing training data, failing to generalize |
+| `batchnorm_eval_mode` | Model left in eval mode, BatchNorm using running statistics |
+| `code_bug` | Bug in the PyTorch training code (Task 6 — always this, regardless of bug variant) |
+### 4.5 Reward Function Summary
+Per-step signal. **Separate from the grader** (see 4.6). Range: [-1.0, 1.0] hard cap.
+| Event | Reward | Gate Condition |
+|---|---|---|
+| Any step taken | -0.01 | Unconditional, flat constant (never multiplied by step_count) |
+| First-time inspection (per type) | +0.05 | Not previously inspected for that type |
+| `add_callback` after normal gradients | -0.20 | `gradients_inspected == True AND gradients_were_normal == True` |
+| Invalid action | -0.05 | Action not in current `available_actions` |
+| Wrong code fix | -0.10 | `fix_code` with incorrect line or replacement |
+| Correct diagnosis | +0.50 | `diagnosis == true_root_cause` |
+| Wrong diagnosis | -0.30 | `diagnosis != true_root_cause` |
+| Convergence after fix+restart | +0.40 | `fix_action_taken AND restart_after_fix AND convergence_confirmed` |
+See spec Section 12 for full design rationale.
+### 4.6 Grader Function
+Returns a single normalized 0.0-1.0 score at episode end. Evaluates `EpisodeState` holistically — checks which key actions were taken, whether the correct fix was applied, whether the diagnosis is correct, and efficiency. **Not a sum of step rewards.** One grader function per task. All graders are deterministic.
+Exposed via `POST /grader`. Returns score for the most recently completed episode.
+### 4.7 The Six Tasks
+| Task | ID | Difficulty | Root Cause | Key Signal | Heuristic Score |
+|---|---|---|---|---|---|
+| Exploding Gradients | `task_001` | Easy | `lr_too_high` | All layers `is_exploding: True`, NaN in error_log | ~0.85 |
+| Vanishing Gradients | `task_002` | Easy | `vanishing_gradients` | Deeper layers `is_vanishing: True`, flat loss | ~0.80 |
+| Silent Data Leakage | `task_003` | Medium | `data_leakage` | High val accuracy from epoch 1, `class_overlap_score` 0.68-0.88 | ~0.70 |
+| Overfitting | `task_004` | Medium | `overfitting` | Train-val divergence, loss→0.01 while val climbs | ~0.65 |
+| BatchNorm Eval Mode | `task_005` | Hard | `batchnorm_eval_mode` | Slow val degradation + compound red herrings | ~0.45 |
+| PyTorch Code Bug | `task_006` | Hard | `code_bug` (always) | Anomalous metrics, root cause only visible in code | ~0.30 |
+**MVP tasks:** 1, 3, 5 (satisfies the 3-task minimum with easy→medium→hard range).
+See spec Section 11 for complete task specifications including fault parameters, red herrings, solution paths, and grader breakdowns.
+### 4.8 Baseline Agents
+**Rule-based baseline (submission default, `baseline_heuristic.py`):**
+- Deterministic decision tree: inspect_gradients → check exploding/vanishing → inspect_data → check leakage → check overfitting → inspect_model_modes → inspect_code → fallback
+- No API key required. Bit-exact reproducible.
+- Used for Phase 1 auto-validation reproducibility checks.
+**LLM baseline (optional, `baseline_inference.py`):**
+- GPT-4o at temperature=0.0, seed=42
+- Requires `OPENAI_API_KEY` environment variable
+- Supplementary demonstration of heuristic vs. reasoning score gap
+- Not used for Phase 1 reproducibility — scores reported only after empirical measurement
+### 4.9 Required Endpoints
+| Endpoint | Method | Required By | Response |
+|---|---|---|---|
+| `/ws` | WebSocket | OpenEnv framework | Handles `reset`, `step`, `state` messages |
+| `/tasks` | GET | Hackathon | Task list with IDs, difficulties, MLTrainingAction JSON schema |
+| `/grader` | POST | Hackathon | `{"score": float, "task_id": str, "steps": int}` for last completed episode |
+| `/baseline` | POST | Hackathon | Triggers baseline run, returns `{"scores": {"task_001": float, ...}}` |
+| `/health` | GET | Hackathon | `{"status": "ready", "tasks": N}` — N is active task count |
+| `/dashboard` | GET | Bonus | Live diagnostic dashboard (HTML/JS, Plotly.js via CDN) |
+| `/validation-report` | GET | Bonus | Pre-computed PyTorch fidelity reports |
+Framework auto-provides: `POST /reset`, `POST /step`, `GET /state`, `GET /schema`, `GET /docs`, `/mcp`.
+### 4.10 Error Handling
+`step()` must never raise an unhandled exception. All invalid actions return a valid observation with -0.05 penalty and an error note. See spec Section 16 for the complete error handling matrix covering all edge cases (invalid actions, malformed JSON, step before reset, etc.).
+---
+## 5. Non-Functional Requirements
+### 5.1 OpenEnv Spec Compliance
+| Requirement | Implementation |
+|---|---|
+| `openenv.yaml` present | Name, version, description, framework, tags, observation/action space, tasks with IDs+difficulties+max_steps, reward config, endpoints |
+| Typed Pydantic models | `MLTrainingAction` extends `Action`, `MLTrainingObservation` extends `Observation`, all fields explicitly typed |
+| `step()`/`reset()`/`state()` | Implemented in `MLTrainingEnvironment` extending `Environment` from `openenv.core.env_server.interfaces` |
+| `openenv validate` passes | Tested before every submission |
+### 5.2 Framework Integration
+| Requirement | Implementation |
+|---|---|
+| `openenv-core` v0.2.2 | `create_app()` returns standard FastAPI instance — **verified** |
+| Custom routes compose | `/tasks`, `/grader`, `/baseline`, `/health` added via `@app.get()`/`@app.post()` on the returned FastAPI app |
+| Framework-provided routes | `/reset`, `/step`, `/state`, `/ws`, `/schema`, `/docs`, `/mcp` — do not reimplement |
+| Factory pattern | `create_app(MLTrainingEnvironment, ...)` takes the class, not an instance |
+| Concurrent sessions | `SUPPORTS_CONCURRENT_SESSIONS = True`, session state keyed by session ID |
+| Typed client | `client.py` extends `EnvClient` with typed action/observation — used by baseline scripts |
+### 5.3 Docker & Deployment
+| Requirement | Target |
+|---|---|
+| Base image | `python:3.12-slim` |
+| PyTorch | CPU-only wheel (`--index-url https://download.pytorch.org/whl/cpu`), ~150MB |
+| Total image size | <500MB |
+| Build time | <5 min (no real training during build; validation reports pre-computed) |
+| HF Spaces | Tagged with `openenv`, port 7860 |
+| Health check | `/health` returns `{"status": "ready", "tasks": N}` within 60s of container start |
+### 5.4 Reproducibility
+| Requirement | Implementation |
+|---|---|
+| Deterministic episodes | `torch.manual_seed(seed)` at every `reset()`, seed derived deterministically from task ID |
+| Baseline bit-exact | Rule-based baseline produces identical scores on two consecutive runs |
+| Exploit resistance | Parameters randomized per `reset()` from defined ranges; opaque task IDs |
+| Grader determinism | Same `EpisodeState` always produces same score |
+### 5.5 Performance
+| Requirement | Target |
+|---|---|
+| `reset()` latency | <200ms (model instantiation + 2 forward passes + parametric curves) |
+| `step()` latency | <10ms (action dispatch + reward + state update) |
+| Memory | <512MB RSS (small CNN ~50K params, no GPU, no large datasets) |
+### 5.6 Code Quality
+| Requirement | Standard |
+|---|---|
+| Formatting | black (line length 88) |
+| Linting | ruff |
+| Import ordering | isort (profile=black) |
+| Type hints | Every function signature and return type |
+| Tests | pytest, >80% coverage, every module has corresponding test file |
+| PyTorch-native | All core computation uses `torch.Tensor`, zero numpy in core modules |
+---
+## 6. Prioritized Scope
+### Tier 1: MVP (Must Ship First)
+**Deadline within deadline:** Deploy to HF Spaces by Day 6 (April 2). Everything after is additive.
+| Deliverable | Description | DQ Risk if Missing |
+|---|---|---|
+| Task 1 (`task_001`) | Exploding gradients — easy | Yes (need 3+ tasks) |
+| Task 3 (`task_003`) | Silent data leakage — medium | Yes (need 3+ tasks) |
+| Task 5 (`task_005`) | BatchNorm eval mode — hard | Yes (need easy→hard range) |
+| Context-gated penalty | -0.20 for `add_callback` after `gradients_were_normal` | No (but kills differentiation) |
+| Rule-based baseline | `baseline_heuristic.py`, deterministic, no API key | Yes (baseline required) |
+| Reward engine | All 7 reward components implemented exactly | Yes (reward logic required) |
+| Graders (3) | One per MVP task, 0.0-1.0, deterministic | Yes (graders required) |
+| `openenv.yaml` | Full metadata, 3+ tasks listed | Yes (spec compliance) |
+| Required endpoints | `/tasks`, `/grader`, `/baseline`, `/health` | Yes (auto-validator checks) |
+| Dockerfile | Builds and runs, port 7860 | Yes (auto-validator checks) |
+| HF Space | Deployed, tagged `openenv`, responds to `reset()` | Yes (auto-validator pings) |
+| README | Environment description, action/observation spaces, task descriptions, setup instructions, baseline scores | Yes (submission requirement) |
+### Tier 2: Strongest Differentiator (Add Immediately After MVP)
+| Deliverable | Description | Why This Order |
+|---|---|---|
+| Task 6 (`task_006`) | PyTorch code bug — hard, code-level debugging | Single highest-impact feature for Meta judges |
+| Code fix validation | Multi-strategy pipeline (tokenize, AST, semantic patterns) | Required for Task 6 to work with LLM agents |
+| Grader for Task 6 | `code_bug` diagnosis, code fix scoring | Completes Task 6 |
+### Tier 3: Full Task Coverage (Time Permitting)
+| Deliverable | Description |
+|---|---|
+| Task 2 (`task_002`) | Vanishing gradients — easy (similar to Task 1, fast to implement) |
+| Task 4 (`task_004`) | Overfitting — medium (train-val divergence, regularization fix) |
+| Graders for Tasks 2 & 4 | Same pattern as existing graders |
+### Tier 4: Polish & Extras (Only After Tiers 1-3 Complete)
+| Deliverable | Description | Priority Within Tier |
+|---|---|---|
+| Live dashboard | HTML/JS at `/dashboard`, Plotly.js via CDN, 4-panel layout | 1st — transforms judging experience |
+| PyTorch validation suite | 6 scripts proving parametric curves match real training, R² > 0.85 | 2nd — answers "how realistic?" |
+| Validation report endpoint | `GET /validation-report` serving pre-computed fidelity plots | With validation suite |
+| LLM baseline | `baseline_inference.py`, GPT-4o, measures heuristic-LLM gap | 3rd — supplementary demonstration |
+### Implementation Timeline (11 days: March 28 - April 8)
+| Days | Focus | Exit Criteria |
+|---|---|---|
+| 1-2 | Skeleton server + Task 1 end-to-end | `reset()` → `step()` → `grader` works for one task, Docker builds |
+| 3-5 | Tasks 3 & 5 + reward engine + baseline | All 3 MVP tasks pass grader, `baseline_heuristic.py` reproduces |
+| 6 | **Deploy MVP to HF Spaces** | Auto-validation passes. This is the insurance policy. |
+| 7-8 | Task 6 (code debugging) | Code fix validation works for all 4 bug variants |
+| 9-10 | Tasks 2 & 4 + dashboard | Full 6-task environment, dashboard shows agent behavior |
+| 11 | Polish, README, final smoke test | Submission-ready |
+### What We Will NOT Build (Explicit Exclusions)
+- No game or toy environments
+- No numpy in core modules (torch.Tensor only)
+- No free-text diagnosis (closed enum only)
+- No grader that sums step rewards (holistic evaluation only)
+- No cumulative step penalty (flat -0.01 only, never -0.01 * step_count)
+- No accommodation support or non-RL features
+- No multi-GPU or CUDA dependencies (CPU-only PyTorch)

README.md ADDED Viewed

	@@ -0,0 +1,149 @@

+# PyTorch Training Run Debugger
+**OpenEnv RL Environment** — Meta PyTorch OpenEnv Hackathon x Scaler School of Technology, Round 1
+An AI agent debugs broken PyTorch training runs by investigating gradients, model weights, data pipelines, and source code to diagnose and fix real ML failure patterns.
+## What Is This?
+This environment recreates the experience of an ML engineer facing a broken PyTorch training job. The agent receives a snapshot of a failing training run and must:
+1. **Investigate** — inspect gradients, data batches, model weights, model modes, and code
+2. **Diagnose** — identify the root cause from a closed set of known ML failures
+3. **Fix** — apply the correct intervention (reduce LR, patch data, fix model mode, etc.)
+4. **Verify** — restart training and confirm recovery before submitting diagnosis
+### Key Differentiators
+- **PyTorch-native internals** — Real `torch.nn.Module` models (~50K params), real `torch.autograd` gradients, real `state_dict()` weight snapshots
+- **Context-gated reward shaping** — Penalty fires only when agent ignores evidence it already gathered; no penalty for reasonable priors
+- **Progressive information reveal** — Gradient stats, weight stats, data batch stats only populated after corresponding inspection actions
+## Environment Design
+### Observation Space (`MLTrainingObservation`)
+| Field | Type | Visibility |
+|-------|------|-----------|
+| `training_loss_history` | `list[float]` (20 epochs) | Always |
+| `val_accuracy_history` | `list[float]` (20 epochs) | Always |
+| `val_loss_history` | `list[float]` (20 epochs) | Always |
+| `current_config` | `TrainingConfig` | Always |
+| `error_log` | `Optional[str]` | Always |
+| `gradient_stats` | `list[GradientStats]` | After `inspect_gradients` |
+| `model_weight_stats` | `Optional[list[ModelWeightStats]]` | After `inspect_model_weights` |
+| `data_batch_stats` | `Optional[DataBatchStats]` | After `inspect_data_batch` |
+| `model_mode_info` | `Optional[dict[str, str]]` | After `inspect_model_modes` |
+| `code_snippet` | `Optional[CodeSnippet]` | After `inspect_code` |
+| `available_actions` | `list[str]` | Always (dynamic) |
+| `episode_state` | `EpisodeState` | Always |
+### Action Space (`MLTrainingAction`)
+| Category | Actions |
+|----------|---------|
+| **Investigation** | `inspect_gradients`, `inspect_data_batch`, `inspect_model_modes`, `inspect_model_weights`, `inspect_code` |
+| **Fix** | `modify_config`, `add_callback`, `replace_optimizer`, `patch_data_loader`, `fix_model_mode`, `fix_code` |
+| **Terminal** | `restart_run`, `mark_diagnosed` |
+Dynamic availability: `restart_run` requires a fix first; `fix_code` requires code inspection; `mark_diagnosed` disappears after submission.
+### Diagnosis Enum
+| Value | Description |
+|-------|-------------|
+| `lr_too_high` | Learning rate too large |
+| `vanishing_gradients` | Gradients decay to near-zero |
+| `data_leakage` | Validation samples in training |
+| `overfitting` | Model memorizing, failing to generalize |
+| `batchnorm_eval_mode` | Model in eval mode during training |
+| `code_bug` | Bug in PyTorch training code |
+### Reward Function
+| Event | Reward | Gate |
+|-------|--------|------|
+| Any step | -0.01 | Flat, unconditional |
+| First-time inspection | +0.05 | Per inspection type |
+| `add_callback` after normal gradients | -0.20 | `gradients_inspected AND gradients_were_normal` |
+| Invalid action | -0.05 | Action not in `available_actions` |
+| Correct diagnosis | +0.50 | Equality check |
+| Wrong diagnosis | -0.30 | Inequality check |
+| Convergence after fix+restart | +0.40 | All gates met |
+## Tasks
+| ID | Difficulty | Root Cause | Description |
+|----|-----------|------------|-------------|
+| `task_001` | Easy | `lr_too_high` | Exploding gradients — all layers show `is_exploding: True`, NaN in error log |
+| `task_003` | Medium | `data_leakage` | Silent data leakage — suspiciously high val accuracy, `class_overlap_score > 0.5` |
+| `task_005` | Hard | `batchnorm_eval_mode` | Model in eval mode with compound red herrings (FC gradient spike, GPU 91%, near-vanishing conv1) |
+## Baseline Scores
+Rule-based heuristic baseline (deterministic, no API key):
+| Task | Score |
+|------|-------|
+| `task_001` | 1.00 |
+| `task_003` | 1.00 |
+| `task_005` | 0.35 |
+## Setup
+### Local Development
+```bash
+# Create virtual environment
+python3 -m venv .venv
+source .venv/bin/activate
+# Install dependencies
+pip install torch --index-url https://download.pytorch.org/whl/cpu
+pip install openenv-core pydantic fastapi uvicorn
+# Install dev tools
+pip install pytest pytest-cov black ruff isort
+# Start server
+uvicorn server.app:app --host 0.0.0.0 --port 7860
+# Run tests
+pytest tests/ -v --cov=ml_training_debugger
+# Run baseline
+python baseline_heuristic.py
+```
+### Docker
+```bash
+docker build -t pytorch-debugger .
+docker run -p 7860:7860 pytorch-debugger
+curl http://localhost:7860/health
+```
+## API Endpoints
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/health` | GET | `{"status": "ready", "tasks": 3}` |
+| `/tasks` | GET | Task list with action schema |
+| `/grader` | POST | Grader score for last completed episode |
+| `/baseline` | POST | Run baseline, return scores |
+| `/ws` | WebSocket | Primary agent interface |
+| `/reset` | POST | Reset environment (framework) |
+| `/step` | POST | Execute action (framework) |
+| `/state` | GET | Current state (framework) |
+| `/schema` | GET | Action/observation schemas (framework) |
+| `/docs` | GET | Swagger UI (framework) |
+## Architecture
+- **Python 3.12** · PyTorch CPU-only · openenv-core
+- Real `torch.nn.Module` models with real `torch.autograd` gradients
+- Parametric curve generation for loss/accuracy histories (sub-ms latency)
+- Typed Pydantic models everywhere — no `Dict[str, Any]`
+- `import torch` in every core module — zero numpy in core
+- Session isolation via per-session `EpisodeState`
+- Deterministic reproducibility via `torch.manual_seed()`

ROADMAP.md ADDED Viewed

	@@ -0,0 +1,441 @@

+# ROADMAP — PyTorch Training Run Debugger
+**Timeline:** March 28 - April 8, 2026 (11 days)
+**Runtime:** Python 3.12 · PyTorch CPU-only · openenv-core v0.2.2
+**Governing documents:** `ml-training-debugger-spec.md` (source of truth), `PRD.md` (requirements), `CLAUDE.md` (coding rules)
+**Iron rule:** No phase begins until the previous phase's acceptance criteria are met. The single exception: Phase 0 and Phase 1 file creation can overlap on Day 1.
+---
+## Phase 0: Setup & Validation (Days 1-2)
+**Goal:** A running skeleton server that proves the toolchain works end-to-end. Zero business logic — just plumbing.
+### 0.1 Files to Create
+| File | Purpose | Lines (est.) |
+|---|---|---|
+| `ML Debugger/` (this directory) | Project root directory (git init here) | — |
+| `pyproject.toml` | Project metadata, dependencies (torch CPU, openenv-core, pydantic>=2.0, fastapi, uvicorn, pytest, black, ruff, isort) | ~40 |
+| `requirements.txt` | Flat dependency list mirroring pyproject.toml (Docker uses this). **Exclude openai** — deferred to Phase 3. | ~10 |
+| `.python-version` | `3.12` | 1 |
+| `openenv.yaml` | Full metadata — start with 3 MVP tasks (task_001, task_003, task_005), expand later | ~50 |
+| `Dockerfile` | `python:3.12-slim`, torch CPU-only, openenv-core, app deps, port 7860 | ~15 |
+| `.dockerignore` | Exclude `.venv/`, `__pycache__/`, `.git/`, `validation/reports/*.png` | ~10 |
+| `.gitignore` | `.venv/`, `__pycache__/`, `*.pyc`, `.env`, `run*.json` | ~15 |
+| `ml_training_debugger/__init__.py` | Package init, version string | ~3 |
+| `ml_training_debugger/models.py` | **Stub only:** `RootCauseDiagnosis` enum, `EpisodeState`, `TrainingConfig`, `GradientStats`, `DataBatchStats`, `ModelWeightStats`, `CodeSnippet`, `MLTrainingObservation` (extends `Observation`), `MLTrainingAction` (extends `Action`). All fields typed, all values defaulted. | ~200 |
+| `ml_training_debugger/client.py` | **Stub:** `MLTrainingEnvClient` extending `EnvClient` with `action_type = MLTrainingAction` and `observation_type = MLTrainingObservation`. Used by baseline scripts. | ~20 |
+| `server/__init__.py` | Empty | 0 |
+| `server/environment.py` | **Stub:** `MLTrainingEnvironment(Environment)` with `reset()` returning a hardcoded observation, `step()` echoing back, `state` property | ~50 |
+| `server/app.py` | `create_app(MLTrainingEnvironment, MLTrainingAction, MLTrainingObservation)` + stub routes for `/tasks`, `/grader`, `/baseline`, `/health` | ~60 |
+| `tests/__init__.py` | Empty | 0 |
+| `tests/test_models.py` | Validate all Pydantic models instantiate, serialize to JSON, and round-trip | ~60 |
+| `tests/conftest.py` | Shared fixtures: sample `EpisodeState`, sample `ScenarioParams`, sample observation | ~40 |
+### 0.2 Dependencies to Install
+```bash
+# Create venv inside ML Debugger/ project root
+python3 -m venv .venv && source .venv/bin/activate
+# Core runtime
+pip install torch --index-url https://download.pytorch.org/whl/cpu
+pip install openenv-core pydantic>=2.0 fastapi uvicorn
+# Dev tools
+pip install pytest pytest-cov pytest-asyncio black ruff isort httpx websockets
+# NOTE: openai is deferred to Phase 3 (LLM baseline). Do NOT install now.
+```
+### 0.3 Validation Steps (Must All Pass)
+| # | Command | Expected Result |
+|---|---|---|
+| 1 | `python -c "import torch; print(torch.__version__)"` | Version string, no CUDA |
+| 2 | `python -c "from openenv.core.env_server.http_server import create_app"` | No import error |
+| 3 | `python -c "from ml_training_debugger.models import MLTrainingAction, MLTrainingObservation"` | No import error |
+| 4 | `python -c "from ml_training_debugger.client import MLTrainingEnvClient"` | No import error |
+| 5 | `uvicorn server.app:app --host 0.0.0.0 --port 7860` | Server starts, no crash |
+| 6 | `curl http://localhost:7860/health` | `{"status": "ready", "tasks": 3}` |
+| 7 | `curl http://localhost:7860/tasks` | JSON with task list |
+| 8 | `curl http://localhost:7860/docs` | Swagger UI loads |
+| 9 | `pytest tests/test_models.py -v` | All pass |
+| 10 | `docker build -t pytorch-debugger .` | Builds in <5min, image <500MB |
+| 11 | `docker run -p 7860:7860 pytorch-debugger` then `curl /health` | Returns `{"status": "ready", "tasks": 3}` |
+| 12 | `openenv validate` | Passes (or identify what needs fixing) |
+| 13 | `black --check . && ruff check . && isort --check .` | Clean |
+### 0.4 Acceptance Criteria
+- [ ] Skeleton server starts on port 7860 and responds to `/health`, `/tasks`, `/docs`, `/ws`
+- [ ] `/health` returns `{"status": "ready", "tasks": 3}` (task count matches active tasks)
+- [ ] All Pydantic models instantiate without error and serialize to valid JSON
+- [ ] `client.py` imports without error
+- [ ] Docker image builds under 500MB and container starts cleanly
+- [ ] `openenv validate` passes or all failures are documented with a fix plan
+- [ ] `pytest` runs with zero failures
+- [ ] Git repo initialized, first commit made
+---
+## Phase 1: MVP — Tasks 1, 3, 5 + Core Engine (Days 2-6)
+**Goal:** A fully functional 3-task environment that passes all auto-validation gates, deployed to HF Spaces. This is the survival milestone — everything after this is differentiation.
+### 1.1 Files to Create
+| File | Purpose | Lines (est.) | Depends On |
+|---|---|---|---|
+| `ml_training_debugger/scenarios.py` | `ScenarioParams` dataclass, `sample_scenario(task_id, seed)` for tasks 001/003/005. Parameter ranges from spec Section 11. | ~120 | `models.py` |
+| `ml_training_debugger/pytorch_engine.py` | `SimpleCNN(torch.nn.Module)`, `inject_fault(model, scenario)`, `extract_gradient_stats(model)`, `extract_weight_stats(model)`. Real torch.autograd. | ~250 | `scenarios.py` |
+| `ml_training_debugger/simulation.py` | `gen_loss_history(scenario)`, `gen_val_accuracy_history(scenario)`, `gen_val_loss_history(scenario)`. All `torch.Tensor` ops. Parametric curves per spec Section 6. | ~180 | `scenarios.py` |
+| `ml_training_debugger/reward_engine.py` | `compute_reward(action, episode_state, scenario) -> float`. All 7 reward components per spec Section 12. Context-gated penalty logic. | ~100 | `models.py` |
+| `ml_training_debugger/graders.py` | `grade_task_001(state, scenario)`, `grade_task_003(...)`, `grade_task_005(...)`. Each returns float in [0.0, 1.0]. Per spec Section 11 grader breakdowns. | ~150 | `models.py` |
+| `baseline_heuristic.py` | Deterministic decision tree agent using `MLTrainingEnvClient`. Runs all MVP tasks, prints JSON scores. | ~150 | `client.py`, server running |
+| `README.md` | Environment description, action/observation spaces, task descriptions with difficulty, setup instructions, baseline scores table | ~200 | Everything |
+### 1.2 Files to Edit
+| File | Changes | Why |
+|---|---|---|
+| `ml_training_debugger/models.py` | Finalize all field types, add `available_actions` computation logic to `EpisodeState`, add red herring fields (notes, gpu_memory) | Stubs from Phase 0 become real |
+| `ml_training_debugger/client.py` | Wire typed client to connect via WebSocket or HTTP as needed by baseline | Stub becomes functional |
+| `server/environment.py` | Full `reset()` and `step()` implementations. See spec Sections 9, 13 for lifecycle. | Stubs become real |
+| `server/app.py` | Wire `/tasks`, `/grader`, `/baseline`, `/health` to return real data. `/health` returns `{"status": "ready", "tasks": 3}`. | Stubs become real |
+| `openenv.yaml` | Finalize observation_space, action_space, reward section. Verify task IDs and max_steps per spec Section 14. | Was skeletal in Phase 0 |
+| `Dockerfile` | Add `COPY` for all new source files. Verify build still works. | New files added |
+### 1.3 Tests to Create
+| Test File | What It Covers | Critical Assertions |
+|---|---|---|
+| `tests/test_scenarios.py` | `sample_scenario()` for each MVP task | Returns correct root cause enum; params within defined ranges; different seeds produce different params |
+| `tests/test_pytorch_engine.py` | Model instantiation, fault injection, gradient/weight extraction | `SimpleCNN` is a real `torch.nn.Module`; `extract_gradient_stats` returns `GradientStats` with real float norms; exploding fault produces `is_exploding=True`; batchnorm eval fault produces `model.training==False` |
+| `tests/test_simulation.py` | Parametric curve generators | All outputs are `list[float]` of length 20; exploding LR produces diverging loss; leakage produces inflated val_acc; batchnorm produces slow val_acc degradation |
+| `tests/test_reward_engine.py` | All 7 reward components | **Critical:** context-gated penalty fires when `gradients_inspected=True AND gradients_were_normal=True` then `add_callback`; does NOT fire when `add_callback` without prior inspection; step penalty is flat -0.01; investigation bonus is +0.05 first-time only |
+| `tests/test_graders.py` | Graders for tasks 001, 003, 005 | Each returns float in [0.0, 1.0]; correct diagnosis + fix + restart = 1.0; wrong diagnosis < 0.5; partial completion scores between 0 and 1 |
+| `tests/test_episode_lifecycle.py` | Full reset→inspect→fix→restart→diagnose flow | State transitions match spec Section 13; `available_actions` updates correctly; `done=True` after `mark_diagnosed`; step limit triggers `done=True` |
+### 1.4 Task-Specific Implementation
+See spec Section 11 for complete task specifications. Key implementation notes per task:
+**Task 1 (`task_001`, easy):** Unambiguous signal. LR from spec ranges → real gradients explode → `is_exploding=True` on all layers. Straightforward grader.
+**Task 3 (`task_003`, medium):** Red herring note about architecture upgrade. Data leakage confirmed via `class_overlap_score`. Normal model (no gradient/weight anomaly). Mild gradient elevation on one layer (`is_exploding=False`).
+**Task 5 (`task_005`, hard):** The differentiator task. `gradients_were_normal=True` set inside `inspect_gradients` handler because `is_exploding=False` on ALL layers (FC spike mean_norm < 10.0). Context-gated penalty fires when agent then calls `add_callback`. Red herrings: FC spike, GPU 91%, conv1 near-vanishing, error_log warning.
+### 1.5 Endpoint Responses
+**`GET /health`:** `{"status": "ready", "tasks": 3}` (200) — or `{"status": "initializing"}` (503) during startup.
+**`GET /tasks`:** Task list with IDs, difficulties, max_steps, and MLTrainingAction JSON schema.
+**`POST /grader`:** `{"score": float, "task_id": str, "steps": int}` (200) — or `{"score": null, "error": "no_completed_episode"}` (200) if no episode. See spec Section 14 for edge cases.
+**`POST /baseline`:** Runs baseline logic internally, returns `{"scores": {"task_001": float, "task_003": float, "task_005": float}}`. Returns 409 if already running.
+### 1.6 Baseline Heuristic Decision Tree
+See spec Section 17 for the complete decision tree. Summary:
+```
+1. reset(task_id)
+2. inspect_gradients
+3. IF any layer is_exploding → fix LR → restart → diagnose lr_too_high
+4. IF any layer is_vanishing → fix LR → restart → diagnose vanishing_gradients
+5. inspect_data_batch
+6. IF class_overlap_score > 0.5 → patch_data_loader → restart → diagnose data_leakage
+7. IF val_loss diverging → modify weight_decay → restart → diagnose overfitting
+8. inspect_model_modes
+9. IF any layer in "eval" → fix_model_mode → restart → diagnose batchnorm_eval_mode
+10. inspect_code → attempt fix → restart → diagnose code_bug
+11. FALLBACK: diagnose overfitting
+```
+### 1.7 Deploy to HF Spaces
+| Step | Action | Verification |
+|---|---|---|
+| 1 | Create HF Space (Docker type), tag with `openenv` | Space page shows openenv tag |
+| 2 | Push Dockerfile + source to Space repo | Build triggers automatically |
+| 3 | Wait for build to complete | Build log shows success |
+| 4 | Test health endpoint | `curl https://<space-url>/health` returns `{"status": "ready", "tasks": 3}` |
+| 5 | Test reset via WebSocket | `wscat -c wss://<space-url>/ws` then send `{"type": "reset", "task_id": "task_001"}` |
+| 6 | Run `openenv validate` against deployed space | All checks pass |
+### 1.8 Acceptance Criteria
+- [ ] `reset(task_id)` for tasks 001, 003, 005 returns valid `MLTrainingObservation` with correct initial state
+- [ ] `step()` dispatches all 14 action types correctly (investigation, fix, terminal)
+- [ ] `inspect_gradients` on Task 1 → `is_exploding=True` on all layers (real torch.autograd)
+- [ ] `inspect_gradients` on Task 5 → `is_exploding=False` on all layers, `gradients_were_normal=True`
+- [ ] `inspect_data_batch` on Task 3 → `class_overlap_score > 0.5`
+- [ ] `inspect_model_modes` on Task 5 → all layers in "eval" mode
+- [ ] Context-gated penalty: `inspect_gradients`(normal) then `add_callback` → reward includes -0.20
+- [ ] Context-gated penalty: `add_callback` without prior inspection → NO -0.20 penalty
+- [ ] Grader for Task 1: correct path scores 1.0, wrong diagnosis scores < 0.5
+- [ ] Grader for Task 5: agent that chases red herring scores 0.80-0.85 (penalty applied)
+- [ ] `baseline_heuristic.py` runs twice → `diff run1.json run2.json` is empty
+- [ ] `POST /baseline` returns scores for all 3 tasks, all in [0.0, 1.0]
+- [ ] `POST /grader` returns score after completed episode
+- [ ] `GET /tasks` returns 3 tasks with action schema
+- [ ] `GET /health` returns `{"status": "ready", "tasks": 3}`
+- [ ] Docker builds <500MB, starts <60s, serves on port 7860
+- [ ] HF Space deployed, responds to `reset()`, tagged `openenv`
+- [ ] `openenv validate` passes
+- [ ] `pytest --cov` shows >80% coverage on all Phase 1 modules
+- [ ] `import torch` in every core module; zero `import numpy` in core
+- [ ] README has: description, action/observation spaces, 3 task descriptions, setup instructions, baseline scores
+---
+## Phase 2: Stretch — Tasks 2, 4, 6 + Code Debugging (Days 7-9)
+**Goal:** Full 6-task environment with code-level debugging. Task 6 is the single highest-impact differentiator for Meta judges.
+**Prerequisites:** Phase 1 acceptance criteria ALL met. HF Space deployed and passing auto-validation.
+### 2.1 Priority Order (Strict)
+1. **Task 6** first — it is the strongest differentiator and the hardest to implement
+2. **Task 2** second — structurally identical to Task 1 (vanishing vs. exploding), fastest to add
+3. **Task 4** third — medium difficulty overfitting, similar pattern to existing tasks
+### 2.2 Files to Create
+| File | Purpose | Lines (est.) | Depends On |
+|---|---|---|---|
+| `ml_training_debugger/code_templates.py` | 4 bug variant templates, `generate_code_snippet(bug_type, seed)`, `validate_fix(bug_type, line, replacement)` with multi-strategy pipeline per spec Section 22 | ~250 | `models.py` |
+| `tests/test_code_templates.py` | All 4 variants generate valid code; fix validation accepts correct fixes; rejects wrong fixes; handles whitespace/comment variations | ~150 | `code_templates.py` |
+### 2.3 Files to Edit
+| File | Changes | Complexity |
+|---|---|---|
+| `ml_training_debugger/scenarios.py` | Add `sample_scenario` cases for task_002, task_004, task_006. Task 006 includes `bug_type` field. | Low |
+| `ml_training_debugger/pytorch_engine.py` | Add fault injection for vanishing gradients, overfitting, code bug variants. | Medium |
+| `ml_training_debugger/simulation.py` | Add curve generators for vanishing (flat loss), overfitting (train-val divergence), code bug variants. | Medium |
+| `ml_training_debugger/reward_engine.py` | Add wrong code fix penalty (-0.10). No other changes. | Low |
+| `ml_training_debugger/graders.py` | Add `grade_task_002`, `grade_task_004`, `grade_task_006`. Task 006: diagnosis must be `code_bug` always. | Medium |
+| `server/environment.py` | `step()` handlers for `inspect_code` and `fix_code`. Update `available_actions`. | Medium |
+| `server/app.py` | Update `/tasks` to return 6 tasks. Update `/health` to return `"tasks": 6`. | Low |
+| `openenv.yaml` | Add task_002, task_004, task_006. | Low |
+| `baseline_heuristic.py` | Extend decision tree for vanishing, overfitting, code bug. | Medium |
+| `README.md` | Add descriptions for Tasks 2, 4, 6. Update baseline scores. | Low |
+### 2.4 Task 6 Code Fix Validation
+The `validate_fix()` pipeline is defined in spec Section 22 (Known Risks). Key layers:
+1. **Normalize:** strip whitespace + inline comments → compare against known correct strings
+2. **Tokenize:** Python `tokenize` module, filter noise tokens, compare streams
+3. **Semantic patterns:** 2-3 per variant (e.g. `"criterion("` present AND `".detach()"` absent)
+4. **AST fallback:** `ast.parse()` full code with replacement, verify buggy pattern absent
+Test cases that MUST pass: correct fix, trailing whitespace, inline comments, different indentation.
+Test cases that MUST fail: bug still present, `pass`, wrong line number.
+### 2.5 Tests to Create/Extend
+| Test File | New Coverage |
+|---|---|
+| `tests/test_code_templates.py` | **New file.** All 4 variants, validate_fix accepts/rejects correctly, 5+ whitespace/comment variations per variant |
+| `tests/test_scenarios.py` | Extend: sample_scenario for task_002, 004, 006 |
+| `tests/test_simulation.py` | Extend: vanishing flat loss, overfitting divergence, code bug symptoms |
+| `tests/test_graders.py` | Extend: graders 002, 004, 006. Task 006: `code_bug` required; `batchnorm_eval_mode` on eval_mode variant = wrong |
+| `tests/test_reward_engine.py` | Extend: wrong code fix penalty (-0.10) |
+| `tests/test_episode_lifecycle.py` | Extend: `inspect_code` → `fix_code` available; `fix_code` before `inspect_code` → invalid |
+### 2.6 Acceptance Criteria
+- [ ] All 6 tasks return valid observations from `reset()` and process all action types in `step()`
+- [ ] Task 6: `inspect_code` returns `CodeSnippet` with real PyTorch code containing the sampled bug
+- [ ] Task 6: `fix_code` correct → `fix_action_taken=True`, no penalty
+- [ ] Task 6: `fix_code` wrong → -0.10 penalty
+- [ ] Task 6: `mark_diagnosed(code_bug)` → correct (+0.50)
+- [ ] Task 6: `mark_diagnosed(batchnorm_eval_mode)` on eval_mode variant → wrong (-0.30)
+- [ ] `validate_fix` accepts 5+ whitespace/comment variations per variant
+- [ ] `validate_fix` rejects all invalid fixes
+- [ ] Graders for all 6 tasks return [0.0, 1.0] with meaningful variance
+- [ ] `baseline_heuristic.py` handles all 6 tasks, still bit-exact reproducible
+- [ ] `POST /baseline` returns scores for all 6 tasks
+- [ ] `GET /tasks` returns 6 tasks
+- [ ] `GET /health` returns `{"status": "ready", "tasks": 6}`
+- [ ] All new tests pass; overall coverage >80%
+- [ ] Updated openenv.yaml lists all 6 tasks
+- [ ] HF Space redeployed with 6 tasks, auto-validation still passes
+---
+## Phase 3: Polish — Dashboard, Validation Suite, LLM Baseline (Days 10-11)
+**Goal:** Transform a technically correct submission into a visually impressive, deeply validated, winning submission.
+**Prerequisites:** Phase 2 acceptance criteria ALL met. 6-task environment deployed.
+### 3.1 Priority Order Within Phase 3
+1. **Dashboard** — transforms judging experience (highest ROI for judges)
+2. **Full test suite + README polish** — ensures no auto-validation failure
+3. **Validation suite** — answers "how realistic are your curves?"
+4. **LLM baseline** — demonstrates heuristic-reasoning gap (lowest priority)
+### 3.2 Files to Create
+| File | Purpose | Lines (est.) | Priority |
+|---|---|---|---|
+| `server/dashboard.html` | Single-file SPA. 4 panels per spec Section 19. Plotly.js via CDN. | ~400 | 1st |
+| `validation/requirements.txt` | `torch`, `matplotlib`, `scipy` | ~3 | 3rd |
+| `validation/conftest.py` | Shared fixtures: CIFAR-10 subset loader, model definitions | ~50 | 3rd |
+| `validation/validate_exploding_gradients.py` | Real training, compare to parametric curve, R² > 0.85 | ~80 | 3rd |
+| `validation/validate_data_leakage.py` | Real training with leakage, compare | ~80 | 3rd |
+| `validation/validate_batchnorm_eval.py` | Real training with `model.eval()`, compare | ~80 | 3rd |
+| `validation/validate_vanishing_gradients.py` | Real gradient decay, compare | ~80 | 3rd |
+| `validation/validate_overfitting.py` | Real train-val divergence, compare | ~80 | 3rd |
+| `validation/validate_code_bugs.py` | Run 4 bug variants, confirm symptoms | ~80 | 3rd |
+| `validation/reports/` | Pre-computed fidelity scores + comparison plots | — | 3rd |
+| `baseline_inference.py` | LLM agent (GPT-4o, temp=0.0, seed=42). Runs all 6 tasks. **Now install openai.** | ~200 | 4th |
+### 3.3 Files to Edit
+| File | Changes | Priority |
+|---|---|---|
+| `server/app.py` | Add `GET /dashboard` and `GET /validation-report` routes | 1st/3rd |
+| `requirements.txt` | Add `openai` (only now, for LLM baseline) | 4th |
+| `Dockerfile` | `COPY validation/reports/` and `COPY server/dashboard.html` | 1st |
+| `README.md` | Final polish: dashboard description, validation suite, measured baseline scores | 2nd |
+| `openenv.yaml` | Add dashboard and validation-report to endpoints | 1st |
+### 3.4 Dashboard Panels
+See spec Section 19 for full specification. Summary:
+1. **Training Metrics** — Plotly.js line charts for loss/accuracy with restart markers
+2. **Gradient & Weight Heatmap** — color-coded per-layer grid (green/yellow/red/blue)
+3. **Action Timeline** — horizontal bars per step, color-coded by type, reward bars
+4. **Episode Summary** — task ID, state flags, available actions, grader score
+Tech: single HTML file, Plotly.js CDN, native WebSocket, CSS Grid. Zero Docker bloat.
+### 3.5 Validation Suite
+Run locally (NOT in Docker build). Each script: real training → capture metrics → compare to parametric → assert R² > 0.85 → save plots. Pre-computed reports committed to git and served via `/validation-report`. See spec Section 18.
+### 3.6 Tests to Create/Extend
+| Test File | Coverage |
+|---|---|
+| `tests/test_dashboard.py` | `GET /dashboard` returns 200 with HTML containing "Plotly" and "WebSocket" |
+| `tests/test_endpoints.py` | Integration: full episode via HTTP (reset→step→grader), verify response schemas |
+| `tests/test_baseline_reproducibility.py` | Run baseline twice, assert identical JSON |
+| Existing test files | Fill coverage gaps to >80% on every module |
+### 3.7 Acceptance Criteria
+- [ ] `GET /dashboard` serves HTML that renders in a browser with 4 panels
+- [ ] Dashboard connects to WebSocket and updates in real time during a baseline run
+- [ ] Validation suite passes all scripts with R² > 0.85 (run locally)
+- [ ] Pre-computed validation reports exist in `validation/reports/`
+- [ ] `GET /validation-report` serves fidelity data
+- [ ] LLM baseline runs, scores higher than heuristic on Tasks 5 and 6 (if implemented)
+- [ ] README is complete: all 6 tasks, both baselines, dashboard description, setup instructions
+- [ ] `pytest --cov` shows >80% coverage across all modules
+- [ ] Final `openenv validate` passes
+- [ ] Final Docker build <500MB, starts <60s
+- [ ] HF Space redeployed with dashboard + all features
+---
+## Pre-Submission Gate Checklist
+**Every item must be checked before submitting. Failure on any starred (*) item = disqualification.**
+### Auto-Validation Gates (*)
+- [ ] * **HF Space deploys** — `curl https://<space-url>/health` returns `{"status": "ready", "tasks": N}` with HTTP 200
+- [ ] * **HF Space responds to reset** — WebSocket connection to `/ws`, send reset message, receive valid observation
+- [ ] * **OpenEnv spec compliance** — `openenv validate` passes (openenv.yaml present, typed models, step/reset/state work)
+- [ ] * **Dockerfile builds** — `docker build -t pytorch-debugger .` succeeds
+- [ ] * **Docker runs** — `docker run -p 7860:7860 pytorch-debugger` starts and serves on port 7860
+- [ ] * **Baseline reproduces** — `python baseline_heuristic.py > run1.json && python baseline_heuristic.py > run2.json && diff run1.json run2.json` produces no output
+- [ ] * **3+ tasks with graders** — `GET /tasks` returns ≥3 tasks; `POST /grader` returns score in [0.0, 1.0] after each task completes
+- [ ] * **Graders produce varying scores** — different agent behaviors produce different scores (not always same value)
+### Required Endpoint Gates (*)
+- [ ] * **`GET /tasks`** — returns JSON with task IDs, difficulties, action schema
+- [ ] * **`POST /grader`** — returns `{"score": float}` after a completed episode
+- [ ] * **`POST /baseline`** — triggers baseline, returns scores for all tasks
+- [ ] * **`GET /health`** — returns `{"status": "ready", "tasks": N}`
+### Submission Artifacts (*)
+- [ ] * **Public GitHub repo** — contains all code, README, requirements, openenv.yaml
+- [ ] * **HF Spaces demo link** — deployed, tagged `openenv`, accessible
+- [ ] * **README complete** — environment description, action/observation space definitions, task descriptions with difficulty, setup instructions, baseline scores
+### Quality Gates (Not DQ, but impact scoring)
+- [ ] All typed Pydantic models — no `Dict[str, Any]`
+- [ ] `import torch` in every core module — zero `import numpy` in core
+- [ ] Context-gated penalty fires correctly (manually tested both paths)
+- [ ] Task 5 red herrings present: FC spike, GPU 91%, conv1 near-vanishing, error_log warning
+- [ ] Task 6 code fix validation handles whitespace and comment variations
+- [ ] Task 6 diagnosis is always `code_bug` regardless of bug variant
+- [ ] Grader and reward function are separate modules
+- [ ] Step penalty is flat -0.01 (not multiplied by step_count)
+- [ ] Episode state is isolated per WebSocket session
+- [ ] Test suite passes with >80% coverage
+- [ ] Code formatted with black, linted with ruff, imports sorted with isort
+### Final Smoke Test Sequence
+Run this entire sequence the night before submission:
+```bash
+# 1. Clean build
+docker build --no-cache -t pytorch-debugger .
+docker run -d -p 7860:7860 --name smoke-test pytorch-debugger
+# 2. Wait for startup
+sleep 10
+curl -f http://localhost:7860/health || echo "FAIL: health"
+# 3. Tasks endpoint
+curl -f http://localhost:7860/tasks | python -m json.tool || echo "FAIL: tasks"
+# 4. Baseline reproducibility
+python baseline_heuristic.py > run1.json 2>/dev/null
+python baseline_heuristic.py > run2.json 2>/dev/null
+diff run1.json run2.json && echo "PASS: reproducible" || echo "FAIL: non-reproducible"
+# 5. Baseline via endpoint
+curl -f -X POST http://localhost:7860/baseline | python -m json.tool || echo "FAIL: baseline endpoint"
+# 6. Grader via endpoint (after baseline has completed episodes)
+curl -f -X POST http://localhost:7860/grader | python -m json.tool || echo "FAIL: grader endpoint"
+# 7. OpenEnv validation
+openenv validate || echo "FAIL: openenv validate"
+# 8. Test suite
+pytest tests/ -v --cov=ml_training_debugger --cov-report=term-missing
+# 9. Cleanup
+docker stop smoke-test && docker rm smoke-test
+echo "=== Smoke test complete ==="
+```
+### If Something Fails at Submission Time
+| Failure | Triage |
+|---|---|
+| HF Space won't deploy | Check Dockerfile CMD, port 7860, build logs. Redeploy. |
+| Baseline non-reproducible | Check `torch.manual_seed()` in `reset()`. Check for `random` module usage. |
+| Grader returns same score | Check that `sample_scenario` uses different seeds. Check grader logic has branching. |
+| `openenv validate` fails | Read error message. Usually missing field in openenv.yaml or wrong model base class. |
+| Docker image >500MB | Check `docker images` size. Remove unused deps. Ensure torch is CPU-only. |
+| Test coverage <80% | Run `pytest --cov` with `--cov-report=html`. Find uncovered branches. Add targeted tests. |

baseline_heuristic.py ADDED Viewed

	@@ -0,0 +1,186 @@

+#!/usr/bin/env python3
+"""Rule-based heuristic baseline agent.
+Deterministic decision tree — no API key required. Bit-exact reproducible.
+Spec reference: Section 17.
+Usage:
+    python baseline_heuristic.py [--url http://localhost:7860]
+"""
+from __future__ import annotations
+import argparse
+import json
+import sys
+from ml_training_debugger.graders import grade_episode
+from ml_training_debugger.models import EpisodeState, MLTrainingAction, MLTrainingObservation
+from ml_training_debugger.scenarios import sample_scenario
+from server.environment import MLTrainingEnvironment
+MVP_TASKS = ["task_001", "task_003", "task_005"]
+def run_heuristic_episode(task_id: str, seed: int = 42) -> float:
+    """Run one heuristic baseline episode. Returns grader score."""
+    env = MLTrainingEnvironment()
+    obs = env.reset(seed=seed, episode_id=f"baseline_{task_id}", task_id=task_id)
+    # Step 1: inspect_gradients
+    obs = env.step(MLTrainingAction(action_type="inspect_gradients"))
+    if obs.gradient_stats:
+        # Check exploding
+        if any(g.is_exploding for g in obs.gradient_stats):
+            obs = env.step(
+                MLTrainingAction(
+                    action_type="modify_config",
+                    target="learning_rate",
+                    value=0.001,
+                )
+            )
+            obs = env.step(MLTrainingAction(action_type="restart_run"))
+            obs = env.step(
+                MLTrainingAction(
+                    action_type="mark_diagnosed",
+                    diagnosis="lr_too_high",
+                )
+            )
+            session = env._get_session()
+            return session.last_score if session and session.last_score is not None else 0.0
+        # Check vanishing
+        if any(g.is_vanishing for g in obs.gradient_stats):
+            obs = env.step(
+                MLTrainingAction(
+                    action_type="modify_config",
+                    target="learning_rate",
+                    value=0.01,
+                )
+            )
+            obs = env.step(MLTrainingAction(action_type="restart_run"))
+            obs = env.step(
+                MLTrainingAction(
+                    action_type="mark_diagnosed",
+                    diagnosis="vanishing_gradients",
+                )
+            )
+            session = env._get_session()
+            return session.last_score if session and session.last_score is not None else 0.0
+    # Step 2: inspect_data_batch
+    obs = env.step(MLTrainingAction(action_type="inspect_data_batch"))
+    if obs.data_batch_stats and obs.data_batch_stats.class_overlap_score > 0.5:
+        obs = env.step(MLTrainingAction(action_type="patch_data_loader"))
+        obs = env.step(MLTrainingAction(action_type="restart_run"))
+        obs = env.step(
+            MLTrainingAction(
+                action_type="mark_diagnosed",
+                diagnosis="data_leakage",
+            )
+        )
+        session = env._get_session()
+        return session.last_score if session and session.last_score is not None else 0.0
+    # Check overfitting (val_loss diverging)
+    if obs.val_loss_history and len(obs.val_loss_history) >= 10:
+        early = sum(obs.val_loss_history[:5]) / 5
+        late = sum(obs.val_loss_history[-5:]) / 5
+        if (
+            late > early * 1.2
+            and obs.data_batch_stats
+            and obs.data_batch_stats.class_overlap_score < 0.1
+        ):
+            obs = env.step(
+                MLTrainingAction(
+                    action_type="modify_config",
+                    target="weight_decay",
+                    value=0.01,
+                )
+            )
+            obs = env.step(MLTrainingAction(action_type="restart_run"))
+            obs = env.step(
+                MLTrainingAction(
+                    action_type="mark_diagnosed",
+                    diagnosis="overfitting",
+                )
+            )
+            session = env._get_session()
+            return session.last_score if session and session.last_score is not None else 0.0
+    # Step 3: inspect_model_modes
+    obs = env.step(MLTrainingAction(action_type="inspect_model_modes"))
+    if obs.model_mode_info:
+        has_eval = any(v == "eval" for v in obs.model_mode_info.values())
+        if has_eval:
+            obs = env.step(MLTrainingAction(action_type="fix_model_mode"))
+            obs = env.step(MLTrainingAction(action_type="restart_run"))
+            obs = env.step(
+                MLTrainingAction(
+                    action_type="mark_diagnosed",
+                    diagnosis="batchnorm_eval_mode",
+                )
+            )
+            session = env._get_session()
+            return session.last_score if session and session.last_score is not None else 0.0
+    # Step 4: inspect_code
+    obs = env.step(MLTrainingAction(action_type="inspect_code"))
+    if obs.code_snippet:
+        code = obs.code_snippet.code
+        if "model.eval()" in code and "model.train()" not in code:
+            obs = env.step(
+                MLTrainingAction(
+                    action_type="fix_code",
+                    line=5,
+                    replacement="model.train()",
+                )
+            )
+        elif ".detach()" in code:
+            obs = env.step(
+                MLTrainingAction(
+                    action_type="fix_code",
+                    line=14,
+                    replacement="        loss = criterion(output, batch_y)",
+                )
+            )
+        if obs.episode_state.fix_action_taken:
+            obs = env.step(MLTrainingAction(action_type="restart_run"))
+        obs = env.step(
+            MLTrainingAction(
+                action_type="mark_diagnosed",
+                diagnosis="code_bug",
+            )
+        )
+        session = env._get_session()
+        return session.last_score if session and session.last_score is not None else 0.0
+    # Fallback
+    obs = env.step(
+        MLTrainingAction(
+            action_type="mark_diagnosed",
+            diagnosis="overfitting",
+        )
+    )
+    session = env._get_session()
+    return session.last_score if session and session.last_score is not None else 0.0
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Rule-based baseline agent")
+    parser.add_argument("--url", default="http://localhost:7860")
+    args = parser.parse_args()
+    scores: dict[str, float] = {}
+    for task_id in MVP_TASKS:
+        score = run_heuristic_episode(task_id)
+        scores[task_id] = round(score, 4)
+    print(json.dumps(scores, indent=2))
+if __name__ == "__main__":
+    main()

deploy.sh ADDED Viewed

	@@ -0,0 +1,52 @@

+#!/bin/bash
+set -euo pipefail
+echo "=== PyTorch Training Run Debugger — Pre-Submission Smoke Test ==="
+echo ""
+# 1. Run tests
+echo "=== 1. Running test suite ==="
+source .venv/bin/activate
+pytest tests/ -v --cov=ml_training_debugger --cov-report=term-missing
+echo ""
+# 2. Code formatting check
+echo "=== 2. Code formatting ==="
+black --check ml_training_debugger/ server/ tests/ || { echo "Run: black ml_training_debugger/ server/ tests/"; exit 1; }
+ruff check ml_training_debugger/ server/ tests/ || { echo "Run: ruff check --fix"; exit 1; }
+isort --check ml_training_debugger/ server/ tests/ --profile black || { echo "Run: isort --profile black"; exit 1; }
+echo "PASS: formatting OK"
+echo ""
+# 3. Baseline reproducibility
+echo "=== 3. Baseline reproducibility ==="
+python baseline_heuristic.py > /tmp/run1.json 2>/dev/null
+python baseline_heuristic.py > /tmp/run2.json 2>/dev/null
+diff /tmp/run1.json /tmp/run2.json && echo "PASS: bit-exact reproducible" || { echo "FAIL: non-reproducible"; exit 1; }
+echo ""
+# 4. Docker build
+echo "=== 4. Docker build ==="
+docker build -t pytorch-debugger .
+IMAGE_SIZE=$(docker images pytorch-debugger --format "{{.Size}}")
+echo "Image size: $IMAGE_SIZE"
+echo ""
+# 5. Docker run + health check
+echo "=== 5. Docker run + endpoint checks ==="
+docker run -d -p 7860:7860 --name smoke-test pytorch-debugger
+sleep 10
+curl -f http://localhost:7860/health || { echo "FAIL: health"; docker stop smoke-test; docker rm smoke-test; exit 1; }
+echo ""
+curl -f http://localhost:7860/tasks || { echo "FAIL: tasks"; docker stop smoke-test; docker rm smoke-test; exit 1; }
+echo ""
+curl -f -X POST http://localhost:7860/grader || { echo "FAIL: grader"; docker stop smoke-test; docker rm smoke-test; exit 1; }
+echo ""
+# 6. Cleanup
+docker stop smoke-test && docker rm smoke-test
+rm -f /tmp/run1.json /tmp/run2.json
+echo ""
+echo "=== ALL CHECKS PASSED ==="

ml-training-debugger-spec.md ADDED Viewed

The diff for this file is too large to render. See raw diff

ml_training_debugger/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ """PyTorch Training Run Debugger — OpenEnv Environment."""
2	+
3	+ __version__ = "1.0.0"

ml_training_debugger/client.py ADDED Viewed

	@@ -0,0 +1,21 @@

+"""Typed EnvClient for baseline scripts.
+Extends GenericEnvClient since we can't easily subclass the
+abstract EnvClient without implementing all transport methods.
+Used by baseline_heuristic.py.
+"""
+from __future__ import annotations
+from openenv.core.generic_client import GenericEnvClient
+class MLTrainingEnvClient(GenericEnvClient):
+    """Typed client for the PyTorch Training Debugger environment.
+    Wraps GenericEnvClient for convenient use in baselines.
+    Actions are sent as dicts matching MLTrainingAction schema.
+    Observations are received as dicts matching MLTrainingObservation schema.
+    """
+    pass

ml_training_debugger/code_templates.py ADDED Viewed

	@@ -0,0 +1,248 @@

+"""PyTorch code snippet templates for Task 6 code-level debugging.
+Each template is a real, syntactically valid Python/PyTorch training script
+with one injected bug. Spec reference: Section 11 (Task 6), Section 22.
+"""
+from __future__ import annotations
+import ast
+import io
+import tokenize
+from typing import Optional
+import torch  # noqa: F401 — PyTorch-native project
+# Bug variant templates: (buggy_code, correct_line_num, correct_replacement)
+_TEMPLATES: dict[str, tuple[str, int, str]] = {
+    "eval_mode": (
+        """\
+import torch
+import torch.nn as nn
+model = SimpleCNN()
+model.eval()
+optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
+criterion = nn.CrossEntropyLoss()
+for epoch in range(100):
+    for batch_x, batch_y in train_loader:
+        optimizer.zero_grad()
+        output = model(batch_x)
+        loss = criterion(output, batch_y)
+        loss.backward()
+        optimizer.step()""",
+        5,
+        "model.train()",
+    ),
+    "detach_loss": (
+        """\
+import torch
+import torch.nn as nn
+model = SimpleCNN()
+model.train()
+optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
+criterion = nn.CrossEntropyLoss()
+for epoch in range(100):
+    for batch_x, batch_y in train_loader:
+        optimizer.zero_grad()
+        output = model(batch_x)
+        loss = criterion(output, batch_y).detach()
+        loss.backward()
+        optimizer.step()""",
+        14,
+        "        loss = criterion(output, batch_y)",
+    ),
+    "zero_grad_missing": (
+        """\
+import torch
+import torch.nn as nn
+model = SimpleCNN()
+model.train()
+optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
+criterion = nn.CrossEntropyLoss()
+for epoch in range(100):
+    for batch_x, batch_y in train_loader:
+        output = model(batch_x)
+        loss = criterion(output, batch_y)
+        loss.backward()
+        optimizer.step()""",
+        11,
+        "        optimizer.zero_grad()",
+    ),
+    "inplace_relu": (
+        """\
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+model = SimpleCNN()
+model.train()
+optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
+criterion = nn.CrossEntropyLoss()
+for epoch in range(100):
+    for batch_x, batch_y in train_loader:
+        optimizer.zero_grad()
+        output = model(batch_x)
+        output = F.relu(output, inplace=True)
+        loss = criterion(output, batch_y)
+        loss.backward()
+        optimizer.step()""",
+        15,
+        "        output = F.relu(output)",
+    ),
+}
+# Semantic equivalence patterns per bug variant
+_SEMANTIC_PATTERNS: dict[str, list[tuple[str, str]]] = {
+    "eval_mode": [
+        # (must_contain, must_not_contain)
+        ("model.train()", "model.eval()"),
+    ],
+    "detach_loss": [
+        ("criterion(", ".detach()"),
+    ],
+    "zero_grad_missing": [
+        ("zero_grad()", ""),  # just needs zero_grad present
+    ],
+    "inplace_relu": [
+        ("F.relu(", "inplace=True"),
+    ],
+}
+def generate_code_snippet(bug_type: str, seed: int = 42) -> dict:
+    """Generate a code snippet with the specified bug.
+    Returns dict with keys: code, filename, line_count, imports, hint.
+    """
+    if bug_type not in _TEMPLATES:
+        raise ValueError(f"Unknown bug_type: {bug_type}")
+    code, _line, _replacement = _TEMPLATES[bug_type]
+    lines = code.strip().split("\n")
+    imports = [
+        line for line in lines if line.startswith("import ") or line.startswith("from ")
+    ]
+    hint: Optional[str] = None
+    if bug_type == "eval_mode":
+        hint = "Check the model mode before the training loop."
+    elif bug_type == "detach_loss":
+        hint = "Examine how the loss is computed and used."
+    return {
+        "code": code,
+        "filename": "train.py",
+        "line_count": len(lines),
+        "imports": imports,
+        "hint": hint,
+    }
+def _normalize_code(s: str) -> str:
+    """Strip whitespace and inline comments for comparison."""
+    s = s.strip()
+    # Remove inline comments
+    result_lines: list[str] = []
+    for line in s.split("\n"):
+        # Remove trailing comment but preserve strings
+        stripped = line.rstrip()
+        result_lines.append(stripped)
+    return "\n".join(result_lines)
+def _tokenize_compare(original: str, replacement: str) -> bool:
+    """Compare token streams ignoring whitespace and comments."""
+    def get_tokens(code: str) -> list[tuple[int, str]]:
+        try:
+            tokens = list(tokenize.generate_tokens(io.StringIO(code).readline))
+            # Filter out COMMENT, NL, NEWLINE, INDENT, DEDENT, ENCODING, ENDMARKER
+            skip = {
+                tokenize.COMMENT,
+                tokenize.NL,
+                tokenize.NEWLINE,
+                tokenize.INDENT,
+                tokenize.DEDENT,
+                tokenize.ENCODING,
+                tokenize.ENDMARKER,
+            }
+            return [(t.type, t.string) for t in tokens if t.type not in skip]
+        except tokenize.TokenError:
+            return []
+    return get_tokens(original) == get_tokens(replacement)
+def validate_fix(bug_type: str, line: int, replacement: str) -> bool:
+    """Validate a code fix submission.
+    Multi-strategy pipeline per spec Section 22:
+    1. Normalize whitespace + strip comments
+    2. Token-stream comparison
+    3. Semantic equivalence patterns
+    4. AST fallback
+    """
+    if bug_type not in _TEMPLATES:
+        return False
+    code, correct_line, correct_replacement = _TEMPLATES[bug_type]
+    lines = code.strip().split("\n")
+    # Check line number is valid
+    if line < 1 or line > len(lines):
+        return False
+    # For zero_grad_missing, the fix is inserting a line, not replacing
+    if bug_type == "zero_grad_missing":
+        # Accept if the replacement contains zero_grad
+        normalized = _normalize_code(replacement)
+        if "zero_grad" in normalized:
+            return True
+        return False
+    # Strategy 1: Normalize and compare
+    norm_replacement = _normalize_code(replacement)
+    norm_correct = _normalize_code(correct_replacement)
+    if norm_replacement == norm_correct:
+        return True
+    # Strategy 2: Token-stream comparison
+    if _tokenize_compare(correct_replacement, replacement):
+        return True
+    # Strategy 3: Semantic equivalence patterns
+    patterns = _SEMANTIC_PATTERNS.get(bug_type, [])
+    for must_contain, must_not_contain in patterns:
+        if must_contain and must_contain in norm_replacement:
+            if not must_not_contain or must_not_contain not in norm_replacement:
+                return True
+    # Strategy 4: AST fallback — verify buggy pattern absent
+    try:
+        # Replace the line in the full code and parse
+        new_lines = lines.copy()
+        new_lines[line - 1] = replacement.rstrip()
+        new_code = "\n".join(new_lines)
+        tree = ast.parse(new_code)
+        # Check that the buggy pattern is absent
+        ast.dump(tree)  # Validates AST is well-formed
+        if bug_type == "eval_mode" and "eval" not in replacement.lower():
+            if "train" in replacement.lower():
+                return True
+        if bug_type == "detach_loss" and "detach" not in replacement.lower():
+            return True
+        if bug_type == "inplace_relu" and "inplace" not in replacement.lower():
+            if "relu" in replacement.lower():
+                return True
+    except SyntaxError:
+        pass
+    return False

ml_training_debugger/graders.py ADDED Viewed

	@@ -0,0 +1,207 @@

+"""Per-task grader functions — returns normalized 0.0-1.0 score at episode end.
+Separate from reward_engine.py. Evaluates EpisodeState holistically.
+NOT a sum of step rewards. Spec reference: Section 11 grader breakdowns.
+"""
+from __future__ import annotations
+import torch  # noqa: F401 — PyTorch-native project
+from ml_training_debugger.models import EpisodeState
+from ml_training_debugger.scenarios import ScenarioParams
+FIX_ACTIONS = frozenset(
+    {
+        "modify_config",
+        "add_callback",
+        "replace_optimizer",
+        "patch_data_loader",
+        "fix_model_mode",
+        "fix_code",
+    }
+)
+def _has_action(state: EpisodeState, action_type: str) -> bool:
+    return action_type in state.actions_taken
+def _correct_diagnosis(state: EpisodeState, scenario: ScenarioParams) -> bool:
+    if not state.diagnosis_submitted:
+        return False
+    # Find the diagnosis from actions_taken metadata
+    # We store "mark_diagnosed:<diagnosis>" in actions_taken
+    for action_str in reversed(state.actions_taken):
+        if action_str.startswith("mark_diagnosed:"):
+            submitted = action_str.split(":", 1)[1]
+            return submitted == scenario.root_cause.value
+    return False
+def _submitted_diagnosis(state: EpisodeState) -> str | None:
+    for action_str in reversed(state.actions_taken):
+        if action_str.startswith("mark_diagnosed:"):
+            return action_str.split(":", 1)[1]
+    return None
+def grade_task_001(state: EpisodeState, scenario: ScenarioParams) -> float:
+    """Grade Task 1 — Exploding Gradients (easy). Spec Section 11."""
+    score = 0.0
+    # +0.05 for inspect_gradients
+    if state.gradients_inspected:
+        score += 0.05
+    # +0.20 for correct fix (modify_config with LR reduction)
+    if _has_action(state, "modify_config"):
+        score += 0.20
+    # +0.35 for restart with convergence
+    if state.restart_after_fix:
+        score += 0.35
+    # +0.40 for correct diagnosis
+    if _correct_diagnosis(state, scenario):
+        score += 0.40
+    return min(1.0, max(0.0, score))
+def grade_task_002(state: EpisodeState, scenario: ScenarioParams) -> float:
+    """Grade Task 2 — Vanishing Gradients (easy). Spec Section 11."""
+    score = 0.0
+    if state.gradients_inspected:
+        score += 0.05
+    if _has_action(state, "modify_config"):
+        score += 0.20
+    if state.restart_after_fix:
+        score += 0.35
+    if _correct_diagnosis(state, scenario):
+        score += 0.40
+    return min(1.0, max(0.0, score))
+def grade_task_003(state: EpisodeState, scenario: ScenarioParams) -> float:
+    """Grade Task 3 — Silent Data Leakage (medium). Spec Section 11."""
+    score = 0.0
+    # +0.05 for inspect_data_batch
+    if state.data_inspected:
+        score += 0.05
+    # +0.30 for patch_data_loader
+    if _has_action(state, "patch_data_loader"):
+        score += 0.30
+    # +0.30 for restart with convergence (val accuracy normalizes)
+    if state.restart_after_fix:
+        score += 0.30
+    # +0.35 for correct diagnosis
+    if _correct_diagnosis(state, scenario):
+        score += 0.35
+    return min(1.0, max(0.0, score))
+def grade_task_004(state: EpisodeState, scenario: ScenarioParams) -> float:
+    """Grade Task 4 — Overfitting (medium). Spec Section 11."""
+    score = 0.0
+    if state.data_inspected:
+        score += 0.05
+    if _has_action(state, "modify_config") or _has_action(state, "add_callback"):
+        score += 0.25
+    if state.restart_after_fix:
+        score += 0.30
+    if _correct_diagnosis(state, scenario):
+        score += 0.40
+    return min(1.0, max(0.0, score))
+def grade_task_005(state: EpisodeState, scenario: ScenarioParams) -> float:
+    """Grade Task 5 — BatchNorm Eval Mode (hard). Spec Section 11.
+    Context-gated penalty: -0.20 if add_callback after gradients_were_normal.
+    """
+    score = 0.0
+    # +0.05 for inspect_gradients
+    if state.gradients_inspected:
+        score += 0.05
+    # +0.05 for inspect_model_modes — the revealing action
+    if state.model_modes_inspected:
+        score += 0.05
+    # -0.20 for add_callback after gradients_were_normal
+    if (
+        _has_action(state, "add_callback")
+        and state.gradients_inspected
+        and state.gradients_were_normal
+    ):
+        score -= 0.20
+    # +0.25 for fix_model_mode
+    if _has_action(state, "fix_model_mode"):
+        score += 0.25
+    # +0.30 for restart with convergence
+    if state.restart_after_fix:
+        score += 0.30
+    # +0.40 for correct diagnosis
+    if _correct_diagnosis(state, scenario):
+        score += 0.40
+    return min(1.0, max(0.0, score))
+def grade_task_006(state: EpisodeState, scenario: ScenarioParams) -> float:
+    """Grade Task 6 — PyTorch Code Bug (hard). Spec Section 11.
+    Diagnosis must ALWAYS be 'code_bug' regardless of bug variant.
+    """
+    score = 0.0
+    # +0.05 for inspect_code
+    if state.code_inspected:
+        score += 0.05
+    # +0.30 for correct code fix
+    if _has_action(state, "fix_code") and state.fix_action_taken:
+        score += 0.30
+    # +0.25 for restart with convergence
+    if state.restart_after_fix:
+        score += 0.25
+    # +0.40 for correct diagnosis (must be code_bug)
+    if _correct_diagnosis(state, scenario):
+        score += 0.40
+    return min(1.0, max(0.0, score))
+# Registry mapping task IDs to grader functions
+GRADERS = {
+    "task_001": grade_task_001,
+    "task_002": grade_task_002,
+    "task_003": grade_task_003,
+    "task_004": grade_task_004,
+    "task_005": grade_task_005,
+    "task_006": grade_task_006,
+}
+def grade_episode(task_id: str, state: EpisodeState, scenario: ScenarioParams) -> float:
+    """Grade a completed episode. Returns 0.0-1.0."""
+    grader = GRADERS.get(task_id)
+    if grader is None:
+        return 0.0
+    return grader(state, scenario)

ml_training_debugger/models.py ADDED Viewed

	@@ -0,0 +1,195 @@

+"""All Pydantic models, enums, and typed data structures.
+No business logic. Pure data definitions.
+Spec reference: Section 10 — Data Models.
+"""
+from __future__ import annotations
+import enum
+from typing import Optional, Union
+import torch  # noqa: F401 — PyTorch-native project, required import
+from openenv.core.env_server.types import Action, Observation
+from pydantic import BaseModel, Field
+class RootCauseDiagnosis(str, enum.Enum):
+    """Closed enumeration of ML failure root causes. Spec Section 10."""
+    LR_TOO_HIGH = "lr_too_high"
+    VANISHING_GRADIENTS = "vanishing_gradients"
+    DATA_LEAKAGE = "data_leakage"
+    OVERFITTING = "overfitting"
+    BATCHNORM_EVAL_MODE = "batchnorm_eval_mode"
+    CODE_BUG = "code_bug"
+VALID_DIAGNOSES: set[str] = {d.value for d in RootCauseDiagnosis}
+class TrainingConfig(BaseModel):
+    """Typed hyperparameter configuration. Spec Section 10."""
+    learning_rate: float = 0.001
+    weight_decay: float = 0.0001
+    batch_size: int = 64
+    hidden_dim: int = 64
+    num_layers: int = 3
+    optimizer: str = "adam"
+    dropout_rate: float = 0.0
+    gradient_clip_norm: Optional[float] = None
+VALID_CONFIG_KEYS: set[str] = set(TrainingConfig.model_fields.keys())
+class GradientStats(BaseModel):
+    """Per-layer gradient information from real torch.autograd. Spec Section 10."""
+    layer_name: str
+    norm_history: list[float]
+    mean_norm: float
+    max_norm: float
+    is_exploding: bool  # True when mean_norm > 10.0
+    is_vanishing: bool  # True when mean_norm < 1e-6
+class ModelWeightStats(BaseModel):
+    """Per-layer weight statistics from real state_dict(). Spec Section 10."""
+    layer_name: str
+    weight_norm: float
+    weight_mean: float
+    weight_std: float
+    weight_min: float
+    weight_max: float
+    dead_neuron_pct: float = 0.0
+    has_nan: bool = False
+    has_inf: bool = False
+class DataBatchStats(BaseModel):
+    """Data batch inspection results. Spec Section 10."""
+    label_distribution: dict[int, float]
+    feature_mean: float
+    feature_std: float
+    null_count: int = 0
+    class_overlap_score: float
+    batch_size: int
+    duplicate_ratio: float = 0.0
+class CodeSnippet(BaseModel):
+    """PyTorch code for Task 6 inspection. Spec Section 10."""
+    code: str
+    filename: str = "train.py"
+    line_count: int
+    imports: list[str]
+    hint: Optional[str] = None
+class EpisodeState(BaseModel):
+    """Tracks agent history within an episode. Spec Section 10."""
+    step_count: int = 0
+    gradients_inspected: bool = False
+    gradients_were_normal: bool = False
+    data_inspected: bool = False
+    model_modes_inspected: bool = False
+    model_weights_inspected: bool = False
+    code_inspected: bool = False
+    fix_action_taken: bool = False
+    restart_after_fix: bool = False
+    diagnosis_submitted: bool = False
+    actions_taken: list[str] = Field(default_factory=list)
+    def compute_available_actions(self) -> list[str]:
+        """Dynamically compute available actions based on current state.
+        Rules from spec Section 10 — Dynamic available_actions:
+        - restart_run: only after fix_action_taken
+        - rollback_checkpoint: only after restart_after_fix
+        - fix_code: only after code_inspected
+        - mark_diagnosed: disappears after diagnosis_submitted
+        """
+        actions: list[str] = [
+            "inspect_gradients",
+            "inspect_data_batch",
+            "inspect_model_modes",
+            "inspect_model_weights",
+            "inspect_code",
+            "modify_config",
+            "add_callback",
+            "replace_optimizer",
+            "patch_data_loader",
+            "fix_model_mode",
+        ]
+        if self.code_inspected:
+            actions.append("fix_code")
+        if self.fix_action_taken:
+            actions.append("restart_run")
+        if self.restart_after_fix:
+            actions.append("rollback_checkpoint")
+        if not self.diagnosis_submitted:
+            actions.append("mark_diagnosed")
+        return actions
+ALL_ACTION_TYPES: set[str] = {
+    "inspect_gradients",
+    "inspect_data_batch",
+    "inspect_model_modes",
+    "inspect_model_weights",
+    "inspect_code",
+    "modify_config",
+    "add_callback",
+    "replace_optimizer",
+    "patch_data_loader",
+    "fix_model_mode",
+    "fix_code",
+    "restart_run",
+    "mark_diagnosed",
+    "rollback_checkpoint",
+}
+class MLTrainingAction(Action):
+    """What the agent can do — extends openenv Action. Spec Section 10."""
+    action_type: str
+    target: Optional[str] = None
+    value: Optional[Union[float, int, str]] = None
+    diagnosis: Optional[str] = None
+    line: Optional[int] = None
+    replacement: Optional[str] = None
+class MLTrainingObservation(Observation):
+    """Full observation — extends openenv Observation.
+    Observation base has built-in: done (bool), reward (float|None), metadata (dict).
+    Spec Section 10.
+    """
+    run_id: str = ""
+    framework: str = "pytorch"
+    epoch: int = 20
+    training_loss_history: list[float] = Field(default_factory=list)
+    val_loss_history: list[float] = Field(default_factory=list)
+    val_accuracy_history: list[float] = Field(default_factory=list)
+    gradient_stats: list[GradientStats] = Field(default_factory=list)
+    model_weight_stats: Optional[list[ModelWeightStats]] = None
+    gpu_memory_used_gb: float = 6.2
+    gpu_memory_total_gb: float = 16.0
+    learning_rate: float = 0.001
+    current_config: TrainingConfig = Field(default_factory=TrainingConfig)
+    error_log: Optional[str] = None
+    data_batch_stats: Optional[DataBatchStats] = None
+    model_mode_info: Optional[dict[str, str]] = None
+    code_snippet: Optional[CodeSnippet] = None
+    available_actions: list[str] = Field(default_factory=list)
+    episode_state: EpisodeState = Field(default_factory=EpisodeState)
+    notes: Optional[str] = None

ml_training_debugger/pytorch_engine.py ADDED Viewed

	@@ -0,0 +1,240 @@

+"""PyTorch-native fault injection engine.
+Real torch.nn.Module models, real torch.autograd gradients,
+real state_dict() weight snapshots. Zero numpy.
+Spec reference: Sections 6, 9.
+"""
+from __future__ import annotations
+from typing import Optional
+import torch
+import torch.nn as nn
+from ml_training_debugger.models import GradientStats, ModelWeightStats
+from ml_training_debugger.scenarios import ScenarioParams
+class SimpleCNN(nn.Module):
+    """3-layer CNN for CIFAR-10 style classification. ~50K params.
+    Spec Section 9 — PyTorch Model Pool.
+    """
+    def __init__(self, num_layers: int = 3, hidden_dim: int = 64) -> None:
+        super().__init__()
+        self.conv1 = nn.Conv2d(3, 32, 3, padding=1)
+        self.bn1 = nn.BatchNorm2d(32)
+        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
+        self.bn2 = nn.BatchNorm2d(64)
+        self.conv3 = nn.Conv2d(64, 64, 3, padding=1)
+        self.bn3 = nn.BatchNorm2d(64)
+        self.fc = nn.Linear(64 * 4 * 4, 10)
+        self.pool = nn.MaxPool2d(2, 2)
+        self.relu = nn.ReLU()
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.pool(self.relu(self.bn1(self.conv1(x))))
+        x = self.pool(self.relu(self.bn2(self.conv2(x))))
+        x = self.pool(self.relu(self.bn3(self.conv3(x))))
+        x = x.view(x.size(0), -1)
+        x = self.fc(x)
+        return x
+def create_model_and_inject_fault(
+    scenario: ScenarioParams,
+) -> tuple[nn.Module, dict]:
+    """Instantiate a real PyTorch model and inject the specified fault.
+    Returns:
+        (model, info_dict) where info_dict contains computed artifacts.
+    """
+    torch.manual_seed(scenario.seed)
+    model = SimpleCNN()
+    criterion = nn.CrossEntropyLoss()
+    info: dict = {}
+    # Generate random batch (CIFAR-10 style: 3x32x32)
+    batch_x = torch.randn(8, 3, 32, 32)
+    batch_y = torch.randint(0, 10, (8,))
+    if scenario.root_cause.value == "lr_too_high":
+        # Exploding gradients: high LR with SGD → gradients explode on all layers
+        model.train()
+        optimizer = torch.optim.SGD(
+            model.parameters(), lr=scenario.learning_rate * 10.0
+        )
+        for _ in range(3):
+            optimizer.zero_grad()
+            output = model(batch_x)
+            loss = criterion(output, batch_y)
+            loss.backward()
+            optimizer.step()
+        # Run one final backward to capture extreme gradients
+        optimizer.zero_grad()
+        output = model(batch_x)
+        loss = criterion(output, batch_y)
+        loss.backward()
+    elif scenario.root_cause.value == "vanishing_gradients":
+        # Tiny LR → gradients are extremely small
+        model.train()
+        optimizer = torch.optim.SGD(model.parameters(), lr=scenario.learning_rate)
+        for _ in range(2):
+            optimizer.zero_grad()
+            output = model(batch_x)
+            loss = criterion(output, batch_y)
+            loss.backward()
+            optimizer.step()
+    elif scenario.root_cause.value == "data_leakage":
+        # Normal model — no gradient anomaly
+        model.train()
+        optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
+        optimizer.zero_grad()
+        output = model(batch_x)
+        loss = criterion(output, batch_y)
+        loss.backward()
+        optimizer.step()
+    elif scenario.root_cause.value == "overfitting":
+        # Normal model with zero weight decay
+        model.train()
+        optimizer = torch.optim.Adam(
+            model.parameters(),
+            lr=0.001,
+            weight_decay=scenario.weight_decay,
+        )
+        optimizer.zero_grad()
+        output = model(batch_x)
+        loss = criterion(output, batch_y)
+        loss.backward()
+        optimizer.step()
+    elif scenario.root_cause.value == "batchnorm_eval_mode":
+        # model.eval() before training — the real bug
+        model.eval()
+        optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
+        # Still run forward/backward to get gradient data
+        output = model(batch_x)
+        loss = criterion(output, batch_y)
+        loss.backward()
+        optimizer.step()
+    elif scenario.root_cause.value == "code_bug":
+        # Normal training with the model bug injected in code only
+        model.train()
+        optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
+        optimizer.zero_grad()
+        output = model(batch_x)
+        loss = criterion(output, batch_y)
+        loss.backward()
+        optimizer.step()
+    return model, info
+def extract_gradient_stats(
+    model: nn.Module,
+    scenario: Optional[ScenarioParams] = None,
+) -> list[GradientStats]:
+    """Extract gradient statistics from real param.grad tensors.
+    For Task 5 (batchnorm_eval_mode), injects red-herring spike on
+    the configured layer.
+    """
+    stats: list[GradientStats] = []
+    named_layers = [
+        ("conv1", model.conv1),
+        ("conv2", model.conv2),
+        ("conv3", model.conv3),
+        ("fc", model.fc),
+    ]
+    for layer_name, layer in named_layers:
+        norms: list[float] = []
+        for param in layer.parameters():
+            if param.grad is not None:
+                norm_val = torch.norm(param.grad).item()
+                norms.append(norm_val)
+        if not norms:
+            norms = [0.0]
+        mean_norm = sum(norms) / len(norms)
+        max_norm = max(norms)
+        # Build norm_history (simulated last 5 values, based on current)
+        norm_history = [mean_norm * (0.9 + 0.2 * i / 4) for i in range(5)]
+        # Task 5 red herring: spike on configured layer
+        if scenario and scenario.root_cause.value == "batchnorm_eval_mode":
+            if layer_name == scenario.red_herring_spike_layer:
+                spike = scenario.red_herring_intensity
+                norm_history = [
+                    mean_norm,
+                    mean_norm,
+                    mean_norm * spike,
+                    mean_norm * spike * 1.2,
+                    mean_norm,
+                ]
+                mean_norm = sum(norm_history) / len(norm_history)
+                max_norm = max(norm_history)
+            # Conv1 near-vanishing red herring
+            if layer_name == "conv1" and scenario.red_herring_spike_layer != "conv1":
+                near_vanish = 0.0003
+                norm_history = [near_vanish * (0.95 + 0.1 * i / 4) for i in range(5)]
+                mean_norm = near_vanish
+                max_norm = max(norm_history)
+        is_exploding = mean_norm > 10.0
+        is_vanishing = mean_norm < 1e-6
+        stats.append(
+            GradientStats(
+                layer_name=layer_name,
+                norm_history=norm_history,
+                mean_norm=mean_norm,
+                max_norm=max_norm,
+                is_exploding=is_exploding,
+                is_vanishing=is_vanishing,
+            )
+        )
+    return stats
+def extract_weight_stats(model: nn.Module) -> list[ModelWeightStats]:
+    """Extract weight statistics from real model.state_dict()."""
+    stats: list[ModelWeightStats] = []
+    for name, param in model.named_parameters():
+        if "weight" not in name:
+            continue
+        stats.append(
+            ModelWeightStats(
+                layer_name=name,
+                weight_norm=torch.norm(param).item(),
+                weight_mean=param.mean().item(),
+                weight_std=param.std().item(),
+                weight_min=param.min().item(),
+                weight_max=param.max().item(),
+                dead_neuron_pct=0.0,
+                has_nan=bool(torch.isnan(param).any().item()),
+                has_inf=bool(torch.isinf(param).any().item()),
+            )
+        )
+    return stats
+def extract_model_modes(model: nn.Module) -> dict[str, str]:
+    """Extract training/eval mode for each named module."""
+    modes: dict[str, str] = {}
+    for name, module in model.named_modules():
+        if name == "":
+            continue
+        modes[name] = "train" if module.training else "eval"
+    return modes

ml_training_debugger/reward_engine.py ADDED Viewed

	@@ -0,0 +1,104 @@

+"""Reward function — all 7 components per spec Section 12.
+Separate from graders.py. Returns a float per step for RL training signal.
+Hard cap at [-1.0, 1.0].
+"""
+from __future__ import annotations
+import torch  # noqa: F401 — PyTorch-native project
+from ml_training_debugger.models import EpisodeState, MLTrainingAction
+from ml_training_debugger.scenarios import ScenarioParams
+# Reward constants — do not change (CLAUDE.md)
+STEP_PENALTY = -0.01
+INVESTIGATION_BONUS = 0.05
+CONTEXT_GATED_PENALTY = -0.20
+INVALID_ACTION_PENALTY = -0.05
+WRONG_CODE_FIX_PENALTY = -0.10
+CORRECT_DIAGNOSIS_REWARD = 0.50
+WRONG_DIAGNOSIS_PENALTY = -0.30
+TERMINAL_CONVERGENCE_REWARD = 0.40
+INVESTIGATION_ACTIONS = frozenset(
+    {
+        "inspect_gradients",
+        "inspect_data_batch",
+        "inspect_model_modes",
+        "inspect_model_weights",
+        "inspect_code",
+    }
+)
+_INSPECTION_STATE_MAP = {
+    "inspect_gradients": "gradients_inspected",
+    "inspect_data_batch": "data_inspected",
+    "inspect_model_modes": "model_modes_inspected",
+    "inspect_model_weights": "model_weights_inspected",
+    "inspect_code": "code_inspected",
+}
+def compute_reward(
+    action: MLTrainingAction,
+    state: EpisodeState,
+    scenario: ScenarioParams,
+    is_valid_action: bool = True,
+    is_correct_fix: bool | None = None,
+    convergence_confirmed: bool = False,
+) -> float:
+    """Compute reward for a single step.
+    Args:
+        action: The action taken.
+        state: Episode state BEFORE the action is applied.
+        scenario: Current scenario params.
+        is_valid_action: Whether the action is in available_actions.
+        is_correct_fix: For fix_code — True/False/None.
+        convergence_confirmed: Whether restart showed convergence.
+    Returns:
+        Reward float, capped at [-1.0, 1.0].
+    """
+    reward = 0.0
+    # Component 1: Flat step penalty (unconditional)
+    reward += STEP_PENALTY
+    # Component 4: Invalid action penalty
+    if not is_valid_action:
+        reward += INVALID_ACTION_PENALTY
+        return max(-1.0, min(1.0, reward))
+    action_type = action.action_type
+    # Component 2: Investigation bonus (first-time only)
+    if action_type in INVESTIGATION_ACTIONS:
+        state_field = _INSPECTION_STATE_MAP.get(action_type)
+        if state_field and not getattr(state, state_field):
+            reward += INVESTIGATION_BONUS
+    # Component 3: Context-gated red herring penalty
+    # Fires ONLY when gradients_inspected=True AND gradients_were_normal=True
+    if action_type == "add_callback":
+        if state.gradients_inspected and state.gradients_were_normal:
+            reward += CONTEXT_GATED_PENALTY
+    # Component 7: Wrong code fix penalty
+    if action_type == "fix_code" and is_correct_fix is False:
+        reward += WRONG_CODE_FIX_PENALTY
+    # Component 5: Diagnosis outcome
+    if action_type == "mark_diagnosed":
+        if action.diagnosis == scenario.root_cause.value:
+            reward += CORRECT_DIAGNOSIS_REWARD
+        else:
+            reward += WRONG_DIAGNOSIS_PENALTY
+    # Component 6: Terminal convergence reward
+    if action_type == "restart_run":
+        if state.fix_action_taken and convergence_confirmed:
+            reward += TERMINAL_CONVERGENCE_REWARD
+    return max(-1.0, min(1.0, reward))

ml_training_debugger/scenarios.py ADDED Viewed

	@@ -0,0 +1,155 @@

+"""ScenarioParams and scenario sampling.
+Internal scenario configuration — not exposed to the agent.
+Spec reference: Sections 6, 10, 11.
+"""
+from __future__ import annotations
+import dataclasses
+from typing import Optional
+import torch
+from ml_training_debugger.models import RootCauseDiagnosis
+@dataclasses.dataclass(frozen=True)
+class ScenarioParams:
+    """Internal scenario parameters created at reset() time."""
+    task_id: str
+    root_cause: RootCauseDiagnosis
+    seed: int
+    learning_rate: float = 0.001
+    weight_decay: float = 0.0001
+    leakage_pct: float = 0.0
+    depth_multiplier: float = 1.0
+    divergence_epoch: int = 5
+    red_herring_intensity: float = 1.0
+    red_herring_spike_layer: str = "fc"
+    bug_type: Optional[str] = None
+    notes: Optional[str] = None
+    error_log: Optional[str] = None
+    gpu_memory_used_gb: float = 6.2
+    max_steps: int = 20
+def _task_seed(task_id: str, seed: int) -> int:
+    """Derive a deterministic seed from task_id and provided seed."""
+    task_num = int(task_id.split("_")[1])
+    return seed * 1000 + task_num
+def _choose(options: list, rng: torch.Generator) -> object:
+    """Choose a random element from a list using torch RNG."""
+    idx = int(torch.randint(0, len(options), (1,), generator=rng).item())
+    return options[idx]
+def sample_scenario(task_id: str, seed: int = 42) -> ScenarioParams:
+    """Sample a ScenarioParams for the given task.
+    Args:
+        task_id: One of task_001 through task_006.
+        seed: Base seed for reproducibility.
+    Returns:
+        ScenarioParams with randomized fault parameters.
+    Raises:
+        ValueError: If task_id is unknown.
+    """
+    effective_seed = _task_seed(task_id, seed)
+    rng = torch.Generator()
+    rng.manual_seed(effective_seed)
+    if task_id == "task_001":
+        lr = _choose([0.05, 0.08, 0.10, 0.15, 0.30], rng)
+        return ScenarioParams(
+            task_id=task_id,
+            root_cause=RootCauseDiagnosis.LR_TOO_HIGH,
+            seed=effective_seed,
+            learning_rate=float(lr),
+            error_log=f"RuntimeError: Loss is NaN at epoch 12 (lr={lr})",
+            max_steps=20,
+        )
+    if task_id == "task_002":
+        lr = _choose([1e-6, 5e-6, 1e-5], rng)
+        depth_mult = _choose([1.0, 1.5, 2.0], rng)
+        return ScenarioParams(
+            task_id=task_id,
+            root_cause=RootCauseDiagnosis.VANISHING_GRADIENTS,
+            seed=effective_seed,
+            learning_rate=float(lr),
+            depth_multiplier=float(depth_mult),
+            notes=(
+                "Training resumed from a checkpoint saved at epoch 0 — "
+                "early learning rate warmup may still be in effect."
+            ),
+            max_steps=20,
+        )
+    if task_id == "task_003":
+        leakage = _choose([0.12, 0.18, 0.22, 0.28], rng)
+        return ScenarioParams(
+            task_id=task_id,
+            root_cause=RootCauseDiagnosis.DATA_LEAKAGE,
+            seed=effective_seed,
+            leakage_pct=float(leakage),
+            notes=(
+                "Model architecture upgraded from 2-layer to 4-layer CNN "
+                "at epoch 2. Performance improvement may reflect increased "
+                "model capacity."
+            ),
+            max_steps=25,
+        )
+    if task_id == "task_004":
+        wd = _choose([0.0, 0.0001, 0.001], rng)
+        div_epoch = _choose([5, 8, 12], rng)
+        return ScenarioParams(
+            task_id=task_id,
+            root_cause=RootCauseDiagnosis.OVERFITTING,
+            seed=effective_seed,
+            weight_decay=float(wd),
+            divergence_epoch=int(div_epoch),
+            notes=(
+                "Dataset augmentation was disabled for this run to speed "
+                "up training. Re-enabling may improve generalization."
+            ),
+            max_steps=25,
+        )
+    if task_id == "task_005":
+        intensity = torch.empty(1).uniform_(0.8, 2.5, generator=rng).item()
+        spike_layer = _choose(["fc", "conv1"], rng)
+        return ScenarioParams(
+            task_id=task_id,
+            root_cause=RootCauseDiagnosis.BATCHNORM_EVAL_MODE,
+            seed=effective_seed,
+            red_herring_intensity=float(intensity),
+            red_herring_spike_layer=str(spike_layer),
+            gpu_memory_used_gb=14.56,  # 91% of 16GB — red herring
+            error_log=(
+                "Warning: GPU memory pressure detected, consider reducing "
+                "batch size or enabling gradient checkpointing"
+            ),
+            max_steps=30,
+        )
+    if task_id == "task_006":
+        bug = _choose(
+            ["eval_mode", "detach_loss", "zero_grad_missing", "inplace_relu"], rng
+        )
+        return ScenarioParams(
+            task_id=task_id,
+            root_cause=RootCauseDiagnosis.CODE_BUG,
+            seed=effective_seed,
+            bug_type=str(bug),
+            notes="Try adjusting the learning rate schedule.",
+            max_steps=30,
+        )
+    raise ValueError(f"Unknown task_id: {task_id}")

ml_training_debugger/simulation.py ADDED Viewed

	@@ -0,0 +1,225 @@

+"""Parametric curve generation using torch.Tensor operations.
+All loss/accuracy histories are generated via parametric equations.
+Zero numpy. Spec reference: Section 6.
+"""
+from __future__ import annotations
+import torch
+from ml_training_debugger.scenarios import ScenarioParams
+EPOCHS = 20
+def gen_loss_history(scenario: ScenarioParams) -> list[float]:
+    """Generate training loss history (20 epochs) using torch ops."""
+    torch.manual_seed(scenario.seed)
+    t = torch.arange(EPOCHS, dtype=torch.float32)
+    root = scenario.root_cause.value
+    if root == "lr_too_high":
+        # Exponentially growing loss
+        lr_tensor = torch.tensor(scenario.learning_rate, dtype=torch.float32)
+        base = torch.exp(lr_tensor * t * 0.5)
+        loss = 2.3 * base
+        # Add NaN marker after epoch 12
+        loss_list = loss.tolist()
+        for i in range(12, EPOCHS):
+            loss_list[i] = float("inf")
+        return loss_list
+    if root == "vanishing_gradients":
+        # Flat loss — barely decreases
+        noise = torch.randn(EPOCHS) * 0.02
+        loss = 2.3 - t * 0.002 + noise
+        return loss.clamp(min=0.01).tolist()
+    if root == "data_leakage":
+        # Normal-looking training loss
+        loss = 2.3 * torch.exp(-0.15 * t) + 0.05
+        noise = torch.randn(EPOCHS) * 0.02
+        return (loss + noise).clamp(min=0.01).tolist()
+    if root == "overfitting":
+        # Steadily decreasing to near-zero
+        loss = 2.3 * torch.exp(-0.25 * t) + 0.01
+        noise = torch.randn(EPOCHS) * 0.01
+        return (loss + noise).clamp(min=0.001).tolist()
+    if root == "batchnorm_eval_mode":
+        # Roughly normal with higher variance
+        base = 2.3 * torch.exp(-0.1 * t) + 0.3
+        noise = torch.randn(EPOCHS) * 0.15
+        return (base + noise).clamp(min=0.1).tolist()
+    if root == "code_bug":
+        # Varies by bug variant — generic anomalous
+        loss = 2.3 * torch.exp(-0.05 * t) + 0.5
+        noise = torch.randn(EPOCHS) * 0.1
+        return (loss + noise).clamp(min=0.1).tolist()
+    # Fallback
+    return (2.3 * torch.exp(-0.1 * t)).tolist()
+def gen_val_accuracy_history(scenario: ScenarioParams) -> list[float]:
+    """Generate validation accuracy history (20 epochs) using torch ops."""
+    torch.manual_seed(scenario.seed + 1)
+    t = torch.arange(EPOCHS, dtype=torch.float32)
+    root = scenario.root_cause.value
+    if root == "lr_too_high":
+        # Collapses along with training loss
+        acc = torch.sigmoid(torch.linspace(0, -3, EPOCHS)) * 0.5
+        return acc.clamp(0.0, 1.0).tolist()
+    if root == "vanishing_gradients":
+        # Near random chance
+        noise = torch.randn(EPOCHS) * 0.02
+        acc = 0.10 + t * 0.001 + noise
+        return acc.clamp(0.0, 1.0).tolist()
+    if root == "data_leakage":
+        # Suspiciously high from epoch 1
+        leakage = torch.tensor(scenario.leakage_pct, dtype=torch.float32)
+        base = torch.sigmoid(torch.linspace(-3, 3, EPOCHS))
+        acc = base * (1.0 - leakage) + leakage * 0.95
+        # Inflate early epochs
+        acc = acc.clamp(0.0, 1.0)
+        # Ensure suspiciously high from epoch 1
+        acc_list = acc.tolist()
+        for i in range(EPOCHS):
+            acc_list[i] = max(acc_list[i], 0.82 * (1.0 + scenario.leakage_pct))
+        return [min(v, 0.99) for v in acc_list]
+    if root == "overfitting":
+        # Rises then falls — classic divergence
+        div = scenario.divergence_epoch
+        acc_list: list[float] = []
+        for i in range(EPOCHS):
+            if i < div:
+                val = 0.10 + (0.75 - 0.10) * (i / max(div, 1))
+            else:
+                decline = (i - div) * 0.02
+                val = 0.75 - decline
+            acc_list.append(max(0.0, min(1.0, val)))
+        return acc_list
+    if root == "batchnorm_eval_mode":
+        # Slow degradation ~1-2% per epoch
+        start = 0.76
+        noise = torch.randn(EPOCHS) * 0.01
+        acc = torch.tensor(
+            [start - 0.015 * i for i in range(EPOCHS)], dtype=torch.float32
+        )
+        acc = acc + noise
+        return acc.clamp(0.0, 1.0).tolist()
+    if root == "code_bug":
+        # Anomalous — depends on variant but generally poor
+        noise = torch.randn(EPOCHS) * 0.03
+        acc = 0.10 + t * 0.005 + noise
+        return acc.clamp(0.0, 1.0).tolist()
+    # Fallback
+    return (torch.sigmoid(torch.linspace(-3, 3, EPOCHS)) * 0.9).tolist()
+def gen_val_loss_history(scenario: ScenarioParams) -> list[float]:
+    """Generate validation loss history (20 epochs) using torch ops."""
+    torch.manual_seed(scenario.seed + 2)
+    t = torch.arange(EPOCHS, dtype=torch.float32)
+    root = scenario.root_cause.value
+    if root == "lr_too_high":
+        # Mirrors training loss divergence
+        lr_tensor = torch.tensor(scenario.learning_rate, dtype=torch.float32)
+        loss = 2.3 * torch.exp(lr_tensor * t * 0.5)
+        loss_list = loss.tolist()
+        for i in range(12, EPOCHS):
+            loss_list[i] = float("inf")
+        return loss_list
+    if root == "vanishing_gradients":
+        noise = torch.randn(EPOCHS) * 0.02
+        loss = 2.3 - t * 0.001 + noise
+        return loss.clamp(min=0.01).tolist()
+    if root == "data_leakage":
+        # Low val loss (because leaking train data into val)
+        base = 2.3 * torch.exp(-0.2 * t) + 0.03
+        noise = torch.randn(EPOCHS) * 0.02
+        return (base + noise).clamp(min=0.01).tolist()
+    if root == "overfitting":
+        # Initially decreases, then diverges upward
+        div = scenario.divergence_epoch
+        loss_list: list[float] = []
+        for i in range(EPOCHS):
+            if i < div:
+                val = 2.3 * (1.0 - 0.8 * i / max(div, 1))
+            else:
+                val = 0.46 + 0.1 * (i - div)
+            loss_list.append(max(0.01, val))
+        return loss_list
+    if root == "batchnorm_eval_mode":
+        # Slightly increasing
+        base = 1.5 + t * 0.03
+        noise = torch.randn(EPOCHS) * 0.1
+        return (base + noise).clamp(min=0.1).tolist()
+    if root == "code_bug":
+        loss = 2.3 * torch.exp(-0.03 * t) + 0.8
+        noise = torch.randn(EPOCHS) * 0.1
+        return (loss + noise).clamp(min=0.1).tolist()
+    # Fallback
+    return (2.3 * torch.exp(-0.1 * t) + 0.1).tolist()
+def gen_data_batch_stats(scenario: ScenarioParams) -> dict:
+    """Generate data batch statistics for the scenario."""
+    torch.manual_seed(scenario.seed + 3)
+    root = scenario.root_cause.value
+    if root == "data_leakage":
+        overlap = 0.5 + scenario.leakage_pct * 1.5  # 0.68-0.88 range
+        overlap = min(overlap, 0.92)
+        return {
+            "label_distribution": {i: 0.1 for i in range(10)},
+            "feature_mean": 0.45 + torch.randn(1).item() * 0.05,
+            "feature_std": 0.22 + torch.randn(1).item() * 0.02,
+            "null_count": 0,
+            "class_overlap_score": overlap,
+            "batch_size": 64,
+            "duplicate_ratio": scenario.leakage_pct,
+        }
+    if root == "overfitting":
+        return {
+            "label_distribution": {i: 0.1 for i in range(10)},
+            "feature_mean": 0.48 + torch.randn(1).item() * 0.03,
+            "feature_std": 0.25 + torch.randn(1).item() * 0.02,
+            "null_count": 0,
+            "class_overlap_score": 0.0,
+            "batch_size": 64,
+            "duplicate_ratio": 0.0,
+        }
+    # Default: normal data
+    return {
+        "label_distribution": {i: 0.1 for i in range(10)},
+        "feature_mean": 0.47 + torch.randn(1).item() * 0.03,
+        "feature_std": 0.24 + torch.randn(1).item() * 0.02,
+        "null_count": 0,
+        "class_overlap_score": 0.0 + torch.randn(1).abs().item() * 0.05,
+        "batch_size": 64,
+        "duplicate_ratio": 0.0,
+    }

openenv.yaml ADDED Viewed

	@@ -0,0 +1,58 @@

+spec_version: 1
+name: pytorch-training-debugger
+type: space
+runtime: fastapi
+app: server.app:app
+port: 7860
+version: "1.0.0"
+description: |
+  PyTorch-native fault injection engine for training failure debugging.
+  An AI agent investigates, diagnoses, fixes, and verifies broken
+  training runs using real torch.nn.Module models, torch.autograd
+  gradients, state_dict() weight inspection, and PyTorch code-level
+  debugging. 3 tasks across 3 difficulty tiers with context-gated
+  reward shaping.
+framework: openenv
+tags:
+  - ml-debugging
+  - pytorch
+  - reinforcement-learning
+  - root-cause-analysis
+  - fault-injection
+  - openenv
+observation_space:
+  type: MLTrainingObservation
+  description: "Training run snapshot with progressive reveal — gradients, weights, data stats, model modes revealed on inspection"
+action_space:
+  type: MLTrainingAction
+  description: "Investigation, fix, and diagnosis actions with dynamic availability"
+tasks:
+  - id: task_001
+    difficulty: easy
+    max_steps: 20
+  - id: task_003
+    difficulty: medium
+    max_steps: 25
+  - id: task_005
+    difficulty: hard
+    max_steps: 30
+reward:
+  range: [-1.0, 1.0]
+  shaped: true
+  step_penalty: -0.01
+  investigation_bonus: 0.05
+  max_investigation_bonus: 0.25
+  correct_diagnosis: 0.50
+  terminal_convergence: 0.40
+endpoints:
+  websocket: "/ws"
+  tasks: "GET /tasks"
+  grader: "POST /grader"
+  baseline: "POST /baseline"
+  health: "GET /health"

pyproject.toml ADDED Viewed

	@@ -0,0 +1,41 @@

+[project]
+name = "pytorch-training-debugger"
+version = "1.0.0"
+description = "OpenEnv RL environment for PyTorch training failure debugging"
+requires-python = ">=3.12"
+dependencies = [
+    "torch",
+    "openenv-core",
+    "pydantic>=2.0",
+    "fastapi",
+    "uvicorn",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest",
+    "pytest-cov",
+    "pytest-asyncio",
+    "black",
+    "ruff",
+    "isort",
+    "httpx",
+    "websockets",
+]
+llm = [
+    "openai",
+]
+[tool.black]
+line-length = 88
+[tool.isort]
+profile = "black"
+[tool.ruff]
+line-length = 88
+target-version = "py312"
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+asyncio_mode = "auto"

requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+torch
+openenv-core
+pydantic>=2.0
+fastapi
+uvicorn
+openai

server/__init__.py ADDED Viewed

File without changes

server/_baseline_results.py ADDED Viewed

	@@ -0,0 +1,27 @@

+"""Shared state for grader results across endpoints."""
+from __future__ import annotations
+from typing import Optional
+# Store last completed episode results
+_last_results: dict[str, dict] = {}
+def store_grader_result(
+    session_id: str, score: float, task_id: str, steps: int
+) -> None:
+    """Store a grader result for retrieval."""
+    _last_results[session_id] = {
+        "score": round(score, 4),
+        "task_id": task_id,
+        "steps": steps,
+    }
+    _last_results["_latest"] = _last_results[session_id]
+def get_last_grader_result(session_id: Optional[str] = None) -> dict | None:
+    """Get grader result for a session, or the most recent one."""
+    if session_id:
+        return _last_results.get(session_id)
+    return _last_results.get("_latest")

server/app.py ADDED Viewed

	@@ -0,0 +1,287 @@

+"""FastAPI app — openenv create_app() + custom hackathon routes.
+Spec reference: Sections 9, 14.
+"""
+from __future__ import annotations
+import asyncio
+import logging
+from typing import Optional
+from fastapi import FastAPI
+from fastapi.responses import JSONResponse
+from openenv.core.env_server.http_server import create_app
+from ml_training_debugger.models import MLTrainingAction, MLTrainingObservation
+from server.environment import MLTrainingEnvironment
+logging.basicConfig(
+    level=logging.INFO,
+    format='{"time":"%(asctime)s","level":"%(levelname)s","msg":"%(message)s"}',
+)
+logger = logging.getLogger(__name__)
+# MVP task list
+MVP_TASKS = [
+    {"id": "task_001", "difficulty": "easy", "max_steps": 20},
+    {"id": "task_003", "difficulty": "medium", "max_steps": 25},
+    {"id": "task_005", "difficulty": "hard", "max_steps": 30},
+]
+# create_app takes the class (factory), not an instance
+app: FastAPI = create_app(
+    MLTrainingEnvironment,
+    MLTrainingAction,
+    MLTrainingObservation,
+    env_name="pytorch_training_debugger",
+    max_concurrent_envs=5,
+)
+# Override framework's /health route with our custom version
+# Remove the framework's health route first
+app.routes[:] = [
+    r for r in app.routes if not (hasattr(r, "path") and r.path == "/health")
+]
+# Track baseline state
+_baseline_lock = asyncio.Lock()
+_baseline_running = False
+@app.get("/health")
+def health_check() -> dict:
+    """Health check — required by hackathon auto-validator."""
+    return {"status": "ready", "tasks": len(MVP_TASKS)}
+@app.get("/tasks")
+def get_tasks() -> list[dict]:
+    """Return task list with IDs, difficulties, and action schema."""
+    schema = MLTrainingAction.model_json_schema()
+    return [{**task, "action_schema": schema} for task in MVP_TASKS]
+@app.post("/grader")
+def post_grader(session_id: Optional[str] = None) -> dict:
+    """Return grader score for most recently completed episode.
+    Edge cases per spec Section 14:
+    - No episode completed → {"score": null, "error": "no_completed_episode"}
+    - Episode in progress → {"score": null, "error": "episode_in_progress"}
+    - Episode completed → {"score": float, "task_id": str, "steps": int}
+    """
+    # Try to find the environment instance
+    # The framework manages environment instances internally,
+    # so we use the internal baseline results for the /grader endpoint
+    from server._baseline_results import get_last_grader_result
+    result = get_last_grader_result(session_id)
+    if result is None:
+        return {"score": None, "error": "no_completed_episode"}
+    return result
+@app.post("/baseline", response_model=None)
+async def post_baseline():
+    """Trigger baseline run, return scores for all tasks.
+    Returns 409 if already running.
+    """
+    global _baseline_running
+    if _baseline_running:
+        return JSONResponse(
+            status_code=409,
+            content={"error": "baseline_in_progress"},
+        )
+    _baseline_running = True
+    try:
+        scores = await _run_baseline()
+        return {"scores": scores}
+    finally:
+        _baseline_running = False
+async def _run_baseline() -> dict[str, float]:
+    """Run the rule-based baseline internally."""
+    scores: dict[str, float] = {}
+    for task_info in MVP_TASKS:
+        task_id = task_info["id"]
+        env = MLTrainingEnvironment()
+        obs = env.reset(seed=42, episode_id=f"baseline_{task_id}", task_id=task_id)
+        # Run heuristic decision tree
+        score = _run_heuristic_episode(env, obs, task_id)
+        scores[task_id] = round(score, 4)
+    return scores
+def _run_heuristic_episode(
+    env: MLTrainingEnvironment,
+    obs: MLTrainingObservation,
+    task_id: str,
+) -> float:
+    """Run one heuristic baseline episode. Returns grader score."""
+    # Step 1: inspect_gradients
+    obs = env.step(MLTrainingAction(action_type="inspect_gradients"))
+    # Check for exploding gradients
+    if obs.gradient_stats:
+        if any(g.is_exploding for g in obs.gradient_stats):
+            obs = env.step(
+                MLTrainingAction(
+                    action_type="modify_config",
+                    target="learning_rate",
+                    value=0.001,
+                )
+            )
+            obs = env.step(MLTrainingAction(action_type="restart_run"))
+            obs = env.step(
+                MLTrainingAction(
+                    action_type="mark_diagnosed",
+                    diagnosis="lr_too_high",
+                )
+            )
+            session = env._get_session()
+            if session and session.last_score is not None:
+                return session.last_score
+            return 0.0
+        # Check for vanishing gradients
+        if any(g.is_vanishing for g in obs.gradient_stats):
+            obs = env.step(
+                MLTrainingAction(
+                    action_type="modify_config",
+                    target="learning_rate",
+                    value=0.01,
+                )
+            )
+            obs = env.step(MLTrainingAction(action_type="restart_run"))
+            obs = env.step(
+                MLTrainingAction(
+                    action_type="mark_diagnosed",
+                    diagnosis="vanishing_gradients",
+                )
+            )
+            session = env._get_session()
+            if session and session.last_score is not None:
+                return session.last_score
+            return 0.0
+    # Step 2: inspect_data_batch
+    obs = env.step(MLTrainingAction(action_type="inspect_data_batch"))
+    if obs.data_batch_stats and obs.data_batch_stats.class_overlap_score > 0.5:
+        obs = env.step(MLTrainingAction(action_type="patch_data_loader"))
+        obs = env.step(MLTrainingAction(action_type="restart_run"))
+        obs = env.step(
+            MLTrainingAction(
+                action_type="mark_diagnosed",
+                diagnosis="data_leakage",
+            )
+        )
+        session = env._get_session()
+        if session and session.last_score is not None:
+            return session.last_score
+        return 0.0
+    # Check for overfitting (val_loss diverging)
+    if obs.val_loss_history and len(obs.val_loss_history) >= 10:
+        early = sum(obs.val_loss_history[:5]) / 5
+        late = sum(obs.val_loss_history[-5:]) / 5
+        if (
+            late > early * 1.2
+            and obs.data_batch_stats
+            and obs.data_batch_stats.class_overlap_score < 0.1
+        ):
+            obs = env.step(
+                MLTrainingAction(
+                    action_type="modify_config",
+                    target="weight_decay",
+                    value=0.01,
+                )
+            )
+            obs = env.step(MLTrainingAction(action_type="restart_run"))
+            obs = env.step(
+                MLTrainingAction(
+                    action_type="mark_diagnosed",
+                    diagnosis="overfitting",
+                )
+            )
+            session = env._get_session()
+            if session and session.last_score is not None:
+                return session.last_score
+            return 0.0
+    # Step 3: inspect_model_modes
+    obs = env.step(MLTrainingAction(action_type="inspect_model_modes"))
+    if obs.model_mode_info:
+        has_eval = any(v == "eval" for v in obs.model_mode_info.values())
+        if has_eval:
+            obs = env.step(MLTrainingAction(action_type="fix_model_mode"))
+            obs = env.step(MLTrainingAction(action_type="restart_run"))
+            obs = env.step(
+                MLTrainingAction(
+                    action_type="mark_diagnosed",
+                    diagnosis="batchnorm_eval_mode",
+                )
+            )
+            session = env._get_session()
+            if session and session.last_score is not None:
+                return session.last_score
+            return 0.0
+    # Step 4: inspect_code (for Task 6)
+    obs = env.step(MLTrainingAction(action_type="inspect_code"))
+    if obs.code_snippet:
+        # Simple pattern matching for known bugs
+        code = obs.code_snippet.code
+        if "model.eval()" in code and "model.train()" not in code:
+            obs = env.step(
+                MLTrainingAction(
+                    action_type="fix_code",
+                    line=5,
+                    replacement="model.train()",
+                )
+            )
+        elif ".detach()" in code:
+            obs = env.step(
+                MLTrainingAction(
+                    action_type="fix_code",
+                    line=14,
+                    replacement="        loss = criterion(output, batch_y)",
+                )
+            )
+        else:
+            # Can't reliably fix — just diagnose
+            pass
+        if obs.episode_state.fix_action_taken:
+            obs = env.step(MLTrainingAction(action_type="restart_run"))
+        obs = env.step(
+            MLTrainingAction(
+                action_type="mark_diagnosed",
+                diagnosis="code_bug",
+            )
+        )
+        session = env._get_session()
+        if session and session.last_score is not None:
+            return session.last_score
+        return 0.0
+    # Fallback
+    obs = env.step(
+        MLTrainingAction(
+            action_type="mark_diagnosed",
+            diagnosis="overfitting",
+        )
+    )
+    session = env._get_session()
+    if session and session.last_score is not None:
+        return session.last_score
+    return 0.0

server/environment.py ADDED Viewed

	@@ -0,0 +1,516 @@

+"""MLTrainingEnvironment — extends openenv Environment.
+Full implementation of reset() and step() with session isolation,
+progressive information reveal, and comprehensive error handling.
+step() NEVER raises an unhandled exception.
+Spec reference: Sections 9, 13, 16.
+"""
+from __future__ import annotations
+import dataclasses
+import logging
+import uuid
+from typing import Any, Optional
+import torch
+from openenv.core.env_server.interfaces import Environment
+from ml_training_debugger.code_templates import (
+    generate_code_snippet,
+    validate_fix,
+)
+from ml_training_debugger.graders import grade_episode
+from ml_training_debugger.models import (
+    ALL_ACTION_TYPES,
+    VALID_CONFIG_KEYS,
+    VALID_DIAGNOSES,
+    CodeSnippet,
+    DataBatchStats,
+    EpisodeState,
+    MLTrainingAction,
+    MLTrainingObservation,
+    TrainingConfig,
+)
+from ml_training_debugger.pytorch_engine import (
+    create_model_and_inject_fault,
+    extract_gradient_stats,
+    extract_model_modes,
+    extract_weight_stats,
+)
+from ml_training_debugger.reward_engine import compute_reward
+from ml_training_debugger.scenarios import ScenarioParams, sample_scenario
+from ml_training_debugger.simulation import (
+    gen_data_batch_stats,
+    gen_loss_history,
+    gen_val_accuracy_history,
+    gen_val_loss_history,
+)
+logger = logging.getLogger(__name__)
+@dataclasses.dataclass
+class SessionData:
+    """Per-session episode data."""
+    scenario: ScenarioParams
+    model: torch.nn.Module
+    state: EpisodeState
+    config: TrainingConfig
+    gradient_stats: list[Any]
+    weight_stats: list[Any] | None
+    model_modes: dict[str, str] | None
+    data_batch_stats_raw: dict | None
+    code_snippet_raw: dict | None
+    loss_history: list[float]
+    val_acc_history: list[float]
+    val_loss_history: list[float]
+    done: bool
+    last_score: float | None
+    convergence_after_fix: bool
+class MLTrainingEnvironment(Environment[MLTrainingAction, MLTrainingObservation, dict]):
+    """OpenEnv environment for PyTorch training run debugging.
+    Spec Section 9 — Architecture.
+    """
+    SUPPORTS_CONCURRENT_SESSIONS = True
+    def __init__(self, **kwargs: Any) -> None:
+        super().__init__(**kwargs)
+        self._sessions: dict[str, SessionData] = {}
+        self._last_completed: dict[str, dict] = {}
+        self._current_session_id: str = ""
+    def _get_session(self, episode_id: str | None = None) -> SessionData | None:
+        sid = episode_id or self._current_session_id
+        return self._sessions.get(sid)
+    def _build_observation(
+        self, session: SessionData, reward: float = 0.0
+    ) -> MLTrainingObservation:
+        """Build observation from session data."""
+        state = session.state
+        gradient_stats_models = []
+        if state.gradients_inspected and session.gradient_stats:
+            gradient_stats_models = session.gradient_stats
+        weight_stats_models = None
+        if state.model_weights_inspected and session.weight_stats is not None:
+            weight_stats_models = session.weight_stats
+        data_batch = None
+        if state.data_inspected and session.data_batch_stats_raw is not None:
+            data_batch = DataBatchStats(**session.data_batch_stats_raw)
+        model_modes = None
+        if state.model_modes_inspected and session.model_modes is not None:
+            model_modes = session.model_modes
+        code_snippet = None
+        if state.code_inspected and session.code_snippet_raw is not None:
+            code_snippet = CodeSnippet(**session.code_snippet_raw)
+        return MLTrainingObservation(
+            run_id=self._current_session_id,
+            framework="pytorch",
+            epoch=20,
+            training_loss_history=session.loss_history,
+            val_loss_history=session.val_loss_history,
+            val_accuracy_history=session.val_acc_history,
+            gradient_stats=gradient_stats_models,
+            model_weight_stats=weight_stats_models,
+            gpu_memory_used_gb=session.scenario.gpu_memory_used_gb,
+            gpu_memory_total_gb=16.0,
+            learning_rate=session.config.learning_rate,
+            current_config=session.config,
+            error_log=session.scenario.error_log,
+            data_batch_stats=data_batch,
+            model_mode_info=model_modes,
+            code_snippet=code_snippet,
+            available_actions=state.compute_available_actions(),
+            episode_state=state,
+            notes=session.scenario.notes,
+            done=session.done,
+            reward=reward,
+        )
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        **kwargs: Any,
+    ) -> MLTrainingObservation:
+        """Reset environment for a new episode. Spec Section 13."""
+        # Determine task_id — passed via kwargs or defaults to task_001
+        task_id = kwargs.get("task_id", "task_001")
+        # If called with episode_id that has an active session, terminate it
+        session_id = episode_id or str(uuid.uuid4())
+        if session_id in self._sessions:
+            old = self._sessions[session_id]
+            if not old.done:
+                score = grade_episode(old.scenario.task_id, old.state, old.scenario)
+                self._last_completed[session_id] = {
+                    "score": score,
+                    "task_id": old.scenario.task_id,
+                    "steps": old.state.step_count,
+                }
+        self._current_session_id = session_id
+        # Derive deterministic seed
+        base_seed = seed if seed is not None else 42
+        scenario = sample_scenario(task_id, base_seed)
+        # Set torch seed for reproducibility
+        torch.manual_seed(scenario.seed)
+        # Create real PyTorch model with fault injection
+        model, info = create_model_and_inject_fault(scenario)
+        # Generate parametric curves
+        loss_history = gen_loss_history(scenario)
+        val_acc_history = gen_val_accuracy_history(scenario)
+        val_loss_history = gen_val_loss_history(scenario)
+        # Pre-generate data batch stats
+        data_batch_raw = gen_data_batch_stats(scenario)
+        # Pre-generate code snippet (for Task 6)
+        code_snippet_raw = None
+        if scenario.bug_type is not None:
+            code_snippet_raw = generate_code_snippet(scenario.bug_type, scenario.seed)
+        # Build initial config from scenario
+        config = TrainingConfig(
+            learning_rate=scenario.learning_rate,
+            weight_decay=scenario.weight_decay,
+        )
+        # Create fresh episode state
+        state = EpisodeState()
+        session = SessionData(
+            scenario=scenario,
+            model=model,
+            state=state,
+            config=config,
+            gradient_stats=[],
+            weight_stats=None,
+            model_modes=None,
+            data_batch_stats_raw=data_batch_raw,
+            code_snippet_raw=code_snippet_raw,
+            loss_history=loss_history,
+            val_acc_history=val_acc_history,
+            val_loss_history=val_loss_history,
+            done=False,
+            last_score=None,
+            convergence_after_fix=False,
+        )
+        self._sessions[session_id] = session
+        logger.info(
+            "reset",
+            extra={
+                "session_id": session_id,
+                "task_id": task_id,
+                "scenario_seed": scenario.seed,
+            },
+        )
+        return self._build_observation(session)
+    def step(
+        self,
+        action: MLTrainingAction,
+        timeout_s: Optional[float] = None,
+        **kwargs: Any,
+    ) -> MLTrainingObservation:
+        """Process one agent action. NEVER raises. Spec Sections 13, 16."""
+        session = self._get_session()
+        # No active episode
+        if session is None:
+            return MLTrainingObservation(
+                done=True,
+                reward=0.0,
+                error_log="Error: no active episode. Call reset(task_id) first.",
+            )
+        # Episode already done
+        if session.done:
+            return self._build_observation(session, reward=0.0)
+        state = session.state
+        scenario = session.scenario
+        action_type = action.action_type
+        # Increment step count
+        state.step_count += 1
+        # Validate action_type is a known type
+        if action_type not in ALL_ACTION_TYPES:
+            reward = compute_reward(action, state, scenario, is_valid_action=False)
+            state.actions_taken.append(f"invalid:{action_type}")
+            obs = self._build_observation(session, reward=reward)
+            obs.error_log = (
+                f"Invalid action_type: {action_type}. "
+                f"Valid types: {sorted(ALL_ACTION_TYPES)}"
+            )
+            return obs
+        # Check if action is in available_actions
+        available = state.compute_available_actions()
+        if action_type not in available:
+            reward = compute_reward(action, state, scenario, is_valid_action=False)
+            state.actions_taken.append(f"unavailable:{action_type}")
+            obs = self._build_observation(session, reward=reward)
+            obs.error_log = (
+                f"Action '{action_type}' not available. " f"Available: {available}"
+            )
+            return obs
+        # Validate required fields for specific actions
+        error = self._validate_action_fields(action)
+        if error is not None:
+            reward = compute_reward(action, state, scenario, is_valid_action=False)
+            state.actions_taken.append(f"malformed:{action_type}")
+            obs = self._build_observation(session, reward=reward)
+            obs.error_log = error
+            return obs
+        # Dispatch action
+        is_correct_fix: bool | None = None
+        convergence = False
+        try:
+            is_correct_fix, convergence = self._dispatch_action(action, session)
+        except Exception as exc:
+            logger.error(
+                "step_error",
+                extra={
+                    "session_id": self._current_session_id,
+                    "action": action_type,
+                    "error": str(exc),
+                },
+                exc_info=True,
+            )
+            reward = compute_reward(action, state, scenario, is_valid_action=False)
+            obs = self._build_observation(session, reward=reward)
+            obs.error_log = f"Internal error processing {action_type}: {exc}"
+            return obs
+        # Record action
+        if action_type == "mark_diagnosed" and action.diagnosis:
+            state.actions_taken.append(f"mark_diagnosed:{action.diagnosis}")
+        else:
+            state.actions_taken.append(action_type)
+        # Compute reward
+        reward = compute_reward(
+            action,
+            state,
+            scenario,
+            is_valid_action=True,
+            is_correct_fix=is_correct_fix,
+            convergence_confirmed=convergence,
+        )
+        # Check step limit
+        if state.step_count >= scenario.max_steps and not session.done:
+            session.done = True
+        # Check done
+        if session.done:
+            score = grade_episode(scenario.task_id, state, scenario)
+            session.last_score = score
+            self._last_completed[self._current_session_id] = {
+                "score": score,
+                "task_id": scenario.task_id,
+                "steps": state.step_count,
+            }
+            logger.info(
+                "episode_completed",
+                extra={
+                    "session_id": self._current_session_id,
+                    "task_id": scenario.task_id,
+                    "steps": state.step_count,
+                    "score": score,
+                },
+            )
+        logger.info(
+            "step",
+            extra={
+                "session_id": self._current_session_id,
+                "step_count": state.step_count,
+                "action_type": action_type,
+                "reward": reward,
+            },
+        )
+        return self._build_observation(session, reward=reward)
+    def _validate_action_fields(self, action: MLTrainingAction) -> str | None:
+        """Validate required fields for specific actions. Return error or None."""
+        if action.action_type == "modify_config":
+            if action.target is None or action.value is None:
+                return "modify_config requires 'target' and 'value' fields"
+            if action.target not in VALID_CONFIG_KEYS:
+                return f"Unknown config key: {action.target}. Valid: {sorted(VALID_CONFIG_KEYS)}"
+        if action.action_type == "mark_diagnosed":
+            if action.diagnosis is None:
+                return "mark_diagnosed requires 'diagnosis' field"
+            if action.diagnosis not in VALID_DIAGNOSES:
+                return (
+                    f"Invalid diagnosis: {action.diagnosis}. "
+                    f"Valid: {sorted(VALID_DIAGNOSES)}"
+                )
+        if action.action_type == "fix_code":
+            if action.line is None or action.replacement is None:
+                return "fix_code requires 'line' and 'replacement' fields"
+        return None
+    def _dispatch_action(
+        self, action: MLTrainingAction, session: SessionData
+    ) -> tuple[bool | None, bool]:
+        """Dispatch action to handler. Returns (is_correct_fix, convergence)."""
+        state = session.state
+        scenario = session.scenario
+        is_correct_fix: bool | None = None
+        convergence = False
+        at = action.action_type
+        if at == "inspect_gradients":
+            if not state.gradients_inspected:
+                stats = extract_gradient_stats(session.model, scenario)
+                session.gradient_stats = stats
+                state.gradients_inspected = True
+                # Set gradients_were_normal: True if ALL layers is_exploding=False
+                state.gradients_were_normal = all(not s.is_exploding for s in stats)
+        elif at == "inspect_data_batch":
+            state.data_inspected = True
+        elif at == "inspect_model_modes":
+            if not state.model_modes_inspected:
+                modes = extract_model_modes(session.model)
+                session.model_modes = modes
+                state.model_modes_inspected = True
+        elif at == "inspect_model_weights":
+            if not state.model_weights_inspected:
+                stats = extract_weight_stats(session.model)
+                session.weight_stats = stats
+                state.model_weights_inspected = True
+        elif at == "inspect_code":
+            state.code_inspected = True
+        elif at == "modify_config":
+            if action.target and action.value is not None:
+                setattr(session.config, action.target, action.value)
+                state.fix_action_taken = True
+        elif at == "add_callback":
+            state.fix_action_taken = True
+        elif at == "replace_optimizer":
+            state.fix_action_taken = True
+        elif at == "patch_data_loader":
+            state.fix_action_taken = True
+        elif at == "fix_model_mode":
+            state.fix_action_taken = True
+        elif at == "fix_code":
+            state.fix_action_taken = True
+            if scenario.bug_type and action.line and action.replacement:
+                is_correct_fix = validate_fix(
+                    scenario.bug_type, action.line, action.replacement
+                )
+            else:
+                is_correct_fix = False
+        elif at == "restart_run":
+            state.restart_after_fix = True
+            # Check convergence — did the fix address the root cause?
+            convergence = self._check_convergence(session)
+            session.convergence_after_fix = convergence
+        elif at == "mark_diagnosed":
+            state.diagnosis_submitted = True
+            session.done = True
+        elif at == "rollback_checkpoint":
+            pass  # No-op for now
+        return is_correct_fix, convergence
+    def _check_convergence(self, session: SessionData) -> bool:
+        """Check if the applied fix would resolve the root cause."""
+        scenario = session.scenario
+        state = session.state
+        root = scenario.root_cause.value
+        if root == "lr_too_high":
+            return (
+                "modify_config" in state.actions_taken
+                and session.config.learning_rate <= 0.001
+            )
+        if root == "vanishing_gradients":
+            return (
+                "modify_config" in state.actions_taken
+                and session.config.learning_rate >= 0.001
+            )
+        if root == "data_leakage":
+            return "patch_data_loader" in state.actions_taken
+        if root == "overfitting":
+            return (
+                "modify_config" in state.actions_taken
+                or "add_callback" in state.actions_taken
+            )
+        if root == "batchnorm_eval_mode":
+            return "fix_model_mode" in state.actions_taken
+        if root == "code_bug":
+            return "fix_code" in state.actions_taken and state.fix_action_taken
+        return False
+    @property
+    def state(self) -> dict:
+        """Return current environment state."""
+        session = self._get_session()
+        if session is None:
+            return {"status": "no_active_episode"}
+        return {
+            "status": "active",
+            "task_id": session.scenario.task_id,
+            "step_count": session.state.step_count,
+            "done": session.done,
+        }
+    def get_last_completed(self, session_id: str | None = None) -> dict | None:
+        """Get last completed episode data for grader endpoint."""
+        if session_id:
+            return self._last_completed.get(session_id)
+        # Return most recent
+        if self._last_completed:
+            return list(self._last_completed.values())[-1]
+        return None

tests/__init__.py ADDED Viewed

File without changes

tests/conftest.py ADDED Viewed

	@@ -0,0 +1,36 @@

+"""Shared test fixtures."""
+from __future__ import annotations
+import pytest
+from ml_training_debugger.models import (
+    EpisodeState,
+    TrainingConfig,
+)
+from ml_training_debugger.scenarios import ScenarioParams, sample_scenario
+@pytest.fixture
+def fresh_state() -> EpisodeState:
+    return EpisodeState()
+@pytest.fixture
+def sample_config() -> TrainingConfig:
+    return TrainingConfig(learning_rate=0.001)
+@pytest.fixture
+def task_001_scenario() -> ScenarioParams:
+    return sample_scenario("task_001", seed=42)
+@pytest.fixture
+def task_003_scenario() -> ScenarioParams:
+    return sample_scenario("task_003", seed=42)
+@pytest.fixture
+def task_005_scenario() -> ScenarioParams:
+    return sample_scenario("task_005", seed=42)

tests/test_code_templates.py ADDED Viewed

	@@ -0,0 +1,65 @@

+"""Test code bug generation and fix validation."""
+from __future__ import annotations
+import pytest
+from ml_training_debugger.code_templates import generate_code_snippet, validate_fix
+class TestGenerateCodeSnippet:
+    def test_eval_mode(self):
+        snippet = generate_code_snippet("eval_mode")
+        assert "model.eval()" in snippet["code"]
+        assert snippet["filename"] == "train.py"
+        assert snippet["line_count"] > 0
+        assert len(snippet["imports"]) > 0
+    def test_detach_loss(self):
+        snippet = generate_code_snippet("detach_loss")
+        assert ".detach()" in snippet["code"]
+    def test_zero_grad_missing(self):
+        snippet = generate_code_snippet("zero_grad_missing")
+        assert "zero_grad" not in snippet["code"]
+    def test_inplace_relu(self):
+        snippet = generate_code_snippet("inplace_relu")
+        assert "inplace=True" in snippet["code"]
+    def test_unknown_bug_raises(self):
+        with pytest.raises(ValueError):
+            generate_code_snippet("nonexistent_bug")
+class TestValidateFix:
+    def test_eval_mode_correct_fix(self):
+        assert validate_fix("eval_mode", 5, "model.train()")
+    def test_eval_mode_with_whitespace(self):
+        assert validate_fix("eval_mode", 5, "  model.train()  ")
+    def test_eval_mode_wrong_fix(self):
+        assert not validate_fix("eval_mode", 5, "pass")
+    def test_detach_loss_correct_fix(self):
+        assert validate_fix(
+            "detach_loss", 14, "        loss = criterion(output, batch_y)"
+        )
+    def test_detach_loss_with_trailing_spaces(self):
+        assert validate_fix(
+            "detach_loss", 14, "        loss = criterion(output, batch_y)   "
+        )
+    def test_zero_grad_correct_fix(self):
+        assert validate_fix("zero_grad_missing", 11, "        optimizer.zero_grad()")
+    def test_inplace_relu_correct_fix(self):
+        assert validate_fix("inplace_relu", 15, "        output = F.relu(output)")
+    def test_wrong_line_number(self):
+        assert not validate_fix("eval_mode", 999, "model.train()")
+    def test_unknown_bug_type(self):
+        assert not validate_fix("nonexistent", 1, "pass")

tests/test_episode_lifecycle.py ADDED Viewed

	@@ -0,0 +1,220 @@

+"""Test full episode lifecycle — reset, step, state transitions."""
+from __future__ import annotations
+import pytest
+from ml_training_debugger.models import MLTrainingAction
+from server.environment import MLTrainingEnvironment
+@pytest.fixture
+def env():
+    return MLTrainingEnvironment()
+class TestReset:
+    def test_reset_returns_valid_observation(self, env):
+        obs = env.reset(seed=42, episode_id="test", task_id="task_001")
+        assert obs.run_id == "test"
+        assert obs.framework == "pytorch"
+        assert len(obs.training_loss_history) == 20
+        assert len(obs.val_accuracy_history) == 20
+        assert obs.done is False
+    def test_reset_initial_state(self, env):
+        obs = env.reset(seed=42, episode_id="test", task_id="task_001")
+        assert obs.episode_state.step_count == 0
+        assert not obs.episode_state.gradients_inspected
+        assert not obs.episode_state.diagnosis_submitted
+    def test_reset_progressive_reveal(self, env):
+        obs = env.reset(seed=42, episode_id="test", task_id="task_001")
+        assert obs.gradient_stats == []
+        assert obs.model_weight_stats is None
+        assert obs.data_batch_stats is None
+        assert obs.model_mode_info is None
+        assert obs.code_snippet is None
+    def test_reset_available_actions(self, env):
+        obs = env.reset(seed=42, episode_id="test", task_id="task_001")
+        assert "inspect_gradients" in obs.available_actions
+        assert "mark_diagnosed" in obs.available_actions
+        assert "fix_code" not in obs.available_actions
+        assert "restart_run" not in obs.available_actions
+class TestStepInspections:
+    def test_inspect_gradients_populates_stats(self, env):
+        env.reset(seed=42, episode_id="test", task_id="task_001")
+        obs = env.step(MLTrainingAction(action_type="inspect_gradients"))
+        assert len(obs.gradient_stats) > 0
+        assert obs.episode_state.gradients_inspected
+    def test_inspect_data_batch(self, env):
+        env.reset(seed=42, episode_id="test", task_id="task_003")
+        obs = env.step(MLTrainingAction(action_type="inspect_data_batch"))
+        assert obs.data_batch_stats is not None
+        assert obs.episode_state.data_inspected
+    def test_inspect_model_modes(self, env):
+        env.reset(seed=42, episode_id="test", task_id="task_005")
+        obs = env.step(MLTrainingAction(action_type="inspect_model_modes"))
+        assert obs.model_mode_info is not None
+        assert obs.episode_state.model_modes_inspected
+    def test_inspect_model_weights(self, env):
+        env.reset(seed=42, episode_id="test", task_id="task_001")
+        obs = env.step(MLTrainingAction(action_type="inspect_model_weights"))
+        assert obs.model_weight_stats is not None
+        assert obs.episode_state.model_weights_inspected
+class TestStepFixActions:
+    def test_modify_config(self, env):
+        env.reset(seed=42, episode_id="test", task_id="task_001")
+        obs = env.step(
+            MLTrainingAction(
+                action_type="modify_config",
+                target="learning_rate",
+                value=0.001,
+            )
+        )
+        assert obs.episode_state.fix_action_taken
+        assert "restart_run" in obs.available_actions
+    def test_restart_run_after_fix(self, env):
+        env.reset(seed=42, episode_id="test", task_id="task_001")
+        env.step(
+            MLTrainingAction(
+                action_type="modify_config",
+                target="learning_rate",
+                value=0.001,
+            )
+        )
+        obs = env.step(MLTrainingAction(action_type="restart_run"))
+        assert obs.episode_state.restart_after_fix
+class TestStepDiagnosis:
+    def test_mark_diagnosed_ends_episode(self, env):
+        env.reset(seed=42, episode_id="test", task_id="task_001")
+        obs = env.step(
+            MLTrainingAction(
+                action_type="mark_diagnosed",
+                diagnosis="lr_too_high",
+            )
+        )
+        assert obs.done is True
+        assert obs.episode_state.diagnosis_submitted
+    def test_step_after_done(self, env):
+        env.reset(seed=42, episode_id="test", task_id="task_001")
+        env.step(
+            MLTrainingAction(
+                action_type="mark_diagnosed",
+                diagnosis="lr_too_high",
+            )
+        )
+        obs = env.step(MLTrainingAction(action_type="inspect_gradients"))
+        assert obs.done is True
+        assert obs.reward == 0.0
+class TestErrorHandling:
+    def test_invalid_action_type(self, env):
+        env.reset(seed=42, episode_id="test", task_id="task_001")
+        obs = env.step(MLTrainingAction(action_type="nonexistent_action"))
+        assert obs.reward == pytest.approx(-0.01 + -0.05)
+        assert obs.error_log is not None
+    def test_action_not_in_available(self, env):
+        env.reset(seed=42, episode_id="test", task_id="task_001")
+        # fix_code requires code_inspected=True
+        obs = env.step(
+            MLTrainingAction(
+                action_type="fix_code",
+                line=1,
+                replacement="pass",
+            )
+        )
+        assert obs.reward < 0
+    def test_modify_config_missing_target(self, env):
+        env.reset(seed=42, episode_id="test", task_id="task_001")
+        obs = env.step(MLTrainingAction(action_type="modify_config"))
+        assert "target" in obs.error_log.lower() or "value" in obs.error_log.lower()
+    def test_mark_diagnosed_missing_diagnosis(self, env):
+        env.reset(seed=42, episode_id="test", task_id="task_001")
+        obs = env.step(MLTrainingAction(action_type="mark_diagnosed"))
+        assert "diagnosis" in obs.error_log.lower()
+    def test_mark_diagnosed_invalid_diagnosis(self, env):
+        env.reset(seed=42, episode_id="test", task_id="task_001")
+        obs = env.step(
+            MLTrainingAction(
+                action_type="mark_diagnosed",
+                diagnosis="not_a_real_diagnosis",
+            )
+        )
+        assert "invalid" in obs.error_log.lower()
+    def test_step_before_reset(self, env):
+        obs = env.step(MLTrainingAction(action_type="inspect_gradients"))
+        assert obs.done is True
+class TestFullEpisodeFlow:
+    def test_task_001_full_flow(self, env):
+        """Full optimal flow for Task 1."""
+        obs = env.reset(seed=42, episode_id="test", task_id="task_001")
+        assert not obs.done
+        obs = env.step(MLTrainingAction(action_type="inspect_gradients"))
+        assert obs.episode_state.gradients_inspected
+        assert any(g.is_exploding for g in obs.gradient_stats)
+        obs = env.step(
+            MLTrainingAction(
+                action_type="modify_config",
+                target="learning_rate",
+                value=0.001,
+            )
+        )
+        assert obs.episode_state.fix_action_taken
+        obs = env.step(MLTrainingAction(action_type="restart_run"))
+        assert obs.episode_state.restart_after_fix
+        obs = env.step(
+            MLTrainingAction(
+                action_type="mark_diagnosed",
+                diagnosis="lr_too_high",
+            )
+        )
+        assert obs.done
+        assert obs.reward > 0
+    def test_task_005_context_gated_penalty(self, env):
+        """Task 5: inspect gradients (normal) → add_callback → penalty fires."""
+        obs = env.reset(seed=42, episode_id="test", task_id="task_005")
+        obs = env.step(MLTrainingAction(action_type="inspect_gradients"))
+        assert obs.episode_state.gradients_inspected
+        assert obs.episode_state.gradients_were_normal
+        # All layers is_exploding=False
+        for g in obs.gradient_stats:
+            assert not g.is_exploding
+        # Now add_callback should trigger context-gated penalty
+        obs = env.step(MLTrainingAction(action_type="add_callback"))
+        assert obs.reward == pytest.approx(-0.01 + -0.20)
+    def test_task_003_data_leakage(self, env):
+        """Task 3: data inspection reveals leakage."""
+        obs = env.reset(seed=42, episode_id="test", task_id="task_003")
+        obs = env.step(MLTrainingAction(action_type="inspect_data_batch"))
+        assert obs.data_batch_stats is not None
+        assert obs.data_batch_stats.class_overlap_score > 0.5

tests/test_graders.py ADDED Viewed

	@@ -0,0 +1,168 @@

+"""Test grader functions — each returns 0.0-1.0."""
+from __future__ import annotations
+import pytest
+from ml_training_debugger.graders import (
+    grade_episode,
+    grade_task_001,
+    grade_task_003,
+    grade_task_005,
+)
+from ml_training_debugger.models import EpisodeState
+from ml_training_debugger.scenarios import sample_scenario
+@pytest.fixture
+def scenario_001():
+    return sample_scenario("task_001", seed=42)
+@pytest.fixture
+def scenario_003():
+    return sample_scenario("task_003", seed=42)
+@pytest.fixture
+def scenario_005():
+    return sample_scenario("task_005", seed=42)
+class TestGradeTask001:
+    def test_perfect_score(self, scenario_001):
+        state = EpisodeState(
+            gradients_inspected=True,
+            fix_action_taken=True,
+            restart_after_fix=True,
+            diagnosis_submitted=True,
+            actions_taken=[
+                "inspect_gradients",
+                "modify_config",
+                "restart_run",
+                "mark_diagnosed:lr_too_high",
+            ],
+        )
+        score = grade_task_001(state, scenario_001)
+        assert score == 1.0
+    def test_wrong_diagnosis(self, scenario_001):
+        state = EpisodeState(
+            gradients_inspected=True,
+            fix_action_taken=True,
+            restart_after_fix=True,
+            diagnosis_submitted=True,
+            actions_taken=[
+                "inspect_gradients",
+                "modify_config",
+                "restart_run",
+                "mark_diagnosed:data_leakage",
+            ],
+        )
+        score = grade_task_001(state, scenario_001)
+        assert score < 0.7  # Missing diagnosis credit
+    def test_no_investigation(self, scenario_001):
+        state = EpisodeState(
+            diagnosis_submitted=True,
+            actions_taken=["mark_diagnosed:lr_too_high"],
+        )
+        score = grade_task_001(state, scenario_001)
+        assert 0.0 < score < 1.0
+    def test_score_in_range(self, scenario_001):
+        state = EpisodeState()
+        score = grade_task_001(state, scenario_001)
+        assert 0.0 <= score <= 1.0
+class TestGradeTask003:
+    def test_perfect_score(self, scenario_003):
+        state = EpisodeState(
+            data_inspected=True,
+            fix_action_taken=True,
+            restart_after_fix=True,
+            diagnosis_submitted=True,
+            actions_taken=[
+                "inspect_data_batch",
+                "patch_data_loader",
+                "restart_run",
+                "mark_diagnosed:data_leakage",
+            ],
+        )
+        score = grade_task_003(state, scenario_003)
+        assert score == pytest.approx(1.0)
+    def test_wrong_diagnosis(self, scenario_003):
+        state = EpisodeState(
+            data_inspected=True,
+            diagnosis_submitted=True,
+            actions_taken=[
+                "inspect_data_batch",
+                "mark_diagnosed:overfitting",
+            ],
+        )
+        score = grade_task_003(state, scenario_003)
+        assert score < 0.5
+class TestGradeTask005:
+    def test_perfect_score(self, scenario_005):
+        state = EpisodeState(
+            gradients_inspected=True,
+            gradients_were_normal=True,
+            model_modes_inspected=True,
+            fix_action_taken=True,
+            restart_after_fix=True,
+            diagnosis_submitted=True,
+            actions_taken=[
+                "inspect_gradients",
+                "inspect_model_modes",
+                "fix_model_mode",
+                "restart_run",
+                "mark_diagnosed:batchnorm_eval_mode",
+            ],
+        )
+        score = grade_task_005(state, scenario_005)
+        assert score == 1.0
+    def test_red_herring_chaser(self, scenario_005):
+        """Agent that chases gradient red herring scores 0.80-0.85."""
+        state = EpisodeState(
+            gradients_inspected=True,
+            gradients_were_normal=True,
+            model_modes_inspected=True,
+            fix_action_taken=True,
+            restart_after_fix=True,
+            diagnosis_submitted=True,
+            actions_taken=[
+                "inspect_gradients",
+                "add_callback",  # Wrong: chases red herring
+                "inspect_model_modes",
+                "fix_model_mode",
+                "restart_run",
+                "mark_diagnosed:batchnorm_eval_mode",
+            ],
+        )
+        score = grade_task_005(state, scenario_005)
+        # -0.20 penalty for add_callback after normal gradients
+        assert 0.7 <= score <= 0.90
+class TestGradeEpisode:
+    def test_dispatch_to_correct_grader(self, scenario_001):
+        state = EpisodeState(
+            gradients_inspected=True,
+            diagnosis_submitted=True,
+            actions_taken=[
+                "inspect_gradients",
+                "mark_diagnosed:lr_too_high",
+            ],
+        )
+        score = grade_episode("task_001", state, scenario_001)
+        assert 0.0 <= score <= 1.0
+    def test_unknown_task_returns_zero(self, scenario_001):
+        state = EpisodeState()
+        score = grade_episode("task_999", state, scenario_001)
+        assert score == 0.0

tests/test_models.py ADDED Viewed

	@@ -0,0 +1,168 @@

+"""Test all Pydantic models instantiate and serialize correctly."""
+from __future__ import annotations
+import json
+from openenv.core.env_server.types import Action, Observation
+from ml_training_debugger.models import (
+    EpisodeState,
+    GradientStats,
+    MLTrainingAction,
+    MLTrainingObservation,
+    RootCauseDiagnosis,
+    TrainingConfig,
+)
+class TestRootCauseDiagnosis:
+    def test_all_six_values_exist(self):
+        assert len(RootCauseDiagnosis) == 6
+    def test_values_are_strings(self):
+        for d in RootCauseDiagnosis:
+            assert isinstance(d.value, str)
+    def test_specific_values(self):
+        assert RootCauseDiagnosis.LR_TOO_HIGH.value == "lr_too_high"
+        assert RootCauseDiagnosis.CODE_BUG.value == "code_bug"
+class TestTrainingConfig:
+    def test_default_instantiation(self):
+        config = TrainingConfig()
+        assert config.learning_rate == 0.001
+        assert config.gradient_clip_norm is None
+    def test_json_roundtrip(self):
+        config = TrainingConfig(learning_rate=0.01, weight_decay=0.1)
+        data = json.loads(config.model_dump_json())
+        restored = TrainingConfig.model_validate(data)
+        assert restored.learning_rate == 0.01
+        assert restored.weight_decay == 0.1
+class TestGradientStats:
+    def test_exploding(self):
+        stats = GradientStats(
+            layer_name="fc",
+            norm_history=[15.0],
+            mean_norm=15.0,
+            max_norm=15.0,
+            is_exploding=True,
+            is_vanishing=False,
+        )
+        assert stats.is_exploding
+    def test_vanishing(self):
+        stats = GradientStats(
+            layer_name="conv1",
+            norm_history=[1e-7],
+            mean_norm=1e-7,
+            max_norm=1e-7,
+            is_exploding=False,
+            is_vanishing=True,
+        )
+        assert stats.is_vanishing
+    def test_normal(self):
+        stats = GradientStats(
+            layer_name="conv1",
+            norm_history=[0.5],
+            mean_norm=0.5,
+            max_norm=0.5,
+            is_exploding=False,
+            is_vanishing=False,
+        )
+        assert not stats.is_exploding
+        assert not stats.is_vanishing
+class TestEpisodeState:
+    def test_fresh_state(self):
+        state = EpisodeState()
+        assert state.step_count == 0
+        assert not state.gradients_inspected
+        assert not state.diagnosis_submitted
+    def test_available_actions_initial(self):
+        state = EpisodeState()
+        actions = state.compute_available_actions()
+        assert "inspect_gradients" in actions
+        assert "mark_diagnosed" in actions
+        assert "fix_code" not in actions
+        assert "restart_run" not in actions
+        assert "rollback_checkpoint" not in actions
+    def test_fix_code_available_after_code_inspected(self):
+        state = EpisodeState(code_inspected=True)
+        actions = state.compute_available_actions()
+        assert "fix_code" in actions
+    def test_restart_run_available_after_fix(self):
+        state = EpisodeState(fix_action_taken=True)
+        actions = state.compute_available_actions()
+        assert "restart_run" in actions
+    def test_rollback_available_after_restart(self):
+        state = EpisodeState(restart_after_fix=True)
+        actions = state.compute_available_actions()
+        assert "rollback_checkpoint" in actions
+    def test_mark_diagnosed_disappears_after_submission(self):
+        state = EpisodeState(diagnosis_submitted=True)
+        actions = state.compute_available_actions()
+        assert "mark_diagnosed" not in actions
+class TestMLTrainingObservation:
+    def test_extends_observation(self):
+        assert issubclass(MLTrainingObservation, Observation)
+    def test_has_done_and_reward(self):
+        obs = MLTrainingObservation(done=True, reward=0.5)
+        assert obs.done is True
+        assert obs.reward == 0.5
+    def test_json_serialization(self):
+        obs = MLTrainingObservation(
+            run_id="test",
+            training_loss_history=[1.0, 2.0],
+            val_accuracy_history=[0.5],
+        )
+        data = json.loads(obs.model_dump_json())
+        assert data["run_id"] == "test"
+        assert data["framework"] == "pytorch"
+class TestMLTrainingAction:
+    def test_extends_action(self):
+        assert issubclass(MLTrainingAction, Action)
+    def test_basic_action(self):
+        action = MLTrainingAction(action_type="inspect_gradients")
+        assert action.action_type == "inspect_gradients"
+    def test_modify_config_action(self):
+        action = MLTrainingAction(
+            action_type="modify_config",
+            target="learning_rate",
+            value=0.001,
+        )
+        assert action.target == "learning_rate"
+    def test_mark_diagnosed_action(self):
+        action = MLTrainingAction(
+            action_type="mark_diagnosed",
+            diagnosis="lr_too_high",
+        )
+        assert action.diagnosis == "lr_too_high"
+    def test_fix_code_action(self):
+        action = MLTrainingAction(
+            action_type="fix_code",
+            line=13,
+            replacement="loss = criterion(output, batch_y)",
+        )
+        assert action.line == 13

tests/test_pytorch_engine.py ADDED Viewed

	@@ -0,0 +1,93 @@

+"""Test real PyTorch model instantiation and fault injection."""
+from __future__ import annotations
+import torch
+import torch.nn as nn
+from ml_training_debugger.pytorch_engine import (
+    SimpleCNN,
+    create_model_and_inject_fault,
+    extract_gradient_stats,
+    extract_model_modes,
+    extract_weight_stats,
+)
+from ml_training_debugger.scenarios import sample_scenario
+class TestSimpleCNN:
+    def test_is_nn_module(self):
+        model = SimpleCNN()
+        assert isinstance(model, nn.Module)
+    def test_param_count(self):
+        model = SimpleCNN()
+        count = sum(p.numel() for p in model.parameters())
+        assert 30_000 < count < 100_000  # ~50K params
+    def test_forward_pass(self):
+        model = SimpleCNN()
+        x = torch.randn(2, 3, 32, 32)
+        out = model(x)
+        assert out.shape == (2, 10)
+class TestFaultInjection:
+    def test_task_001_exploding_gradients(self):
+        scenario = sample_scenario("task_001", seed=42)
+        model, info = create_model_and_inject_fault(scenario)
+        stats = extract_gradient_stats(model, scenario)
+        assert len(stats) > 0
+        # At least some layers should have elevated gradients
+        any_high = any(s.mean_norm > 1.0 for s in stats)
+        assert any_high
+    def test_task_005_eval_mode(self):
+        scenario = sample_scenario("task_005", seed=42)
+        model, info = create_model_and_inject_fault(scenario)
+        assert not model.training  # model.eval() was called
+    def test_task_005_gradients_not_exploding(self):
+        scenario = sample_scenario("task_005", seed=42)
+        model, info = create_model_and_inject_fault(scenario)
+        stats = extract_gradient_stats(model, scenario)
+        # ALL layers must have is_exploding=False
+        for s in stats:
+            assert not s.is_exploding, f"Layer {s.layer_name} should not be exploding"
+class TestExtractGradientStats:
+    def test_returns_gradient_stats(self):
+        scenario = sample_scenario("task_001", seed=42)
+        model, _ = create_model_and_inject_fault(scenario)
+        stats = extract_gradient_stats(model, scenario)
+        assert len(stats) == 4  # conv1, conv2, conv3, fc
+        for s in stats:
+            assert isinstance(s.mean_norm, float)
+            assert isinstance(s.norm_history, list)
+            assert len(s.norm_history) == 5
+class TestExtractWeightStats:
+    def test_returns_weight_stats(self):
+        scenario = sample_scenario("task_001", seed=42)
+        model, _ = create_model_and_inject_fault(scenario)
+        stats = extract_weight_stats(model)
+        assert len(stats) > 0
+        for s in stats:
+            assert isinstance(s.weight_norm, float)
+            assert isinstance(s.has_nan, bool)
+class TestExtractModelModes:
+    def test_train_mode(self):
+        model = SimpleCNN()
+        model.train()
+        modes = extract_model_modes(model)
+        assert all(v == "train" for v in modes.values())
+    def test_eval_mode(self):
+        model = SimpleCNN()
+        model.eval()
+        modes = extract_model_modes(model)
+        assert all(v == "eval" for v in modes.values())

tests/test_reward_engine.py ADDED Viewed

	@@ -0,0 +1,176 @@

+"""Test reward engine — all 7 components. THE MOST CRITICAL TEST FILE."""
+from __future__ import annotations
+import pytest
+from ml_training_debugger.models import EpisodeState, MLTrainingAction
+from ml_training_debugger.reward_engine import (
+    CONTEXT_GATED_PENALTY,
+    CORRECT_DIAGNOSIS_REWARD,
+    INVALID_ACTION_PENALTY,
+    INVESTIGATION_BONUS,
+    STEP_PENALTY,
+    TERMINAL_CONVERGENCE_REWARD,
+    WRONG_CODE_FIX_PENALTY,
+    WRONG_DIAGNOSIS_PENALTY,
+    compute_reward,
+)
+from ml_training_debugger.scenarios import sample_scenario
+@pytest.fixture
+def scenario():
+    return sample_scenario("task_001", seed=42)
+@pytest.fixture
+def scenario_005():
+    return sample_scenario("task_005", seed=42)
+class TestStepPenalty:
+    def test_flat_step_penalty(self, scenario):
+        state = EpisodeState()
+        action = MLTrainingAction(action_type="add_callback")
+        reward = compute_reward(action, state, scenario)
+        assert reward == pytest.approx(STEP_PENALTY)
+    def test_step_penalty_not_multiplied_by_step_count(self, scenario):
+        state = EpisodeState(step_count=30)
+        action = MLTrainingAction(action_type="add_callback")
+        reward = compute_reward(action, state, scenario)
+        # Must be flat -0.01, NOT -0.01 * 30
+        assert reward == pytest.approx(-0.01)
+class TestInvestigationBonus:
+    def test_first_time_bonus(self, scenario):
+        state = EpisodeState(gradients_inspected=False)
+        action = MLTrainingAction(action_type="inspect_gradients")
+        reward = compute_reward(action, state, scenario)
+        assert reward == pytest.approx(STEP_PENALTY + INVESTIGATION_BONUS)
+    def test_no_bonus_on_repeat(self, scenario):
+        state = EpisodeState(gradients_inspected=True)
+        action = MLTrainingAction(action_type="inspect_gradients")
+        reward = compute_reward(action, state, scenario)
+        assert reward == pytest.approx(STEP_PENALTY)
+    def test_each_inspection_type_gives_bonus(self, scenario):
+        for action_type, field in [
+            ("inspect_gradients", "gradients_inspected"),
+            ("inspect_data_batch", "data_inspected"),
+            ("inspect_model_modes", "model_modes_inspected"),
+            ("inspect_model_weights", "model_weights_inspected"),
+            ("inspect_code", "code_inspected"),
+        ]:
+            state = EpisodeState(**{field: False})
+            action = MLTrainingAction(action_type=action_type)
+            reward = compute_reward(action, state, scenario)
+            assert reward == pytest.approx(
+                STEP_PENALTY + INVESTIGATION_BONUS
+            ), f"Failed for {action_type}"
+class TestContextGatedPenalty:
+    """The project's primary innovation — must be exact."""
+    def test_no_penalty_before_inspection(self, scenario_005):
+        """add_callback at step 1 (no prior inspection) -> NO penalty."""
+        state = EpisodeState()
+        action = MLTrainingAction(action_type="add_callback")
+        reward = compute_reward(action, state, scenario_005)
+        assert reward == pytest.approx(STEP_PENALTY)
+    def test_penalty_after_normal_gradients(self, scenario_005):
+        """inspect_gradients (normal) then add_callback -> -0.20 penalty."""
+        state = EpisodeState(gradients_inspected=True, gradients_were_normal=True)
+        action = MLTrainingAction(action_type="add_callback")
+        reward = compute_reward(action, state, scenario_005)
+        assert reward == pytest.approx(STEP_PENALTY + CONTEXT_GATED_PENALTY)
+    def test_no_penalty_after_abnormal_gradients(self, scenario):
+        """inspect_gradients (exploding) then add_callback -> no context penalty."""
+        state = EpisodeState(gradients_inspected=True, gradients_were_normal=False)
+        action = MLTrainingAction(action_type="add_callback")
+        reward = compute_reward(action, state, scenario)
+        assert reward == pytest.approx(STEP_PENALTY)
+    def test_penalty_only_for_add_callback(self, scenario_005):
+        """Other fix actions don't trigger context-gated penalty."""
+        state = EpisodeState(gradients_inspected=True, gradients_were_normal=True)
+        for action_type in ["modify_config", "fix_model_mode", "patch_data_loader"]:
+            action = MLTrainingAction(
+                action_type=action_type, target="learning_rate", value=0.001
+            )
+            reward = compute_reward(action, state, scenario_005)
+            assert reward == pytest.approx(
+                STEP_PENALTY
+            ), f"Unexpected penalty for {action_type}"
+class TestDiagnosisReward:
+    def test_correct_diagnosis(self, scenario):
+        state = EpisodeState()
+        action = MLTrainingAction(action_type="mark_diagnosed", diagnosis="lr_too_high")
+        reward = compute_reward(action, state, scenario)
+        assert reward == pytest.approx(STEP_PENALTY + CORRECT_DIAGNOSIS_REWARD)
+    def test_wrong_diagnosis(self, scenario):
+        state = EpisodeState()
+        action = MLTrainingAction(
+            action_type="mark_diagnosed", diagnosis="data_leakage"
+        )
+        reward = compute_reward(action, state, scenario)
+        assert reward == pytest.approx(STEP_PENALTY + WRONG_DIAGNOSIS_PENALTY)
+class TestTerminalConvergence:
+    def test_convergence_after_fix_and_restart(self, scenario):
+        state = EpisodeState(fix_action_taken=True)
+        action = MLTrainingAction(action_type="restart_run")
+        reward = compute_reward(action, state, scenario, convergence_confirmed=True)
+        assert reward == pytest.approx(STEP_PENALTY + TERMINAL_CONVERGENCE_REWARD)
+    def test_no_convergence_without_fix(self, scenario):
+        state = EpisodeState(fix_action_taken=False)
+        action = MLTrainingAction(action_type="restart_run")
+        reward = compute_reward(action, state, scenario, convergence_confirmed=True)
+        # fix_action_taken is False, so no convergence reward
+        assert reward == pytest.approx(STEP_PENALTY)
+class TestInvalidAction:
+    def test_invalid_action_penalty(self, scenario):
+        state = EpisodeState()
+        action = MLTrainingAction(action_type="restart_run")
+        reward = compute_reward(action, state, scenario, is_valid_action=False)
+        assert reward == pytest.approx(STEP_PENALTY + INVALID_ACTION_PENALTY)
+class TestWrongCodeFix:
+    def test_wrong_code_fix_penalty(self, scenario):
+        state = EpisodeState(code_inspected=True)
+        action = MLTrainingAction(action_type="fix_code", line=1, replacement="pass")
+        reward = compute_reward(action, state, scenario, is_correct_fix=False)
+        assert reward == pytest.approx(STEP_PENALTY + WRONG_CODE_FIX_PENALTY)
+class TestRewardCap:
+    def test_reward_capped_at_one(self, scenario):
+        # Theoretical max would exceed 1.0 in some scenarios
+        reward = compute_reward(
+            MLTrainingAction(action_type="mark_diagnosed", diagnosis="lr_too_high"),
+            EpisodeState(),
+            scenario,
+        )
+        assert reward <= 1.0
+    def test_reward_capped_at_negative_one(self, scenario):
+        reward = compute_reward(
+            MLTrainingAction(action_type="mark_diagnosed", diagnosis="wrong"),
+            EpisodeState(),
+            scenario,
+        )
+        assert reward >= -1.0

tests/test_scenarios.py ADDED Viewed

	@@ -0,0 +1,51 @@

+"""Test scenario sampling."""
+from __future__ import annotations
+import pytest
+from ml_training_debugger.models import RootCauseDiagnosis
+from ml_training_debugger.scenarios import sample_scenario
+class TestSampleScenario:
+    def test_task_001_root_cause(self):
+        s = sample_scenario("task_001", seed=42)
+        assert s.root_cause == RootCauseDiagnosis.LR_TOO_HIGH
+        assert s.learning_rate >= 0.05
+    def test_task_003_root_cause(self):
+        s = sample_scenario("task_003", seed=42)
+        assert s.root_cause == RootCauseDiagnosis.DATA_LEAKAGE
+        assert 0.10 <= s.leakage_pct <= 0.30
+    def test_task_005_root_cause(self):
+        s = sample_scenario("task_005", seed=42)
+        assert s.root_cause == RootCauseDiagnosis.BATCHNORM_EVAL_MODE
+        assert 0.8 <= s.red_herring_intensity <= 2.5
+    def test_different_seeds_produce_different_params(self):
+        s1 = sample_scenario("task_001", seed=42)
+        s2 = sample_scenario("task_001", seed=99)
+        # Same root cause, but may have different LR
+        assert s1.root_cause == s2.root_cause
+    def test_same_seed_same_params(self):
+        s1 = sample_scenario("task_001", seed=42)
+        s2 = sample_scenario("task_001", seed=42)
+        assert s1.learning_rate == s2.learning_rate
+        assert s1.seed == s2.seed
+    def test_unknown_task_raises(self):
+        with pytest.raises(ValueError, match="Unknown task_id"):
+            sample_scenario("task_999", seed=42)
+    def test_task_005_has_error_log(self):
+        s = sample_scenario("task_005", seed=42)
+        assert s.error_log is not None
+        assert "GPU memory" in s.error_log
+    def test_task_003_has_notes(self):
+        s = sample_scenario("task_003", seed=42)
+        assert s.notes is not None
+        assert "architecture" in s.notes.lower()

tests/test_simulation.py ADDED Viewed

	@@ -0,0 +1,72 @@

+"""Test parametric curve generators."""
+from __future__ import annotations
+from ml_training_debugger.scenarios import sample_scenario
+from ml_training_debugger.simulation import (
+    gen_data_batch_stats,
+    gen_loss_history,
+    gen_val_accuracy_history,
+    gen_val_loss_history,
+)
+class TestGenLossHistory:
+    def test_returns_20_floats(self):
+        s = sample_scenario("task_001", seed=42)
+        hist = gen_loss_history(s)
+        assert len(hist) == 20
+        assert all(isinstance(v, float) for v in hist)
+    def test_task_001_diverges(self):
+        s = sample_scenario("task_001", seed=42)
+        hist = gen_loss_history(s)
+        assert hist[-1] == float("inf")  # NaN/inf after epoch 12
+    def test_task_003_normal(self):
+        s = sample_scenario("task_003", seed=42)
+        hist = gen_loss_history(s)
+        assert hist[0] > hist[-1]  # Loss decreases
+    def test_task_005_higher_variance(self):
+        s = sample_scenario("task_005", seed=42)
+        hist = gen_loss_history(s)
+        assert len(hist) == 20
+class TestGenValAccuracy:
+    def test_returns_20_floats(self):
+        s = sample_scenario("task_001", seed=42)
+        hist = gen_val_accuracy_history(s)
+        assert len(hist) == 20
+        assert all(isinstance(v, float) for v in hist)
+    def test_task_003_suspiciously_high(self):
+        s = sample_scenario("task_003", seed=42)
+        hist = gen_val_accuracy_history(s)
+        assert hist[1] > 0.80  # Suspiciously high from early epochs
+    def test_task_005_degrades(self):
+        s = sample_scenario("task_005", seed=42)
+        hist = gen_val_accuracy_history(s)
+        assert hist[0] > hist[-1]  # Degrades over time
+class TestGenValLoss:
+    def test_returns_20_floats(self):
+        s = sample_scenario("task_001", seed=42)
+        hist = gen_val_loss_history(s)
+        assert len(hist) == 20
+class TestGenDataBatchStats:
+    def test_leakage_high_overlap(self):
+        s = sample_scenario("task_003", seed=42)
+        stats = gen_data_batch_stats(s)
+        assert stats["class_overlap_score"] > 0.5
+        assert stats["duplicate_ratio"] > 0.0
+    def test_normal_low_overlap(self):
+        s = sample_scenario("task_001", seed=42)
+        stats = gen_data_batch_stats(s)
+        assert stats["class_overlap_score"] < 0.3

tests/test_simulation_extended.py ADDED Viewed

	@@ -0,0 +1,81 @@

+"""Extended simulation tests for coverage gaps."""
+from __future__ import annotations
+from ml_training_debugger.scenarios import sample_scenario
+from ml_training_debugger.simulation import (
+    gen_data_batch_stats,
+    gen_loss_history,
+    gen_val_accuracy_history,
+    gen_val_loss_history,
+)
+class TestVanishingGradients:
+    def test_loss_barely_decreases(self):
+        s = sample_scenario("task_002", seed=42)
+        hist = gen_loss_history(s)
+        assert len(hist) == 20
+        assert abs(hist[0] - hist[-1]) < 0.5
+    def test_val_acc_near_random(self):
+        s = sample_scenario("task_002", seed=42)
+        hist = gen_val_accuracy_history(s)
+        assert all(v < 0.3 for v in hist)
+    def test_val_loss_flat(self):
+        s = sample_scenario("task_002", seed=42)
+        hist = gen_val_loss_history(s)
+        assert len(hist) == 20
+class TestOverfitting:
+    def test_loss_decreases_to_near_zero(self):
+        s = sample_scenario("task_004", seed=42)
+        hist = gen_loss_history(s)
+        assert hist[-1] < 0.5
+    def test_val_acc_diverges(self):
+        s = sample_scenario("task_004", seed=42)
+        hist = gen_val_accuracy_history(s)
+        # Should rise then fall
+        mid = hist[len(hist) // 2]
+        assert mid > hist[-1] or mid > 0.3
+    def test_val_loss_diverges(self):
+        s = sample_scenario("task_004", seed=42)
+        hist = gen_val_loss_history(s)
+        assert len(hist) == 20
+        # Overfitting: val loss should increase in the latter half
+        mid_val = hist[s.divergence_epoch] if s.divergence_epoch < 20 else hist[10]
+        assert mid_val > 0  # Val loss is positive
+    def test_data_batch_stats_clean(self):
+        s = sample_scenario("task_004", seed=42)
+        stats = gen_data_batch_stats(s)
+        assert stats["class_overlap_score"] == 0.0
+        assert stats["duplicate_ratio"] == 0.0
+class TestCodeBug:
+    def test_loss_history(self):
+        s = sample_scenario("task_006", seed=42)
+        hist = gen_loss_history(s)
+        assert len(hist) == 20
+    def test_val_acc_poor(self):
+        s = sample_scenario("task_006", seed=42)
+        hist = gen_val_accuracy_history(s)
+        assert len(hist) == 20
+    def test_val_loss(self):
+        s = sample_scenario("task_006", seed=42)
+        hist = gen_val_loss_history(s)
+        assert len(hist) == 20
+class TestBatchNormEval:
+    def test_val_loss_increases(self):
+        s = sample_scenario("task_005", seed=42)
+        hist = gen_val_loss_history(s)
+        assert len(hist) == 20