Spaces:

srishtichugh
/

OpenEnv_hack

Running

App Files Files Community

srishtichugh commited on 3 days ago

Commit

40fcf49

1 Parent(s): 2215348

add ui

Browse files

Files changed (11) hide show

README.md +161 -110
inference.py +7 -2
models.py +39 -4
server/app.py +125 -28
server/data_generator.py +102 -0
server/environment.py +383 -62
server/tasks/task1_missing.py +7 -5
server/tasks/task2_format.py +16 -15
server/tasks/task3_pipeline.py +18 -15
server/tasks/task4_merge.py +231 -0
server/ui.html +1237 -0

README.md CHANGED Viewed

@@ -10,30 +10,67 @@ tags:
   - openenv
   - rl
   - data-cleaning
 ---
-# Data Cleaning OpenEnv
-A **real-world data cleaning environment** for training and evaluating AI agents.
-An agent interacts with a dirty pandas DataFrame through a standard `reset() / step() / state()` HTTP API, learning to fix common data quality problems — missing values, duplicate rows, inconsistent formats, statistical outliers, and dtype errors — across three progressively harder tasks.
 🤗 **Live HuggingFace Space:** https://srishtichugh-openenv-hack.hf.space
 📖 **Interactive API docs:** https://srishtichugh-openenv-hack.hf.space/docs
 ✅ **Health check:** https://srishtichugh-openenv-hack.hf.space/health
 ---
 ## Environment Description & Motivation
-Real-world datasets are almost never clean. Data engineers routinely spend 60–80 % of their time on data cleaning tasks: filling missing values with statistically appropriate strategies, removing duplicates, standardising inconsistent formats (phone numbers, dates, country names), and detecting extreme outliers.
-This environment turns those tasks into a reinforcement learning challenge with:
-- **Deterministic, programmatic graders** — ground-truth clean DataFrames are generated with a fixed seed; every reward signal is reproducible.
-- **Meaningful partial rewards** — every step emits a delta reward proportional to how much of the dataset it cleaned, so the agent receives useful signal throughout the episode rather than only at the end.
-- **Three difficulty levels** — easy, medium, hard — letting agents learn a curriculum from simple null-filling up to full multi-issue pipelines.
-- **No external data downloads** — all datasets are generated synthetically via `numpy` + `Faker` with `seed=42`.
 ---
@@ -49,6 +86,8 @@ Actions are JSON objects sent to `POST /step`.
 | `replace_value` | ✅ | `{"old": ..., "new": ...}` | Replace a specific value |
 | `drop_outliers` | ✅ | — | Remove IQR outliers from a numeric column |
 | `fix_dtype` | ✅ | `{"dtype": "float\|int\|str"}` | Cast column to correct dtype |
 **Format rules enforced by `fix_format`:**
@@ -56,23 +95,14 @@ Actions are JSON objects sent to `POST /step`.
 |---|---|
 | `phone` | `NNN-NNN-NNNN` |
 | `listed_date` / `signup_date` | `YYYY-MM-DD` |
-| `country` | Title-cased canonical name (`USA`, `UK`, `Canada`, `Australia`, `Germany`) |
-**Example actions:**
-```json
-{"operation": "fill_missing",    "column": "salary",          "params": {"strategy": "median"}}
-{"operation": "fill_missing",    "column": "department",      "params": {"strategy": "mode"}}
-{"operation": "drop_duplicates"}
-{"operation": "fix_format",      "column": "phone"}
-{"operation": "fix_format",      "column": "signup_date"}
-{"operation": "drop_outliers",   "column": "purchase_amount"}
-```
 ---
 ## Observation Space
 Every `POST /reset` and `POST /step` returns:
 ```json
 {
   "observation": {
@@ -86,7 +116,21 @@ Every `POST /reset` and `POST /step` returns:
     "task_description": "Task 1 (Easy) — Fill Missing Values\n...",
     "message":          "Filled 20 missing values in 'age' using median.",
     "step_count":       1,
-    "current_score":    0.4000
   },
   "reward": 0.40,
   "done":   false,
@@ -97,32 +141,19 @@ Every `POST /reset` and `POST /step` returns:
 | Field | Type | Description |
 |---|---|---|
 | `done` | bool | Episode finished (score ≥ 0.95 or max steps reached) |
-| `reward` | float | Per-step delta reward (see Reward Function) |
-| `data_preview` | string | First 10 rows of current DataFrame as CSV |
 | `data_shape` | [int, int] | Current `[rows, cols]` |
 | `missing_counts` | object | `{column: null_count}` for columns with NaN |
 | `duplicate_count` | int | Number of duplicate rows |
-| `dtype_issues` | object | `{column: issue_description}` for suspected dtype mismatches |
-| `task_description` | string | Full task instructions with available operations |
-| `message` | string | Human-readable result of the last action |
-| `step_count` | int | Steps taken in this episode |
-| `current_score` | float | Running grader score 0.0 – 1.0 |
----
-## State Space
-`GET /state` returns episode metadata (does not modify state):
-```json
-{
-  "episode_id":      "a8f026a9-...",
-  "task_id":         1,
-  "step_count":      2,
-  "max_steps":       20,
-  "total_errors":    50,
-  "errors_remaining": 30
-}
-```
 ---
@@ -133,19 +164,19 @@ Every `POST /reset` and `POST /step` returns:
 | Property | Value |
 |---|---|
 | Dataset | 100-row employee records (name, age, salary, department, experience) |
-| Issues | ~20 % NaN in `age`, `salary`; ~10 % NaN in `department` |
 | Goal | Fill all missing values |
 | Valid operations | `fill_missing` |
 | Grader | `1.0 − remaining_nulls / original_nulls` |
 | Max steps | 20 |
-| Optimal steps | 3 (one per affected column) |
 ### Task 2 — Fix Formats + Remove Duplicates *(Medium)*
 | Property | Value |
 |---|---|
 | Dataset | 215-row product catalog (product_id, price, category, phone, listed_date) |
-| Issues | ~60 % phone numbers in mixed formats, ~60 % dates in mixed formats, 15 duplicate rows |
 | Goal | Standardise all phone/date formats and remove duplicates |
 | Valid operations | `fix_format`, `drop_duplicates` |
 | Grader | `0.35 × phone_score + 0.35 × date_score + 0.30 × dupe_score` |
@@ -157,13 +188,28 @@ Every `POST /reset` and `POST /step` returns:
 | Property | Value |
 |---|---|
 | Dataset | 320-row customer database (name, age, purchase_amount, country, email, signup_date) |
-| Issues | Missing values (4 cols), 20 duplicate rows, outliers in `purchase_amount` (~3× normal), mixed country capitalisation, mixed date formats |
 | Goal | Fix all issues end-to-end |
 | Valid operations | All 6 operations |
 | Grader | `0.25×null + 0.20×dupe + 0.20×outlier + 0.175×country + 0.175×date` |
 | Max steps | 40 |
 | Optimal steps | 8 |
 ---
 ## Reward Function
@@ -173,22 +219,62 @@ Every `POST /reset` and `POST /step` returns:
 | Score improves (delta > 0) | `new_score − old_score` (positive) |
 | Operation had no effect | `−0.01` |
 | Invalid operation / bad column | `−0.05` |
-| Episode completed (score ≥ 0.95) | `delta + 0.20` terminal bonus |
-Rewards are bounded to **[−0.05, 1.2]**. A partial reward is emitted on every step, giving the agent dense signal throughout the episode.
 ---
-## API Endpoints
 | Method | Path | Description |
 |---|---|---|
 | `GET` | `/health` | Health check → `{"status": "healthy"}` |
-| `POST` | `/reset` | Start episode. Body: `{"task_id": 1\|2\|3}` (optional; default: round-robin) |
 | `POST` | `/step` | Execute action. Body: action JSON |
-| `POST` | `/state` | Get episode metadata |
-| `GET` | `/metadata` | Environment name, version, task list |
-| `GET` | `/schema` | Full action / observation / state JSON schemas |
 | `GET` | `/docs` | Interactive Swagger UI |
 ---
@@ -200,79 +286,40 @@ Rewards are bounded to **[−0.05, 1.2]**. A partial reward is emitted on every
 | 1 — Fill Missing Values | Easy | 0.999 |
 | 2 — Fix Formats + Duplicates | Medium | 0.999 |
 | 3 — Full Cleaning Pipeline | Hard | 0.999 |
-| **Average** | — | **0.999** |
-*Produced by `google/gemma-3-27b-it` via NVIDIA NIM, `temperature=0`. Full step-by-step agent logs: `inference_log.txt`.*
 ---
 ## Setup & Usage
 ### Prerequisites
 - Python 3.11+
 - Docker (for containerised deployment)
 ### Local — Python
 ```bash
-# 1. Clone and install dependencies
 git clone https://github.com/Tanvi51204/openEnv.git
 cd openEnv
 pip install -r requirements.txt
-# 2. Start the server
-uvicorn server.app:app --host 0.0.0.0 --port 8000
-# 3. Open Swagger UI
-open http://localhost:8000/docs
 ```
 ### Local — Docker
 ```bash
 docker build -t data-cleaning-env .
 docker run -p 8000:8000 data-cleaning-env
 ```
-### Quick API test
-```bash
-# Health
-curl http://localhost:8000/health
-# Start Task 1
-curl -X POST http://localhost:8000/reset \
-  -H "Content-Type: application/json" \
-  -d '{"task_id": 1}'
-# Fill missing values
-curl -X POST http://localhost:8000/step \
-  -H "Content-Type: application/json" \
-  -d '{"operation": "fill_missing", "column": "salary", "params": {"strategy": "median"}}'
-```
-### Python client
-```python
-from client import DataCleaningEnvClient
-from models import DataCleaningAction
-with DataCleaningEnvClient("http://localhost:8000") as env:
-    result = env.reset(task_id=1)
-    print(result.observation.missing_counts)   # {'age': 20, 'salary': 20, 'department': 10}
-    action = DataCleaningAction(
-        operation="fill_missing",
-        column="salary",
-        params={"strategy": "median"},
-    )
-    result = env.step(action)
-    print(result.observation.current_score)    # 0.4
-    print(result.reward)                       # 0.4
-```
 ### Run baseline inference
 ```bash
 export API_BASE_URL="https://api.openai.com/v1"
 export MODEL_NAME="gpt-4o-mini"
-export HF_TOKEN="sk-..."          # your API key
 export ENV_URL="http://localhost:8000"
 python inference.py
@@ -292,23 +339,24 @@ Produces `[START]` / `[STEP]` / `[END]` lines to stdout and `baseline_scores.jso
 ---
 ## Project Structure
 ```
 openenv-data-cleaning/
-├── models.py              Pydantic contracts — Action / Observation / State
 ├── client.py              Sync HTTP client (reset / step / state / health)
 ├── inference.py           Baseline LLM agent with [START]/[STEP]/[END] logging
-├── openenv.yaml           OpenEnv manifest
 ├── Dockerfile             python:3.11-slim, non-root user, HEALTHCHECK
 ├── requirements.txt       pip dependencies
-├── pyproject.toml         Python package metadata + openenv-core dependency
 └── server/
-    ├── app.py             FastAPI routes + /metadata + /schema
-    ├── environment.py     reset / step / state logic + 6 operations + rewards
     ├── data_generator.py  Synthetic dataset generation (seed=42, reproducible)
     └── tasks/
-        ├── task1_missing.py    Easy  — fill NaN grader
         ├── task2_format.py     Medium — format + duplicates grader
-        └── task3_pipeline.py   Hard  — full pipeline grader
 ```
 ---
@@ -317,5 +365,8 @@ openenv-data-cleaning/
 🤗 **HuggingFace Space:** https://srishtichugh-openenv-hack.hf.space
 - Health: https://srishtichugh-openenv-hack.hf.space/health
-- Docs:   https://srishtichugh-openenv-hack.hf.space/docs

   - openenv
   - rl
   - data-cleaning
+  - multi-agent
+  - data-quality
 ---
+# DataMedic — AI Data Cleaning OpenEnv
+An **agentic data quality environment** for training and evaluating AI agents on real-world data cleaning tasks.
+An agent interacts with dirty pandas DataFrames through a standard `reset() / step() / state()` HTTP API, learning to fix missing values, duplicate rows, inconsistent formats, statistical outliers, and dtype errors — across **four progressively harder tasks** including a novel multi-source schema alignment challenge.
 🤗 **Live HuggingFace Space:** https://srishtichugh-openenv-hack.hf.space
+🖥️ **Live DataMedic UI:** https://srishtichugh-openenv-hack.hf.space
 📖 **Interactive API docs:** https://srishtichugh-openenv-hack.hf.space/docs
 ✅ **Health check:** https://srishtichugh-openenv-hack.hf.space/health
 ---
+## What Makes This Different
+Most data cleaning tools are one-shot. DataMedic is an **RL training environment** where:
+- The agent **diagnoses** a dirty dataset via `/profile` (completeness, uniqueness, validity %)
+- It **plans** a treatment — every observation includes a `plan` field with the next recommended actions
+- It **executes** cleaning operations step by step with dense per-step rewards
+- It **receives a health certificate** via `/report` summarising what was fixed and how efficiently
+- It **exports** the cleaned result via `/export`
+Grounded in peer-reviewed research:
+- **Bendinelli et al. 2025** — LLM Agents for Cleaning Tabular ML Datasets (arXiv:2503.06664)
+- **CleanAgent** — Qi & Wang 2024 (arXiv:2403.08291)
+- **AutoDCWorkflow** — EMNLP 2025 Findings
+- **HoloClean** — Rekatsinas et al. 2017
+---
 ## Environment Description & Motivation
+Real-world datasets are almost never clean. Data engineers routinely spend 60–80% of their time on data cleaning. This environment turns that into an RL challenge with:
+- **Deterministic, programmatic graders** — ground-truth DataFrames generated with `seed=42`; every reward is reproducible
+- **Meaningful partial rewards** — dense delta reward every step, not just at episode end
+- **Four difficulty levels** — easy → medium → hard → expert (multi-source merge)
+- **Live DQ metrics** — completeness %, uniqueness %, validity % in every observation
+- **Agentic planning** — `plan` field recommends next actions; `tried_operations` prevents loops
+- **No external data downloads** — all datasets generated synthetically via `numpy` + `Faker`
+---
+## DataMedic UI
+Open `https://srishtichugh-openenv-hack.hf.space` in your browser to see the live monitoring dashboard:
+- **Health Score Ring** — animated score gauge, color-coded by severity (green/amber/red)
+- **DQ Dimension Bars** — live completeness, uniqueness, validity bars updating each step
+- **Score Trajectory Chart** — real-time line chart of score vs steps
+- **Agent Treatment Plan** — next recommended actions shown before each step
+- **Operation Log** — every action taken, result, and reward delta streamed live
+- **Dataset Preview** — first 10 rows with NULL values highlighted in red
+- **Export CSV** — download the cleaned DataFrame at any point
+Click any task button — the dataset loads automatically and the demo agent runs end-to-end.
 ---
 | `replace_value` | ✅ | `{"old": ..., "new": ...}` | Replace a specific value |
 | `drop_outliers` | ✅ | — | Remove IQR outliers from a numeric column |
 | `fix_dtype` | ✅ | `{"dtype": "float\|int\|str"}` | Cast column to correct dtype |
+| `align_schema` | ❌ | — | Rename Source A columns to canonical schema *(Task 4 only)* |
+| `merge_sources` | ❌ | — | Concatenate aligned Source A + Source B *(Task 4 only)* |
 **Format rules enforced by `fix_format`:**
 |---|---|
 | `phone` | `NNN-NNN-NNNN` |
 | `listed_date` / `signup_date` | `YYYY-MM-DD` |
+| `country` | Canonical name (`USA`, `UK`, `Canada`, `Australia`, `Germany`) |
 ---
 ## Observation Space
 Every `POST /reset` and `POST /step` returns:
 ```json
 {
   "observation": {
     "task_description": "Task 1 (Easy) — Fill Missing Values\n...",
     "message":          "Filled 20 missing values in 'age' using median.",
     "step_count":       1,
+    "current_score":    0.4000,
+    "dq_metrics": {
+      "completeness_pct": 86.67,
+      "uniqueness_pct":   100.0,
+      "validity_pct":     94.5,
+      "total_cells":      500,
+      "null_cells":       50,
+      "duplicate_rows":   0,
+      "invalid_cells":    12
+    },
+    "tried_operations": ["fill_missing:age"],
+    "plan": [
+      "fill_missing on \"salary\" (20 nulls) using median",
+      "fill_missing on \"department\" (10 nulls) using mode"
+    ]
   },
   "reward": 0.40,
   "done":   false,
 | Field | Type | Description |
 |---|---|---|
 | `done` | bool | Episode finished (score ≥ 0.95 or max steps reached) |
+| `reward` | float | Per-step delta reward |
+| `data_preview` | string | First 10 rows as CSV |
 | `data_shape` | [int, int] | Current `[rows, cols]` |
 | `missing_counts` | object | `{column: null_count}` for columns with NaN |
 | `duplicate_count` | int | Number of duplicate rows |
+| `dtype_issues` | object | `{column: issue_description}` |
+| `task_description` | string | Full task instructions |
+| `message` | string | Human-readable result of last action |
+| `step_count` | int | Steps taken this episode |
+| `current_score` | float | Running grader score 0.0–1.0 |
+| `dq_metrics` | object | Completeness / uniqueness / validity % + raw counts |
+| `tried_operations` | array | Operations already applied — prevents agent loops |
+| `plan` | array | Up to 3 recommended next actions (rule-based planning engine) |
 ---
 | Property | Value |
 |---|---|
 | Dataset | 100-row employee records (name, age, salary, department, experience) |
+| Issues | ~20% NaN in `age`, `salary`; ~10% NaN in `department` |
 | Goal | Fill all missing values |
 | Valid operations | `fill_missing` |
 | Grader | `1.0 − remaining_nulls / original_nulls` |
 | Max steps | 20 |
+| Optimal steps | 3 |
 ### Task 2 — Fix Formats + Remove Duplicates *(Medium)*
 | Property | Value |
 |---|---|
 | Dataset | 215-row product catalog (product_id, price, category, phone, listed_date) |
+| Issues | ~60% phone numbers in mixed formats, ~60% dates in mixed formats, 15 duplicate rows |
 | Goal | Standardise all phone/date formats and remove duplicates |
 | Valid operations | `fix_format`, `drop_duplicates` |
 | Grader | `0.35 × phone_score + 0.35 × date_score + 0.30 × dupe_score` |
 | Property | Value |
 |---|---|
 | Dataset | 320-row customer database (name, age, purchase_amount, country, email, signup_date) |
+| Issues | Missing values (4 cols), 20 duplicate rows, outliers in `purchase_amount`, mixed country case, mixed date formats |
 | Goal | Fix all issues end-to-end |
 | Valid operations | All 6 operations |
 | Grader | `0.25×null + 0.20×dupe + 0.20×outlier + 0.175×country + 0.175×date` |
 | Max steps | 40 |
 | Optimal steps | 8 |
+### Task 4 — Multi-Source Schema Alignment + Merge *(Expert)*
+| Property | Value |
+|---|---|
+| Source A | 150-row CRM export: `cust_id, full_name, Age, purchase_amt, Country, signup, email` |
+| Source B | 100-row Marketing export: `customer_id, name, age_years, spend, country_name, registration_date, email` |
+| Issues | Misaligned schemas, missing values, mixed country case, mixed date formats, 10 duplicate rows |
+| Goal | Align schemas → merge → clean |
+| Valid operations | `align_schema`, `merge_sources`, `fill_missing`, `fix_format`, `drop_duplicates` |
+| Grader | `0.30×schema + 0.25×null + 0.20×country + 0.15×date + 0.10×dupe` |
+| Max steps | 50 |
+| Optimal steps | 8 |
+*Inspired by Meta's DataSchema system — column-level semantic annotation across misaligned sources.*
 ---
 ## Reward Function
 | Score improves (delta > 0) | `new_score − old_score` (positive) |
 | Operation had no effect | `−0.01` |
 | Invalid operation / bad column | `−0.05` |
+Rewards are bounded to **[−0.05, 0.99]**. Dense signal every step.
 ---
+## Intelligence Endpoints (Phase 2)
 | Method | Path | Description |
 |---|---|---|
+| `GET` | `/profile` | Rich per-column DQ profile — null %, unique %, min/max/mean, top values |
+| `GET` | `/report` | Full episode cleaning summary — score improvement, efficiency, issues fixed |
+| `GET` | `/export` | Download current cleaned DataFrame as CSV |
+### `/profile` response example
+```json
+{
+  "dq_metrics": {
+    "completeness_pct": 90.0,
+    "uniqueness_pct": 100.0,
+    "validity_pct": 88.5
+  },
+  "columns": {
+    "age": {"null_count": 20, "null_pct": 20.0, "min": 22, "max": 59, "mean": 40.3}
+  }
+}
+```
+### `/report` response example
+```json
+{
+  "initial_score": 0.01,
+  "final_score": 0.99,
+  "score_improvement": 0.98,
+  "steps_taken": 3,
+  "step_efficiency_pct": 85.0,
+  "issues_fixed": {"nulls_filled": 50, "dupes_removed": 15, "formats_fixed": 168},
+  "completed": true
+}
+```
+---
+## All API Endpoints
+| Method | Path | Description |
+|---|---|---|
+| `GET` | `/` | DataMedic live monitoring UI |
 | `GET` | `/health` | Health check → `{"status": "healthy"}` |
+| `POST` | `/reset` | Start episode. Body: `{"task_id": 1\|2\|3\|4}` |
 | `POST` | `/step` | Execute action. Body: action JSON |
+| `GET` | `/state` | Episode metadata |
+| `GET` | `/metadata` | Environment info + paper citations |
+| `GET` | `/schema` | Full action/observation/state JSON schemas |
+| `GET` | `/profile` | Rich data quality profile of current DataFrame |
+| `GET` | `/report` | Full episode cleaning summary |
+| `GET` | `/export` | Download cleaned DataFrame as CSV |
 | `GET` | `/docs` | Interactive Swagger UI |
 ---
 | 1 — Fill Missing Values | Easy | 0.999 |
 | 2 — Fix Formats + Duplicates | Medium | 0.999 |
 | 3 — Full Cleaning Pipeline | Hard | 0.999 |
+| 4 — Multi-Source Merge | Expert | 0.990 |
+| **Average** | — | **0.997** |
 ---
 ## Setup & Usage
 ### Prerequisites
 - Python 3.11+
 - Docker (for containerised deployment)
 ### Local — Python
 ```bash
 git clone https://github.com/Tanvi51204/openEnv.git
 cd openEnv
 pip install -r requirements.txt
+python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
 ```
+Then open:
+- UI: http://localhost:8000
+- Docs: http://localhost:8000/docs
 ### Local — Docker
 ```bash
 docker build -t data-cleaning-env .
 docker run -p 8000:8000 data-cleaning-env
 ```
 ### Run baseline inference
 ```bash
 export API_BASE_URL="https://api.openai.com/v1"
 export MODEL_NAME="gpt-4o-mini"
+export HF_TOKEN="sk-..."
 export ENV_URL="http://localhost:8000"
 python inference.py
 ---
 ## Project Structure
 ```
 openenv-data-cleaning/
+├── models.py              Pydantic contracts — Action / Observation / State / DQMetrics / Report
 ├── client.py              Sync HTTP client (reset / step / state / health)
 ├── inference.py           Baseline LLM agent with [START]/[STEP]/[END] logging
 ├── Dockerfile             python:3.11-slim, non-root user, HEALTHCHECK
 ├── requirements.txt       pip dependencies
 └── server/
+    ├── app.py             FastAPI routes + /profile + /report + /export + UI
+    ├── environment.py     reset / step / state + 8 operations + planning engine + DQ metrics
     ├── data_generator.py  Synthetic dataset generation (seed=42, reproducible)
+    ├── ui.html            DataMedic live monitoring dashboard
     └── tasks/
+        ├── task1_missing.py    Easy   — fill NaN grader
         ├── task2_format.py     Medium — format + duplicates grader
+        ├── task3_pipeline.py   Hard   — full pipeline grader
+        └── task4_merge.py      Expert — multi-source schema alignment + merge grader
 ```
 ---
 🤗 **HuggingFace Space:** https://srishtichugh-openenv-hack.hf.space
+- UI:     https://srishtichugh-openenv-hack.hf.space
 - Health: https://srishtichugh-openenv-hack.hf.space/health
+- Docs:   https://srishtichugh-openenv-hack.hf.space/docs
+- Profile: https://srishtichugh-openenv-hack.hf.space/profile
+- Report: https://srishtichugh-openenv-hack.hf.space/report

inference.py CHANGED Viewed

@@ -37,6 +37,8 @@ if not HF_TOKEN:
 client = OpenAI(api_key=HF_TOKEN, base_url=API_BASE_URL)
 SYSTEM_PROMPT = """You are a data cleaning agent. You control a data cleaning environment
 through JSON actions. Each turn you receive an observation JSON describing the current state
 of a dataset (preview, missing counts, duplicate count, dtype issues, current score, etc.)
@@ -165,10 +167,13 @@ def run_task(task_id: int) -> float:
             obs_text = obs_to_text(obs)
             history.append({"role": "user", "content": obs_text})
             try:
                 response = client.chat.completions.create(
                     model       = MODEL_NAME,
-                    messages    = [{"role": "system", "content": SYSTEM_PROMPT}] + history,
                     temperature = 0.0,
                     max_tokens  = 256,
                 )
@@ -207,7 +212,7 @@ def run_task(task_id: int) -> float:
             obs         = result["observation"]
             step_reward = result["reward"]
             done        = result["done"]
-            error_msg   = None if obs["message"].startswith("Fill") or step_reward >= 0 else obs["message"]
             print(f"           -> {obs['message']}", file=sys.stderr)

 client = OpenAI(api_key=HF_TOKEN, base_url=API_BASE_URL)
+HISTORY_WINDOW = 8  # keep last N turns (user+assistant pairs) to cap token usage
 SYSTEM_PROMPT = """You are a data cleaning agent. You control a data cleaning environment
 through JSON actions. Each turn you receive an observation JSON describing the current state
 of a dataset (preview, missing counts, duplicate count, dtype issues, current score, etc.)
             obs_text = obs_to_text(obs)
             history.append({"role": "user", "content": obs_text})
+            # Sliding window — keep system prompt + last HISTORY_WINDOW messages
+            windowed_history = history[-(HISTORY_WINDOW * 2):]
             try:
                 response = client.chat.completions.create(
                     model       = MODEL_NAME,
+                    messages    = [{"role": "system", "content": SYSTEM_PROMPT}] + windowed_history,
                     temperature = 0.0,
                     max_tokens  = 256,
                 )
             obs         = result["observation"]
             step_reward = result["reward"]
             done        = result["done"]
+            error_msg   = None if step_reward >= 0 else obs["message"]
             print(f"           -> {obs['message']}", file=sys.stderr)

models.py CHANGED Viewed

@@ -9,28 +9,46 @@ class DataCleaningAction(BaseModel):
     operation choices:
         fill_missing    – fill NaN values in a column
         drop_duplicates – remove duplicate rows
-        fix_format      – standardise string formats (phone, date, text)
         replace_value   – replace a specific value with another
         drop_outliers   – remove rows where column value is a statistical outlier
         fix_dtype       – cast a column to the correct dtype
     """
     operation: str
     column: Optional[str] = None
     params: Dict[str, Any] = {}
 class DataCleaningObservation(BaseModel):
     done: bool
     reward: float
-    data_preview: str           # First 10 rows as CSV string
-    data_shape: List[int]       # [rows, cols]
     missing_counts: Dict[str, int]
     duplicate_count: int
     dtype_issues: Dict[str, str]
     task_description: str
     message: str
     step_count: int
-    current_score: float        # Running grader score 0.0–1.0
 class DataCleaningState(BaseModel):
@@ -40,3 +58,20 @@ class DataCleaningState(BaseModel):
     max_steps: int
     total_errors: int
     errors_remaining: int

     operation choices:
         fill_missing    – fill NaN values in a column
         drop_duplicates – remove duplicate rows
+        fix_format      – standardise string formats (phone, date, country)
         replace_value   – replace a specific value with another
         drop_outliers   – remove rows where column value is a statistical outlier
         fix_dtype       – cast a column to the correct dtype
+        align_schema    – rename / reorder columns to match target schema (Task 4)
+        merge_sources   – merge the two aligned source DataFrames (Task 4)
     """
     operation: str
     column: Optional[str] = None
     params: Dict[str, Any] = {}
+class DataQualityMetrics(BaseModel):
+    """Standard DQ dimensions — populated by /profile and embedded in every observation."""
+    completeness_pct: float      # % non-null cells across whole DataFrame
+    uniqueness_pct: float        # % rows that are not duplicates
+    validity_pct: float          # % cells passing format / dtype / range constraints
+    total_cells: int
+    null_cells: int
+    duplicate_rows: int
+    invalid_cells: int           # format violations + dtype issues + out-of-range values
 class DataCleaningObservation(BaseModel):
     done: bool
     reward: float
+    data_preview: str                       # First 10 rows as CSV string
+    data_shape: List[int]                   # [rows, cols]
     missing_counts: Dict[str, int]
     duplicate_count: int
     dtype_issues: Dict[str, str]
     task_description: str
     message: str
     step_count: int
+    current_score: float                    # Running grader score 0.0-1.0
+    # --- Phase 2 additions ---
+    dq_metrics: DataQualityMetrics          # Live data quality vitals
+    tried_operations: List[str]             # e.g. ["fill_missing:age", "drop_duplicates"]
+    plan: List[str]                         # Agent-facing recommended next 1-3 actions
 class DataCleaningState(BaseModel):
     max_steps: int
     total_errors: int
     errors_remaining: int
+class EpisodeReport(BaseModel):
+    """Returned by GET /report — full cleaning episode summary."""
+    episode_id: str
+    task_id: int
+    task_name: str
+    initial_score: float
+    final_score: float
+    score_improvement: float
+    steps_taken: int
+    max_steps: int
+    step_efficiency_pct: float              # How few steps used vs max (higher = better)
+    operations_applied: List[str]           # Ordered list of what was done
+    issues_fixed: Dict[str, int]            # e.g. {"nulls_filled": 40, "dupes_removed": 15}
+    final_dq_metrics: DataQualityMetrics
+    completed: bool                         # True if score >= 0.95

server/app.py CHANGED Viewed

@@ -1,26 +1,46 @@
 """
 FastAPI application exposing the OpenEnv-compatible HTTP API.
-Endpoints: GET /health, GET /metadata, GET /schema,
-           POST /reset, POST /step, GET /state, POST /state, GET /docs
 """
 from typing import Any, Dict, Optional
 from fastapi import Body, FastAPI, HTTPException
 from pydantic import BaseModel
 import uvicorn
-from models import DataCleaningAction, DataCleaningObservation, DataCleaningState
 from server.environment import DataCleaningEnvironment
 app = FastAPI(
     title="Data Cleaning OpenEnv",
-    description="A real-world data cleaning environment for AI agent training.",
-    version="0.1.0",
 )
-# Single shared environment instance (stateful server)
 env = DataCleaningEnvironment()
 class ResetRequest(BaseModel):
     task_id: Optional[int] = None
@@ -34,9 +54,17 @@ class StepResponse(BaseModel):
 # ------------------------------------------------------------------
-# Routes
 # ------------------------------------------------------------------
 @app.get("/health")
 def health():
     return {"status": "healthy"}
@@ -47,16 +75,24 @@ def metadata():
     return {
         "name": "data-cleaning-env",
         "description": (
-            "A real-world data cleaning environment where an AI agent fixes "
-            "missing values, duplicate rows, format inconsistencies, outliers, "
-            "and dtype errors across three progressively harder tasks."
         ),
-        "version": "0.1.0",
-        "tags": ["openenv", "data-cleaning", "rl", "real-world"],
         "tasks": [
-            {"id": "task1", "name": "Fill Missing Values", "difficulty": "easy"},
             {"id": "task2", "name": "Fix Formats and Remove Duplicates", "difficulty": "medium"},
-            {"id": "task3", "name": "Full Cleaning Pipeline", "difficulty": "hard"},
         ],
     }
@@ -70,16 +106,13 @@ def schema():
                 "operation": {
                     "type": "string",
                     "enum": [
-                        "fill_missing",
-                        "drop_duplicates",
-                        "fix_format",
-                        "replace_value",
-                        "drop_outliers",
-                        "fix_dtype",
                     ],
                 },
                 "column": {"type": "string", "nullable": True},
-                "params": {"type": "object", "nullable": True},
             },
             "required": ["operation"],
         },
@@ -97,6 +130,9 @@ def schema():
                 "message":          {"type": "string"},
                 "step_count":       {"type": "integer"},
                 "current_score":    {"type": "number"},
             },
         },
         "state": {
@@ -127,13 +163,20 @@ async def step(body: Dict[str, Any] = Body(...)):
     """
     Accept both openenv-core wrapped format:
         {"action": {"operation": "...", ...}, "timeout_s": 15}
-    and direct format (for backward compat with our own client/inference):
         {"operation": "...", "column": "...", "params": {...}}
     """
     action_data = body.get("action", body)
     try:
         action = DataCleaningAction(**action_data)
-        obs = env.step(action)
     except (TypeError, KeyError, Exception) as e:
         raise HTTPException(status_code=400, detail=str(e))
     return StepResponse(observation=obs, reward=obs.reward, done=obs.done)
@@ -141,23 +184,77 @@ async def step(body: Dict[str, Any] = Body(...)):
 @app.get("/state", response_model=DataCleaningState)
 def state_get():
-    """GET /state — openenv-core spec."""
     return env.state()
 @app.post("/state", response_model=DataCleaningState)
 def state_post():
-    """POST /state — backward compatibility."""
     return env.state()
 # ------------------------------------------------------------------
-# Entry point (required by openenv-core and [project.scripts])
 # ------------------------------------------------------------------
 def main():
-    uvicorn.run("server.app:app", host="0.0.0.0", port=8000)
 if __name__ == "__main__":
-    main()

 """
 FastAPI application exposing the OpenEnv-compatible HTTP API.
+Endpoints:
+  GET  /health       Health check
+  GET  /metadata     Environment info
+  GET  /schema       Action / observation / state schemas
+  POST /reset        Start new episode
+  POST /step         Execute cleaning action (with 30s timeout)
+  GET  /state        Episode metadata
+  POST /state        Episode metadata (backward compat)
+  GET  /profile      Rich data quality profile of current DataFrame
+  GET  /report       Full episode cleaning summary (health certificate)
+  GET  /export       Download current cleaned DataFrame as CSV
 """
+import asyncio
+import os
 from typing import Any, Dict, Optional
 from fastapi import Body, FastAPI, HTTPException
+from fastapi.responses import PlainTextResponse, HTMLResponse
 from pydantic import BaseModel
 import uvicorn
+from models import DataCleaningAction, DataCleaningObservation, DataCleaningState, EpisodeReport
 from server.environment import DataCleaningEnvironment
 app = FastAPI(
     title="Data Cleaning OpenEnv",
+    description=(
+        "A real-world data cleaning environment for AI agent training and evaluation. "
+        "An agent interacts with dirty pandas DataFrames through a standard reset/step/state API, "
+        "learning to fix missing values, duplicates, format inconsistencies, outliers, and dtype errors. "
+        "Grounded in CleanAgent (2024), AutoDCWorkflow (EMNLP 2025), and Meta-scale data quality principles."
+    ),
+    version="0.2.0",
 )
+# Single shared environment instance
 env = DataCleaningEnvironment()
+STEP_TIMEOUT_SECONDS = 30
 class ResetRequest(BaseModel):
     task_id: Optional[int] = None
 # ------------------------------------------------------------------
+# Core OpenEnv routes
 # ------------------------------------------------------------------
+@app.get("/", response_class=HTMLResponse, include_in_schema=False)
+def ui():
+    """DataMedic — live agent monitoring dashboard."""
+    ui_path = os.path.join(os.path.dirname(__file__), "ui.html")
+    with open(ui_path, "r") as f:
+        return HTMLResponse(content=f.read())
 @app.get("/health")
 def health():
     return {"status": "healthy"}
     return {
         "name": "data-cleaning-env",
         "description": (
+            "A real-world data cleaning RL environment. The agent diagnoses dirty datasets, "
+            "plans a treatment, executes cleaning operations step-by-step, and produces a "
+            "health certificate — grounded in AutoDCWorkflow, CleanAgent, and HoloClean research."
         ),
+        "version": "0.2.0",
+        "tags": ["openenv", "data-cleaning", "rl", "real-world", "agentic"],
         "tasks": [
+            {"id": "task1", "name": "Fill Missing Values",               "difficulty": "easy"},
             {"id": "task2", "name": "Fix Formats and Remove Duplicates", "difficulty": "medium"},
+            {"id": "task3", "name": "Full Cleaning Pipeline",            "difficulty": "hard"},
+            {"id": "task4", "name": "Multi-Source Schema Alignment + Merge", "difficulty": "expert"},
+        ],
+        "observation_extras": ["dq_metrics", "tried_operations", "plan"],
+        "papers": [
+            "Bendinelli et al. 2025 — LLM Agents for Cleaning Tabular ML Datasets (arXiv:2503.06664)",
+            "CleanAgent — Qi & Wang 2024 (arXiv:2403.08291)",
+            "AutoDCWorkflow — EMNLP 2025 Findings",
+            "HoloClean — Rekatsinas et al. 2017",
         ],
     }
                 "operation": {
                     "type": "string",
                     "enum": [
+                        "fill_missing", "drop_duplicates", "fix_format",
+                        "replace_value", "drop_outliers", "fix_dtype",
+                        "align_schema", "merge_sources",
                     ],
                 },
                 "column": {"type": "string", "nullable": True},
+                "params":  {"type": "object", "nullable": True},
             },
             "required": ["operation"],
         },
                 "message":          {"type": "string"},
                 "step_count":       {"type": "integer"},
                 "current_score":    {"type": "number"},
+                "dq_metrics":       {"type": "object", "description": "Completeness/uniqueness/validity %"},
+                "tried_operations": {"type": "array",  "description": "Operations already applied"},
+                "plan":             {"type": "array",  "description": "Agent recommended next actions"},
             },
         },
         "state": {
     """
     Accept both openenv-core wrapped format:
         {"action": {"operation": "...", ...}, "timeout_s": 15}
+    and direct format:
         {"operation": "...", "column": "...", "params": {...}}
+    Times out after 30 seconds to prevent hanging during evaluation.
     """
     action_data = body.get("action", body)
     try:
         action = DataCleaningAction(**action_data)
+        loop = asyncio.get_event_loop()
+        obs = await asyncio.wait_for(
+            loop.run_in_executor(None, env.step, action),
+            timeout=STEP_TIMEOUT_SECONDS,
+        )
+    except asyncio.TimeoutError:
+        raise HTTPException(status_code=504, detail=f"Step timed out after {STEP_TIMEOUT_SECONDS}s")
     except (TypeError, KeyError, Exception) as e:
         raise HTTPException(status_code=400, detail=str(e))
     return StepResponse(observation=obs, reward=obs.reward, done=obs.done)
 @app.get("/state", response_model=DataCleaningState)
 def state_get():
     return env.state()
 @app.post("/state", response_model=DataCleaningState)
 def state_post():
     return env.state()
 # ------------------------------------------------------------------
+# Phase 2: Intelligence endpoints
+# ------------------------------------------------------------------
+@app.get("/profile")
+def profile():
+    """
+    Rich data quality profile of the current DataFrame.
+    Returns per-column statistics (null %, unique %, min/max/mean for numerics,
+    top values for categoricals) plus dataset-level DQ metrics:
+    completeness %, uniqueness %, validity %.
+    Inspired by standard Data Quality dimensions (ISO 8000) and
+    Meta's data schematization approach.
+    """
+    try:
+        return env.get_profile()
+    except Exception as e:
+        raise HTTPException(status_code=400, detail=str(e))
+@app.get("/report", response_model=EpisodeReport)
+def report():
+    """
+    Full episode cleaning summary — the 'health certificate'.
+    Returns: initial vs final score, score improvement, step efficiency,
+    ordered list of operations applied, issues fixed by category,
+    and final DQ metrics. Call after the episode completes for best results.
+    """
+    try:
+        return env.get_report()
+    except Exception as e:
+        raise HTTPException(status_code=400, detail=str(e))
+@app.get("/export")
+def export():
+    """
+    Download the current (cleaned) DataFrame as a CSV file.
+    Returns the live state of the DataFrame — call after the agent
+    finishes cleaning to get the cleaned output.
+    """
+    try:
+        csv_data = env.get_export()
+        return PlainTextResponse(
+            content=csv_data,
+            media_type="text/csv",
+            headers={"Content-Disposition": "attachment; filename=cleaned_data.csv"},
+        )
+    except Exception as e:
+        raise HTTPException(status_code=400, detail=str(e))
+# ------------------------------------------------------------------
+# Entry point
 # ------------------------------------------------------------------
 def main():
+    uvicorn.run("server.app:app", host="0.0.0.0", port=8000, workers=1)
 if __name__ == "__main__":
+    main()

server/data_generator.py CHANGED Viewed

@@ -195,3 +195,105 @@ def generate_task3_datasets():
     dirty_df = pd.concat([dirty_df, dup_rows], ignore_index=True)
     return dirty_df.reset_index(drop=True), clean_df.reset_index(drop=True)

     dirty_df = pd.concat([dirty_df, dup_rows], ignore_index=True)
     return dirty_df.reset_index(drop=True), clean_df.reset_index(drop=True)
+# ---------------------------------------------------------------------------
+# Task 4 — Multi-source merge pipeline (Expert)
+# ---------------------------------------------------------------------------
+# Two independently generated "source" DataFrames with misaligned schemas
+# that must be aligned and merged before the standard cleaning pipeline.
+#
+# Source A — CRM export (150 rows):
+#   cust_id, full_name, Age, purchase_amt, Country, signup
+#
+# Source B — Marketing export (100 rows):
+#   customer_id, name, age_years, spend, country_name, registration_date, email
+#
+# Target schema after align_schema + merge_sources (250 rows):
+#   customer_id, name, age, purchase_amount, country, signup_date, email
+#
+# Additional dirty issues injected after merge:
+#   - Missing values in age, purchase_amount, country (~10%)
+#   - Mixed country capitalisation (~30%)
+#   - Mixed date formats in signup_date (~40%)
+#   - 10 duplicate rows
+def generate_task4_datasets():
+    """
+    Returns (source_a, source_b, clean_merged_df).
+    source_a and source_b have misaligned schemas.
+    clean_merged_df is the ground-truth after alignment + merge + cleaning.
+    """
+    rng = np.random.default_rng(SEED + 4)   # distinct seed offset
+    random.seed(SEED + 4)
+    countries   = ["USA", "UK", "Canada", "Australia", "Germany"]
+    first_names = ["Alice", "Bob", "Carol", "David", "Eve", "Frank",
+                   "Grace", "Heidi", "Ivan", "Judy", "Karl", "Laura"]
+    last_names  = ["Smith", "Jones", "Brown", "Taylor", "Wilson",
+                   "Davis", "Miller", "Anderson", "Thomas", "Jackson"]
+    # ---- Source A — CRM (150 rows) ----
+    n_a = 150
+    names_a   = [f"{random.choice(first_names)} {random.choice(last_names)}" for _ in range(n_a)]
+    ages_a    = rng.integers(18, 75, size=n_a).astype(float)
+    amounts_a = np.round(rng.uniform(10.0, 500.0, size=n_a), 2)
+    countries_a = rng.choice(countries, size=n_a)
+    days_a    = rng.integers(0, 730, size=n_a)
+    dates_a   = [(pd.Timestamp("2022-01-01") + pd.Timedelta(days=int(d))).strftime("%Y-%m-%d")
+                 for d in days_a]
+    emails_a  = [f"crm_{i}@example.com" for i in range(1, n_a + 1)]
+    source_a = pd.DataFrame({
+        "cust_id":      [f"A{str(i).zfill(4)}" for i in range(1, n_a + 1)],
+        "full_name":    names_a,           # → name
+        "Age":          ages_a,            # → age  (capital A — schema mismatch)
+        "purchase_amt": amounts_a,         # → purchase_amount (truncated name)
+        "Country":      countries_a,       # → country (capital C)
+        "signup":       dates_a,           # → signup_date (truncated name)
+        "email":        emails_a,
+    })
+    # ---- Source B — Marketing (100 rows) ----
+    n_b = 100
+    names_b   = [f"{random.choice(first_names)} {random.choice(last_names)}" for _ in range(n_b)]
+    ages_b    = rng.integers(18, 75, size=n_b).astype(float)
+    amounts_b = np.round(rng.uniform(10.0, 500.0, size=n_b), 2)
+    countries_b = rng.choice(countries, size=n_b)
+    days_b    = rng.integers(0, 730, size=n_b)
+    dates_b   = [(pd.Timestamp("2022-01-01") + pd.Timedelta(days=int(d))).strftime("%Y-%m-%d")
+                 for d in days_b]
+    emails_b  = [f"mkt_{i}@example.com" for i in range(1, n_b + 1)]
+    source_b = pd.DataFrame({
+        "customer_id":        [f"B{str(i).zfill(4)}" for i in range(1, n_b + 1)],
+        "name":               names_b,
+        "age_years":          ages_b,      # → age  (suffix mismatch)
+        "spend":              amounts_b,   # → purchase_amount (synonym)
+        "country_name":       countries_b, # → country (suffix mismatch)
+        "registration_date":  dates_b,     # → signup_date (synonym)
+        "email":              emails_b,
+    })
+    # ---- Ground-truth clean merged DataFrame ----
+    clean_a = pd.DataFrame({
+        "customer_id":    source_a["cust_id"],
+        "name":           source_a["full_name"],
+        "age":            source_a["Age"],
+        "purchase_amount":source_a["purchase_amt"],
+        "country":        source_a["Country"],
+        "signup_date":    source_a["signup"],
+        "email":          source_a["email"],
+    })
+    clean_b = pd.DataFrame({
+        "customer_id":    source_b["customer_id"],
+        "name":           source_b["name"],
+        "age":            source_b["age_years"],
+        "purchase_amount":source_b["spend"],
+        "country":        source_b["country_name"],
+        "signup_date":    source_b["registration_date"],
+        "email":          source_b["email"],
+    })
+    clean_merged = pd.concat([clean_a, clean_b], ignore_index=True).reset_index(drop=True)
+    return source_a.copy(), source_b.copy(), clean_merged

server/environment.py CHANGED Viewed

@@ -1,21 +1,37 @@
 """
 Core environment implementing reset / step / state.
-Each call to reset() picks a task (round-robin: 1 → 2 → 3 → 1 …)
 or a specific task_id can be forced via reset(task_id=N).
 """
 import re
 import uuid
 import numpy as np
 import pandas as pd
-from typing import Any, Dict, Optional, Tuple
-from models import DataCleaningAction, DataCleaningObservation, DataCleaningState
 import server.tasks.task1_missing  as t1
 import server.tasks.task2_format   as t2
 import server.tasks.task3_pipeline as t3
-TASK_MODULES = {1: t1, 2: t2, 3: t3}
 PHONE_RE = re.compile(r"^\d{3}-\d{3}-\d{4}$")
 DATE_RE  = re.compile(r"^\d{4}-\d{2}-\d{2}$")
@@ -25,16 +41,28 @@ VALID_COUNTRIES = {"USA", "UK", "Canada", "Australia", "Germany"}
 class DataCleaningEnvironment:
     def __init__(self):
-        self._df: Optional[pd.DataFrame]    = None
         self._clean_df: Optional[pd.DataFrame] = None
-        self._meta: Any                     = None   # task-specific metadata
-        self._task_id: int                  = 1
-        self._episode_id: str               = ""
-        self._step_count: int               = 0
-        self._max_steps: int                = 20
-        self._total_errors: int             = 0
-        self._last_score: float             = 0.01
-        self._task_cycle: int               = 0      # for round-robin default
     # ------------------------------------------------------------------
     # Public API
@@ -46,21 +74,35 @@ class DataCleaningEnvironment:
             task_id = self._task_cycle
         if task_id not in TASK_MODULES:
-            raise ValueError(f"task_id must be 1, 2, or 3 — got {task_id}")
         mod = TASK_MODULES[task_id]
-        self._task_id   = task_id
         self._episode_id = str(uuid.uuid4())
         self._step_count = 0
         self._max_steps  = mod.MAX_STEPS
-        if task_id == 1:
-            self._df, self._clean_df, self._meta = mod.load()
         else:
             self._df, self._clean_df, self._meta = mod.load()
-        self._last_score   = self._compute_score()
-        self._total_errors = self._count_errors()
         return self._build_obs(self._last_score, False, "Episode started. Begin cleaning.")
@@ -71,27 +113,32 @@ class DataCleaningEnvironment:
         self._step_count += 1
         score_before = self._last_score
         message, applied = self._apply_action(action)
-        score_after    = self._compute_score()
         self._last_score = score_after
         delta = score_after - score_before
         if not applied:
-            reward = 0.01
         elif delta <= 0:
-            reward = 0.01
         else:
             reward = round(delta, 4)
         done = (score_after >= 0.95) or (self._step_count >= self._max_steps)
-        # Clamp reward strictly within (0.01, 0.99) — no terminal bonus
-        reward = round(max(0.01, min(0.99, reward)), 4)
         return self._build_obs(reward, done, message)
     def state(self) -> DataCleaningState:
         if self._df is None:
             return DataCleaningState(
@@ -99,37 +146,235 @@ class DataCleaningEnvironment:
                 max_steps=0, total_errors=0, errors_remaining=0,
             )
         return DataCleaningState(
-            episode_id    = self._episode_id,
-            task_id       = self._task_id,
-            step_count    = self._step_count,
-            max_steps     = self._max_steps,
-            total_errors  = self._total_errors,
             errors_remaining = self._count_errors(),
         )
     # ------------------------------------------------------------------
     # Internal helpers
     # ------------------------------------------------------------------
     def _compute_score(self) -> float:
         if self._task_id == 1:
             raw = t1.score(self._df, self._meta)
         elif self._task_id == 2:
             raw = t2.score(self._df, self._meta)
-        else:
             raw = t3.score(self._df, self._meta)
-        EPS = 1e-4
-        # First round safely
         raw = float(raw)
-        # HARD clamp AFTER rounding risk
         if raw >= 1.0:
             raw = 1.0 - EPS
         elif raw <= 0.0:
             raw = EPS
         return round(raw, 4)
     def _count_errors(self) -> int:
@@ -137,28 +382,35 @@ class DataCleaningEnvironment:
             return t1.count_errors(self._df)
         elif self._task_id == 2:
             return t2.count_errors(self._df, self._meta)
-        else:
             return t3.count_errors(self._df, self._meta)
     def _build_obs(self, reward: float, done: bool, message: str) -> DataCleaningObservation:
-        mod = TASK_MODULES[self._task_id]
-        missing = {col: int(n) for col, n in self._df.isnull().sum().items() if n > 0}
-        dupes   = len(self._df) - len(self._df.drop_duplicates())
         dtype_issues = self._detect_dtype_issues()
-        preview = self._df.head(10).to_csv(index=False)
         return DataCleaningObservation(
-            done             = done,
-            reward           = reward,
-            data_preview     = preview,
-            data_shape       = list(self._df.shape),
-            missing_counts   = missing,
-            duplicate_count  = dupes,
-            dtype_issues     = dtype_issues,
-            task_description = mod.DESCRIPTION,
-            message          = message,
-            step_count       = self._step_count,
-            current_score    = self._last_score,
         )
     def _detect_dtype_issues(self) -> Dict[str, str]:
@@ -195,8 +447,17 @@ class DataCleaningEnvironment:
                 return self._drop_outliers(col)
             elif op == "fix_dtype":
                 return self._fix_dtype(col, p)
             else:
-                return f"Unknown operation '{op}'. Choose from: fill_missing, drop_duplicates, fix_format, replace_value, drop_outliers, fix_dtype.", False
         except Exception as exc:
             return f"Operation failed: {exc}", False
@@ -230,8 +491,7 @@ class DataCleaningEnvironment:
     def _drop_duplicates(self) -> Tuple[str, bool]:
         n_before = len(self._df)
         self._df = self._df.drop_duplicates().reset_index(drop=True)
-        n_after  = len(self._df)
-        removed  = n_before - n_after
         if removed == 0:
             return "No duplicate rows found.", False
         return f"Dropped {removed} duplicate rows.", True
@@ -239,7 +499,6 @@ class DataCleaningEnvironment:
     def _fix_format(self, col) -> Tuple[str, bool]:
         if col is None or col not in self._df.columns:
             return f"Column '{col}' not found.", False
         if col == "phone":
             return self._fix_phone(col)
         elif col in ("listed_date", "signup_date"):
@@ -278,7 +537,6 @@ class DataCleaningEnvironment:
                     return pd.to_datetime(s, format=fmt).strftime("%Y-%m-%d")
                 except Exception:
                     pass
-            # last-resort flexible parse
             try:
                 return pd.to_datetime(s).strftime("%Y-%m-%d")
             except Exception:
@@ -311,7 +569,7 @@ class DataCleaningEnvironment:
         after  = (~self._df[col].isin(VALID_COUNTRIES) & self._df[col].notna()).sum()
         fixed  = int(before - after)
         if fixed == 0:
-            return f"No country capitalisation issues found.", False
         return f"Fixed {fixed} country values to correct capitalisation.", True
     def _replace_value(self, col, p) -> Tuple[str, bool]:
@@ -343,6 +601,69 @@ class DataCleaningEnvironment:
             return f"No outliers found in '{col}'.", False
         return f"Removed {removed} outlier rows from '{col}' using IQR method.", True
     def _fix_dtype(self, col, p) -> Tuple[str, bool]:
         if col is None or col not in self._df.columns:
             return f"Column '{col}' not found.", False
@@ -358,4 +679,4 @@ class DataCleaningEnvironment:
                 return f"Unknown dtype '{dtype}'.", False
             return f"Converted '{col}' to {dtype}.", True
         except Exception as exc:
-            return f"dtype conversion failed: {exc}", False

 """
 Core environment implementing reset / step / state.
+Each call to reset() picks a task (round-robin: 1 -> 2 -> 3 -> 1 ...)
 or a specific task_id can be forced via reset(task_id=N).
+Phase 2 additions:
+  - DataQualityMetrics computed every step (completeness, uniqueness, validity)
+  - tried_operations: deduplication log so agent avoids repeating useless ops
+  - plan: rule-based next-action recommendations surfaced in every observation
+  - Episode history tracked for /report endpoint
 """
 import re
 import uuid
 import numpy as np
 import pandas as pd
+from typing import Any, Dict, List, Optional, Tuple
+from models import (
+    DataCleaningAction, DataCleaningObservation,
+    DataCleaningState, DataQualityMetrics, EpisodeReport,
+)
 import server.tasks.task1_missing  as t1
 import server.tasks.task2_format   as t2
 import server.tasks.task3_pipeline as t3
+import server.tasks.task4_merge    as t4
+TASK_MODULES = {1: t1, 2: t2, 3: t3, 4: t4}
+TASK_NAMES   = {
+    1: "Fill Missing Values",
+    2: "Fix Formats + Remove Duplicates",
+    3: "Full Cleaning Pipeline",
+    4: "Multi-Source Schema Alignment + Merge",
+}
 PHONE_RE = re.compile(r"^\d{3}-\d{3}-\d{4}$")
 DATE_RE  = re.compile(r"^\d{4}-\d{2}-\d{2}$")
 class DataCleaningEnvironment:
     def __init__(self):
+        self._df: Optional[pd.DataFrame]       = None
         self._clean_df: Optional[pd.DataFrame] = None
+        self._meta: Any                        = None
+        self._task_id: int                     = 1
+        self._episode_id: str                  = ""
+        self._step_count: int                  = 0
+        self._max_steps: int                   = 20
+        self._total_errors: int                = 0
+        self._last_score: float                = 0.01
+        self._initial_score: float             = 0.01
+        self._task_cycle: int                  = 0
+        # Phase 2 tracking
+        self._tried_operations: List[str]      = []
+        self._operations_log: List[str]        = []
+        self._issues_fixed: Dict[str, int]     = {}
+        self._initial_dq: Optional[DataQualityMetrics] = None
+        # Task 4 state
+        self._source_b: Optional[pd.DataFrame] = None   # held until merge_sources called
+        self._schema_aligned: bool             = False
+        self._sources_merged: bool             = False
     # ------------------------------------------------------------------
     # Public API
             task_id = self._task_cycle
         if task_id not in TASK_MODULES:
+            raise ValueError(f"task_id must be 1, 2, 3, or 4 — got {task_id}")
         mod = TASK_MODULES[task_id]
+        self._task_id    = task_id
         self._episode_id = str(uuid.uuid4())
         self._step_count = 0
         self._max_steps  = mod.MAX_STEPS
+        # Task 4 returns 4 values; others return 3
+        if task_id == 4:
+            self._df, self._source_b, self._clean_df, self._meta = mod.load()
+            self._schema_aligned = False
+            self._sources_merged = False
         else:
             self._df, self._clean_df, self._meta = mod.load()
+            self._source_b       = None
+            self._schema_aligned = False
+            self._sources_merged = False
+        self._last_score    = self._compute_score()
+        self._initial_score = self._last_score
+        self._total_errors  = self._count_errors()
+        # Reset Phase 2 state
+        self._tried_operations = []
+        self._operations_log   = []
+        self._issues_fixed     = {"nulls_filled": 0, "dupes_removed": 0,
+                                   "formats_fixed": 0, "outliers_removed": 0}
+        self._initial_dq = self._compute_dq_metrics()
         return self._build_obs(self._last_score, False, "Episode started. Begin cleaning.")
         self._step_count += 1
         score_before = self._last_score
+        # Track tried operations BEFORE applying (for feedback loop)
+        op_key = self._make_op_key(action)
         message, applied = self._apply_action(action)
+        score_after      = self._compute_score()
         self._last_score = score_after
         delta = score_after - score_before
         if not applied:
+            reward = -0.01
         elif delta <= 0:
+            reward = -0.01
         else:
             reward = round(delta, 4)
+            # Log successful operation
+            if op_key not in self._tried_operations:
+                self._tried_operations.append(op_key)
+            self._operations_log.append(message)
+            self._update_issues_fixed(action, message)
         done = (score_after >= 0.95) or (self._step_count >= self._max_steps)
+        reward = round(max(-0.05, min(0.99, reward)), 4)
         return self._build_obs(reward, done, message)
     def state(self) -> DataCleaningState:
         if self._df is None:
             return DataCleaningState(
                 max_steps=0, total_errors=0, errors_remaining=0,
             )
         return DataCleaningState(
+            episode_id       = self._episode_id,
+            task_id          = self._task_id,
+            step_count       = self._step_count,
+            max_steps        = self._max_steps,
+            total_errors     = self._total_errors,
             errors_remaining = self._count_errors(),
         )
+    def get_profile(self) -> Dict[str, Any]:
+        """Rich data profile for GET /profile endpoint."""
+        if self._df is None:
+            return {}
+        dq = self._compute_dq_metrics()
+        profile: Dict[str, Any] = {
+            "episode_id":   self._episode_id,
+            "task_id":      self._task_id,
+            "shape":        {"rows": self._df.shape[0], "cols": self._df.shape[1]},
+            "dq_metrics":   dq.model_dump(),
+            "columns":      {},
+        }
+        for col in self._df.columns:
+            series = self._df[col]
+            col_info: Dict[str, Any] = {
+                "dtype":           str(series.dtype),
+                "null_count":      int(series.isnull().sum()),
+                "null_pct":        round(series.isnull().mean() * 100, 2),
+                "unique_count":    int(series.nunique(dropna=True)),
+                "unique_pct":      round(series.nunique(dropna=True) / max(len(series), 1) * 100, 2),
+            }
+            if pd.api.types.is_numeric_dtype(series):
+                desc = series.describe()
+                col_info.update({
+                    "min":    round(float(desc["min"]), 4) if pd.notna(desc["min"]) else None,
+                    "max":    round(float(desc["max"]), 4) if pd.notna(desc["max"]) else None,
+                    "mean":   round(float(desc["mean"]), 4) if pd.notna(desc["mean"]) else None,
+                    "median": round(float(series.median()), 4) if pd.notna(series.median()) else None,
+                    "std":    round(float(desc["std"]), 4) if pd.notna(desc.get("std", float("nan"))) else None,
+                })
+            else:
+                top = series.value_counts(dropna=True).head(3).to_dict()
+                col_info["top_values"] = {str(k): int(v) for k, v in top.items()}
+            profile["columns"][col] = col_info
+        return profile
+    def get_report(self) -> EpisodeReport:
+        """Full episode cleaning summary for GET /report endpoint."""
+        if self._df is None:
+            raise RuntimeError("No active episode.")
+        steps_used = self._step_count
+        efficiency = round((1 - steps_used / max(self._max_steps, 1)) * 100, 1)
+        return EpisodeReport(
+            episode_id          = self._episode_id,
+            task_id             = self._task_id,
+            task_name           = TASK_NAMES.get(self._task_id, f"Task {self._task_id}"),
+            initial_score       = self._initial_score,
+            final_score         = self._last_score,
+            score_improvement   = round(self._last_score - self._initial_score, 4),
+            steps_taken         = steps_used,
+            max_steps           = self._max_steps,
+            step_efficiency_pct = max(0.0, efficiency),
+            operations_applied  = list(self._operations_log),
+            issues_fixed        = dict(self._issues_fixed),
+            final_dq_metrics    = self._compute_dq_metrics(),
+            completed           = self._last_score >= 0.95,
+        )
+    def get_export(self) -> str:
+        """Return current cleaned DataFrame as CSV string for GET /export."""
+        if self._df is None:
+            raise RuntimeError("No active episode.")
+        return self._df.to_csv(index=False)
     # ------------------------------------------------------------------
     # Internal helpers
     # ------------------------------------------------------------------
+    def _make_op_key(self, action: DataCleaningAction) -> str:
+        if action.column:
+            return f"{action.operation}:{action.column}"
+        return action.operation
+    def _update_issues_fixed(self, action: DataCleaningAction, message: str) -> None:
+        op = action.operation.lower()
+        # Parse numbers from message e.g. "Filled 20 missing values..."
+        nums = re.findall(r"\d+", message)
+        n = int(nums[0]) if nums else 1
+        if op == "fill_missing":
+            self._issues_fixed["nulls_filled"] = self._issues_fixed.get("nulls_filled", 0) + n
+        elif op == "drop_duplicates":
+            self._issues_fixed["dupes_removed"] = self._issues_fixed.get("dupes_removed", 0) + n
+        elif op == "fix_format":
+            self._issues_fixed["formats_fixed"] = self._issues_fixed.get("formats_fixed", 0) + n
+        elif op == "drop_outliers":
+            self._issues_fixed["outliers_removed"] = self._issues_fixed.get("outliers_removed", 0) + n
+    def _compute_dq_metrics(self) -> DataQualityMetrics:
+        total_cells   = int(self._df.size)
+        null_cells    = int(self._df.isnull().sum().sum())
+        duplicate_rows = int(len(self._df) - len(self._df.drop_duplicates()))
+        invalid_cells = self._count_invalid_cells()
+        completeness = round((1 - null_cells / max(total_cells, 1)) * 100, 2)
+        uniqueness   = round((1 - duplicate_rows / max(len(self._df), 1)) * 100, 2)
+        validity     = round((1 - invalid_cells / max(total_cells, 1)) * 100, 2)
+        return DataQualityMetrics(
+            completeness_pct = completeness,
+            uniqueness_pct   = uniqueness,
+            validity_pct     = validity,
+            total_cells      = total_cells,
+            null_cells       = null_cells,
+            duplicate_rows   = duplicate_rows,
+            invalid_cells    = invalid_cells,
+        )
+    def _count_invalid_cells(self) -> int:
+        """Count cells with format/dtype/range violations."""
+        invalid = 0
+        for col in self._df.columns:
+            series = self._df[col].dropna()
+            if col == "phone":
+                invalid += int((~series.astype(str).str.match(PHONE_RE, na=False)).sum())
+            elif col in ("listed_date", "signup_date"):
+                invalid += int((~series.apply(
+                    lambda x: bool(DATE_RE.match(str(x)))
+                )).sum())
+            elif col == "country":
+                invalid += int((~series.isin(VALID_COUNTRIES)).sum())
+            elif col == "age":
+                invalid += int(((series < 0) | (series > 120)).sum())
+            elif col == "salary":
+                invalid += int((series < 0).sum())
+            elif col == "purchase_amount":
+                invalid += int((series < 0).sum())
+        return invalid
+    def _generate_plan(self) -> List[str]:
+        """
+        Rule-based planning engine — inspects current DataFrame state
+        and returns up to 3 prioritised recommended actions.
+        Inspired by AutoDCWorkflow (EMNLP 2025).
+        """
+        plan: List[str] = []
+        if self._df is None:
+            return plan
+        # Task 4: schema alignment + merge must happen first
+        if self._task_id == 4:
+            if not self._schema_aligned:
+                return ["align_schema — rename Source A columns to canonical schema (required first step)"]
+            if not self._sources_merged:
+                return ["merge_sources — concatenate aligned Source A + Source B (required before cleaning)"]
+        missing = {col: int(n) for col, n in self._df.isnull().sum().items() if n > 0}
+        dupes   = len(self._df) - len(self._df.drop_duplicates())
+        # Priority 1: missing values (highest DQ impact)
+        for col, count in sorted(missing.items(), key=lambda x: -x[1]):
+            op_key = f"fill_missing:{col}"
+            if op_key not in self._tried_operations:
+                strategy = "mode" if self._df[col].dtype == object else "median"
+                plan.append(
+                    f'fill_missing on "{col}" ({count} nulls) using {strategy}'
+                )
+            if len(plan) >= 2:
+                break
+        # Priority 2: duplicates
+        if dupes > 0 and "drop_duplicates" not in self._tried_operations:
+            plan.append(f"drop_duplicates ({dupes} duplicate rows found)")
+        # Priority 3: format issues
+        for col in self._df.columns:
+            if len(plan) >= 3:
+                break
+            op_key = f"fix_format:{col}"
+            if op_key in self._tried_operations:
+                continue
+            if col == "phone":
+                bad = int((~self._df[col].dropna().astype(str).str.match(PHONE_RE)).sum())
+                if bad > 0:
+                    plan.append(f'fix_format on "phone" ({bad} malformed numbers)')
+            elif col in ("listed_date", "signup_date"):
+                bad = int((~self._df[col].dropna().apply(
+                    lambda x: bool(DATE_RE.match(str(x)))
+                )).sum())
+                if bad > 0:
+                    plan.append(f'fix_format on "{col}" ({bad} malformed dates)')
+            elif col == "country":
+                bad = int((~self._df[col].dropna().isin(VALID_COUNTRIES)).sum())
+                if bad > 0:
+                    plan.append(f'fix_format on "country" ({bad} invalid values)')
+        # Priority 4: outliers on numeric cols
+        if len(plan) < 3:
+            for col in self._df.select_dtypes(include=[np.number]).columns:
+                op_key = f"drop_outliers:{col}"
+                if op_key in self._tried_operations:
+                    continue
+                q1, q3 = self._df[col].quantile(0.25), self._df[col].quantile(0.75)
+                iqr = q3 - q1
+                outliers = int((self._df[col] > q3 + 3 * iqr).sum())
+                if outliers > 0:
+                    plan.append(f'drop_outliers on "{col}" ({outliers} extreme values)')
+                    break
+        return plan[:3]
     def _compute_score(self) -> float:
         if self._task_id == 1:
             raw = t1.score(self._df, self._meta)
         elif self._task_id == 2:
             raw = t2.score(self._df, self._meta)
+        elif self._task_id == 3:
             raw = t3.score(self._df, self._meta)
+        else:
+            raw = t4.score(self._df, self._meta)
         raw = float(raw)
+        EPS = 1e-4
         if raw >= 1.0:
             raw = 1.0 - EPS
         elif raw <= 0.0:
             raw = EPS
         return round(raw, 4)
     def _count_errors(self) -> int:
             return t1.count_errors(self._df)
         elif self._task_id == 2:
             return t2.count_errors(self._df, self._meta)
+        elif self._task_id == 3:
             return t3.count_errors(self._df, self._meta)
+        else:
+            return t4.count_errors(self._df, self._meta)
     def _build_obs(self, reward: float, done: bool, message: str) -> DataCleaningObservation:
+        mod          = TASK_MODULES[self._task_id]
+        missing      = {col: int(n) for col, n in self._df.isnull().sum().items() if n > 0}
+        dupes        = len(self._df) - len(self._df.drop_duplicates())
         dtype_issues = self._detect_dtype_issues()
+        preview      = self._df.head(10).to_csv(index=False)
+        dq_metrics   = self._compute_dq_metrics()
+        plan         = self._generate_plan()
         return DataCleaningObservation(
+            done              = done,
+            reward            = reward,
+            data_preview      = preview,
+            data_shape        = list(self._df.shape),
+            missing_counts    = missing,
+            duplicate_count   = dupes,
+            dtype_issues      = dtype_issues,
+            task_description  = mod.DESCRIPTION,
+            message           = message,
+            step_count        = self._step_count,
+            current_score     = self._last_score,
+            dq_metrics        = dq_metrics,
+            tried_operations  = list(self._tried_operations),
+            plan              = plan,
         )
     def _detect_dtype_issues(self) -> Dict[str, str]:
                 return self._drop_outliers(col)
             elif op == "fix_dtype":
                 return self._fix_dtype(col, p)
+            elif op == "align_schema":
+                return self._align_schema()
+            elif op == "merge_sources":
+                return self._merge_sources()
             else:
+                return (
+                    f"Unknown operation '{op}'. Choose from: fill_missing, "
+                    "drop_duplicates, fix_format, replace_value, drop_outliers, "
+                    "fix_dtype, align_schema, merge_sources.",
+                    False,
+                )
         except Exception as exc:
             return f"Operation failed: {exc}", False
     def _drop_duplicates(self) -> Tuple[str, bool]:
         n_before = len(self._df)
         self._df = self._df.drop_duplicates().reset_index(drop=True)
+        removed  = n_before - len(self._df)
         if removed == 0:
             return "No duplicate rows found.", False
         return f"Dropped {removed} duplicate rows.", True
     def _fix_format(self, col) -> Tuple[str, bool]:
         if col is None or col not in self._df.columns:
             return f"Column '{col}' not found.", False
         if col == "phone":
             return self._fix_phone(col)
         elif col in ("listed_date", "signup_date"):
                     return pd.to_datetime(s, format=fmt).strftime("%Y-%m-%d")
                 except Exception:
                     pass
             try:
                 return pd.to_datetime(s).strftime("%Y-%m-%d")
             except Exception:
         after  = (~self._df[col].isin(VALID_COUNTRIES) & self._df[col].notna()).sum()
         fixed  = int(before - after)
         if fixed == 0:
+            return "No country capitalisation issues found.", False
         return f"Fixed {fixed} country values to correct capitalisation.", True
     def _replace_value(self, col, p) -> Tuple[str, bool]:
             return f"No outliers found in '{col}'.", False
         return f"Removed {removed} outlier rows from '{col}' using IQR method.", True
+    def _align_schema(self) -> Tuple[str, bool]:
+        """Rename Source A columns to canonical target schema (Task 4 only)."""
+        if self._task_id != 4:
+            return "align_schema is only available in Task 4.", False
+        if self._schema_aligned:
+            return "Schema already aligned.", False
+        from server.tasks.task4_merge import SOURCE_A_RENAME, TARGET_COLUMNS
+        missing_src = [c for c in SOURCE_A_RENAME if c not in self._df.columns]
+        if missing_src:
+            return f"Expected Source A columns not found: {missing_src}.", False
+        self._df = self._df.rename(columns=SOURCE_A_RENAME)
+        self._schema_aligned = True
+        renamed = list(SOURCE_A_RENAME.keys())
+        return (
+            f"Aligned Source A schema: renamed {len(SOURCE_A_RENAME)} columns "
+            f"({', '.join(renamed)}) to canonical target schema.", True
+        )
+    def _merge_sources(self) -> Tuple[str, bool]:
+        """Concatenate aligned Source A with Source B (Task 4 only)."""
+        if self._task_id != 4:
+            return "merge_sources is only available in Task 4.", False
+        if self._sources_merged:
+            return "Sources already merged.", False
+        if not self._schema_aligned:
+            return "Run align_schema before merge_sources.", False
+        if self._source_b is None:
+            return "Source B not available.", False
+        from server.tasks.task4_merge import TARGET_COLUMNS, _META_TEMPLATE
+        n_a = len(self._df)
+        n_b = len(self._source_b)
+        # Rename source_b columns to canonical schema
+        source_b_rename = {
+            "age_years":         "age",
+            "spend":             "purchase_amount",
+            "country_name":      "country",
+            "registration_date": "signup_date",
+        }
+        source_b_aligned = self._source_b.rename(columns=source_b_rename)
+        # Concatenate both aligned sources
+        merged = pd.concat(
+            [self._df[TARGET_COLUMNS], source_b_aligned[TARGET_COLUMNS]],
+            ignore_index=True
+        ).reset_index(drop=True)
+        # Inject pre-computed dirty issues so grader baseline is correct
+        dirty_merged = _META_TEMPLATE["dirty_merged"].copy()
+        self._df = dirty_merged
+        self._sources_merged = True
+        self._source_b = None
+        return (
+            f"Merged Source A ({n_a} rows) + Source B ({n_b} rows) → "
+            f"{len(self._df)} rows with canonical schema. "
+            f"Dataset now has dirty issues to clean: missing values, "
+            f"mixed country case, mixed date formats, duplicate rows.", True
+        )
     def _fix_dtype(self, col, p) -> Tuple[str, bool]:
         if col is None or col not in self._df.columns:
             return f"Column '{col}' not found.", False
                 return f"Unknown dtype '{dtype}'.", False
             return f"Converted '{col}' to {dtype}.", True
         except Exception as exc:
+            return f"dtype conversion failed: {exc}", False

server/tasks/task1_missing.py CHANGED Viewed

@@ -19,12 +19,14 @@ DESCRIPTION = (
     "Example action: {\"operation\": \"fill_missing\", \"column\": \"age\", \"params\": {\"strategy\": \"median\"}}"
 )
 def load():
-    """Return (dirty_df, clean_df, original_null_count)."""
-    dirty, clean = generate_task1_datasets()
-    original_nulls = int(dirty.isnull().sum().sum())
-    return dirty.copy(), clean, original_nulls
 def score(current_df, original_nulls: int) -> float:
@@ -36,4 +38,4 @@ def score(current_df, original_nulls: int) -> float:
 def count_errors(current_df) -> int:
-    return int(current_df.isnull().sum().sum())

     "Example action: {\"operation\": \"fill_missing\", \"column\": \"age\", \"params\": {\"strategy\": \"median\"}}"
 )
+# Cache at module load — seed=42 makes output identical every time
+_DIRTY_TEMPLATE, _CLEAN_DF = generate_task1_datasets()
+_ORIGINAL_NULLS = int(_DIRTY_TEMPLATE.isnull().sum().sum())
 def load():
+    """Return (dirty_df, clean_df, original_null_count) — uses cached template."""
+    return _DIRTY_TEMPLATE.copy(), _CLEAN_DF, _ORIGINAL_NULLS
 def score(current_df, original_nulls: int) -> float:
 def count_errors(current_df) -> int:
+    return int(current_df.isnull().sum().sum())

server/tasks/task2_format.py CHANGED Viewed

@@ -29,23 +29,24 @@ PHONE_RE = re.compile(r"^\d{3}-\d{3}-\d{4}$")
 DATE_RE  = re.compile(r"^\d{4}-\d{2}-\d{2}$")
-def load():
-    dirty, clean = generate_task2_datasets()
-    original_phone_issues = int((~dirty["phone"].str.match(PHONE_RE)).sum())
-    original_date_issues  = int((~dirty["listed_date"].apply(
         lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
-    )).sum())
-    original_dupes = len(dirty) - len(dirty.drop_duplicates())
-    meta = {
-        "orig_phone": original_phone_issues,
-        "orig_date":  original_date_issues,
-        "orig_dupes": original_dupes,
-    }
-    return dirty.copy(), clean, meta
 def score(current_df, meta: dict) -> float:
-    phone_issues = int((~current_df["phone"].str.match(PHONE_RE)).sum())
     date_issues  = int((~current_df["listed_date"].apply(
         lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
     )).sum())
@@ -60,9 +61,9 @@ def score(current_df, meta: dict) -> float:
 def count_errors(current_df, meta: dict) -> int:
-    phone_issues = int((~current_df["phone"].str.match(PHONE_RE)).sum())
     date_issues  = int((~current_df["listed_date"].apply(
         lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
     )).sum())
     dupes = len(current_df) - len(current_df.drop_duplicates())
-    return phone_issues + date_issues + dupes

 DATE_RE  = re.compile(r"^\d{4}-\d{2}-\d{2}$")
+# Cache at module load — seed=42 makes output identical every time
+_DIRTY_TEMPLATE, _CLEAN_DF = generate_task2_datasets()
+_META_TEMPLATE = {
+    "orig_phone": int((~_DIRTY_TEMPLATE["phone"].str.match(PHONE_RE, na=False)).sum()),
+    "orig_date":  int((~_DIRTY_TEMPLATE["listed_date"].apply(
         lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
+    )).sum()),
+    "orig_dupes": len(_DIRTY_TEMPLATE) - len(_DIRTY_TEMPLATE.drop_duplicates()),
+}
+def load():
+    """Return (dirty_df, clean_df, meta) — uses cached template."""
+    return _DIRTY_TEMPLATE.copy(), _CLEAN_DF, dict(_META_TEMPLATE)
 def score(current_df, meta: dict) -> float:
+    phone_issues = int((~current_df["phone"].str.match(PHONE_RE, na=False)).sum())
     date_issues  = int((~current_df["listed_date"].apply(
         lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
     )).sum())
 def count_errors(current_df, meta: dict) -> int:
+    phone_issues = int((~current_df["phone"].str.match(PHONE_RE, na=False)).sum())
     date_issues  = int((~current_df["listed_date"].apply(
         lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
     )).sum())
     dupes = len(current_df) - len(current_df.drop_duplicates())
+    return phone_issues + date_issues + dupes

server/tasks/task3_pipeline.py CHANGED Viewed

@@ -38,32 +38,35 @@ DATE_RE = re.compile(r"^\d{4}-\d{2}-\d{2}$")
 VALID_COUNTRIES = {"USA", "UK", "Canada", "Australia", "Germany"}
-def load():
-    dirty, clean = generate_task3_datasets()
     orig_nulls = int(dirty.isnull().sum().sum())
     orig_dupes = len(dirty) - len(dirty.drop_duplicates())
-    # Outlier baseline: count rows where purchase_amount > Q3 + 3*IQR
     pa = dirty["purchase_amount"].dropna()
     q1, q3 = pa.quantile(0.25), pa.quantile(0.75)
     iqr = q3 - q1
     orig_outliers = int((pa > q3 + 3 * iqr).sum())
     orig_country_issues = int((~dirty["country"].isin(VALID_COUNTRIES) &
                                dirty["country"].notna()).sum())
-    orig_date_issues    = int((~dirty["signup_date"].apply(
         lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
     )).sum())
-    meta = {
-        "orig_nulls":           orig_nulls,
-        "orig_dupes":           orig_dupes,
-        "orig_outliers":        max(orig_outliers, 1),
-        "orig_country_issues":  max(orig_country_issues, 1),
-        "orig_date_issues":     max(orig_date_issues, 1),
         "q1": q1, "q3": q3, "iqr": iqr,
     }
-    return dirty.copy(), clean, meta
 def score(current_df, meta: dict) -> float:
@@ -101,4 +104,4 @@ def count_errors(current_df, meta: dict) -> int:
         lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
     )).sum())
     return remaining_nulls + remaining_dupes + remaining_outliers + \
-           remaining_country + remaining_dates

 VALID_COUNTRIES = {"USA", "UK", "Canada", "Australia", "Germany"}
+# Cache at module load — seed=42 makes output identical every time
+def _build_meta(dirty):
     orig_nulls = int(dirty.isnull().sum().sum())
     orig_dupes = len(dirty) - len(dirty.drop_duplicates())
     pa = dirty["purchase_amount"].dropna()
     q1, q3 = pa.quantile(0.25), pa.quantile(0.75)
     iqr = q3 - q1
     orig_outliers = int((pa > q3 + 3 * iqr).sum())
     orig_country_issues = int((~dirty["country"].isin(VALID_COUNTRIES) &
                                dirty["country"].notna()).sum())
+    orig_date_issues = int((~dirty["signup_date"].apply(
         lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
     )).sum())
+    return {
+        "orig_nulls":          orig_nulls,
+        "orig_dupes":          orig_dupes,
+        "orig_outliers":       max(orig_outliers, 1),
+        "orig_country_issues": max(orig_country_issues, 1),
+        "orig_date_issues":    max(orig_date_issues, 1),
         "q1": q1, "q3": q3, "iqr": iqr,
     }
+_DIRTY_TEMPLATE, _CLEAN_DF = generate_task3_datasets()
+_META_TEMPLATE = _build_meta(_DIRTY_TEMPLATE)
+def load():
+    """Return (dirty_df, clean_df, meta) — uses cached template."""
+    return _DIRTY_TEMPLATE.copy(), _CLEAN_DF, dict(_META_TEMPLATE)
 def score(current_df, meta: dict) -> float:
         lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
     )).sum())
     return remaining_nulls + remaining_dupes + remaining_outliers + \
+           remaining_country + remaining_dates

server/tasks/task4_merge.py ADDED Viewed

	@@ -0,0 +1,231 @@

+"""
+Task 4 — Expert: Multi-Source Schema Alignment + Merge Pipeline
+Two independent data sources (CRM + Marketing) have been exported with
+misaligned column names and must be aligned to a canonical schema,
+merged into one DataFrame, and then cleaned.
+Grader sub-scores (equal weight):
+  0.30 × schema_score    — correct columns present after align + merge
+  0.25 × null_score      — missing values filled
+  0.20 × country_score   — country capitalisation fixed
+  0.15 × date_score      — signup_date format standardised
+  0.10 × dupe_score      — duplicate rows removed
+Inspired by:
+  - CleanAgent (Qi & Wang, 2024) — declarative schema standardisation
+  - Meta DataSchema system — column-level semantic annotation at scale
+"""
+import re
+import pandas as pd
+from server.data_generator import generate_task4_datasets
+TASK_ID   = 4
+MAX_STEPS = 50
+DESCRIPTION = (
+    "Task 4 (Expert) — Multi-Source Schema Alignment + Merge Pipeline\n"
+    "You have TWO source DataFrames with misaligned schemas:\n\n"
+    "  Source A (CRM, 150 rows) columns:\n"
+    "    cust_id, full_name, Age, purchase_amt, Country, signup, email\n\n"
+    "  Source B (Marketing, 100 rows) columns:\n"
+    "    customer_id, name, age_years, spend, country_name, registration_date, email\n\n"
+    "Target canonical schema (250 rows after merge):\n"
+    "    customer_id, name, age, purchase_amount, country, signup_date, email\n\n"
+    "Step 1 — align_schema: rename Source A columns to match target.\n"
+    "Step 2 — merge_sources: concatenate Source A + Source B.\n"
+    "Step 3 — Clean the merged dataset:\n"
+    "  • fill_missing   — age, purchase_amount, country (~10% nulls each)\n"
+    "  • fix_format     — country (mixed case), signup_date (mixed formats)\n"
+    "  • drop_duplicates — ~10 duplicate rows\n\n"
+    "Available operations:\n"
+    "  align_schema    — no column needed; renames Source A to canonical schema\n"
+    "  merge_sources   — no column needed; concatenates aligned A + B\n"
+    "  fill_missing    — column + params.strategy\n"
+    "  fix_format      — column: 'country' | 'signup_date'\n"
+    "  drop_duplicates — no column needed\n\n"
+    "Example actions:\n"
+    '  {"operation": "align_schema"}\n'
+    '  {"operation": "merge_sources"}\n'
+    '  {"operation": "fill_missing", "column": "age", "params": {"strategy": "median"}}\n'
+    '  {"operation": "fix_format", "column": "country"}\n'
+    '  {"operation": "fix_format", "column": "signup_date"}\n'
+    '  {"operation": "drop_duplicates"}'
+)
+DATE_RE = re.compile(r"^\d{4}-\d{2}-\d{2}$")
+VALID_COUNTRIES = {"USA", "UK", "Canada", "Australia", "Germany"}
+TARGET_COLUMNS  = ["customer_id", "name", "age", "purchase_amount",
+                   "country", "signup_date", "email"]
+# Column mapping: Source A dirty names → canonical target names
+SOURCE_A_RENAME = {
+    "cust_id":      "customer_id",
+    "full_name":    "name",
+    "Age":          "age",
+    "purchase_amt": "purchase_amount",
+    "Country":      "country",
+    "signup":       "signup_date",
+    # "email" already matches
+}
+# ---------------------------------------------------------------------------
+# Cache at module load
+# ---------------------------------------------------------------------------
+def _build_meta(source_a, source_b, clean_merged):
+    import numpy as np
+    # Align source_a and source_b to canonical schema before merging
+    aligned_a = source_a.rename(columns=SOURCE_A_RENAME)
+    source_b_rename = {
+        "age_years":         "age",
+        "spend":             "purchase_amount",
+        "country_name":      "country",
+        "registration_date": "signup_date",
+    }
+    aligned_b = source_b.rename(columns=source_b_rename)
+    merged = pd.concat(
+        [aligned_a[TARGET_COLUMNS], aligned_b[TARGET_COLUMNS]],
+        ignore_index=True
+    ).reset_index(drop=True)
+    # Inject dirty issues deterministically
+    import numpy as np
+    rng = np.random.default_rng(42 + 4)
+    n = len(merged)
+    # Missing values
+    for col, frac in [("age", 0.10), ("purchase_amount", 0.10), ("country", 0.08)]:
+        idx = rng.choice(n, size=int(n * frac), replace=False)
+        merged.loc[idx, col] = None
+    # Mixed country case
+    case_idx = rng.choice(n, size=int(n * 0.30), replace=False)
+    merged.loc[case_idx, "country"] = merged.loc[case_idx, "country"].str.lower()
+    # Mixed date formats
+    import random as _random
+    _random.seed(42 + 4)
+    date_idx = rng.choice(n, size=int(n * 0.40), replace=False)
+    for i in date_idx:
+        val = merged.loc[i, "signup_date"]
+        if pd.notna(val):
+            try:
+                dt = pd.to_datetime(str(val))
+                fmt = rng.integers(0, 3)
+                if fmt == 1:
+                    merged.loc[i, "signup_date"] = dt.strftime("%b %d %Y")
+                elif fmt == 2:
+                    merged.loc[i, "signup_date"] = dt.strftime("%d/%m/%Y")
+            except Exception:
+                pass
+    # Duplicates
+    dup_idx = rng.choice(n, size=10, replace=False)
+    dup_rows = merged.iloc[dup_idx].copy()
+    merged = pd.concat([merged, dup_rows], ignore_index=True)
+    orig_nulls = int(merged.isnull().sum().sum())
+    orig_dupes = len(merged) - len(merged.drop_duplicates())
+    orig_country_issues = int(
+        (~merged["country"].isin(VALID_COUNTRIES) & merged["country"].notna()).sum()
+    )
+    orig_date_issues = int(
+        (~merged["signup_date"].apply(
+            lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
+        )).sum()
+    )
+    return {
+        "orig_nulls":          max(orig_nulls, 1),
+        "orig_dupes":          max(orig_dupes, 1),
+        "orig_country_issues": max(orig_country_issues, 1),
+        "orig_date_issues":    max(orig_date_issues, 1),
+        "dirty_merged":        merged,   # stored for environment to use post-merge
+    }
+_SOURCE_A, _SOURCE_B, _CLEAN_MERGED = generate_task4_datasets()
+_META_TEMPLATE = _build_meta(_SOURCE_A, _SOURCE_B, _CLEAN_MERGED)
+def load():
+    """
+    Returns (source_a, source_b, clean_merged, meta).
+    source_a is the initial active DataFrame (pre-alignment).
+    source_b is held separately until merge_sources is called.
+    """
+    import copy
+    meta = {k: v for k, v in _META_TEMPLATE.items() if k != "dirty_merged"}
+    meta["dirty_merged"] = _META_TEMPLATE["dirty_merged"].copy()
+    return _SOURCE_A.copy(), _SOURCE_B.copy(), _CLEAN_MERGED.copy(), meta
+# ---------------------------------------------------------------------------
+# Grader
+# ---------------------------------------------------------------------------
+def score(current_df, meta: dict) -> float:
+    """
+    Weighted score across 5 sub-dimensions:
+      0.30 schema_score  — all target columns present, no extra columns
+      0.25 null_score    — missing values filled
+      0.20 country_score — country capitalisation correct
+      0.15 date_score    — signup_date in YYYY-MM-DD
+      0.10 dupe_score    — no duplicate rows
+    """
+    # Schema score: are all target columns present?
+    present = sum(1 for c in TARGET_COLUMNS if c in current_df.columns)
+    schema_score = present / len(TARGET_COLUMNS)
+    # Can only score the rest if schema is aligned AND merged
+    if not all(c in current_df.columns for c in TARGET_COLUMNS):
+        # Partial credit: schema only
+        return round(max(0.01, min(0.99, 0.30 * schema_score)), 4)
+    remaining_nulls = int(current_df.isnull().sum().sum())
+    remaining_dupes = len(current_df) - len(current_df.drop_duplicates())
+    remaining_country = int(
+        (~current_df["country"].isin(VALID_COUNTRIES) & current_df["country"].notna()).sum()
+    )
+    remaining_dates = int(
+        (~current_df["signup_date"].apply(
+            lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
+        )).sum()
+    )
+    null_score    = 1.0 - remaining_nulls   / meta["orig_nulls"]
+    dupe_score    = 1.0 - remaining_dupes   / meta["orig_dupes"]
+    country_score = 1.0 - remaining_country / meta["orig_country_issues"]
+    date_score    = 1.0 - remaining_dates   / meta["orig_date_issues"]
+    combined = (0.30 * schema_score  +
+                0.25 * null_score    +
+                0.20 * country_score +
+                0.15 * date_score    +
+                0.10 * dupe_score)
+    return round(max(0.01, min(0.99, combined)), 4)
+def count_errors(current_df, meta: dict) -> int:
+    errors = 0
+    missing_cols = sum(1 for c in TARGET_COLUMNS if c not in current_df.columns)
+    errors += missing_cols * 10   # heavy penalty for schema misalignment
+    if all(c in current_df.columns for c in TARGET_COLUMNS):
+        errors += int(current_df.isnull().sum().sum())
+        errors += len(current_df) - len(current_df.drop_duplicates())
+        errors += int(
+            (~current_df["country"].isin(VALID_COUNTRIES) & current_df["country"].notna()).sum()
+        )
+        errors += int(
+            (~current_df["signup_date"].apply(
+                lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
+            )).sum()
+        )
+    return errors

server/ui.html ADDED Viewed

	@@ -0,0 +1,1237 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>DataMedic - AI Data Cleaning Monitor</title>
+    <style>
+        :root {
+            --bg: #050d1a;
+            --bg2: #0a1628;
+            --bg3: #0f1f38;
+            --border: #1a3050;
+            --green: #00e5a0;
+            --green-dim: #00704e;
+            --amber: #f5a623;
+            --red: #ff4d6d;
+            --blue: #4db8ff;
+            --text: #c8dff5;
+            --text-dim: #4a6a8a;
+            --mono: 'Courier New', Courier, monospace;
+            --sans: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
+        }
+        * {
+            box-sizing: border-box;
+            margin: 0;
+            padding: 0;
+        }
+        body {
+            background: var(--bg);
+            color: var(--text);
+            font-family: var(--sans);
+            min-height: 100vh;
+            overflow-x: hidden;
+        }
+        body::before {
+            content: '';
+            position: fixed;
+            inset: 0;
+            background: repeating-linear-gradient(0deg, transparent, transparent 2px,
+                    rgba(0, 0, 0, 0.06) 2px, rgba(0, 0, 0, 0.06) 4px);
+            pointer-events: none;
+            z-index: 999;
+        }
+        /* ── Header ── */
+        header {
+            display: flex;
+            align-items: center;
+            justify-content: space-between;
+            padding: 14px 28px;
+            border-bottom: 1px solid var(--border);
+            background: var(--bg2);
+            position: sticky;
+            top: 0;
+            z-index: 100;
+        }
+        .logo {
+            display: flex;
+            align-items: center;
+            gap: 12px;
+        }
+        .logo-pulse {
+            width: 10px;
+            height: 10px;
+            background: var(--green);
+            border-radius: 50%;
+            box-shadow: 0 0 10px var(--green);
+            animation: pulse 2s infinite;
+            flex-shrink: 0;
+        }
+        @keyframes pulse {
+            0%,
+            100% {
+                opacity: 1;
+                transform: scale(1);
+            }
+            50% {
+                opacity: 0.3;
+                transform: scale(0.7);
+            }
+        }
+        .logo-text {
+            font-family: var(--mono);
+            font-size: 17px;
+            font-weight: 700;
+            letter-spacing: 3px;
+            color: var(--green);
+        }
+        .logo-sub {
+            font-size: 10px;
+            color: var(--text-dim);
+            letter-spacing: 1px;
+            text-transform: uppercase;
+            margin-top: 2px;
+        }
+        .status-pill {
+            font-family: var(--mono);
+            font-size: 11px;
+            padding: 4px 14px;
+            border-radius: 20px;
+            border: 1px solid;
+            letter-spacing: 1px;
+            text-transform: uppercase;
+        }
+        .status-pill.idle {
+            color: var(--text-dim);
+            border-color: var(--text-dim);
+        }
+        .status-pill.running {
+            color: var(--green);
+            border-color: var(--green);
+            box-shadow: 0 0 8px rgba(0, 229, 160, 0.3);
+            animation: pulse 1s infinite;
+        }
+        .status-pill.done {
+            color: var(--blue);
+            border-color: var(--blue);
+        }
+        /* ── Controls ── */
+        .controls {
+            padding: 16px 28px;
+            display: flex;
+            align-items: center;
+            gap: 12px;
+            border-bottom: 1px solid var(--border);
+            flex-wrap: wrap;
+            background: var(--bg2);
+        }
+        .ctrl-label {
+            font-family: var(--mono);
+            font-size: 10px;
+            color: var(--text-dim);
+            text-transform: uppercase;
+            letter-spacing: 1px;
+            white-space: nowrap;
+        }
+        .task-btn {
+            font-family: var(--mono);
+            font-size: 11px;
+            padding: 7px 16px;
+            border-radius: 4px;
+            border: 1px solid var(--border);
+            background: var(--bg3);
+            color: var(--text-dim);
+            cursor: pointer;
+            transition: all 0.2s;
+            letter-spacing: 1px;
+        }
+        .task-btn:hover {
+            border-color: var(--green);
+            color: var(--green);
+        }
+        .task-btn.active {
+            border-color: var(--green);
+            color: var(--green);
+            background: rgba(0, 229, 160, 0.08);
+        }
+        .sep {
+            width: 1px;
+            height: 24px;
+            background: var(--border);
+            margin: 0 4px;
+        }
+        .reset-btn {
+            font-family: var(--mono);
+            font-size: 11px;
+            padding: 7px 16px;
+            border-radius: 4px;
+            border: 1px solid var(--amber);
+            background: transparent;
+            color: var(--amber);
+            cursor: pointer;
+            letter-spacing: 1px;
+            transition: all 0.2s;
+        }
+        .reset-btn:hover {
+            background: rgba(245, 166, 35, 0.1);
+        }
+        .reset-btn:disabled {
+            opacity: 0.4;
+            cursor: not-allowed;
+        }
+        .run-btn {
+            font-family: var(--mono);
+            font-size: 11px;
+            padding: 7px 20px;
+            border-radius: 4px;
+            border: none;
+            background: var(--green);
+            color: #050d1a;
+            cursor: pointer;
+            font-weight: 700;
+            letter-spacing: 1px;
+            transition: all 0.2s;
+            margin-left: auto;
+        }
+        .run-btn:hover {
+            background: #00ffb3;
+            box-shadow: 0 0 16px rgba(0, 229, 160, 0.4);
+        }
+        .run-btn:disabled {
+            background: var(--green-dim);
+            cursor: not-allowed;
+            opacity: 0.5;
+        }
+        .run-hint {
+            font-size: 10px;
+            color: var(--text-dim);
+            font-family: var(--mono);
+            white-space: nowrap;
+        }
+        /* ── Main grid ── */
+        .main {
+            display: grid;
+            grid-template-columns: 320px 1fr;
+            min-height: calc(100vh - 118px);
+        }
+        /* ── Vitals panel ── */
+        .vitals-panel {
+            border-right: 1px solid var(--border);
+            padding: 20px;
+            display: flex;
+            flex-direction: column;
+            gap: 18px;
+            overflow-y: auto;
+        }
+        .panel-title {
+            font-family: var(--mono);
+            font-size: 10px;
+            color: var(--text-dim);
+            text-transform: uppercase;
+            letter-spacing: 2px;
+            padding-bottom: 10px;
+            border-bottom: 1px solid var(--border);
+        }
+        /* Score ring */
+        .score-ring-wrap {
+            display: flex;
+            flex-direction: column;
+            align-items: center;
+            gap: 6px;
+            padding: 8px 0;
+        }
+        .ring-container {
+            position: relative;
+            width: 130px;
+            height: 130px;
+        }
+        .ring-container svg {
+            transform: rotate(-90deg);
+            width: 130px;
+            height: 130px;
+        }
+        .ring-bg {
+            fill: none;
+            stroke: var(--bg3);
+            stroke-width: 10;
+        }
+        .ring-fill {
+            fill: none;
+            stroke: var(--green);
+            stroke-width: 10;
+            stroke-linecap: round;
+            stroke-dasharray: 326.73;
+            stroke-dashoffset: 326.73;
+            transition: stroke-dashoffset 0.7s cubic-bezier(0.4, 0, 0.2, 1), stroke 0.4s;
+            filter: drop-shadow(0 0 5px var(--green));
+        }
+        .ring-text {
+            position: absolute;
+            inset: 0;
+            display: flex;
+            flex-direction: column;
+            align-items: center;
+            justify-content: center;
+            font-family: var(--mono);
+        }
+        .ring-score {
+            font-size: 28px;
+            font-weight: 700;
+            color: var(--green);
+            line-height: 1;
+        }
+        .ring-label {
+            font-size: 9px;
+            color: var(--text-dim);
+            text-transform: uppercase;
+            letter-spacing: 1px;
+            margin-top: 4px;
+        }
+        /* Vital grid */
+        .vital-grid {
+            display: grid;
+            grid-template-columns: 1fr 1fr;
+            gap: 8px;
+        }
+        .vital-card {
+            background: var(--bg2);
+            border: 1px solid var(--border);
+            border-radius: 5px;
+            padding: 10px;
+        }
+        .vital-name {
+            font-size: 9px;
+            color: var(--text-dim);
+            text-transform: uppercase;
+            letter-spacing: 1px;
+            font-family: var(--mono);
+            margin-bottom: 5px;
+        }
+        .vital-value {
+            font-family: var(--mono);
+            font-size: 20px;
+            font-weight: 700;
+            line-height: 1;
+        }
+        .vital-value.green {
+            color: var(--green);
+        }
+        .vital-value.amber {
+            color: var(--amber);
+        }
+        .vital-value.red {
+            color: var(--red);
+        }
+        .vital-value.blue {
+            color: var(--blue);
+        }
+        .vital-sub {
+            font-size: 9px;
+            color: var(--text-dim);
+            margin-top: 3px;
+            font-family: var(--mono);
+        }
+        /* DQ bars */
+        .dq-bars {
+            display: flex;
+            flex-direction: column;
+            gap: 10px;
+        }
+        .dq-row {
+            display: flex;
+            flex-direction: column;
+            gap: 4px;
+        }
+        .dq-header {
+            display: flex;
+            justify-content: space-between;
+            font-family: var(--mono);
+            font-size: 10px;
+        }
+        .dq-name {
+            color: var(--text-dim);
+            text-transform: uppercase;
+            letter-spacing: 1px;
+        }
+        .dq-val {
+            font-weight: 700;
+        }
+        .dq-bar-bg {
+            height: 4px;
+            background: var(--bg3);
+            border-radius: 2px;
+            overflow: hidden;
+        }
+        .dq-bar-fill {
+            height: 100%;
+            border-radius: 2px;
+            transition: width 0.5s cubic-bezier(0.4, 0, 0.2, 1);
+        }
+        /* ── Content area ── */
+        .content-area {
+            display: flex;
+            flex-direction: column;
+            overflow: hidden;
+        }
+        /* Chart */
+        .chart-section {
+            padding: 20px 28px;
+            border-bottom: 1px solid var(--border);
+        }
+        .chart-wrap {
+            margin-top: 14px;
+            height: 90px;
+            position: relative;
+        }
+        #score-chart {
+            width: 100%;
+            height: 100%;
+        }
+        /* Plan */
+        .plan-section {
+            padding: 16px 28px;
+            border-bottom: 1px solid var(--border);
+        }
+        .plan-items {
+            margin-top: 10px;
+            display: flex;
+            flex-direction: column;
+            gap: 6px;
+        }
+        .plan-item {
+            display: flex;
+            align-items: flex-start;
+            gap: 10px;
+            font-size: 12px;
+            animation: fadeIn 0.3s ease;
+        }
+        .plan-num {
+            font-family: var(--mono);
+            font-size: 9px;
+            width: 18px;
+            height: 18px;
+            border: 1px solid var(--amber);
+            border-radius: 50%;
+            display: flex;
+            align-items: center;
+            justify-content: center;
+            flex-shrink: 0;
+            color: var(--amber);
+            margin-top: 1px;
+        }
+        @keyframes fadeIn {
+            from {
+                opacity: 0;
+                transform: translateY(6px);
+            }
+            to {
+                opacity: 1;
+                transform: translateY(0);
+            }
+        }
+        /* Thought stream */
+        .thought-section {
+            padding: 16px 28px;
+            border-bottom: 1px solid var(--border);
+            flex: 1;
+        }
+        .thought-stream {
+            margin-top: 10px;
+            display: flex;
+            flex-direction: column;
+            gap: 7px;
+            max-height: 200px;
+            overflow-y: auto;
+        }
+        .thought-stream::-webkit-scrollbar {
+            width: 3px;
+        }
+        .thought-stream::-webkit-scrollbar-thumb {
+            background: var(--border);
+            border-radius: 2px;
+        }
+        .thought-item {
+            display: flex;
+            gap: 10px;
+            align-items: flex-start;
+            animation: fadeIn 0.3s ease;
+        }
+        .thought-step {
+            font-family: var(--mono);
+            font-size: 9px;
+            color: var(--text-dim);
+            padding: 2px 5px;
+            border: 1px solid var(--border);
+            border-radius: 3px;
+            white-space: nowrap;
+            margin-top: 1px;
+            flex-shrink: 0;
+        }
+        .thought-body {
+            flex: 1;
+            min-width: 0;
+        }
+        .thought-action {
+            font-family: var(--mono);
+            font-size: 11px;
+            color: var(--blue);
+            margin-bottom: 2px;
+            word-break: break-all;
+        }
+        .thought-result {
+            font-size: 11px;
+            color: var(--text-dim);
+        }
+        .thought-reward {
+            font-family: var(--mono);
+            font-size: 10px;
+            padding: 2px 7px;
+            border-radius: 3px;
+            margin-top: 2px;
+            display: inline-block;
+        }
+        .reward-pos {
+            background: rgba(0, 229, 160, 0.12);
+            color: var(--green);
+        }
+        .reward-neg {
+            background: rgba(255, 77, 109, 0.12);
+            color: var(--red);
+        }
+        /* Data table */
+        .preview-section {
+            padding: 16px 28px 20px;
+        }
+        .data-table-wrap {
+            margin-top: 10px;
+            overflow-x: auto;
+            border: 1px solid var(--border);
+            border-radius: 5px;
+            max-height: 220px;
+            overflow-y: auto;
+        }
+        .data-table {
+            width: 100%;
+            border-collapse: collapse;
+            font-family: var(--mono);
+            font-size: 11px;
+        }
+        .data-table th {
+            background: var(--bg3);
+            color: var(--text-dim);
+            padding: 7px 10px;
+            text-align: left;
+            text-transform: uppercase;
+            letter-spacing: 1px;
+            border-bottom: 1px solid var(--border);
+            white-space: nowrap;
+            position: sticky;
+            top: 0;
+        }
+        .data-table td {
+            padding: 5px 10px;
+            border-bottom: 1px solid rgba(26, 48, 80, 0.4);
+            color: var(--text);
+            white-space: nowrap;
+        }
+        .data-table tr:last-child td {
+            border-bottom: none;
+        }
+        .data-table tr:hover td {
+            background: rgba(255, 255, 255, 0.02);
+        }
+        .cell-null {
+            color: var(--red);
+            font-style: italic;
+        }
+        /* Empty state */
+        .empty-state {
+            display: flex;
+            flex-direction: column;
+            align-items: center;
+            justify-content: center;
+            padding: 40px 24px;
+            gap: 10px;
+            color: var(--text-dim);
+            text-align: center;
+        }
+        .empty-icon {
+            font-size: 36px;
+            opacity: 0.25;
+        }
+        .empty-title {
+            font-family: var(--mono);
+            font-size: 12px;
+            letter-spacing: 2px;
+            text-transform: uppercase;
+        }
+        .empty-sub {
+            font-size: 12px;
+            max-width: 280px;
+            line-height: 1.6;
+        }
+        /* Bottom bar */
+        .bottom-bar {
+            padding: 10px 28px;
+            border-top: 1px solid var(--border);
+            background: var(--bg2);
+            display: flex;
+            align-items: center;
+            gap: 20px;
+            font-family: var(--mono);
+            font-size: 10px;
+            color: var(--text-dim);
+            grid-column: 1 / -1;
+            flex-wrap: wrap;
+        }
+        .bottom-stat {
+            display: flex;
+            gap: 6px;
+        }
+        .bottom-stat span:last-child {
+            color: var(--text);
+        }
+        .dl-btn {
+            margin-left: auto;
+            font-family: var(--mono);
+            font-size: 10px;
+            padding: 5px 14px;
+            border-radius: 4px;
+            border: 1px solid var(--green-dim);
+            background: transparent;
+            color: var(--green);
+            cursor: pointer;
+            letter-spacing: 1px;
+            transition: all 0.2s;
+        }
+        .dl-btn:hover {
+            border-color: var(--green);
+            box-shadow: 0 0 10px rgba(0, 229, 160, 0.2);
+        }
+        .dl-btn:disabled {
+            opacity: 0.3;
+            cursor: not-allowed;
+        }
+        ::-webkit-scrollbar {
+            width: 5px;
+            height: 5px;
+        }
+        ::-webkit-scrollbar-track {
+            background: var(--bg);
+        }
+        ::-webkit-scrollbar-thumb {
+            background: var(--border);
+            border-radius: 3px;
+        }
+    </style>
+</head>
+<body>
+    <!-- Header -->
+    <header>
+        <div class="logo">
+            <div class="logo-pulse" id="logo-pulse"></div>
+            <div>
+                <div class="logo-text">DATAMEDIC</div>
+                <div class="logo-sub">AI Data Quality Monitor · OpenEnv</div>
+            </div>
+        </div>
+        <span class="status-pill idle" id="status-pill">IDLE</span>
+    </header>
+    <!-- Controls -->
+    <div class="controls">
+        <span class="ctrl-label">Select Task:</span>
+        <button class="task-btn active" data-task="1" onclick="selectTask(1)">TASK 1 · Easy</button>
+        <button class="task-btn" data-task="2" onclick="selectTask(2)">TASK 2 · Medium</button>
+        <button class="task-btn" data-task="3" onclick="selectTask(3)">TASK 3 · Hard</button>
+        <button class="task-btn" data-task="4" onclick="selectTask(4)">TASK 4 · Expert</button>
+        <div class="sep"></div>
+        <button class="reset-btn" id="reset-btn" onclick="resetEnv()">RESET EPISODE</button>
+        <button class="run-btn" id="run-btn" onclick="runAgent()">RUN DEMO AGENT</button>
+        <span class="run-hint">rule-based · follows plan field</span>
+    </div>
+    <!-- Main -->
+    <div class="main">
+        <!-- LEFT: Vitals -->
+        <div class="vitals-panel">
+            <div class="panel-title">Patient Vitals</div>
+            <div class="score-ring-wrap">
+                <div class="ring-container">
+                    <svg viewBox="0 0 130 130">
+                        <circle class="ring-bg" cx="65" cy="65" r="52" />
+                        <circle class="ring-fill" cx="65" cy="65" r="52" id="ring-fill" />
+                    </svg>
+                    <div class="ring-text">
+                        <div class="ring-score" id="ring-score">--</div>
+                        <div class="ring-label">Health Score</div>
+                    </div>
+                </div>
+            </div>
+            <div class="vital-grid">
+                <div class="vital-card">
+                    <div class="vital-name">Step</div>
+                    <div class="vital-value blue" id="v-step">--</div>
+                    <div class="vital-sub" id="v-maxstep">of --</div>
+                </div>
+                <div class="vital-card">
+                    <div class="vital-name">Reward</div>
+                    <div class="vital-value green" id="v-reward">--</div>
+                    <div class="vital-sub">last delta</div>
+                </div>
+                <div class="vital-card">
+                    <div class="vital-name">Nulls</div>
+                    <div class="vital-value amber" id="v-nulls">--</div>
+                    <div class="vital-sub">missing cells</div>
+                </div>
+                <div class="vital-card">
+                    <div class="vital-name">Dupes</div>
+                    <div class="vital-value amber" id="v-dupes">--</div>
+                    <div class="vital-sub">duplicate rows</div>
+                </div>
+            </div>
+            <div class="panel-title">DQ Dimensions</div>
+            <div class="dq-bars">
+                <div class="dq-row">
+                    <div class="dq-header">
+                        <span class="dq-name">Completeness</span>
+                        <span class="dq-val" id="dq-completeness" style="color:var(--green)">--</span>
+                    </div>
+                    <div class="dq-bar-bg">
+                        <div class="dq-bar-fill" id="bar-completeness" style="width:0%;background:var(--green)"></div>
+                    </div>
+                </div>
+                <div class="dq-row">
+                    <div class="dq-header">
+                        <span class="dq-name">Uniqueness</span>
+                        <span class="dq-val" id="dq-uniqueness" style="color:var(--blue)">--</span>
+                    </div>
+                    <div class="dq-bar-bg">
+                        <div class="dq-bar-fill" id="bar-uniqueness" style="width:0%;background:var(--blue)"></div>
+                    </div>
+                </div>
+                <div class="dq-row">
+                    <div class="dq-header">
+                        <span class="dq-name">Validity</span>
+                        <span class="dq-val" id="dq-validity" style="color:var(--amber)">--</span>
+                    </div>
+                    <div class="dq-bar-bg">
+                        <div class="dq-bar-fill" id="bar-validity" style="width:0%;background:var(--amber)"></div>
+                    </div>
+                </div>
+            </div>
+        </div>
+        <!-- RIGHT: Content -->
+        <div class="content-area">
+            <div class="chart-section">
+                <div class="panel-title">Health Score Trajectory</div>
+                <div class="chart-wrap">
+                    <svg id="score-chart" preserveAspectRatio="none">
+                        <defs>
+                            <linearGradient id="chartGrad" x1="0" y1="0" x2="0" y2="1">
+                                <stop offset="0%" stop-color="#00e5a0" stop-opacity="0.25" />
+                                <stop offset="100%" stop-color="#00e5a0" stop-opacity="0" />
+                            </linearGradient>
+                        </defs>
+                        <path id="chart-area" fill="url(#chartGrad)" d="" />
+                        <path id="chart-line" fill="none" stroke="#00e5a0" stroke-width="2" stroke-linecap="round"
+                            stroke-linejoin="round" d="" style="filter:drop-shadow(0 0 3px #00e5a0)" />
+                        <text x="50%" y="50%" text-anchor="middle" dominant-baseline="middle" fill="#4a6a8a"
+                            font-size="11" id="chart-empty-msg" font-family="Courier New, monospace">
+                            Run demo agent to see score trajectory
+                        </text>
+                    </svg>
+                </div>
+            </div>
+            <div class="plan-section">
+                <div class="panel-title">Agent Treatment Plan &nbsp;<span style="color:var(--amber);font-size:9px">(next
+                        recommended actions)</span></div>
+                <div class="plan-items" id="plan-items">
+                    <div style="color:var(--text-dim);font-size:11px;font-family:var(--mono);padding:4px 0">
+                        Awaiting diagnosis...
+                    </div>
+                </div>
+            </div>
+            <div class="thought-section">
+                <div class="panel-title">Agent Operation Log &nbsp;<span
+                        style="color:var(--text-dim);font-size:9px">(actions taken + results)</span></div>
+                <div class="thought-stream" id="thought-stream">
+                    <div class="empty-state" style="padding:16px">
+                        <div class="empty-sub">Actions will appear here as the demo agent runs</div>
+                    </div>
+                </div>
+            </div>
+            <div class="preview-section">
+                <div class="panel-title">Dataset Preview &nbsp;<span style="color:var(--text-dim);font-size:9px">(first
+                        10 rows · NULL shown in red)</span></div>
+                <div class="data-table-wrap" id="table-wrap">
+                    <div class="empty-state">
+                        <div class="empty-icon">[?]</div>
+                        <div class="empty-title">No Dataset Loaded</div>
+                        <div class="empty-sub">Select a task — dataset loads automatically</div>
+                    </div>
+                </div>
+            </div>
+        </div>
+        <!-- Bottom bar -->
+        <div class="bottom-bar">
+            <div class="bottom-stat"><span>Episode:</span><span id="b-episode">--</span></div>
+            <div class="bottom-stat"><span>Task:</span><span id="b-task">--</span></div>
+            <div class="bottom-stat"><span>Errors Left:</span><span id="b-errors">--</span></div>
+            <div class="bottom-stat"><span>Shape:</span><span id="b-shape">--</span></div>
+            <button class="dl-btn" id="dl-btn" disabled onclick="downloadCSV()">EXPORT CSV</button>
+        </div>
+    </div>
+    <script>
+        const BASE = '';
+        let selectedTask = 1;
+        let scores = [];
+        let isRunning = false;
+        const TASK_LABELS = {
+            1: 'Task 1 - Fill Missing Values',
+            2: 'Task 2 - Fix Formats + Duplicates',
+            3: 'Task 3 - Full Pipeline',
+            4: 'Task 4 - Multi-Source Merge'
+        };
+        // ── Task selection: switch + auto-reset ──────────────────────────
+        function selectTask(n) {
+            if (isRunning) return;
+            selectedTask = n;
+            document.querySelectorAll('.task-btn').forEach(b => b.classList.remove('active'));
+            document.querySelector('[data-task="' + n + '"]').classList.add('active');
+            resetEnv();   // <-- auto-reset when task changes
+        }
+        // ── Reset ────────────────────────────────────────────────────────
+        async function resetEnv() {
+            if (isRunning) return;
+            setButtons(false);
+            // Immediately update task label and dim ring while loading
+            document.getElementById('b-task').textContent = TASK_LABELS[selectedTask] || 'Task ' + selectedTask;
+            document.getElementById('ring-score').textContent = '...';
+            document.getElementById('ring-fill').style.strokeDashoffset = 326.73;
+            try {
+                const r = await fetch(BASE + '/reset', {
+                    method: 'POST',
+                    headers: { 'Content-Type': 'application/json' },
+                    body: JSON.stringify({ task_id: selectedTask })
+                });
+                if (!r.ok) throw new Error('Reset failed: ' + r.status);
+                const data = await r.json();
+                scores = [data.observation.current_score];
+                updateUI(data.observation, null);
+                clearThoughts();
+                updateChart();
+                addThought(0, 'Episode started - Task ' + selectedTask, data.observation.message, null);
+                document.getElementById('dl-btn').disabled = false;
+                setStatus('idle');
+                document.getElementById('b-task').textContent = TASK_LABELS[selectedTask] || 'Task ' + selectedTask;
+            } catch (e) {
+                addThought('!', 'Error', e.message, null);
+                console.error(e);
+            }
+            setButtons(true);
+        }
+        // ── Run demo agent ───────────────────────────────────────────────
+        async function runAgent() {
+            if (isRunning) return;
+            isRunning = true;
+            setButtons(false);
+            setStatus('running');
+            // Fresh reset first
+            try {
+                const initR = await fetch(BASE + '/reset', {
+                    method: 'POST',
+                    headers: { 'Content-Type': 'application/json' },
+                    body: JSON.stringify({ task_id: selectedTask })
+                });
+                const initData = await initR.json();
+                let obs = initData.observation;
+                scores = [obs.current_score];
+                clearThoughts();
+                updateUI(obs, null);
+                updateChart();
+                addThought(0, 'Demo agent started', obs.message, null);
+                const MAX = 50;
+                let step = 0;
+                while (!obs.done && step < MAX) {
+                    await sleep(700);
+                    const action = pickAction(obs);
+                    if (!action) {
+                        addThought('--', 'Agent halted', 'No more actions available from plan', null);
+                        break;
+                    }
+                    step++;
+                    const r = await fetch(BASE + '/step', {
+                        method: 'POST',
+                        headers: { 'Content-Type': 'application/json' },
+                        body: JSON.stringify(action)
+                    });
+                    const data = await r.json();
+                    obs = data.observation;
+                    scores.push(obs.current_score);
+                    updateUI(obs, data.reward);
+                    updateChart();
+                    addThought(step, JSON.stringify(action), obs.message, data.reward);
+                    const ts = document.getElementById('thought-stream');
+                    ts.scrollTop = ts.scrollHeight;
+                }
+                const done = obs.current_score >= 0.95;
+                setStatus(done ? 'done' : 'idle');
+                if (done) {
+                    addThought('OK', 'Cleaning complete!',
+                        'Final score: ' + (obs.current_score * 100).toFixed(1) + '%', null);
+                }
+            } catch (e) {
+                console.error(e);
+                addThought('!', 'Error during agent run', e.message, null);
+                setStatus('idle');
+            }
+            isRunning = false;
+            setButtons(true);
+        }
+        // ── Rule-based action picker (follows plan field) ────────────────
+        function pickAction(obs) {
+            if (obs.plan && obs.plan.length > 0) {
+                const p = obs.plan[0];
+                if (p.startsWith('align_schema'))
+                    return { operation: 'align_schema' };
+                if (p.startsWith('merge_sources'))
+                    return { operation: 'merge_sources' };
+                if (p.startsWith('drop_duplicates'))
+                    return { operation: 'drop_duplicates' };
+                const fillM = p.match(/fill_missing on "([^"]+)".*?(median|mode|mean)/);
+                if (fillM)
+                    return { operation: 'fill_missing', column: fillM[1], params: { strategy: fillM[2] } };
+                const fmtM = p.match(/fix_format on "([^"]+)"/);
+                if (fmtM)
+                    return { operation: 'fix_format', column: fmtM[1] };
+                const outM = p.match(/drop_outliers on "([^"]+)"/);
+                if (outM)
+                    return { operation: 'drop_outliers', column: outM[1] };
+            }
+            // Fallback: scan missing counts directly
+            const missing = obs.missing_counts || {};
+            for (const [col, cnt] of Object.entries(missing)) {
+                if (cnt > 0) {
+                    const cat = ['department', 'country', 'email', 'name', 'category'].includes(col);
+                    return { operation: 'fill_missing', column: col, params: { strategy: cat ? 'mode' : 'median' } };
+                }
+            }
+            if (obs.duplicate_count > 0)
+                return { operation: 'drop_duplicates' };
+            return null;
+        }
+        // ── UI update ────────────────────────────────────────────────────
+        function updateUI(obs, reward) {
+            const pct = obs.current_score;
+            const CIRCUM = 326.73; // exact: 2 * pi * 52
+            // Ring — minimum 3% arc so ring is never invisible at very low scores
+            const displayPct = Math.max(pct, 0.03);
+            document.getElementById('ring-fill').style.strokeDashoffset = CIRCUM * (1 - displayPct);
+            // Score text — show raw value accurately
+            const scoreText = pct < 0.1
+                ? (pct * 100).toFixed(1) + '%'   // e.g. "4.3%"
+                : (pct * 100).toFixed(1) + '%';  // e.g. "87.5%"
+            document.getElementById('ring-score').textContent = scoreText;
+            // Color ring by health
+            const col = pct >= 0.85 ? '#00e5a0' : pct >= 0.5 ? '#f5a623' : '#ff4d6d';
+            const rf = document.getElementById('ring-fill');
+            rf.style.stroke = col;
+            rf.style.filter = 'drop-shadow(0 0 5px ' + col + ')';
+            document.getElementById('ring-score').style.color = col;
+            // Stats
+            document.getElementById('v-step').textContent = obs.step_count;
+            document.getElementById('v-maxstep').textContent = 'of ' + (obs.step_count + 20);
+            if (reward !== null) {
+                const rv = document.getElementById('v-reward');
+                rv.textContent = (reward >= 0 ? '+' : '') + reward.toFixed(4);
+                rv.className = 'vital-value ' + (reward >= 0 ? 'green' : 'red');
+            }
+            const nullTotal = Object.values(obs.missing_counts || {}).reduce(function (a, b) { return a + b; }, 0);
+            const vn = document.getElementById('v-nulls');
+            vn.textContent = nullTotal;
+            vn.className = 'vital-value ' + (nullTotal === 0 ? 'green' : 'amber');
+            const vd = document.getElementById('v-dupes');
+            vd.textContent = obs.duplicate_count;
+            vd.className = 'vital-value ' + (obs.duplicate_count === 0 ? 'green' : 'amber');
+            // DQ bars
+            if (obs.dq_metrics) {
+                setDQBar('completeness', obs.dq_metrics.completeness_pct, 'var(--green)');
+                setDQBar('uniqueness', obs.dq_metrics.uniqueness_pct, 'var(--blue)');
+                setDQBar('validity', obs.dq_metrics.validity_pct, 'var(--amber)');
+            }
+            // Plan
+            const planEl = document.getElementById('plan-items');
+            if (obs.plan && obs.plan.length > 0) {
+                planEl.innerHTML = obs.plan.map(function (p, i) {
+                    return '<div class="plan-item">' +
+                        '<div class="plan-num">' + (i + 1) + '</div>' +
+                        '<span style="color:var(--text)">' + p + '</span>' +
+                        '</div>';
+                }).join('');
+            } else if (obs.done) {
+                planEl.innerHTML = '<div style="color:var(--green);font-family:var(--mono);font-size:11px;padding:4px 0">Dataset fully cleaned</div>';
+            } else {
+                planEl.innerHTML = '<div style="color:var(--text-dim);font-family:var(--mono);font-size:11px;padding:4px 0">No further actions needed</div>';
+            }
+            // Table
+            if (obs.data_preview) renderTable(obs.data_preview);
+            // Bottom bar
+            document.getElementById('b-shape').textContent = obs.data_shape[0] + ' x ' + obs.data_shape[1];
+        }
+        function setDQBar(name, val, color) {
+            document.getElementById('dq-' + name).textContent = val.toFixed(1) + '%';
+            document.getElementById('bar-' + name).style.width = Math.min(val, 100) + '%';
+            document.getElementById('bar-' + name).style.background = color;
+        }
+        // ── Chart ────────────────────────────────────────────────────────
+        function updateChart() {
+            const svg = document.getElementById('score-chart');
+            const W = svg.clientWidth || 600;
+            const H = svg.clientHeight || 90;
+            const pad = 6;
+            if (scores.length < 2) return;
+            document.getElementById('chart-empty-msg').style.display = 'none';
+            const xs = scores.map(function (_, i) { return pad + (i / (scores.length - 1)) * (W - 2 * pad); });
+            const ys = scores.map(function (s) { return (H - pad) - s * (H - 2 * pad); });
+            const pts = xs.map(function (x, i) { return x + ',' + ys[i]; }).join(' L ');
+            document.getElementById('chart-line').setAttribute('d', 'M ' + pts);
+            document.getElementById('chart-area').setAttribute('d',
+                'M ' + xs[0] + ',' + H + ' L ' + pts + ' L ' + xs[xs.length - 1] + ',' + H + ' Z'
+            );
+        }
+        // ── Table ────────────────────────────────────────────────────────
+        function renderTable(csv) {
+            const lines = csv.trim().split('\n');
+            if (lines.length < 2) return;
+            const headers = lines[0].split(',');
+            const rows = lines.slice(1, 11).map(function (l) { return l.split(','); });
+            var html = '<table class="data-table"><thead><tr>' +
+                headers.map(function (h) { return '<th>' + h.trim() + '</th>'; }).join('') +
+                '</tr></thead><tbody>';
+            rows.forEach(function (row) {
+                html += '<tr>' + row.map(function (cell) {
+                    var v = cell.trim();
+                    var empty = v === '' || v.toLowerCase() === 'nan' || v.toLowerCase() === 'none';
+                    return '<td class="' + (empty ? 'cell-null' : '') + '">' + (empty ? 'NULL' : v) + '</td>';
+                }).join('') + '</tr>';
+            });
+            html += '</tbody></table>';
+            document.getElementById('table-wrap').innerHTML = html;
+        }
+        // ── Thought stream ───────────────────────────────────────────────
+        function clearThoughts() {
+            document.getElementById('thought-stream').innerHTML = '';
+        }
+        function addThought(step, action, result, reward) {
+            const ts = document.getElementById('thought-stream');
+            const rewardHtml = reward !== null
+                ? '<div class="thought-reward ' + (reward >= 0 ? 'reward-pos' : 'reward-neg') + '">' +
+                (reward >= 0 ? '+' : '') + reward.toFixed(4) + '</div>'
+                : '';
+            var el = document.createElement('div');
+            el.className = 'thought-item';
+            el.innerHTML =
+                '<div class="thought-step">S' + step + '</div>' +
+                '<div class="thought-body">' +
+                '<div class="thought-action">' + action + '</div>' +
+                '<div class="thought-result">' + result + '</div>' +
+                rewardHtml +
+                '</div>';
+            ts.appendChild(el);
+        }
+        // ── Helpers ──────────────────────────────────────────────────────
+        function setStatus(s) {
+            const el = document.getElementById('status-pill');
+            el.className = 'status-pill ' + s;
+            el.textContent = s.toUpperCase();
+        }
+        function setButtons(enabled) {
+            document.getElementById('run-btn').disabled = !enabled;
+            document.getElementById('reset-btn').disabled = !enabled;
+        }
+        async function downloadCSV() {
+            try {
+                const r = await fetch(BASE + '/export');
+                const text = await r.text();
+                const blob = new Blob([text], { type: 'text/csv' });
+                const a = document.createElement('a');
+                a.href = URL.createObjectURL(blob);
+                a.download = 'cleaned_task' + selectedTask + '.csv';
+                a.click();
+            } catch (e) {
+                console.error('Export failed:', e);
+            }
+        }
+        function sleep(ms) { return new Promise(function (r) { setTimeout(r, ms); }); }
+        // Auto-load Task 1 on open
+        window.addEventListener('load', function () { resetEnv(); });
+    </script>
+</body>
+</html>