Spaces:

srishtichugh
/

OpenEnv_hack

Running

Taniieeee83 Claude Sonnet 4.6 commited on 30 days ago

Commit

d2d30e9

0 Parent(s):

feat: initial implementation of Data Cleaning OpenEnv environment

Complete OpenEnv-compliant data cleaning environment with:
- 3 tasks (easy/medium/hard): fill missing values, fix formats+duplicates, full pipeline
- Synthetic dataset generation with fixed seed (fully reproducible, no external downloads)
- Deterministic programmatic graders with partial progress rewards
- FastAPI server exposing /health /reset /step /state endpoints
- Baseline inference script using OpenAI client
- Dockerfile for containerised deployment
- openenv.yaml manifest, README with full API/task documentation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (16) hide show

.gitignore +11 -0
Dockerfile +24 -0
README.md +194 -0
inference.py +200 -0
models.py +42 -0
openenv.yaml +73 -0
pyproject.toml +22 -0
requirements.txt +8 -0
server/__init__.py +0 -0
server/app.py +63 -0
server/data_generator.py +197 -0
server/environment.py +340 -0
server/tasks/__init__.py +0 -0
server/tasks/task1_missing.py +39 -0
server/tasks/task2_format.py +68 -0
server/tasks/task3_pipeline.py +104 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,11 @@

+__pycache__/
+*.py[cod]
+*.egg-info/
+dist/
+build/
+.venv/
+venv/
+.env
+*.env
+baseline_scores.json
+.DS_Store

Dockerfile ADDED Viewed

	@@ -0,0 +1,24 @@

+FROM python:3.11-slim
+# Non-root user for HuggingFace Spaces compatibility
+RUN useradd -m -u 1000 appuser
+WORKDIR /app
+# Install dependencies first (layer cache)
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy project files
+COPY . .
+# Switch to non-root
+RUN chown -R appuser:appuser /app
+USER appuser
+EXPOSE 8000
+HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]

README.md ADDED Viewed

	@@ -0,0 +1,194 @@

+---
+title: Data Cleaning Environment
+emoji: 🧹
+colorFrom: blue
+colorTo: green
+sdk: docker
+pinned: false
+app_port: 8000
+tags:
+  - openenv
+  - rl
+  - data-cleaning
+---
+# Data Cleaning OpenEnv
+A **real-world data cleaning environment** for AI agent training, built for the Scaler × OpenEnv hackathon.
+An agent interacts with a dirty DataFrame through a simple `reset() / step() / state()` API, learning to fix common data quality issues: missing values, duplicate rows, format inconsistencies, outliers, and dtype errors.
+---
+## Environment Description
+Real-world datasets are rarely clean. Data engineers spend a significant fraction of their time:
+- Filling missing values with appropriate strategies (median/mean/mode)
+- Removing duplicate records
+- Standardising inconsistent formats (phone numbers, dates, country names)
+- Detecting and removing statistical outliers
+This environment turns those tasks into a reinforcement learning challenge with deterministic, programmatic graders and a meaningful partial-progress reward signal.
+---
+## Action Space
+Actions are JSON objects sent to `POST /step`:
+| `operation`      | `column`   | `params`                                         | Description                         |
+|------------------|------------|--------------------------------------------------|-------------------------------------|
+| `fill_missing`   | required   | `{"strategy": "median\|mean\|mode\|constant", "value": ...}` | Fill NaN values             |
+| `drop_duplicates`| —          | —                                                | Remove duplicate rows               |
+| `fix_format`     | required   | —                                                | Standardise phone/date/country col  |
+| `replace_value`  | required   | `{"old": ..., "new": ...}`                       | Replace a specific value            |
+| `drop_outliers`  | required   | —                                                | Remove IQR outliers in numeric col  |
+| `fix_dtype`      | required   | `{"dtype": "float\|int\|str"}`                   | Cast column to correct dtype        |
+**Example:**
+```json
+{"operation": "fill_missing", "column": "salary", "params": {"strategy": "median"}}
+{"operation": "drop_duplicates"}
+{"operation": "fix_format", "column": "signup_date"}
+```
+---
+## Observation Space
+The `POST /step` and `POST /reset` responses return:
+```json
+{
+  "observation": {
+    "done":             false,
+    "reward":           0.05,
+    "data_preview":     "name,age,salary,...\n...",
+    "data_shape":       [100, 5],
+    "missing_counts":   {"salary": 18, "age": 20},
+    "duplicate_count":  0,
+    "dtype_issues":     {},
+    "task_description": "Task 1 (Easy) — Fill Missing Values\n...",
+    "message":          "Filled 20 missing values in 'age' using median.",
+    "step_count":       1,
+    "current_score":    0.25
+  },
+  "reward": 0.05,
+  "done": false,
+  "info": {}
+}
+```
+---
+## Tasks
+### Task 1 — Fill Missing Values (Easy)
+- **Dataset:** 100-row employee records (name, age, salary, department, experience)
+- **Issues:** ~20 % NaN in `age`, `salary`, `department`
+- **Goal:** Fill all missing values
+- **Grader:** `1.0 - remaining_nulls / original_nulls`
+- **Max steps:** 20
+- **Expected baseline score:** ~0.95
+### Task 2 — Fix Formats + Remove Duplicates (Medium)
+- **Dataset:** 200-row product catalog (product_id, price, phone, listed_date, …)
+- **Issues:** Mixed phone formats, mixed date formats, 15 duplicate rows
+- **Goal:** Standardise all formats and remove duplicates
+- **Grader:** `0.35 × phone_score + 0.35 × date_score + 0.30 × dupe_score`
+- **Max steps:** 30
+- **Expected baseline score:** ~0.80
+### Task 3 — Full Cleaning Pipeline (Hard)
+- **Dataset:** 300-row customer database (name, age, purchase_amount, country, email, signup_date)
+- **Issues:** Missing values (4 cols), 20 duplicates, outliers in `purchase_amount`, mixed country case, mixed date formats
+- **Goal:** Clean all issues end-to-end
+- **Grader:** `0.25 × null + 0.20 × dupe + 0.20 × outlier + 0.175 × country + 0.175 × date`
+- **Max steps:** 40
+- **Expected baseline score:** ~0.70
+---
+## Reward Function
+| Scenario                   | Reward                             |
+|----------------------------|------------------------------------|
+| Progress (score improves)  | `new_score - old_score` (≥ 0)      |
+| No effect                  | `-0.01`                            |
+| Invalid operation          | `-0.05`                            |
+| Episode completion (≥0.95) | `delta + 0.20` terminal bonus      |
+Rewards are bounded to `[-0.05, 1.2]`. Partial rewards are emitted every step.
+---
+## API Endpoints
+| Method | Path      | Description                       |
+|--------|-----------|-----------------------------------|
+| GET    | `/health` | Health check → `{"status":"ok"}`  |
+| POST   | `/reset`  | Start episode. Body: `{"task_id": 1\|2\|3}` (optional; default: round-robin) |
+| POST   | `/step`   | Execute action. Body: action JSON |
+| POST   | `/state`  | Get episode state                 |
+| GET    | `/docs`   | Interactive Swagger UI            |
+---
+## Setup & Usage
+### Local (Python)
+```bash
+pip install -r requirements.txt
+uvicorn server.app:app --host 0.0.0.0 --port 8000
+```
+### Docker
+```bash
+docker build -t data-cleaning-env .
+docker run -p 8000:8000 data-cleaning-env
+```
+### Run Baseline Inference
+```bash
+export API_BASE_URL="https://api.openai.com/v1"
+export MODEL_NAME="gpt-4o-mini"
+export HF_TOKEN="your-api-key"
+export ENV_URL="http://localhost:8000"
+python inference.py
+```
+---
+## Baseline Scores
+| Task | Difficulty | Score  |
+|------|------------|--------|
+| 1    | Easy       | ~0.950 |
+| 2    | Medium     | ~0.800 |
+| 3    | Hard       | ~0.700 |
+| avg  | —          | ~0.817 |
+*(Scores produced by `gpt-4o-mini` with greedy decoding, temperature=0)*
+---
+## Project Structure
+```
+openenv-data-cleaning/
+├── server/
+│   ├── environment.py        # Core env: reset/step/state + action dispatcher
+│   ├── app.py                # FastAPI HTTP API
+│   ├── data_generator.py     # Synthetic dataset generation (fixed seed=42)
+│   └── tasks/
+│       ├── task1_missing.py  # Task 1: missing values dataset + grader
+│       ├── task2_format.py   # Task 2: format + duplicates dataset + grader
+│       └── task3_pipeline.py # Task 3: full pipeline dataset + grader
+├── models.py                 # Pydantic models (Action, Observation, State)
+├── inference.py              # Baseline inference script
+├── openenv.yaml              # OpenEnv manifest
+├── Dockerfile
+├── requirements.txt
+└── README.md
+```

inference.py ADDED Viewed

	@@ -0,0 +1,200 @@

+"""
+Baseline inference script for the Data Cleaning OpenEnv environment.
+Uses the OpenAI client against all 3 tasks and reports scores.
+Required environment variables:
+    API_BASE_URL   — LLM API endpoint (OpenAI-compatible)
+    MODEL_NAME     — model identifier
+    HF_TOKEN       — API key
+    ENV_URL        — environment server URL (default: http://localhost:8000)
+"""
+import json
+import os
+import sys
+import time
+import httpx
+from openai import OpenAI
+# ------------------------------------------------------------------
+# Config
+# ------------------------------------------------------------------
+API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.openai.com/v1")
+MODEL_NAME   = os.environ.get("MODEL_NAME",   "gpt-4o-mini")
+HF_TOKEN     = os.environ.get("HF_TOKEN",     "")
+ENV_URL      = os.environ.get("ENV_URL",      "http://localhost:8000")
+if not HF_TOKEN:
+    print("[WARNING] HF_TOKEN is not set — LLM calls may fail.", file=sys.stderr)
+client = OpenAI(api_key=HF_TOKEN, base_url=API_BASE_URL)
+SYSTEM_PROMPT = """You are a data cleaning agent. You control a data cleaning environment
+through JSON actions. Each turn you receive an observation JSON describing the current state
+of a dataset (preview, missing counts, duplicate count, dtype issues, current score, etc.)
+and a task description.
+Your job is to pick the single best action to improve the dataset quality.
+Respond ONLY with a valid JSON object — no markdown, no explanation, just the JSON.
+Available operations and their required parameters:
+1. fill_missing
+   {"operation": "fill_missing", "column": "<col>", "params": {"strategy": "median|mean|mode|constant", "value": <only if constant>}}
+2. drop_duplicates
+   {"operation": "drop_duplicates"}
+3. fix_format
+   {"operation": "fix_format", "column": "phone|listed_date|signup_date|country"}
+4. replace_value
+   {"operation": "replace_value", "column": "<col>", "params": {"old": "<val>", "new": "<val>"}}
+5. drop_outliers
+   {"operation": "drop_outliers", "column": "<numeric_col>"}
+6. fix_dtype
+   {"operation": "fix_dtype", "column": "<col>", "params": {"dtype": "float|int|str"}}
+Rules:
+- Address the highest-impact issues first (missing values > duplicates > outliers > format).
+- Do not repeat an operation that returned no effect (watch the 'message' field).
+- Stop when current_score >= 0.95.
+"""
+# ------------------------------------------------------------------
+# HTTP helpers
+# ------------------------------------------------------------------
+def api_post(path: str, payload: dict = None) -> dict:
+    url = ENV_URL.rstrip("/") + path
+    resp = httpx.post(url, json=payload or {}, timeout=30)
+    resp.raise_for_status()
+    return resp.json()
+def api_get(path: str) -> dict:
+    url = ENV_URL.rstrip("/") + path
+    resp = httpx.get(url, timeout=10)
+    resp.raise_for_status()
+    return resp.json()
+# ------------------------------------------------------------------
+# Agent loop
+# ------------------------------------------------------------------
+def obs_to_text(obs: dict) -> str:
+    lines = [
+        f"current_score: {obs['current_score']}",
+        f"step_count:    {obs['step_count']}",
+        f"data_shape:    {obs['data_shape']}",
+        f"duplicate_count: {obs['duplicate_count']}",
+        f"missing_counts: {json.dumps(obs['missing_counts'])}",
+        f"dtype_issues:   {json.dumps(obs['dtype_issues'])}",
+        f"message:        {obs['message']}",
+        "",
+        "=== DATA PREVIEW (first 10 rows) ===",
+        obs["data_preview"],
+        "",
+        "=== TASK DESCRIPTION ===",
+        obs["task_description"],
+    ]
+    return "\n".join(lines)
+def run_task(task_id: int) -> float:
+    print(f"\n{'='*60}")
+    print(f"  Running Task {task_id}")
+    print(f"{'='*60}")
+    result  = api_post("/reset", {"task_id": task_id})
+    obs     = result["observation"]
+    history = []
+    for step_num in range(1, 50):
+        if obs["done"]:
+            break
+        obs_text = obs_to_text(obs)
+        history.append({"role": "user", "content": obs_text})
+        response = client.chat.completions.create(
+            model    = MODEL_NAME,
+            messages = [{"role": "system", "content": SYSTEM_PROMPT}] + history,
+            temperature = 0.0,
+            max_tokens  = 256,
+        )
+        action_str = response.choices[0].message.content.strip()
+        history.append({"role": "assistant", "content": action_str})
+        # Parse action
+        try:
+            action = json.loads(action_str)
+        except json.JSONDecodeError:
+            # Try to extract JSON from markdown code fence
+            import re
+            m = re.search(r"\{.*\}", action_str, re.DOTALL)
+            if m:
+                try:
+                    action = json.loads(m.group())
+                except Exception:
+                    print(f"  Step {step_num}: Could not parse action JSON, skipping.")
+                    break
+            else:
+                print(f"  Step {step_num}: No JSON found in response, skipping.")
+                break
+        print(f"  Step {step_num:2d} | score={obs['current_score']:.4f} | action={json.dumps(action)}")
+        result = api_post("/step", action)
+        obs    = result["observation"]
+        print(f"           → {obs['message']}")
+        # Slight delay to stay within rate limits on free-tier endpoints
+        time.sleep(0.3)
+    final_score = obs["current_score"]
+    print(f"\n  Task {task_id} final score: {final_score:.4f}  (steps used: {obs['step_count']})")
+    return final_score
+# ------------------------------------------------------------------
+# Main
+# ------------------------------------------------------------------
+def main():
+    print("Data Cleaning OpenEnv — Baseline Inference")
+    print(f"Model : {MODEL_NAME}")
+    print(f"Env   : {ENV_URL}")
+    # Smoke-test health endpoint
+    health = api_get("/health")
+    assert health.get("status") == "ok", f"Health check failed: {health}"
+    print("Health check: OK\n")
+    scores = {}
+    for task_id in [1, 2, 3]:
+        scores[f"task{task_id}"] = run_task(task_id)
+    print("\n" + "="*60)
+    print("  BASELINE RESULTS")
+    print("="*60)
+    for k, v in scores.items():
+        print(f"  {k}: {v:.4f}")
+    avg = sum(scores.values()) / len(scores)
+    print(f"  average: {avg:.4f}")
+    print("="*60)
+    # Write scores to file for automated validators
+    with open("baseline_scores.json", "w") as f:
+        json.dump({"scores": scores, "average": avg}, f, indent=2)
+    print("\nScores written to baseline_scores.json")
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,42 @@

+from typing import Any, Dict, List, Optional
+from pydantic import BaseModel
+class DataCleaningAction(BaseModel):
+    """
+    Action to apply to the current dirty DataFrame.
+    operation choices:
+        fill_missing    – fill NaN values in a column
+        drop_duplicates – remove duplicate rows
+        fix_format      – standardise string formats (phone, date, text)
+        replace_value   – replace a specific value with another
+        drop_outliers   – remove rows where column value is a statistical outlier
+        fix_dtype       – cast a column to the correct dtype
+    """
+    operation: str
+    column: Optional[str] = None
+    params: Dict[str, Any] = {}
+class DataCleaningObservation(BaseModel):
+    done: bool
+    reward: float
+    data_preview: str           # First 10 rows as CSV string
+    data_shape: List[int]       # [rows, cols]
+    missing_counts: Dict[str, int]
+    duplicate_count: int
+    dtype_issues: Dict[str, str]
+    task_description: str
+    message: str
+    step_count: int
+    current_score: float        # Running grader score 0.0–1.0
+class DataCleaningState(BaseModel):
+    episode_id: str
+    task_id: int
+    step_count: int
+    max_steps: int
+    total_errors: int
+    errors_remaining: int

openenv.yaml ADDED Viewed

	@@ -0,0 +1,73 @@

+name: data-cleaning-env
+version: "0.1.0"
+description: >
+  A real-world data cleaning environment where an AI agent fixes missing
+  values, duplicate rows, format inconsistencies, outliers, and dtype errors
+  across three progressively harder tasks.
+author: openenv-hackathon
+tags:
+  - openenv
+  - data-cleaning
+  - rl
+  - real-world
+tasks:
+  - id: task1
+    name: "Fill Missing Values"
+    difficulty: easy
+    max_steps: 20
+    description: >
+      Fill all NaN values in an employee records dataset.
+      Columns with missing data: age, salary, department.
+  - id: task2
+    name: "Fix Formats and Remove Duplicates"
+    difficulty: medium
+    max_steps: 30
+    description: >
+      Standardise phone numbers (NNN-NNN-NNNN) and dates (YYYY-MM-DD)
+      in a product catalog, and remove ~15 duplicate rows.
+  - id: task3
+    name: "Full Cleaning Pipeline"
+    difficulty: hard
+    max_steps: 40
+    description: >
+      End-to-end pipeline on a customer database: fill missing values,
+      remove duplicates, drop outliers in purchase_amount, standardise
+      country capitalisation, and fix mixed date formats.
+api:
+  health:  GET  /health
+  reset:   POST /reset
+  step:    POST /step
+  state:   POST /state
+  docs:    GET  /docs
+reward:
+  range: [-0.05, 1.2]
+  partial: true
+  terminal_bonus: 0.2
+observation_space:
+  type: object
+  fields:
+    done:            boolean
+    reward:          float
+    data_preview:    string   # First 10 rows as CSV
+    data_shape:      list     # [rows, cols]
+    missing_counts:  object   # {column: count}
+    duplicate_count: integer
+    dtype_issues:    object   # {column: issue_description}
+    task_description: string
+    message:         string
+    step_count:      integer
+    current_score:   float    # 0.0–1.0
+action_space:
+  type: object
+  fields:
+    operation: string   # fill_missing | drop_duplicates | fix_format | replace_value | drop_outliers | fix_dtype
+    column:    string   # optional depending on operation
+    params:    object   # optional operation parameters

pyproject.toml ADDED Viewed

	@@ -0,0 +1,22 @@

+[project]
+name = "data-cleaning-env"
+version = "0.1.0"
+description = "Real-world data cleaning environment for OpenEnv / Scaler hackathon"
+requires-python = ">=3.11"
+dependencies = [
+    "fastapi>=0.104.0",
+    "uvicorn[standard]>=0.24.0",
+    "pydantic>=2.0.0",
+    "pandas>=2.0.0",
+    "numpy>=1.24.0",
+    "faker>=18.0.0",
+    "openai>=1.0.0",
+    "httpx>=0.25.0",
+]
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.hatch.build.targets.wheel]
+packages = ["server"]

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+fastapi>=0.104.0
+uvicorn[standard]>=0.24.0
+pydantic>=2.0.0
+pandas>=2.0.0
+numpy>=1.24.0
+faker>=18.0.0
+openai>=1.0.0
+httpx>=0.25.0

server/__init__.py ADDED Viewed

File without changes

server/app.py ADDED Viewed

	@@ -0,0 +1,63 @@

+"""
+FastAPI application exposing the OpenEnv-compatible HTTP API.
+Endpoints: GET /health, POST /reset, POST /step, POST /state, GET /docs
+"""
+from typing import Optional
+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel
+from models import DataCleaningAction, DataCleaningObservation, DataCleaningState
+from server.environment import DataCleaningEnvironment
+app = FastAPI(
+    title="Data Cleaning OpenEnv",
+    description="A real-world data cleaning environment for AI agent training.",
+    version="0.1.0",
+)
+# Single shared environment instance (stateful server)
+env = DataCleaningEnvironment()
+class ResetRequest(BaseModel):
+    task_id: Optional[int] = None
+class StepResponse(BaseModel):
+    observation: DataCleaningObservation
+    reward: float
+    done: bool
+    info: dict = {}
+# ------------------------------------------------------------------
+# Routes
+# ------------------------------------------------------------------
+@app.get("/health")
+def health():
+    return {"status": "ok"}
+@app.post("/reset", response_model=StepResponse)
+def reset(req: ResetRequest = ResetRequest()):
+    try:
+        obs = env.reset(task_id=req.task_id)
+    except ValueError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    return StepResponse(observation=obs, reward=0.0, done=False)
+@app.post("/step", response_model=StepResponse)
+def step(action: DataCleaningAction):
+    try:
+        obs = env.step(action)
+    except RuntimeError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    return StepResponse(observation=obs, reward=obs.reward, done=obs.done)
+@app.post("/state", response_model=DataCleaningState)
+def state():
+    return env.state()

server/data_generator.py ADDED Viewed

	@@ -0,0 +1,197 @@

+"""
+Synthetic dataset generation with a fixed seed for full reproducibility.
+All datasets are generated purely from numpy/random — no external downloads.
+"""
+import random
+import numpy as np
+import pandas as pd
+SEED = 42
+# ---------------------------------------------------------------------------
+# Task 1 — Employee records with missing values
+# ---------------------------------------------------------------------------
+def generate_task1_datasets():
+    """Returns (dirty_df, clean_df) for Task 1."""
+    rng = np.random.default_rng(SEED)
+    random.seed(SEED)
+    n = 100
+    departments = ["Engineering", "Marketing", "Sales", "HR", "Finance"]
+    first_names = ["Alice", "Bob", "Carol", "David", "Eve", "Frank", "Grace",
+                   "Heidi", "Ivan", "Judy", "Karl", "Laura", "Mallory", "Niaj",
+                   "Oscar", "Peggy", "Quinn", "Romeo", "Sybil", "Trent"]
+    last_names  = ["Smith", "Jones", "Brown", "Taylor", "Wilson", "Davis",
+                   "Miller", "Anderson", "Thomas", "Jackson"]
+    names       = [f"{random.choice(first_names)} {random.choice(last_names)}" for _ in range(n)]
+    ages        = rng.integers(22, 60, size=n).astype(float)
+    salaries    = rng.integers(40_000, 120_000, size=n).astype(float)
+    depts       = rng.choice(departments, size=n)
+    experience  = rng.integers(0, 30, size=n).astype(float)
+    clean_df = pd.DataFrame({
+        "name":       names,
+        "age":        ages,
+        "salary":     salaries,
+        "department": depts,
+        "experience": experience,
+    })
+    dirty_df = clean_df.copy()
+    # Inject ~20 % NaN into age, salary, department
+    for col, frac in [("age", 0.20), ("salary", 0.20), ("department", 0.10)]:
+        idx = rng.choice(n, size=int(n * frac), replace=False)
+        dirty_df.loc[idx, col] = np.nan
+    return dirty_df.reset_index(drop=True), clean_df.reset_index(drop=True)
+# ---------------------------------------------------------------------------
+# Task 2 — Product catalog with format & duplicate issues
+# ---------------------------------------------------------------------------
+def _scramble_phone(phone: str, rng) -> str:
+    digits = phone.replace("-", "")
+    fmt = rng.integers(0, 3)
+    if fmt == 0:
+        return digits                          # 5551234567
+    elif fmt == 1:
+        return f"({digits[:3]}){digits[3:]}"   # (555)1234567
+    else:
+        return phone                           # 555-123-4567  (canonical)
+def _scramble_date(date_str: str, rng) -> str:
+    dt = pd.to_datetime(date_str)
+    fmt = rng.integers(0, 3)
+    if fmt == 0:
+        return dt.strftime("%Y-%m-%d")
+    elif fmt == 1:
+        return dt.strftime("%b %d %Y")
+    else:
+        return dt.strftime("%d/%m/%Y")
+def generate_task2_datasets():
+    """Returns (dirty_df, clean_df) for Task 2."""
+    rng = np.random.default_rng(SEED)
+    random.seed(SEED)
+    n = 200
+    categories = ["Electronics", "Clothing", "Food", "Books", "Toys"]
+    product_ids   = [f"P{str(i).zfill(4)}" for i in range(1, n + 1)]
+    product_names = [f"Product_{i}" for i in range(1, n + 1)]
+    prices        = np.round(rng.uniform(5.0, 500.0, size=n), 2)
+    categories_col = rng.choice(categories, size=n)
+    phones        = [
+        f"{rng.integers(100,999)}-{rng.integers(100,999)}-{rng.integers(1000,9999)}"
+        for _ in range(n)
+    ]
+    days_offset   = rng.integers(0, 1000, size=n)
+    dates         = [
+        (pd.Timestamp("2020-01-01") + pd.Timedelta(days=int(d))).strftime("%Y-%m-%d")
+        for d in days_offset
+    ]
+    clean_df = pd.DataFrame({
+        "product_id":   product_ids,
+        "product_name": product_names,
+        "price":        prices,
+        "category":     categories_col,
+        "phone":        phones,
+        "listed_date":  dates,
+    })
+    dirty_df = clean_df.copy()
+    # Scramble ~60 % of phone formats
+    phone_idx = rng.choice(n, size=int(n * 0.6), replace=False)
+    dirty_df.loc[phone_idx, "phone"] = [
+        _scramble_phone(dirty_df.loc[i, "phone"], rng) for i in phone_idx
+    ]
+    # Scramble ~60 % of date formats
+    date_idx = rng.choice(n, size=int(n * 0.6), replace=False)
+    dirty_df.loc[date_idx, "listed_date"] = [
+        _scramble_date(dirty_df.loc[i, "listed_date"], rng) for i in date_idx
+    ]
+    # Add 15 duplicate rows
+    dup_idx  = rng.choice(n, size=15, replace=False)
+    dup_rows = dirty_df.iloc[dup_idx].copy()
+    dirty_df = pd.concat([dirty_df, dup_rows], ignore_index=True)
+    return dirty_df.reset_index(drop=True), clean_df.reset_index(drop=True)
+# ---------------------------------------------------------------------------
+# Task 3 — Customer database: full pipeline
+# ---------------------------------------------------------------------------
+def generate_task3_datasets():
+    """Returns (dirty_df, clean_df) for Task 3."""
+    rng = np.random.default_rng(SEED)
+    random.seed(SEED)
+    n = 300
+    countries  = ["USA", "UK", "Canada", "Australia", "Germany"]
+    first_names = ["Alice", "Bob", "Carol", "David", "Eve", "Frank", "Grace",
+                   "Heidi", "Ivan", "Judy"]
+    last_names  = ["Smith", "Jones", "Brown", "Taylor", "Wilson"]
+    names             = [f"{random.choice(first_names)} {random.choice(last_names)}" for _ in range(n)]
+    ages              = rng.integers(18, 75, size=n).astype(float)
+    purchase_amounts  = np.round(rng.uniform(10.0, 500.0, size=n), 2)
+    countries_col     = rng.choice(countries, size=n)
+    emails            = [f"user{i}@example.com" for i in range(1, n + 1)]
+    days_offset       = rng.integers(0, 730, size=n)
+    signup_dates      = [
+        (pd.Timestamp("2022-01-01") + pd.Timedelta(days=int(d))).strftime("%Y-%m-%d")
+        for d in days_offset
+    ]
+    clean_df = pd.DataFrame({
+        "name":            names,
+        "age":             ages,
+        "purchase_amount": purchase_amounts,
+        "country":         countries_col,
+        "email":           emails,
+        "signup_date":     signup_dates,
+    })
+    dirty_df = clean_df.copy()
+    # Missing values (~15 % in age, purchase_amount, country, signup_date)
+    for col, frac in [("age", 0.15), ("purchase_amount", 0.15),
+                      ("country", 0.10), ("signup_date", 0.10)]:
+        idx = rng.choice(n, size=int(n * frac), replace=False)
+        dirty_df.loc[idx, col] = np.nan
+    # Outliers in purchase_amount (~3 %)
+    out_idx = rng.choice(n, size=int(n * 0.03), replace=False)
+    dirty_df.loc[out_idx, "purchase_amount"] = (
+        dirty_df.loc[out_idx, "purchase_amount"] * 10
+    )
+    # Mixed case in country (~40 %)
+    case_idx = rng.choice(n, size=int(n * 0.40), replace=False)
+    dirty_df.loc[case_idx, "country"] = dirty_df.loc[case_idx, "country"].str.lower()
+    # Mixed date formats (~50 %) — only scramble non-null entries
+    date_idx = rng.choice(n, size=int(n * 0.50), replace=False)
+    valid_date_idx = [i for i in date_idx if pd.notna(dirty_df.loc[i, "signup_date"])]
+    for i in valid_date_idx:
+        dirty_df.loc[i, "signup_date"] = _scramble_date(dirty_df.loc[i, "signup_date"], rng)
+    # 20 duplicate rows
+    dup_idx  = rng.choice(n, size=20, replace=False)
+    dup_rows = dirty_df.iloc[dup_idx].copy()
+    dirty_df = pd.concat([dirty_df, dup_rows], ignore_index=True)
+    return dirty_df.reset_index(drop=True), clean_df.reset_index(drop=True)

server/environment.py ADDED Viewed

	@@ -0,0 +1,340 @@

+"""
+Core environment implementing reset / step / state.
+Each call to reset() picks a task (round-robin: 1 → 2 → 3 → 1 …)
+or a specific task_id can be forced via reset(task_id=N).
+"""
+import re
+import uuid
+import numpy as np
+import pandas as pd
+from typing import Any, Dict, Optional, Tuple
+from models import DataCleaningAction, DataCleaningObservation, DataCleaningState
+import server.tasks.task1_missing  as t1
+import server.tasks.task2_format   as t2
+import server.tasks.task3_pipeline as t3
+TASK_MODULES = {1: t1, 2: t2, 3: t3}
+PHONE_RE = re.compile(r"^\d{3}-\d{3}-\d{4}$")
+DATE_RE  = re.compile(r"^\d{4}-\d{2}-\d{2}$")
+VALID_COUNTRIES = {"USA", "UK", "Canada", "Australia", "Germany"}
+class DataCleaningEnvironment:
+    def __init__(self):
+        self._df: Optional[pd.DataFrame]    = None
+        self._clean_df: Optional[pd.DataFrame] = None
+        self._meta: Any                     = None   # task-specific metadata
+        self._task_id: int                  = 1
+        self._episode_id: str               = ""
+        self._step_count: int               = 0
+        self._max_steps: int                = 20
+        self._total_errors: int             = 0
+        self._last_score: float             = 0.0
+        self._task_cycle: int               = 0      # for round-robin default
+    # ------------------------------------------------------------------
+    # Public API
+    # ------------------------------------------------------------------
+    def reset(self, task_id: Optional[int] = None) -> DataCleaningObservation:
+        if task_id is None:
+            self._task_cycle = (self._task_cycle % 3) + 1
+            task_id = self._task_cycle
+        if task_id not in TASK_MODULES:
+            raise ValueError(f"task_id must be 1, 2, or 3 — got {task_id}")
+        mod = TASK_MODULES[task_id]
+        self._task_id   = task_id
+        self._episode_id = str(uuid.uuid4())
+        self._step_count = 0
+        self._max_steps  = mod.MAX_STEPS
+        if task_id == 1:
+            self._df, self._clean_df, self._meta = mod.load()
+        else:
+            self._df, self._clean_df, self._meta = mod.load()
+        self._last_score   = self._compute_score()
+        self._total_errors = self._count_errors()
+        return self._build_obs(0.0, False, "Episode started. Begin cleaning.")
+    def step(self, action: DataCleaningAction) -> DataCleaningObservation:
+        if self._df is None:
+            raise RuntimeError("Call reset() before step().")
+        self._step_count += 1
+        score_before = self._last_score
+        message, applied = self._apply_action(action)
+        score_after    = self._compute_score()
+        self._last_score = score_after
+        delta   = score_after - score_before
+        if not applied:
+            reward = -0.05
+        elif delta <= 0:
+            reward = -0.01
+        else:
+            reward = round(delta, 4)
+        done = (score_after >= 0.95) or (self._step_count >= self._max_steps)
+        if done and score_after >= 0.95:
+            reward = round(reward + 0.2, 4)
+        return self._build_obs(reward, done, message)
+    def state(self) -> DataCleaningState:
+        if self._df is None:
+            return DataCleaningState(
+                episode_id="", task_id=0, step_count=0,
+                max_steps=0, total_errors=0, errors_remaining=0,
+            )
+        return DataCleaningState(
+            episode_id    = self._episode_id,
+            task_id       = self._task_id,
+            step_count    = self._step_count,
+            max_steps     = self._max_steps,
+            total_errors  = self._total_errors,
+            errors_remaining = self._count_errors(),
+        )
+    # ------------------------------------------------------------------
+    # Internal helpers
+    # ------------------------------------------------------------------
+    def _compute_score(self) -> float:
+        if self._task_id == 1:
+            return t1.score(self._df, self._meta)
+        elif self._task_id == 2:
+            return t2.score(self._df, self._meta)
+        else:
+            return t3.score(self._df, self._meta)
+    def _count_errors(self) -> int:
+        if self._task_id == 1:
+            return t1.count_errors(self._df)
+        elif self._task_id == 2:
+            return t2.count_errors(self._df, self._meta)
+        else:
+            return t3.count_errors(self._df, self._meta)
+    def _build_obs(self, reward: float, done: bool, message: str) -> DataCleaningObservation:
+        mod = TASK_MODULES[self._task_id]
+        missing = {col: int(n) for col, n in self._df.isnull().sum().items() if n > 0}
+        dupes   = len(self._df) - len(self._df.drop_duplicates())
+        dtype_issues = self._detect_dtype_issues()
+        preview = self._df.head(10).to_csv(index=False)
+        return DataCleaningObservation(
+            done             = done,
+            reward           = reward,
+            data_preview     = preview,
+            data_shape       = list(self._df.shape),
+            missing_counts   = missing,
+            duplicate_count  = dupes,
+            dtype_issues     = dtype_issues,
+            task_description = mod.DESCRIPTION,
+            message          = message,
+            step_count       = self._step_count,
+            current_score    = self._last_score,
+        )
+    def _detect_dtype_issues(self) -> Dict[str, str]:
+        issues: Dict[str, str] = {}
+        for col in self._df.columns:
+            series = self._df[col].dropna()
+            if series.empty:
+                continue
+            if self._df[col].dtype == object:
+                numeric_count = pd.to_numeric(series, errors="coerce").notna().sum()
+                if numeric_count / len(series) > 0.8:
+                    issues[col] = "stored as string but appears numeric"
+        return issues
+    # ------------------------------------------------------------------
+    # Action dispatcher
+    # ------------------------------------------------------------------
+    def _apply_action(self, action: DataCleaningAction) -> Tuple[str, bool]:
+        op  = action.operation.strip().lower()
+        col = action.column
+        p   = action.params or {}
+        try:
+            if op == "fill_missing":
+                return self._fill_missing(col, p)
+            elif op == "drop_duplicates":
+                return self._drop_duplicates()
+            elif op == "fix_format":
+                return self._fix_format(col)
+            elif op == "replace_value":
+                return self._replace_value(col, p)
+            elif op == "drop_outliers":
+                return self._drop_outliers(col)
+            elif op == "fix_dtype":
+                return self._fix_dtype(col, p)
+            else:
+                return f"Unknown operation '{op}'. Choose from: fill_missing, drop_duplicates, fix_format, replace_value, drop_outliers, fix_dtype.", False
+        except Exception as exc:
+            return f"Operation failed: {exc}", False
+    def _fill_missing(self, col, p) -> Tuple[str, bool]:
+        if col is None or col not in self._df.columns:
+            return f"Column '{col}' not found.", False
+        n_before = int(self._df[col].isnull().sum())
+        if n_before == 0:
+            return f"No missing values in '{col}'.", False
+        strategy = str(p.get("strategy", "median")).lower()
+        if strategy == "median":
+            fill_val = self._df[col].median(skipna=True)
+        elif strategy == "mean":
+            fill_val = self._df[col].mean(skipna=True)
+        elif strategy == "mode":
+            mode = self._df[col].mode(dropna=True)
+            fill_val = mode.iloc[0] if not mode.empty else None
+        elif strategy == "constant":
+            fill_val = p.get("value")
+        else:
+            return f"Unknown strategy '{strategy}'.", False
+        if fill_val is None:
+            return "Could not determine fill value.", False
+        self._df[col] = self._df[col].fillna(fill_val)
+        n_after = int(self._df[col].isnull().sum())
+        return f"Filled {n_before - n_after} missing values in '{col}' using {strategy}.", True
+    def _drop_duplicates(self) -> Tuple[str, bool]:
+        n_before = len(self._df)
+        self._df = self._df.drop_duplicates().reset_index(drop=True)
+        n_after  = len(self._df)
+        removed  = n_before - n_after
+        if removed == 0:
+            return "No duplicate rows found.", False
+        return f"Dropped {removed} duplicate rows.", True
+    def _fix_format(self, col) -> Tuple[str, bool]:
+        if col is None or col not in self._df.columns:
+            return f"Column '{col}' not found.", False
+        if col == "phone":
+            return self._fix_phone(col)
+        elif col in ("listed_date", "signup_date"):
+            return self._fix_date(col)
+        elif col == "country":
+            return self._fix_country(col)
+        else:
+            return f"No format rule defined for column '{col}'.", False
+    def _fix_phone(self, col) -> Tuple[str, bool]:
+        def normalise(val):
+            if pd.isna(val):
+                return val
+            digits = re.sub(r"\D", "", str(val))
+            if len(digits) == 10:
+                return f"{digits[:3]}-{digits[3:6]}-{digits[6:]}"
+            return val
+        before = (~self._df[col].str.match(PHONE_RE, na=False)).sum()
+        self._df[col] = self._df[col].apply(normalise)
+        after  = (~self._df[col].str.match(PHONE_RE, na=False)).sum()
+        fixed  = int(before - after)
+        if fixed == 0:
+            return f"No phone format issues found in '{col}'.", False
+        return f"Fixed {fixed} phone numbers in '{col}' to NNN-NNN-NNNN format.", True
+    def _fix_date(self, col) -> Tuple[str, bool]:
+        def normalise(val):
+            if pd.isna(val):
+                return val
+            try:
+                return pd.to_datetime(str(val), dayfirst=False).strftime("%Y-%m-%d")
+            except Exception:
+                try:
+                    return pd.to_datetime(str(val), dayfirst=True).strftime("%Y-%m-%d")
+                except Exception:
+                    return val
+        before = (~self._df[col].apply(
+            lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
+        )).sum()
+        self._df[col] = self._df[col].apply(normalise)
+        after  = (~self._df[col].apply(
+            lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
+        )).sum()
+        fixed  = int(before - after)
+        if fixed == 0:
+            return f"No date format issues found in '{col}'.", False
+        return f"Fixed {fixed} dates in '{col}' to YYYY-MM-DD format.", True
+    def _fix_country(self, col) -> Tuple[str, bool]:
+        def normalise(val):
+            if pd.isna(val):
+                return val
+            mapping = {
+                "usa": "USA", "uk": "UK", "canada": "Canada",
+                "australia": "Australia", "germany": "Germany",
+            }
+            return mapping.get(str(val).strip().lower(), val)
+        before = (~self._df[col].isin(VALID_COUNTRIES) & self._df[col].notna()).sum()
+        self._df[col] = self._df[col].apply(normalise)
+        after  = (~self._df[col].isin(VALID_COUNTRIES) & self._df[col].notna()).sum()
+        fixed  = int(before - after)
+        if fixed == 0:
+            return f"No country capitalisation issues found.", False
+        return f"Fixed {fixed} country values to correct capitalisation.", True
+    def _replace_value(self, col, p) -> Tuple[str, bool]:
+        if col is None or col not in self._df.columns:
+            return f"Column '{col}' not found.", False
+        old = p.get("old")
+        new = p.get("new")
+        if old is None:
+            return "params.old is required for replace_value.", False
+        count = int((self._df[col] == old).sum())
+        if count == 0:
+            return f"Value '{old}' not found in '{col}'.", False
+        self._df[col] = self._df[col].replace(old, new)
+        return f"Replaced {count} occurrences of '{old}' with '{new}' in '{col}'.", True
+    def _drop_outliers(self, col) -> Tuple[str, bool]:
+        if col is None or col not in self._df.columns:
+            return f"Column '{col}' not found.", False
+        if not pd.api.types.is_numeric_dtype(self._df[col]):
+            return f"'{col}' is not numeric.", False
+        q1  = self._df[col].quantile(0.25)
+        q3  = self._df[col].quantile(0.75)
+        iqr = q3 - q1
+        mask     = (self._df[col] >= q1 - 3 * iqr) & (self._df[col] <= q3 + 3 * iqr)
+        n_before = len(self._df)
+        self._df = self._df[mask | self._df[col].isna()].reset_index(drop=True)
+        removed  = n_before - len(self._df)
+        if removed == 0:
+            return f"No outliers found in '{col}'.", False
+        return f"Removed {removed} outlier rows from '{col}' using IQR method.", True
+    def _fix_dtype(self, col, p) -> Tuple[str, bool]:
+        if col is None or col not in self._df.columns:
+            return f"Column '{col}' not found.", False
+        dtype = str(p.get("dtype", "float")).lower()
+        try:
+            if dtype == "float":
+                self._df[col] = pd.to_numeric(self._df[col], errors="coerce").astype(float)
+            elif dtype == "int":
+                self._df[col] = pd.to_numeric(self._df[col], errors="coerce")
+            elif dtype == "str":
+                self._df[col] = self._df[col].astype(str)
+            else:
+                return f"Unknown dtype '{dtype}'.", False
+            return f"Converted '{col}' to {dtype}.", True
+        except Exception as exc:
+            return f"dtype conversion failed: {exc}", False

server/tasks/__init__.py ADDED Viewed

File without changes

server/tasks/task1_missing.py ADDED Viewed

	@@ -0,0 +1,39 @@

+"""
+Task 1 — Easy: Fill Missing Values
+Objective: Fill all NaN values in the employee records DataFrame.
+Score: 1.0 - (remaining_nulls / original_nulls)
+"""
+from server.data_generator import generate_task1_datasets
+TASK_ID = 1
+MAX_STEPS = 20
+DESCRIPTION = (
+    "Task 1 (Easy) — Fill Missing Values\n"
+    "You have an employee records dataset with missing values (NaN) in "
+    "'age', 'salary', and 'department' columns. "
+    "Your goal is to fill all missing values so the dataset is complete.\n\n"
+    "Available operation: fill_missing\n"
+    "  params.strategy: 'median' | 'mean' | 'mode' | 'constant'\n"
+    "  params.value: (required when strategy='constant') the fill value\n"
+    "Example action: {\"operation\": \"fill_missing\", \"column\": \"age\", \"params\": {\"strategy\": \"median\"}}"
+)
+def load():
+    """Return (dirty_df, clean_df, original_null_count)."""
+    dirty, clean = generate_task1_datasets()
+    original_nulls = int(dirty.isnull().sum().sum())
+    return dirty.copy(), clean, original_nulls
+def score(current_df, original_nulls: int) -> float:
+    """Score in [0, 1]: fraction of nulls filled."""
+    if original_nulls == 0:
+        return 1.0
+    remaining = int(current_df.isnull().sum().sum())
+    return round(max(0.0, 1.0 - remaining / original_nulls), 4)
+def count_errors(current_df) -> int:
+    return int(current_df.isnull().sum().sum())

server/tasks/task2_format.py ADDED Viewed

	@@ -0,0 +1,68 @@

+"""
+Task 2 — Medium: Fix Formats + Remove Duplicates
+Objective: Standardise phone & date formats and drop duplicate rows.
+Score: weighted average of format_score (0.7) + dupe_score (0.3)
+"""
+import re
+import pandas as pd
+from server.data_generator import generate_task2_datasets
+TASK_ID = 2
+MAX_STEPS = 30
+DESCRIPTION = (
+    "Task 2 (Medium) — Fix Formats and Remove Duplicates\n"
+    "You have a product catalog with:\n"
+    "  • Phone numbers in mixed formats (need: NNN-NNN-NNNN)\n"
+    "  • Dates in mixed formats (need: YYYY-MM-DD)\n"
+    "  • Duplicate rows (~15)\n\n"
+    "Available operations:\n"
+    "  fix_format  — column: 'phone' | 'listed_date'\n"
+    "  drop_duplicates — no column needed\n\n"
+    "Example actions:\n"
+    '  {"operation": "fix_format", "column": "phone"}\n'
+    '  {"operation": "fix_format", "column": "listed_date"}\n'
+    '  {"operation": "drop_duplicates"}'
+)
+PHONE_RE = re.compile(r"^\d{3}-\d{3}-\d{4}$")
+DATE_RE  = re.compile(r"^\d{4}-\d{2}-\d{2}$")
+def load():
+    dirty, clean = generate_task2_datasets()
+    original_phone_issues = int((~dirty["phone"].str.match(PHONE_RE)).sum())
+    original_date_issues  = int((~dirty["listed_date"].apply(
+        lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
+    )).sum())
+    original_dupes = len(dirty) - len(dirty.drop_duplicates())
+    meta = {
+        "orig_phone": original_phone_issues,
+        "orig_date":  original_date_issues,
+        "orig_dupes": original_dupes,
+    }
+    return dirty.copy(), clean, meta
+def score(current_df, meta: dict) -> float:
+    phone_issues = int((~current_df["phone"].str.match(PHONE_RE)).sum())
+    date_issues  = int((~current_df["listed_date"].apply(
+        lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
+    )).sum())
+    dupes        = len(current_df) - len(current_df.drop_duplicates())
+    phone_score = 1.0 - phone_issues / max(meta["orig_phone"], 1)
+    date_score  = 1.0 - date_issues  / max(meta["orig_date"],  1)
+    dupe_score  = 1.0 - dupes        / max(meta["orig_dupes"], 1)
+    combined = 0.35 * phone_score + 0.35 * date_score + 0.30 * dupe_score
+    return round(max(0.0, min(1.0, combined)), 4)
+def count_errors(current_df, meta: dict) -> int:
+    phone_issues = int((~current_df["phone"].str.match(PHONE_RE)).sum())
+    date_issues  = int((~current_df["listed_date"].apply(
+        lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
+    )).sum())
+    dupes = len(current_df) - len(current_df.drop_duplicates())
+    return phone_issues + date_issues + dupes

server/tasks/task3_pipeline.py ADDED Viewed

	@@ -0,0 +1,104 @@

+"""
+Task 3 — Hard: Full Cleaning Pipeline
+Objective: Fix missing values, remove duplicates, handle outliers, standardise
+           country capitalisation and date formats.
+Score: equal-weight average of 4 sub-scores.
+"""
+import re
+import numpy as np
+import pandas as pd
+from server.data_generator import generate_task3_datasets
+TASK_ID = 3
+MAX_STEPS = 40
+DESCRIPTION = (
+    "Task 3 (Hard) — Full Cleaning Pipeline\n"
+    "You have a customer database with multiple issues:\n"
+    "  1. Missing values in 'age', 'purchase_amount', 'country', 'signup_date'\n"
+    "  2. ~20 duplicate rows\n"
+    "  3. Outliers in 'purchase_amount' (injected values ~10x normal)\n"
+    "  4. Mixed case in 'country' (need: title case, e.g. 'Usa' → 'USA')\n"
+    "  5. Mixed date formats in 'signup_date' (need: YYYY-MM-DD)\n\n"
+    "Available operations:\n"
+    "  fill_missing    — column + params.strategy ('median'|'mean'|'mode'|'constant')\n"
+    "  drop_duplicates — no column needed\n"
+    "  drop_outliers   — column (numeric); uses IQR method\n"
+    "  fix_format      — column: 'country' | 'signup_date'\n"
+    "  fix_dtype       — column + params.dtype ('float'|'int'|'str')\n\n"
+    "Example actions:\n"
+    '  {"operation": "fill_missing",    "column": "age",             "params": {"strategy": "median"}}\n'
+    '  {"operation": "drop_duplicates"}\n'
+    '  {"operation": "drop_outliers",   "column": "purchase_amount"}\n'
+    '  {"operation": "fix_format",      "column": "signup_date"}\n'
+    '  {"operation": "fix_format",      "column": "country"}'
+)
+DATE_RE = re.compile(r"^\d{4}-\d{2}-\d{2}$")
+VALID_COUNTRIES = {"USA", "UK", "Canada", "Australia", "Germany"}
+def load():
+    dirty, clean = generate_task3_datasets()
+    orig_nulls = int(dirty.isnull().sum().sum())
+    orig_dupes = len(dirty) - len(dirty.drop_duplicates())
+    # Outlier baseline: count rows where purchase_amount > Q3 + 3*IQR
+    pa = dirty["purchase_amount"].dropna()
+    q1, q3 = pa.quantile(0.25), pa.quantile(0.75)
+    iqr = q3 - q1
+    orig_outliers = int((pa > q3 + 3 * iqr).sum())
+    orig_country_issues = int((~dirty["country"].isin(VALID_COUNTRIES) &
+                               dirty["country"].notna()).sum())
+    orig_date_issues    = int((~dirty["signup_date"].apply(
+        lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
+    )).sum())
+    meta = {
+        "orig_nulls":           orig_nulls,
+        "orig_dupes":           orig_dupes,
+        "orig_outliers":        max(orig_outliers, 1),
+        "orig_country_issues":  max(orig_country_issues, 1),
+        "orig_date_issues":     max(orig_date_issues, 1),
+        "q1": q1, "q3": q3, "iqr": iqr,
+    }
+    return dirty.copy(), clean, meta
+def score(current_df, meta: dict) -> float:
+    remaining_nulls = int(current_df.isnull().sum().sum())
+    remaining_dupes = len(current_df) - len(current_df.drop_duplicates())
+    pa = current_df["purchase_amount"].dropna()
+    remaining_outliers = int((pa > meta["q3"] + 3 * meta["iqr"]).sum())
+    remaining_country = int((~current_df["country"].isin(VALID_COUNTRIES) &
+                              current_df["country"].notna()).sum())
+    remaining_dates   = int((~current_df["signup_date"].apply(
+        lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
+    )).sum())
+    null_score     = 1.0 - remaining_nulls    / max(meta["orig_nulls"],    1)
+    dupe_score     = 1.0 - remaining_dupes    / max(meta["orig_dupes"],    1)
+    outlier_score  = 1.0 - remaining_outliers / meta["orig_outliers"]
+    country_score  = 1.0 - remaining_country  / meta["orig_country_issues"]
+    date_score     = 1.0 - remaining_dates    / meta["orig_date_issues"]
+    combined = 0.25 * null_score + 0.20 * dupe_score + 0.20 * outlier_score \
+             + 0.175 * country_score + 0.175 * date_score
+    return round(max(0.0, min(1.0, combined)), 4)
+def count_errors(current_df, meta: dict) -> int:
+    remaining_nulls = int(current_df.isnull().sum().sum())
+    remaining_dupes = len(current_df) - len(current_df.drop_duplicates())
+    pa = current_df["purchase_amount"].dropna()
+    remaining_outliers = int((pa > meta["q3"] + 3 * meta["iqr"]).sum())
+    remaining_country = int((~current_df["country"].isin(VALID_COUNTRIES) &
+                              current_df["country"].notna()).sum())
+    remaining_dates   = int((~current_df["signup_date"].apply(
+        lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
+    )).sum())
+    return remaining_nulls + remaining_dupes + remaining_outliers + \
+           remaining_country + remaining_dates