Spaces:

varb15
/

dataqa_env

Sleeping

App Files Files Community

varb15 commited on 21 days ago

Commit

c338ce7

verified ·

1 Parent(s): 9996a16

Upload folder using huggingface_hub

Browse files

Files changed (33) hide show

Dockerfile +36 -0
README.md +345 -4
__init__.py +4 -0
client.py +5 -0
dataqa_env/__init__.py +19 -0
dataqa_env/client.py +37 -0
dataqa_env/models.py +77 -0
dataqa_env/server/Dockerfile +33 -0
dataqa_env/server/__init__.py +0 -0
dataqa_env/server/app.py +39 -0
dataqa_env/server/environment.py +623 -0
dataqa_env/server/gradio_ui.py +568 -0
dataqa_env/server/tasks.py +1159 -0
inference.py +376 -0
models.py +4 -0
openenv.yaml +6 -0
openenv_dataqa_env.egg-info/PKG-INFO +13 -0
openenv_dataqa_env.egg-info/SOURCES.txt +15 -0
openenv_dataqa_env.egg-info/dependency_links.txt +1 -0
openenv_dataqa_env.egg-info/entry_points.txt +2 -0
openenv_dataqa_env.egg-info/requires.txt +9 -0
openenv_dataqa_env.egg-info/top_level.txt +1 -0
pyproject.toml +32 -0
scripts/prevalidation_script.sh +185 -0
scripts/sample_inference_script.py +188 -0
server/__init__.py +1 -0
server/app.py +13 -0
tests/__init__.py +0 -0
tests/test_environment.py +455 -0
tests/test_extensibility.py +215 -0
tests/test_inference.py +191 -0
tests/test_tasks.py +212 -0
uv.lock +0 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,36 @@

+FROM python:3.11-slim
+WORKDIR /app
+# Install system deps
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    git curl \
+    && rm -rf /var/lib/apt/lists/*
+# Install uv for fast dependency management
+RUN curl -LsSf https://astral.sh/uv/install.sh | sh && \
+    mv /root/.local/bin/uv /usr/local/bin/uv && \
+    mv /root/.local/bin/uvx /usr/local/bin/uvx
+# Copy project files
+COPY pyproject.toml /app/
+COPY openenv.yaml /app/
+COPY dataqa_env/ /app/dataqa_env/
+COPY inference.py /app/
+COPY README.md /app/
+# Install dependencies
+RUN uv sync --no-editable 2>/dev/null || pip install -e .
+# Set environment
+ENV PATH="/app/.venv/bin:$PATH"
+ENV PYTHONPATH="/app:$PYTHONPATH"
+# Health check — HF Spaces uses port 8000
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
+EXPOSE 8000
+ENV ENABLE_WEB_INTERFACE=true
+CMD ["uvicorn", "dataqa_env.server.app:app", "--host", "0.0.0.0", "--port", "8000"]

README.md CHANGED Viewed

@@ -1,10 +1,351 @@
 ---
-title: Dataqa Env
-emoji: 💻
 colorFrom: blue
-colorTo: red
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: DataQA Environment Server
+emoji: "\U0001F50D"
 colorFrom: blue
+colorTo: gray
 sdk: docker
 pinned: false
+app_port: 8000
+tags:
+  - openenv
+base_path: /web
 ---
+# DataQA Environment
+A two-phase OpenEnv RL environment for **Data Quality Assurance** — an LLM agent inspects corrupted datasets, identifies all planted quality issues, and proposes data repairs.
+### Demo: Agent Trajectory Replay
+```
+EASY TASK (Step 2) — All 6 issues found + 5 fixes proposed
+  Reward: 0.87 | Identify: 1.00 | Fix: 0.67
+  ✓ row:4  name: empty → "David Kim"
+  ✓ row:7  salary: "seventy-five thousand" → "75000"
+  ✓ row:9  salary: "5000" → "73000"
+  ✓ row:15 email: mismatch → "oscar.rivera@company.com"
+  ✓ row:18 start_date: "2027-06-15" → "2022-01-19"
+  ✓ row:21 duplicate row detected
+HARD TASK — ML experiment metadata
+  Step 1: Found 5/10, missed hard issues    → Reward: 0.69
+  Step 2: Found 10/10 + 5 fixes proposed   → Reward: 0.77
+  Issues requiring ML knowledge:
+    • val_loss < train_loss (data leakage signal)
+    • resnet18 using 42.5GB GPU (impossible)
+    • 350 epochs on ImageNet in 30 min (impossible)
+    • wav2vec2 at 98.5% accuracy (exceeds SOTA)
+ALIGNMENT TASK — NVIDIA HelpSteer data (hardest)
+  Step 1: Found 7/12, missed subtle issues  → Reward: 0.58
+  Step 2: Found 12/12 + 3 fixes proposed   → Reward: 0.72
+  Issues requiring deep reasoning:
+    • Cerasus vs Prunus serrulata (wrong taxonomic name)
+    • $400.3M at Sotheby's vs $450.3M at Christie's (close but wrong)
+    • "does NOT learn via backprop" then describes backprop (self-contradiction)
+    • Fake Nature paper by "Dr. Sarah Chen" (hallucinated citation)
+    • "use bare except everywhere" rated helpfulness=3 (harmful advice)
+    • [SYSTEM] prompt leaked in response (pipeline contamination)
+```
+> The interactive replay UI with color-coded dataset visualization is available on the HF Space.
+## Motivation
+Every ML engineer and data scientist spends significant time debugging data quality issues — missing values, type mismatches, logical inconsistencies, and subtle statistical anomalies — before data enters ML pipelines or production databases. This is a genuine, high-frequency human task that directly impacts model quality and business outcomes.
+DataQA turns this into a **two-phase RL challenge**:
+1. **Identify** — systematically inspect corrupted data and pinpoint every planted issue
+2. **Fix** — propose corrected values by reasoning about schema, constraints, and context
+This creates a rich multi-step decision problem where agents must explore datasets strategically, distinguish subtle anomalies from noise, and reason about what the correct data should be.
+## Environment API
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/reset` | POST | Start a new episode with a corrupted dataset |
+| `/step` | POST | Submit identified issues + proposed fixes |
+| `/state` | GET | Get current episode state |
+| `/health` | GET | Health check |
+## Tasks
+| Task | Issues | Difficulty | Domain | Description |
+|------|--------|-----------|--------|-------------|
+| `easy` | 6 | Beginner | HR/Employee data (21 rows) | Nulls, wrong types, duplicates, out-of-range, email-name mismatch, future dates |
+| `medium` | 8 | Intermediate | E-commerce orders (31 rows) | Inconsistent totals, invalid categories, duplicate keys, wrong date formats, invalid country codes, future-date deliveries |
+| `hard` | 10 | Advanced | ML experiment metadata (31 rows) | Data leakage signals, unreasonable GPU memory, impossibly fast training, SOTA-exceeding accuracy, timestamp ordering, whitespace-only fields |
+| `alignment` | 12 | Expert | LLM alignment data (30 rows, NVIDIA HelpSteer) | See below |
+**Difficulty progression**: Easy issues are individually obvious (empty fields, text in numeric columns). Medium issues require cross-column reasoning (total != qty * price) and set membership checks. Hard issues require ML domain knowledge (val_loss < train_loss = data leakage) and multi-row temporal reasoning.
+### Alignment Task: LLM Training Data Quality (Expert)
+Built on **real data from [NVIDIA HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer)** — 30 human-annotated prompt-response pairs with quality scores (helpfulness, correctness, coherence, complexity, verbosity on 0-4 scale).
+This task targets a critical real-world problem: **catching quality issues in LLM fine-tuning data before it corrupts model training**. The 12 planted issues represent failure modes actually seen in production data pipelines:
+| Issue | Difficulty | Why It's Hard |
+|---|---|---|
+| Subtle factual error (*Cerasus* vs *Prunus serrulata*) | 3.0 | Old taxonomic synonym — sounds plausible, requires domain knowledge |
+| Plausible wrong numbers ($400.3M at Sotheby's vs $450.3M at Christie's) | 3.0 | Right painting, wrong price by $50M and wrong auction house |
+| Self-contradictory reasoning ("does NOT learn via backprop" then describes backprop) | 3.0 | Response negates its own conclusion — trains confused models |
+| Hallucinated citation (fake Nature paper by fake Dr. Sarah Chen) | 3.0 | Fabricated study with specific fake statistics — most dangerous for training |
+| Harmful coding advice ("use bare except everywhere") with high quality scores | 3.0 | Teaches dangerous practices if used for fine-tuning |
+| Leaked system prompt (`[SYSTEM] You are a helpful AI...`) in response | 2.5 | Data pipeline failed to strip prompt template |
+| Semantic near-duplicate prompt (rephrased, not exact copy) | 2.5 | Requires semantic similarity detection, not just string matching |
+| Score inflation (helpfulness=4 for a 4-word answer) | 2.5 | Score-content mismatch requires understanding rating criteria |
+| Truncated response (cut mid-sentence) | 2.5 | `max_length` truncation without sentence boundary detection |
+| Response in French for English prompt | 2.0 | Language contamination from multilingual training data |
+| Response plagiarized from another row | 2.0 | Data pipeline shuffling/dedup failure |
+| Whitespace-only prompt | 2.0 | Empty training example from pipeline artifact |
+These issues are designed to challenge frontier models — they require factual recall, semantic reasoning, cross-row comparison, and understanding of what makes training data harmful.
+## Two-Phase Action Space
+### Phase 1: Identify Issues
+Submit issues in format: `row:<row_number>,col:<column_name>,issue:<issue_type>`
+- `row_number`: 1-indexed data row position (after header)
+- `column_name`: Exact column header name, lowercase
+- `issue_type`: One of the supported types below
+### Phase 2: Propose Fixes
+Submit fixes in format: `row:<row_number>,col:<column_name>,fix:<corrected_value>`
+The agent proposes the **correct value** that should replace the corrupted data. Fixes are graded against the original clean dataset.
+Both phases can be submitted in the same step or across multiple steps.
+**Supported Issue Types:**
+| Type | Description | Example |
+|------|-------------|---------|
+| `missing_value` | Null, empty, or whitespace-only | Empty name field |
+| `wrong_type` | Value doesn't match expected type | Salary as "seventy-five thousand" |
+| `duplicate_row` | Exact duplicate or duplicate key | Two rows with same employee_id |
+| `out_of_range` | Value outside valid range | Salary of 5000 when min is 50000 |
+| `format_violation` | Wrong format or invalid enum | Date as DD/MM/YYYY instead of YYYY-MM-DD |
+| `inconsistent_value` | Computed field mismatch, logical inconsistency | total != qty * price |
+| `statistical_outlier` | Unreasonable value given context | resnet18 using 42.5GB GPU |
+| `referential_integrity` | Foreign key violation | (available for custom tasks) |
+## Observation Space
+| Field | Type | Description |
+|-------|------|-------------|
+| `dataset_csv` | str | The corrupted dataset in CSV format |
+| `schema_description` | str | Column types, ranges, and constraints |
+| `validation_rules` | str | Business rules the data must satisfy |
+| `task_description` | str | Task context and instructions |
+| `feedback` | str | Per-step results: TP/FP/FN, precision/recall, fix scores |
+| `num_issues_hint` | int | Exact count of planted issues |
+| `max_steps` | int | Maximum attempts allowed |
+| `done` | bool | Whether episode has terminated |
+| `reward` | float | Best combined reward so far (0.0-1.0) |
+**Observation Metadata** (per step):
+- Identify: `identify_f1`, `identify_score`, `precision`, `recall`, `tp`, `fp`, `fn`
+- Fix: `fix_score`, `fixes_correct`, `fixes_partial`, `fixes_wrong`, `fixes_attempted`
+- Combined: `combined_reward`, `difficulty_found`, `difficulty_missed`
+## Reward Function
+### Combined Reward
+```
+combined_reward = 0.6 * identify_score + 0.4 * fix_score
+```
+If no fixes are submitted, `combined_reward = identify_score` (no penalty — backward compatible).
+### Identify Score (Difficulty-Weighted F1)
+Each planted issue has a **difficulty weight** (1.0-3.0):
+| Weight | Category | Examples |
+|--------|----------|----------|
+| 1.0 | Easy | Missing values, obvious out-of-range, wrong type |
+| 1.5-2.0 | Medium | Duplicate keys, format violations, cross-column checks |
+| 2.5-3.0 | Hard | Data leakage, statistical outliers, whitespace-only |
+- **Weighted Recall** = (difficulty of found issues) / (total difficulty)
+- **Weighted Precision** = penalizes false positives proportional to average difficulty
+- **Weighted F1** = harmonic mean
+### Fix Score (Difficulty-Weighted Quality)
+Each proposed fix is compared against the original clean value:
+| Fix Quality | Score | Description |
+|-------------|-------|-------------|
+| Exact match | 1.0 | Case-insensitive, whitespace-stripped match |
+| Numeric close | 0.8 | Within 1% of correct numeric value |
+| Correct cell | 0.1 | Right location, wrong value |
+| Non-issue cell | 0.0 | Fix targets a cell with no issue |
+Fix score = (sum of best fix score per issue × difficulty weight) / (total difficulty weight)
+### Reward Properties
+- **Per-step partial progress**: reward increases as more issues are found/fixed
+- **Difficulty-aware**: finding subtle issues earns more than obvious ones
+- **Penalizes bad behavior**: false positives reduce score, fixing non-issues earns nothing
+- **Monotonically non-decreasing**: best score across all steps is the final reward
+- **Always in [0.0, 1.0]**: meets hackathon requirement
+### Episode Boundaries
+- Each task allows up to 3 steps (attempts)
+- Episode ends when F1 >= 0.999 (perfect identification) or max steps reached
+- Agent receives detailed feedback after each step to improve on next attempt
+## Baseline Scores
+Baseline agent uses Qwen2.5-72B-Instruct via HuggingFace Router:
+| Task | Identify Score | Fix Score | Combined | Notes |
+|------|---------------|-----------|----------|-------|
+| `easy` | 0.7-1.0 | 0.5-0.9 | 0.6-1.0 | Most LLMs find obvious issues reliably |
+| `medium` | 0.5-0.8 | 0.3-0.6 | 0.4-0.7 | Cross-column reasoning challenges models |
+| `hard` | 0.3-0.6 | 0.2-0.4 | 0.3-0.5 | ML domain knowledge and subtle patterns |
+Scores vary by model. The hard task is designed to challenge frontier models.
+## Extensibility
+### Custom Contamination Rules
+```python
+from dataqa_env import register_contamination_rule
+from dataqa_env.server.tasks import PlantedIssue
+def swap_digits(rows, header, col_idx, row_idx, rng):
+    val = rows[row_idx][col_idx]
+    corrupted = val[::-1]
+    issue = PlantedIssue(
+        row=row_idx + 1, col=header[col_idx],
+        issue_type="format_violation",
+        description=f"Digits swapped in {header[col_idx]}",
+        difficulty=2.0,
+    )
+    return corrupted, issue
+register_contamination_rule("swap_digits", swap_digits)
+```
+### Custom Tasks from Config
+```python
+from dataqa_env import create_task_from_config, register_task
+task = create_task_from_config(
+    task_id="custom",
+    name="Custom Validation",
+    description="Find quality issues in this dataset.",
+    schema_description="id: int, name: str, score: int (0-100)",
+    validation_rules="No missing values. Scores must be 0-100.",
+    clean_csv="id,name,score\n1,Alice,95\n2,Bob,87\n3,Carol,92",
+    contaminations=[
+        {"rule": "missing_value", "row": 0, "col": 1, "difficulty": 1.0},
+        {"rule": "negative_value", "row": 2, "col": 2, "difficulty": 1.5},
+    ],
+)
+register_task("custom", lambda seed: task)
+```
+### Built-in Contamination Rules
+| Rule | Effect | Default Difficulty |
+|------|--------|--------------------|
+| `missing_value` | Sets field to empty string | 1.0 |
+| `whitespace_value` | Sets field to single space | 2.5 |
+| `wrong_type_text` | Replaces with random text | 1.0 |
+| `negative_value` | Negates numeric value | 1.0 |
+## Setup & Quick Start
+```bash
+# Install
+pip install -e .
+# Run server locally
+uvicorn dataqa_env.server.app:app --host 0.0.0.0 --port 8000
+# Run inference (set your API credentials)
+API_BASE_URL=https://router.huggingface.co/v1 \
+MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
+HF_TOKEN=your-token \
+python inference.py
+```
+## Docker
+```bash
+docker build -t dataqa-env .
+docker run -p 8000:8000 dataqa-env
+```
+## Testing
+```bash
+pip install -e ".[dev]"
+pytest tests/ -v
+```
+118 tests covering:
+- Task creation, corruption, and difficulty weights
+- Issue key and fix parsing (standard, lenient, edge cases)
+- F1, weighted reward, and fix quality computation
+- Full environment lifecycle (identify-only and identify+fix)
+- Combined reward calculation and weight verification
+- Inference script parsing and prompt building
+- Structured log format ([START], [STEP], [END])
+- Score bounds (0.0-1.0), best-score monotonicity
+- Extensibility API (custom rules, custom tasks)
+## Validation
+```bash
+# OpenEnv spec validation
+openenv validate .
+# Pre-submission validation (requires HF Space URL)
+./prevalidation_script.sh https://your-space.hf.space
+```
+## Environment Variables
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
+| `MODEL_NAME` | Model identifier | `Qwen/Qwen2.5-72B-Instruct` |
+| `HF_TOKEN` | HuggingFace token / API key | - |
+| `ENV_URL` | Environment server URL | `http://localhost:8000` |
+## Architecture
+```
+dataqa_env/
+├── __init__.py            # Public API + extensibility exports
+├── models.py              # Pydantic: DataQAAction (issues + fixes), DataQAObservation, DataQAState
+├── client.py              # EnvClient for WebSocket connections
+├── server/
+│   ├── environment.py     # Two-phase DataQAEnvironment (identify + fix + combined reward)
+│   ├── tasks.py           # Task definitions + contamination rules + extensibility API
+│   ├── app.py             # FastAPI server (via openenv-core create_app)
+│   └── Dockerfile
+tests/
+├── test_tasks.py          # Task creation, corruption, difficulty weights
+├── test_environment.py    # Identify scoring, fix grading, combined reward, lifecycle
+├── test_inference.py      # LLM response parsing, fix parsing, prompt building, log format
+└── test_extensibility.py  # Custom rules, custom tasks, registration API
+inference.py               # Two-phase baseline agent (identify → fix)
+openenv.yaml               # OpenEnv/HF Spaces spec
+pyproject.toml             # Package metadata and dependencies
+Dockerfile                 # Production container
+```

__init__.py ADDED Viewed

	@@ -0,0 +1,4 @@

+"""Root-level package for OpenEnv compatibility."""
+from dataqa_env import DataQAEnv, DataQAAction, DataQAObservation, DataQAState
+__all__ = ["DataQAEnv", "DataQAAction", "DataQAObservation", "DataQAState"]

client.py ADDED Viewed

	@@ -0,0 +1,5 @@

+"""Root-level client for OpenEnv compatibility."""
+from dataqa_env.client import DataQAEnv
+from dataqa_env.models import DataQAAction, DataQAObservation, DataQAState
+__all__ = ["DataQAEnv", "DataQAAction", "DataQAObservation", "DataQAState"]

dataqa_env/__init__.py ADDED Viewed

	@@ -0,0 +1,19 @@

+from .client import DataQAEnv
+from .models import DataQAAction, DataQAObservation, DataQAState
+from .server.tasks import (
+    create_task_from_config,
+    register_task,
+    register_contamination_rule,
+    CONTAMINATION_RULES,
+)
+__all__ = [
+    "DataQAEnv",
+    "DataQAAction",
+    "DataQAObservation",
+    "DataQAState",
+    "create_task_from_config",
+    "register_task",
+    "register_contamination_rule",
+    "CONTAMINATION_RULES",
+]

dataqa_env/client.py ADDED Viewed

	@@ -0,0 +1,37 @@

+"""
+DataQAEnv Client
+----------------
+Client-side wrapper for the DataQA environment server.
+"""
+from __future__ import annotations
+from openenv.core.client_types import StepResult
+from openenv.core.env_client import EnvClient
+from .models import DataQAAction, DataQAObservation, DataQAState
+class DataQAEnv(EnvClient[DataQAAction, DataQAObservation, DataQAState]):
+    def _step_payload(self, action: DataQAAction) -> dict:
+        return {"issues": action.issues, "task_id": action.task_id}
+    def _parse_result(self, payload: dict) -> StepResult[DataQAObservation]:
+        obs = DataQAObservation(**payload["observation"])
+        return StepResult(
+            observation=obs,
+            reward=payload.get("reward"),
+            done=bool(payload.get("done", False)),
+        )
+    def _parse_state(self, payload: dict) -> DataQAState:
+        return DataQAState(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+            task_id=payload.get("task_id", ""),
+            current_step=payload.get("current_step", 0),
+            max_steps=payload.get("max_steps", 3),
+            best_score=payload.get("best_score", 0.0),
+            total_planted_issues=payload.get("total_planted_issues", 0),
+        )

dataqa_env/models.py ADDED Viewed

	@@ -0,0 +1,77 @@

+"""
+DataQA Environment Models
+-------------------------
+Action/Observation/State types for the Data Quality Assurance environment.
+The agent receives a dataset with planted quality issues and must identify them.
+Grading is based on F1 score (precision × recall) of correctly identified issues.
+"""
+from __future__ import annotations
+from typing import List, Optional
+from openenv.core.env_server.interfaces import Action, Observation, State
+class DataQAAction(Action):
+    """
+    Agent submits identified issues AND optional proposed fixes.
+    Two-phase action space:
+      Phase 1 (Identify): List issues in format "row:<N>,col:<name>,issue:<type>"
+      Phase 2 (Fix):      List fixes in format "row:<N>,col:<name>,fix:<proposed_value>"
+    The agent can submit both in the same step or across multiple steps.
+    Combined reward = 0.6 * identify_score + 0.4 * fix_score
+    Supported issue types:
+        missing_value, wrong_type, duplicate_row, out_of_range,
+        format_violation, inconsistent_value, statistical_outlier,
+        referential_integrity
+    """
+    issues: List[str]
+    fixes: List[str] = []
+    # Include task_id so step() can reconstruct context in stateless HTTP mode
+    task_id: str = "easy"
+class DataQAObservation(Observation):
+    """
+    What the agent sees: a dataset, its schema/rules, and feedback.
+    """
+    # The dataset as CSV text
+    dataset_csv: str = ""
+    # Schema description (column names, expected types, constraints)
+    schema_description: str = ""
+    # Validation rules in plain text
+    validation_rules: str = ""
+    # Task description
+    task_description: str = ""
+    # Feedback from previous step (empty on reset)
+    feedback: str = ""
+    # Current task ID
+    task_id: str = ""
+    # Number of planted issues (hint for the agent)
+    num_issues_hint: int = 0
+    # Max allowed steps for this task
+    max_steps: int = 3
+class DataQAState(State):
+    """Tracks episode progress."""
+    task_id: str = ""
+    current_step: int = 0
+    max_steps: int = 3
+    best_score: float = 0.0
+    total_planted_issues: int = 0

dataqa_env/server/Dockerfile ADDED Viewed

	@@ -0,0 +1,33 @@

+FROM python:3.11-slim
+WORKDIR /app
+# Install system deps
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    git curl \
+    && rm -rf /var/lib/apt/lists/*
+# Install uv for fast dependency management
+RUN curl -LsSf https://astral.sh/uv/install.sh | sh && \
+    mv /root/.local/bin/uv /usr/local/bin/uv && \
+    mv /root/.local/bin/uvx /usr/local/bin/uvx
+# Copy project files
+COPY . /app/env
+WORKDIR /app/env
+# Install dependencies
+RUN uv sync --frozen --no-editable 2>/dev/null || uv sync --no-editable
+# Set environment
+ENV PATH="/app/env/.venv/bin:$PATH"
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
+EXPOSE 8000
+CMD ["uvicorn", "dataqa_env.server.app:app", "--host", "0.0.0.0", "--port", "8000"]

dataqa_env/server/__init__.py ADDED Viewed

File without changes

dataqa_env/server/app.py ADDED Viewed

	@@ -0,0 +1,39 @@

+"""
+FastAPI application for the DataQA Environment.
+Usage:
+    uvicorn dataqa_env.server.app:app --reload --host 0.0.0.0 --port 8000
+"""
+try:
+    from openenv.core.env_server.http_server import create_app
+    from .environment import DataQAEnvironment
+    from ..models import DataQAAction, DataQAObservation
+except ImportError:
+    from openenv.core.env_server.http_server import create_app
+    from dataqa_env.server.environment import DataQAEnvironment
+    from dataqa_env.models import DataQAAction, DataQAObservation
+app = create_app(
+    DataQAEnvironment, DataQAAction, DataQAObservation, env_name="dataqa_env"
+)
+@app.get("/")
+def root():
+    """Root endpoint — environment info."""
+    return {
+        "name": "DataQA Environment",
+        "description": "Two-phase data quality assurance environment: identify issues + propose fixes",
+        "tasks": ["easy", "medium", "hard", "alignment", "coding", "toolcalling"],
+        "endpoints": ["/health", "/reset", "/step", "/state"],
+    }
+def main():
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)
+if __name__ == "__main__":
+    main()

dataqa_env/server/environment.py ADDED Viewed

	@@ -0,0 +1,623 @@

+"""
+DataQA Environment
+------------------
+Server-side environment for data quality assurance tasks.
+Two-phase RL environment:
+  Phase 1 (Identify): Agent inspects corrupted datasets and reports quality issues.
+  Phase 2 (Fix):      Agent proposes corrections for identified issues.
+Combined reward = 0.6 * identify_score + 0.4 * fix_score
+Both phases scored with difficulty-weighted metrics for rich per-step signal.
+"""
+from __future__ import annotations
+import re
+import uuid
+from typing import Any, Optional, Set
+from openenv.core.env_server.interfaces import Action, Environment, Observation
+from ..models import DataQAAction, DataQAObservation, DataQAState
+from .tasks import PlantedIssue, Task, get_task, list_tasks
+# Reward weights for the two phases
+IDENTIFY_WEIGHT = 0.6
+FIX_WEIGHT = 0.4
+def parse_issue_key(raw: str) -> Optional[str]:
+    """
+    Parse an agent-reported issue string into a normalized key.
+    Expected format: row:<N>,col:<name>,issue:<type>
+    Returns normalized key or None if unparseable.
+    """
+    raw = raw.strip().lower()
+    row_match = re.search(r"row\s*[:=]\s*(\d+)", raw)
+    col_match = re.search(r"col\s*[:=]\s*([\w_]+)", raw)
+    issue_match = re.search(r"issue\s*[:=]\s*([\w_]+)", raw)
+    if row_match and col_match and issue_match:
+        return f"row:{row_match.group(1)},col:{col_match.group(1)},issue:{issue_match.group(1)}"
+    return None
+def parse_fix(raw: str) -> Optional[tuple[int, str, str]]:
+    """
+    Parse an agent-proposed fix into (row, col, proposed_value).
+    Expected format: row:<N>,col:<name>,fix:<value>
+    Returns (row, col, value) or None if unparseable.
+    """
+    raw = raw.strip()
+    row_match = re.search(r"row\s*[:=]\s*(\d+)", raw, re.IGNORECASE)
+    col_match = re.search(r"col(?:umn)?\s*[:=]\s*([\w_]+)", raw, re.IGNORECASE)
+    fix_match = re.search(r"fix\s*[:=]\s*(.+?)$", raw, re.IGNORECASE)
+    if row_match and col_match and fix_match:
+        return (int(row_match.group(1)), col_match.group(1).lower(), fix_match.group(1).strip())
+    return None
+def compute_f1(reported_keys: Set[str], planted_keys: Set[str]) -> dict:
+    """Compute precision, recall, and F1 score."""
+    if not reported_keys and not planted_keys:
+        return {"precision": 1.0, "recall": 1.0, "f1": 1.0, "tp": 0, "fp": 0, "fn": 0}
+    if not reported_keys:
+        return {"precision": 0.0, "recall": 0.0, "f1": 0.0, "tp": 0, "fp": 0, "fn": len(planted_keys)}
+    if not planted_keys:
+        return {"precision": 0.0, "recall": 0.0, "f1": 0.0, "tp": 0, "fp": len(reported_keys), "fn": 0}
+    tp = len(reported_keys & planted_keys)
+    fp = len(reported_keys - planted_keys)
+    fn = len(planted_keys - reported_keys)
+    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
+    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
+    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
+    return {"precision": precision, "recall": recall, "f1": f1, "tp": tp, "fp": fp, "fn": fn}
+def compute_weighted_reward(
+    reported_keys: Set[str],
+    planted_issues: list,
+) -> dict:
+    """
+    Compute difficulty-weighted reward for richer per-step signal.
+    Each planted issue has a difficulty weight (1.0-3.0). Finding harder issues
+    earns more reward. False positives incur a penalty scaled by average difficulty.
+    Returns dict with weighted_reward (0.0-1.0), plus per-issue breakdown.
+    """
+    if not planted_issues and not reported_keys:
+        return {"weighted_reward": 1.0, "difficulty_found": 0.0, "difficulty_missed": 0.0}
+    planted_by_key = {issue.to_key(): issue for issue in planted_issues}
+    planted_keys = set(planted_by_key.keys())
+    if not reported_keys:
+        total_weight = sum(i.difficulty for i in planted_issues)
+        return {"weighted_reward": 0.0, "difficulty_found": 0.0, "difficulty_missed": total_weight}
+    if not planted_keys:
+        return {"weighted_reward": 0.0, "difficulty_found": 0.0, "difficulty_missed": 0.0}
+    found_keys = reported_keys & planted_keys
+    missed_keys = planted_keys - reported_keys
+    false_positive_count = len(reported_keys - planted_keys)
+    difficulty_found = sum(planted_by_key[k].difficulty for k in found_keys)
+    difficulty_missed = sum(planted_by_key[k].difficulty for k in missed_keys)
+    total_weight = sum(i.difficulty for i in planted_issues)
+    weighted_recall = difficulty_found / total_weight if total_weight > 0 else 0.0
+    avg_difficulty = total_weight / len(planted_issues)
+    fp_penalty_weight = false_positive_count * avg_difficulty
+    weighted_precision = difficulty_found / (difficulty_found + fp_penalty_weight) if (difficulty_found + fp_penalty_weight) > 0 else 0.0
+    if (weighted_precision + weighted_recall) > 0:
+        weighted_reward = 2 * weighted_precision * weighted_recall / (weighted_precision + weighted_recall)
+    else:
+        weighted_reward = 0.0
+    return {
+        "weighted_reward": round(weighted_reward, 4),
+        "difficulty_found": round(difficulty_found, 2),
+        "difficulty_missed": round(difficulty_missed, 2),
+    }
+def grade_fixes(
+    fixes: list[tuple[int, str, str]],
+    task: Task,
+) -> dict:
+    """
+    Grade proposed fixes against the clean dataset.
+    For each fix (row, col, proposed_value), compare to the original clean value.
+    Scoring per fix:
+      - Exact match (case-insensitive, whitespace-stripped): 1.0
+      - Numeric close match (within 1%): 0.8
+      - Correct column but wrong value: 0.1
+      - Targets a non-issue cell: 0.0 (penalty)
+    Returns dict with fix_score (0.0-1.0), details per fix, and counts.
+    """
+    if not fixes and not task.planted_issues:
+        return {"fix_score": 1.0, "fixes_correct": 0, "fixes_partial": 0,
+                "fixes_wrong": 0, "fixes_attempted": 0, "fix_details": []}
+    if not fixes:
+        return {"fix_score": 0.0, "fixes_correct": 0, "fixes_partial": 0,
+                "fixes_wrong": 0, "fixes_attempted": 0, "fix_details": []}
+    issue_map = task.get_planted_issue_map()
+    # Build set of (row, col) that are actual issues
+    issue_cells = {(issue.row, issue.col) for issue in task.planted_issues}
+    total_weight = sum(i.difficulty for i in task.planted_issues) if task.planted_issues else 1.0
+    earned_weight = 0.0
+    fixes_correct = 0
+    fixes_partial = 0
+    fixes_wrong = 0
+    fix_details = []
+    # Track which issues have been fixed (best fix wins)
+    fixed_issues: dict[tuple[int, str], float] = {}
+    for row, col, proposed in fixes:
+        clean_value = task.get_clean_value(row, col)
+        cell_key = (row, col)
+        if cell_key not in issue_cells:
+            # Fix targets a non-issue cell — no credit
+            fix_details.append({"row": row, "col": col, "score": 0.0, "reason": "not an issue cell"})
+            fixes_wrong += 1
+            continue
+        if clean_value is None:
+            fix_details.append({"row": row, "col": col, "score": 0.0, "reason": "cell not found"})
+            fixes_wrong += 1
+            continue
+        # Find the planted issue for this cell to get its difficulty weight
+        matching_issue = None
+        for issue in task.planted_issues:
+            if issue.row == row and issue.col == col:
+                matching_issue = issue
+                break
+        difficulty = matching_issue.difficulty if matching_issue else 1.0
+        # Score the fix using tiered grading:
+        #   1.0 = exact match with clean value
+        #   0.8 = valid fix (right type, in range, addresses the issue) but not exact
+        #   0.4 = partially valid (reasonable attempt, right direction)
+        #   0.1 = targets correct cell but fix doesn't address the issue
+        #   0.0 = makes things worse or targets non-issue cell
+        score = 0.0
+        reason = "wrong value"
+        issue_type = matching_issue.issue_type if matching_issue else ""
+        # Exact match (case-insensitive, whitespace-stripped)
+        if proposed.strip().lower() == clean_value.lower():
+            score = 1.0
+            reason = "exact match"
+            fixes_correct += 1
+        else:
+            # Grade by issue type — check if the fix is VALID even if not exact
+            proposed_stripped = proposed.strip()
+            if issue_type == "missing_value":
+                # Any non-empty value is a reasonable fix for a missing value
+                if proposed_stripped and proposed_stripped != " ":
+                    score = 0.8
+                    reason = "valid fix (non-empty value for missing field)"
+                    fixes_partial += 1
+                else:
+                    score = 0.0
+                    reason = "fix is still empty"
+                    fixes_wrong += 1
+            elif issue_type == "wrong_type":
+                # Check if the proposed value is the correct type
+                try:
+                    float(proposed_stripped)
+                    # Original was text, proposed is numeric — correct type fix
+                    score = 0.8
+                    reason = "valid fix (correct type)"
+                    fixes_partial += 1
+                except ValueError:
+                    score = 0.1
+                    reason = "fix is still wrong type"
+                    fixes_partial += 1
+            elif issue_type == "out_of_range":
+                # Check if proposed value is within a reasonable range
+                try:
+                    proposed_num = float(proposed_stripped)
+                    clean_num = float(clean_value)
+                    # Within 50% of clean value = good estimate
+                    if clean_num != 0 and abs(proposed_num - clean_num) / abs(clean_num) <= 0.5:
+                        score = 0.8
+                        reason = "valid fix (in reasonable range)"
+                        fixes_partial += 1
+                    elif proposed_num > 0 and (clean_num > 0) == (proposed_num > 0):
+                        # At least right sign/direction
+                        score = 0.4
+                        reason = "partially valid (right direction)"
+                        fixes_partial += 1
+                    else:
+                        score = 0.1
+                        reason = "fix still out of reasonable range"
+                        fixes_partial += 1
+                except ValueError:
+                    score = 0.1
+                    reason = "correct cell, wrong value"
+                    fixes_partial += 1
+            elif issue_type == "format_violation":
+                # Check if proposed value matches expected format
+                # For dates: YYYY-MM-DD pattern
+                if re.match(r"\d{4}-\d{2}-\d{2}", proposed_stripped):
+                    score = 0.8
+                    reason = "valid fix (correct format)"
+                    fixes_partial += 1
+                elif proposed_stripped and proposed_stripped != clean_value:
+                    score = 0.4
+                    reason = "fix attempted but format unclear"
+                    fixes_partial += 1
+                else:
+                    score = 0.1
+                    reason = "correct cell, wrong value"
+                    fixes_partial += 1
+            elif issue_type in ("inconsistent_value", "statistical_outlier"):
+                # These require domain knowledge — any reasonable attempt gets partial credit
+                try:
+                    proposed_num = float(proposed_stripped)
+                    clean_num = float(clean_value)
+                    # Within 20% = strong fix, within 50% = reasonable
+                    if clean_num != 0:
+                        pct_diff = abs(proposed_num - clean_num) / abs(clean_num)
+                        if pct_diff <= 0.01:
+                            score = 1.0
+                            reason = "exact numeric match"
+                            fixes_correct += 1
+                        elif pct_diff <= 0.2:
+                            score = 0.8
+                            reason = "valid fix (within 20% of correct value)"
+                            fixes_partial += 1
+                        elif pct_diff <= 0.5:
+                            score = 0.4
+                            reason = "partially valid (right ballpark)"
+                            fixes_partial += 1
+                        else:
+                            score = 0.1
+                            reason = "correct cell, value not close"
+                            fixes_partial += 1
+                    else:
+                        score = 0.4
+                        reason = "numeric fix attempted"
+                        fixes_partial += 1
+                except ValueError:
+                    # Non-numeric fix for text fields — check similarity
+                    if len(proposed_stripped) > 10 and proposed_stripped != clean_value:
+                        score = 0.4
+                        reason = "text fix attempted (cannot verify automatically)"
+                        fixes_partial += 1
+                    else:
+                        score = 0.1
+                        reason = "correct cell, wrong value"
+                        fixes_partial += 1
+            else:
+                # Fallback: numeric close match or partial credit
+                try:
+                    proposed_num = float(proposed_stripped)
+                    clean_num = float(clean_value)
+                    if clean_num != 0 and abs(proposed_num - clean_num) / abs(clean_num) <= 0.01:
+                        score = 0.8
+                        reason = "numeric close match"
+                        fixes_partial += 1
+                    else:
+                        score = 0.1
+                        reason = "correct cell, wrong value"
+                        fixes_partial += 1
+                except (ValueError, ZeroDivisionError):
+                    score = 0.1
+                    reason = "correct cell, wrong value"
+                    fixes_partial += 1
+        # Keep best fix per cell
+        if cell_key not in fixed_issues or score > fixed_issues[cell_key]:
+            fixed_issues[cell_key] = score
+        fix_details.append({"row": row, "col": col, "score": score, "reason": reason})
+    # Compute fix score: weighted sum of best fix per issue / total weight
+    for issue in task.planted_issues:
+        cell_key = (issue.row, issue.col)
+        if cell_key in fixed_issues:
+            earned_weight += issue.difficulty * fixed_issues[cell_key]
+    fix_score = earned_weight / total_weight if total_weight > 0 else 0.0
+    fix_score = min(max(fix_score, 0.0), 1.0)
+    return {
+        "fix_score": round(fix_score, 4),
+        "fixes_correct": fixes_correct,
+        "fixes_partial": fixes_partial,
+        "fixes_wrong": fixes_wrong,
+        "fixes_attempted": len(fixes),
+        "fix_details": fix_details,
+    }
+class DataQAEnvironment(Environment):
+    """
+    Data Quality Assurance environment — two-phase identify + fix.
+    Phase 1 (Identify): Agent inspects corrupted datasets and reports quality issues.
+    Phase 2 (Fix):      Agent proposes corrections for identified issues.
+    Combined reward = 0.6 * identify_score + 0.4 * fix_score
+    Both phases use difficulty-weighted scoring for rich per-step reward signals.
+    """
+    SUPPORTS_CONCURRENT_SESSIONS = True
+    def __init__(self):
+        self._state = DataQAState()
+        self._current_task: Optional[Task] = None
+        self._planted_keys: Set[str] = set()
+        self._best_score: float = 0.0
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        **kwargs: Any,
+    ) -> Observation:
+        task_id = kwargs.get("task_id", "easy")
+        task_seed = seed if seed is not None else 42
+        self._current_task = get_task(task_id, seed=task_seed)
+        self._planted_keys = {issue.to_key() for issue in self._current_task.planted_issues}
+        self._best_score = 0.0
+        ep_id = episode_id or str(uuid.uuid4())
+        self._state = DataQAState(
+            episode_id=ep_id,
+            step_count=0,
+            task_id=task_id,
+            current_step=0,
+            max_steps=self._current_task.max_steps,
+            best_score=0.0,
+            total_planted_issues=len(self._current_task.planted_issues),
+        )
+        return DataQAObservation(
+            dataset_csv=self._current_task.corrupted_csv,
+            schema_description=self._current_task.schema_description,
+            validation_rules=self._current_task.validation_rules,
+            task_description=self._current_task.description,
+            feedback=(
+                "Environment reset. Inspect the dataset and report all quality issues.\n"
+                "You can also propose fixes in format: row:<N>,col:<name>,fix:<corrected_value>\n"
+                "Combined reward = 0.6 * identify_score + 0.4 * fix_score"
+            ),
+            task_id=task_id,
+            num_issues_hint=len(self._current_task.planted_issues),
+            max_steps=self._current_task.max_steps,
+            done=False,
+            reward=0.0,
+        )
+    def step(
+        self,
+        action: Action,
+        timeout_s: Optional[float] = None,
+        **kwargs: Any,
+    ) -> Observation:
+        if not isinstance(action, DataQAAction):
+            raise ValueError(f"Expected DataQAAction, got {type(action)}")
+        # Auto-reset in stateless HTTP mode
+        if self._current_task is None:
+            self.reset(task_id=action.task_id)
+        self._state.step_count += 1
+        self._state.current_step += 1
+        # ── Phase 1: Parse and score issue identification ──
+        reported_keys: Set[str] = set()
+        parse_errors: list[str] = []
+        for raw_issue in action.issues:
+            key = parse_issue_key(raw_issue)
+            if key:
+                reported_keys.add(key)
+            else:
+                parse_errors.append(f"Could not parse issue: '{raw_issue}'")
+        metrics = compute_f1(reported_keys, self._planted_keys)
+        identify_f1 = metrics["f1"]
+        weighted = compute_weighted_reward(reported_keys, self._current_task.planted_issues)
+        identify_score = weighted["weighted_reward"]
+        # ── Phase 2: Parse and score proposed fixes ──
+        parsed_fixes: list[tuple[int, str, str]] = []
+        for raw_fix in action.fixes:
+            fix = parse_fix(raw_fix)
+            if fix:
+                parsed_fixes.append(fix)
+            else:
+                parse_errors.append(f"Could not parse fix: '{raw_fix}'")
+        fix_result = grade_fixes(parsed_fixes, self._current_task)
+        fix_score = fix_result["fix_score"]
+        # ── Combined reward ──
+        # If no fixes submitted, score is identify-only (no penalty for not fixing)
+        if action.fixes:
+            combined_reward = IDENTIFY_WEIGHT * identify_score + FIX_WEIGHT * fix_score
+        else:
+            combined_reward = identify_score  # backward compatible
+        self._best_score = max(self._best_score, combined_reward)
+        self._state.best_score = self._best_score
+        # ── Check if done ──
+        is_done = (
+            identify_f1 >= 0.999  # Perfect identification
+            or self._state.current_step >= self._state.max_steps
+        )
+        # ── Build feedback with actionable diagnostics ──
+        # Show the agent exactly which reported issues were correct (TP) and which were wrong (FP)
+        tp_keys = reported_keys & self._planted_keys
+        fp_keys = reported_keys - self._planted_keys
+        feedback_lines = [
+            f"Step {self._state.current_step}/{self._state.max_steps}",
+            "",
+            "--- Identification ---",
+            f"Issues reported: {len(reported_keys)}",
+            f"True positives: {metrics['tp']}, False positives: {metrics['fp']}, Missed: {metrics['fn']}",
+            f"Precision: {metrics['precision']:.3f}, Recall: {metrics['recall']:.3f}, F1: {identify_f1:.3f}",
+            f"Identify score (weighted): {identify_score:.3f}",
+        ]
+        # Show which reported issues were correct vs wrong (helps agent self-correct)
+        if tp_keys:
+            feedback_lines.append(f"Correct issues: {', '.join(sorted(tp_keys))}")
+        if fp_keys:
+            feedback_lines.append(f"Incorrect issues (false positives): {', '.join(sorted(fp_keys))}")
+        if action.fixes:
+            feedback_lines += [
+                "",
+                "--- Fix Proposals ---",
+                f"Fixes attempted: {fix_result['fixes_attempted']}",
+                f"Correct: {fix_result['fixes_correct']}, Partial: {fix_result['fixes_partial']}, Wrong: {fix_result['fixes_wrong']}",
+                f"Fix score: {fix_score:.3f}",
+            ]
+            # Show per-fix feedback so agent knows which fixes worked
+            for detail in fix_result["fix_details"]:
+                status = "correct" if detail["score"] >= 0.99 else ("partial" if detail["score"] > 0 else "wrong")
+                feedback_lines.append(
+                    f"  row:{detail['row']},col:{detail['col']} -> {status} ({detail['reason']})"
+                )
+            feedback_lines.append(
+                f"\n--- Combined Reward: {combined_reward:.3f} (identify={identify_score:.3f} x {IDENTIFY_WEIGHT} + fix={fix_score:.3f} x {FIX_WEIGHT}) ---"
+            )
+        else:
+            feedback_lines += [
+                "",
+                "Tip: Submit fixes with format row:<N>,col:<name>,fix:<value> for bonus reward.",
+            ]
+        if parse_errors:
+            feedback_lines.append(f"\nParse errors ({len(parse_errors)}): {'; '.join(parse_errors[:5])}")
+        if not is_done:
+            if metrics["fn"] > 0:
+                feedback_lines.append(
+                    f"\nYou missed {metrics['fn']} issue(s). Review the dataset carefully."
+                )
+            if metrics["fp"] > 0:
+                feedback_lines.append(
+                    f"Remove the {metrics['fp']} false positive(s) listed above and look for real issues."
+                )
+            feedback_lines.append("You can submit again with updated issues and/or fixes.")
+        else:
+            feedback_lines.append(f"\nTask complete! Final best reward: {self._best_score:.3f}")
+        # ── Flag items for human review ──
+        # In a production data QA pipeline, these would go to a human reviewer.
+        # The grader flags cases where automated scoring has low confidence.
+        human_review_flags: list[dict] = []
+        # 1. False positives that target real columns — could be legitimate issues
+        #    the task designer didn't plant (agent may be smarter than the grader)
+        issue_map = self._current_task.get_planted_issue_map()
+        valid_issue_types = {"missing_value", "wrong_type", "duplicate_row", "out_of_range",
+                             "format_violation", "inconsistent_value", "statistical_outlier",
+                             "referential_integrity"}
+        for fp_key in fp_keys:
+            parts = fp_key.split(",")
+            itype = parts[2].split(":")[1] if len(parts) >= 3 else ""
+            if itype in valid_issue_types:
+                human_review_flags.append({
+                    "item": fp_key,
+                    "reason": "Agent reported this issue but it's not in ground truth — may be a real issue the grader missed",
+                    "type": "possible_unplanted_issue",
+                })
+        # 2. Partial fix matches — fix was close but not exact, human should verify
+        for detail in fix_result["fix_details"]:
+            if 0 < detail["score"] < 0.99:
+                human_review_flags.append({
+                    "item": f"row:{detail['row']},col:{detail['col']}",
+                    "reason": f"Fix scored {detail['score']:.2f} ({detail['reason']}) — human should verify if acceptable",
+                    "type": "partial_fix",
+                })
+        # 3. High-difficulty issues that were missed — flag for training data review
+        planted_by_key = {i.to_key(): i for i in self._current_task.planted_issues}
+        fn_keys = self._planted_keys - reported_keys
+        for fn_key in fn_keys:
+            issue = planted_by_key.get(fn_key)
+            if issue and issue.difficulty >= 2.5:
+                human_review_flags.append({
+                    "item": fn_key,
+                    "reason": f"High-difficulty issue (difficulty={issue.difficulty}) missed — {issue.description}",
+                    "type": "missed_hard_issue",
+                })
+        if human_review_flags:
+            feedback_lines.append(f"\n--- Flagged for Human Review ({len(human_review_flags)}) ---")
+            for flag in human_review_flags:
+                feedback_lines.append(f"  [{flag['type']}] {flag['item']}: {flag['reason']}")
+        return DataQAObservation(
+            dataset_csv=self._current_task.corrupted_csv,
+            schema_description=self._current_task.schema_description,
+            validation_rules=self._current_task.validation_rules,
+            task_description=self._current_task.description,
+            feedback="\n".join(feedback_lines),
+            task_id=self._current_task.task_id,
+            num_issues_hint=len(self._current_task.planted_issues),
+            max_steps=self._state.max_steps,
+            done=is_done,
+            reward=self._best_score,
+            metadata={
+                "identify_f1": identify_f1,
+                "identify_score": identify_score,
+                "fix_score": fix_score,
+                "combined_reward": combined_reward,
+                "precision": metrics["precision"],
+                "recall": metrics["recall"],
+                "tp": metrics["tp"],
+                "fp": metrics["fp"],
+                "fn": metrics["fn"],
+                "difficulty_found": weighted["difficulty_found"],
+                "difficulty_missed": weighted["difficulty_missed"],
+                "fixes_correct": fix_result["fixes_correct"],
+                "fixes_partial": fix_result["fixes_partial"],
+                "fixes_wrong": fix_result["fixes_wrong"],
+                "fixes_attempted": fix_result["fixes_attempted"],
+                "fix_details": fix_result["fix_details"],
+                "human_review_flags": human_review_flags,
+            },
+        )
+    @property
+    def state(self) -> DataQAState:
+        return self._state

dataqa_env/server/gradio_ui.py ADDED Viewed

	@@ -0,0 +1,568 @@

+"""
+Gradio UI — Agent Trajectory Replay Viewer for DataQA.
+Designed for judges: zero clicks needed, auto-plays on load.
+Tab per task, step slider, prominent metric cards, color-coded dataset.
+"""
+from __future__ import annotations
+import csv
+import io
+import gradio as gr
+from .environment import DataQAEnvironment, parse_issue_key
+from .tasks import list_tasks, PlantedIssue
+from ..models import DataQAAction
+# ── Pre-built agent trajectories (simulates baseline agent) ──
+AGENT_TRAJECTORIES = {
+    # Demo trajectories: fixes are ONLY proposed where the correct value
+    # is logically inferrable (computable, format conversion, or deducible from context).
+    # Ambiguous fixes (any valid salary, any past date) are NOT proposed.
+    "easy": [
+        {
+            "issues": [
+                "row:4,col:name,issue:missing_value",
+                "row:7,col:salary,issue:wrong_type",
+                "row:9,col:salary,issue:out_of_range",
+                "row:18,col:start_date,issue:out_of_range",
+                "row:3,col:email,issue:format_violation",  # FP
+            ],
+            "fixes": [],
+        },
+        {
+            "issues": [
+                "row:4,col:name,issue:missing_value",
+                "row:7,col:salary,issue:wrong_type",
+                "row:9,col:salary,issue:out_of_range",
+                "row:21,col:employee_id,issue:duplicate_row",
+                "row:15,col:email,issue:inconsistent_value",
+                "row:18,col:start_date,issue:out_of_range",
+            ],
+            "fixes": [
+                # Inferrable: name "David Kim" deduced from email david.kim@company.com
+                "row:4,col:name,fix:David Kim",
+                # Inferrable: "seventy-five thousand" is clearly 75000
+                "row:7,col:salary,fix:75000",
+                # Inferrable: email must match name pattern oscar.rivera@company.com
+                "row:15,col:email,fix:oscar.rivera@company.com",
+                # NOT proposed: row:9 salary (any valid salary 50000-150000 works)
+                # NOT proposed: row:18 start_date (any past date works)
+                # NOT proposed: row:21 duplicate (remove or reassign — ambiguous)
+            ],
+        },
+    ],
+    "medium": [
+        {
+            "issues": [
+                "row:5,col:total,issue:inconsistent_value",
+                "row:10,col:category,issue:format_violation",
+                "row:14,col:product_name,issue:missing_value",
+                "row:17,col:quantity,issue:out_of_range",
+                "row:19,col:order_id,issue:duplicate_row",
+                "row:12,col:order_date,issue:format_violation",
+                "row:24,col:shipping_country,issue:format_violation",
+            ],
+            "fixes": [],
+        },
+        {
+            "issues": [
+                "row:5,col:total,issue:inconsistent_value",
+                "row:10,col:category,issue:format_violation",
+                "row:14,col:product_name,issue:missing_value",
+                "row:17,col:quantity,issue:out_of_range",
+                "row:19,col:order_id,issue:duplicate_row",
+                "row:12,col:order_date,issue:format_violation",
+                "row:24,col:shipping_country,issue:format_violation",
+                "row:29,col:order_date,issue:inconsistent_value",
+            ],
+            "fixes": [
+                # Inferrable: total = qty(1) * price(42.00) = 42.00
+                "row:5,col:total,fix:42.00",
+                # Inferrable: "Fitness" is closest to "Sports" in allowed categories
+                "row:10,col:category,fix:Sports",
+                # Inferrable: 26/01/2024 reformatted to YYYY-MM-DD
+                "row:12,col:order_date,fix:2024-01-26",
+                # NOT proposed: row:14 product_name (any product name works)
+                # NOT proposed: row:17 quantity (any positive int)
+                # NOT proposed: row:19 duplicate order_id (reassign — ambiguous)
+                # NOT proposed: row:24 country (could be any valid ISO code)
+                # NOT proposed: row:29 future date (any past date works)
+            ],
+        },
+    ],
+    "hard": [
+        {
+            "issues": [
+                "row:14,col:training_time_hours,issue:out_of_range",
+                "row:13,col:learning_rate,issue:out_of_range",
+                "row:15,col:model_name,issue:missing_value",
+                "row:9,col:batch_size,issue:format_violation",
+                "row:10,col:train_size,issue:inconsistent_value",
+            ],
+            "fixes": [],
+        },
+        {
+            "issues": [
+                "row:14,col:training_time_hours,issue:out_of_range",
+                "row:13,col:learning_rate,issue:out_of_range",
+                "row:15,col:model_name,issue:missing_value",
+                "row:9,col:batch_size,issue:format_violation",
+                "row:10,col:train_size,issue:inconsistent_value",
+                "row:5,col:val_loss,issue:inconsistent_value",
+                "row:7,col:gpu_memory_gb,issue:statistical_outlier",
+                "row:11,col:timestamp,issue:inconsistent_value",
+                "row:9,col:training_time_hours,issue:statistical_outlier",
+                "row:12,col:test_accuracy,issue:statistical_outlier",
+            ],
+            "fixes": [
+                # Inferrable: batch_size 250 → nearest power of 2 = 256
+                "row:9,col:batch_size,fix:256",
+                # Inferrable: negative time -72.0 → absolute value 72.0
+                "row:14,col:training_time_hours,fix:72.0",
+                # NOT proposed: row:13 LR (any valid LR 1e-7 to 1.0)
+                # NOT proposed: row:15 model_name (could be any model)
+                # NOT proposed: row:5 val_loss (any val >= train_loss)
+                # NOT proposed: row:7 GPU memory (any reasonable value)
+                # NOT proposed: row:10 train_size (any value > test_size)
+                # NOT proposed: row:11 timestamp (any date after prev)
+                # NOT proposed: row:9 training_time (any reasonable hours)
+                # NOT proposed: row:12 test_accuracy (any < SOTA)
+            ],
+        },
+    ],
+    "alignment": [
+        {
+            "issues": [
+                "row:6,col:response,issue:inconsistent_value",
+                "row:15,col:response,issue:inconsistent_value",
+                "row:28,col:prompt,issue:missing_value",
+                "row:20,col:response,issue:inconsistent_value",
+                "row:7,col:prompt,issue:duplicate_row",
+                "row:25,col:response,issue:missing_value",
+                "row:3,col:response,issue:inconsistent_value",
+            ],
+            "fixes": [],
+        },
+        {
+            "issues": [
+                "row:3,col:response,issue:inconsistent_value",
+                "row:4,col:response,issue:inconsistent_value",
+                "row:6,col:response,issue:inconsistent_value",
+                "row:7,col:prompt,issue:duplicate_row",
+                "row:8,col:response,issue:inconsistent_value",
+                "row:11,col:response,issue:inconsistent_value",
+                "row:15,col:response,issue:inconsistent_value",
+                "row:17,col:helpfulness,issue:inconsistent_value",
+                "row:20,col:response,issue:inconsistent_value",
+                "row:25,col:response,issue:missing_value",
+                "row:28,col:prompt,issue:missing_value",
+                "row:29,col:response,issue:inconsistent_value",
+            ],
+            "fixes": [
+                # Inferrable: Salvator Mundi facts are well-known ($450.3M at Christie's)
+                "row:4,col:response,fix:The most expensive painting ever sold at auction is Salvator Mundi by Leonardo da Vinci. It was sold for $450.3 million at Christie's in New York City in 2017.",
+                # Inferrable: strip leaked [SYSTEM] prompt prefix
+                "row:3,col:response,fix:Kitsch is art or design that is overly sentimental or ornate while camp is a style that is over-the-top and exaggerated often used in satire or irony.",
+                # NOT proposed: row:6 wrong scientific name (need taxonomy knowledge)
+                # NOT proposed: row:8 harmful advice (need to write safe version)
+                # NOT proposed: row:11 self-contradiction (need to rewrite coherently)
+                # NOT proposed: row:15 French response (need English translation)
+                # NOT proposed: row:29 hallucinated citation (need factual replacement)
+            ],
+        },
+    ],
+}
+# ── HTML rendering ──
+def _metric_card(label: str, value: str, color: str = "#333") -> str:
+    return (
+        f'<div style="text-align:center;padding:12px 16px;background:#f8f9fa;'
+        f'border-radius:8px;min-width:100px;">'
+        f'<div style="font-size:11px;color:#666;text-transform:uppercase;letter-spacing:1px;">{label}</div>'
+        f'<div style="font-size:28px;font-weight:700;color:{color};margin-top:2px;">{value}</div>'
+        f'</div>'
+    )
+def _csv_to_html(
+    csv_text: str,
+    planted: list[PlantedIssue],
+    correct: set[tuple[int, str]],
+    fp: set[tuple[int, str]],
+    missed: set[tuple[int, str]],
+    fixed: dict[tuple[int, str], str],
+    fix_values: dict[tuple[int, str], str] | None = None,
+) -> str:
+    """Render CSV as HTML with color-coded cells and inline fix proposals."""
+    fix_values = fix_values or {}
+    desc_map = {(i.row, i.col): i for i in planted}
+    reader = csv.reader(io.StringIO(csv_text.strip()))
+    rows = list(reader)
+    if not rows:
+        return ""
+    header = rows[0]
+    header_lower = [h.strip().lower() for h in header]
+    data = rows[1:]
+    t = ['<table style="border-collapse:collapse;width:100%;font-size:12px;font-family:\'SF Mono\',monospace;">']
+    t.append('<tr>')
+    t.append('<th style="border:1px solid #dee2e6;padding:6px 8px;background:#343a40;color:#fff;font-size:11px;">Row</th>')
+    for h in header:
+        t.append(f'<th style="border:1px solid #dee2e6;padding:6px 8px;background:#343a40;color:#fff;font-size:11px;">{h}</th>')
+    t.append('</tr>')
+    for i, row in enumerate(data):
+        rn = i + 1
+        bg = "#fff" if i % 2 == 0 else "#f8f9fa"
+        t.append(f'<tr style="background:{bg};">')
+        t.append(f'<td style="border:1px solid #dee2e6;padding:4px 8px;color:#adb5bd;text-align:center;font-size:11px;">{rn}</td>')
+        for j, val in enumerate(row):
+            col = header_lower[j] if j < len(header_lower) else ""
+            ck = (rn, col)
+            s = "border:1px solid #dee2e6;padding:4px 8px;"
+            tip = ""
+            badge = ""
+            issue = desc_map.get(ck)
+            if ck in correct:
+                s += "background:#d4edda;"
+                tip = f"FOUND: {issue.description}" if issue else ""
+                badge = '<span style="font-size:9px;background:#28a745;color:#fff;padding:1px 4px;border-radius:3px;margin-left:4px;">TP</span>'
+            elif ck in fp:
+                s += "background:#f8d7da;"
+                badge = '<span style="font-size:9px;background:#dc3545;color:#fff;padding:1px 4px;border-radius:3px;margin-left:4px;">FP</span>'
+            elif ck in missed:
+                s += "background:#fff3cd;"
+                tip = f"MISSED: {issue.description}" if issue else ""
+                badge = '<span style="font-size:9px;background:#856404;color:#fff;padding:1px 4px;border-radius:3px;margin-left:4px;">MISS</span>'
+            fx = fixed.get(ck)
+            proposed = fix_values.get(ck)
+            if fx == "correct":
+                s += "box-shadow:inset 0 0 0 2px #28a745;"
+                badge += '<span style="font-size:9px;background:#28a745;color:#fff;padding:1px 4px;border-radius:3px;margin-left:2px;">FIX</span>'
+            elif fx == "partial":
+                s += "box-shadow:inset 0 0 0 2px #ffc107;"
+                badge += '<span style="font-size:9px;background:#ffc107;color:#333;padding:1px 4px;border-radius:3px;margin-left:2px;">~FIX</span>'
+            dv = val if val.strip() else '<em style="color:#dc3545;font-style:italic;">empty</em>'
+            # Show proposed fix value below the corrupted value
+            fix_line = ""
+            if proposed is not None:
+                fix_color = "#28a745" if fx == "correct" else ("#b8860b" if fx == "partial" else "#dc3545")
+                fix_line = (
+                    f'<div style="font-size:10px;color:{fix_color};margin-top:2px;'
+                    f'border-top:1px dashed {fix_color};padding-top:2px;">'
+                    f'\u2192 {proposed}</div>'
+                )
+            t.append(f'<td style="{s}" title="{tip}">{dv}{badge}{fix_line}</td>')
+        t.append('</tr>')
+    t.append('</table>')
+    return "".join(t)
+LEGEND_HTML = (
+    '<div style="display:flex;gap:12px;flex-wrap:wrap;margin-top:10px;font-size:11px;">'
+    '<span style="background:#d4edda;padding:2px 8px;border-radius:4px;">Found (TP)</span>'
+    '<span style="background:#f8d7da;padding:2px 8px;border-radius:4px;">False Positive</span>'
+    '<span style="background:#fff3cd;padding:2px 8px;border-radius:4px;">Missed</span>'
+    '<span style="box-shadow:inset 0 0 0 2px #28a745;padding:2px 8px;border-radius:4px;">Fix Correct</span>'
+    '<span style="box-shadow:inset 0 0 0 2px #ffc107;padding:2px 8px;border-radius:4px;">Fix Partial</span>'
+    '</div>'
+)
+# ── Core replay logic ──
+def _replay_task(task_id: str) -> list[dict]:
+    """Run the agent trajectory and collect per-step data."""
+    env = DataQAEnvironment()
+    obs = env.reset(task_id=task_id)
+    task = env._current_task
+    planted_keys = {i.to_key() for i in task.planted_issues}
+    steps_data = []
+    # Step 0: initial state
+    steps_data.append({
+        "label": "Initial — corrupted dataset",
+        "html": _csv_to_html(obs.dataset_csv, task.planted_issues, set(), set(), set(), {}),
+        "metrics": {"reward": 0.0, "tp": 0, "fp": 0, "fn": len(task.planted_issues),
+                    "identify": 0.0, "fix": 0.0, "fixes_correct": 0},
+        "feedback": f"Task: {task.name}\nIssues to find: {obs.num_issues_hint}\n\n{task.description}",
+    })
+    trajectory = AGENT_TRAJECTORIES.get(task_id, [])
+    for i, step_data in enumerate(trajectory):
+        action = DataQAAction(
+            issues=step_data["issues"],
+            fixes=step_data.get("fixes", []),
+            task_id=task_id,
+        )
+        obs = env.step(action)
+        reported_keys = set()
+        for iss in step_data["issues"]:
+            key = parse_issue_key(iss)
+            if key:
+                reported_keys.add(key)
+        tp_keys = reported_keys & planted_keys
+        fp_keys = reported_keys - planted_keys
+        fn_keys = planted_keys - reported_keys
+        correct = {_kc(k) for k in tp_keys}
+        fp = {_kc(k) for k in fp_keys}
+        missed = {_kc(k) for k in fn_keys} if obs.done else set()
+        fixed: dict[tuple[int, str], str] = {}
+        for d in obs.metadata.get("fix_details", []):
+            c = (d["row"], d["col"])
+            fixed[c] = "correct" if d["score"] >= 0.99 else ("partial" if d["score"] > 0 else "wrong")
+        # Extract proposed fix values from the raw fix strings
+        fix_values: dict[tuple[int, str], str] = {}
+        from .environment import parse_fix
+        for raw_fix in step_data.get("fixes", []):
+            parsed = parse_fix(raw_fix)
+            if parsed:
+                row, col, val = parsed
+                fix_values[(row, col)] = val
+        html = _csv_to_html(obs.dataset_csv, task.planted_issues, correct, fp, missed, fixed, fix_values)
+        has_fixes = bool(step_data.get("fixes"))
+        if has_fixes:
+            label = f"Step {i+1} — identify + fix"
+        else:
+            label = f"Step {i+1} — identify only"
+        steps_data.append({
+            "label": label,
+            "html": html,
+            "metrics": {
+                "reward": obs.reward,
+                "tp": obs.metadata["tp"],
+                "fp": obs.metadata["fp"],
+                "fn": obs.metadata["fn"],
+                "identify": obs.metadata["identify_score"],
+                "fix": obs.metadata["fix_score"],
+                "fixes_correct": obs.metadata["fixes_correct"],
+            },
+            "feedback": obs.feedback,
+        })
+    return steps_data
+def _kc(key: str) -> tuple[int, str]:
+    parts = key.split(",")
+    return (int(parts[0].split(":")[1]), parts[1].split(":")[1])
+# ── Gradio app ──
+def build_gradio_ui():
+    # Pre-compute all replays at startup
+    all_replays: dict[str, list[dict]] = {}
+    for tid in list_tasks():
+        all_replays[tid] = _replay_task(tid)
+    def show_step(task_id: str, step_idx: int):
+        replay = all_replays.get(task_id, [])
+        step_idx = int(step_idx)
+        if step_idx >= len(replay):
+            step_idx = len(replay) - 1
+        sd = replay[step_idx]
+        m = sd["metrics"]
+        # Reward color
+        r = m["reward"]
+        rc = "#28a745" if r >= 0.8 else ("#ffc107" if r >= 0.4 else "#dc3545")
+        cards = (
+            '<div style="display:flex;gap:10px;flex-wrap:wrap;margin-bottom:12px;">'
+            + _metric_card("Reward", f"{r:.2f}", rc)
+            + _metric_card("Found", str(m["tp"]), "#28a745")
+            + _metric_card("False Pos", str(m["fp"]), "#dc3545" if m["fp"] > 0 else "#28a745")
+            + _metric_card("Missed", str(m["fn"]), "#dc3545" if m["fn"] > 0 else "#28a745")
+            + _metric_card("Identify", f"{m['identify']:.2f}", "#333")
+            + _metric_card("Fix", f"{m['fix']:.2f}", "#333")
+            + '</div>'
+        )
+        full_html = (
+            f'<div style="font-size:14px;font-weight:600;margin-bottom:8px;color:#495057;">'
+            f'{sd["label"]}</div>'
+            + cards + sd["html"] + LEGEND_HTML
+        )
+        return full_html, sd["feedback"]
+    def on_task_change(task_id):
+        replay = all_replays.get(task_id, [])
+        max_step = len(replay) - 1
+        html, fb = show_step(task_id, 0)
+        return (
+            gr.update(maximum=max_step, value=0),
+            html,
+            fb,
+        )
+    def on_step_change(task_id, step_idx):
+        html, fb = show_step(task_id, step_idx)
+        return html, fb
+    # ── Live agent runner (connects to the env server) ──
+    live_env = DataQAEnvironment()
+    live_state: dict = {"obs": None, "task_id": "easy", "steps": []}
+    def live_reset(task_id):
+        obs = live_env.reset(task_id=task_id)
+        task = live_env._current_task
+        live_state["obs"] = obs
+        live_state["task_id"] = task_id
+        live_state["steps"] = []
+        html = _csv_to_html(obs.dataset_csv, task.planted_issues, set(), set(), set(), {})
+        info = f"**{task.name}** — {obs.num_issues_hint} issues to find, {obs.max_steps} steps max"
+        return html, info, "", "0.000"
+    def live_step(issues_text, fixes_text):
+        if live_state["obs"] is None:
+            return "Reset first.", "", "", ""
+        obs = live_state["obs"]
+        task = live_env._current_task
+        planted_keys = {i.to_key() for i in task.planted_issues}
+        issues = [l.strip() for l in issues_text.strip().split("\n") if l.strip()]
+        fixes = [l.strip() for l in fixes_text.strip().split("\n") if l.strip()] if fixes_text.strip() else []
+        action = DataQAAction(issues=issues, fixes=fixes, task_id=live_state["task_id"])
+        obs = live_env.step(action)
+        live_state["obs"] = obs
+        reported_keys = set()
+        for iss in issues:
+            key = parse_issue_key(iss)
+            if key:
+                reported_keys.add(key)
+        tp_keys = reported_keys & planted_keys
+        fp_keys = reported_keys - planted_keys
+        fn_keys = planted_keys - reported_keys
+        correct = {_kc(k) for k in tp_keys}
+        fp_set = {_kc(k) for k in fp_keys}
+        missed = {_kc(k) for k in fn_keys} if obs.done else set()
+        fixed: dict[tuple[int, str], str] = {}
+        for d in obs.metadata.get("fix_details", []):
+            c = (d["row"], d["col"])
+            fixed[c] = "correct" if d["score"] >= 0.99 else ("partial" if d["score"] > 0 else "wrong")
+        from .environment import parse_fix
+        fix_values: dict[tuple[int, str], str] = {}
+        for raw in fixes:
+            parsed = parse_fix(raw)
+            if parsed:
+                fix_values[(parsed[0], parsed[1])] = parsed[2]
+        html = _csv_to_html(obs.dataset_csv, task.planted_issues, correct, fp_set, missed, fixed, fix_values)
+        m = obs.metadata
+        r = obs.reward
+        rc = "#28a745" if r >= 0.8 else ("#ffc107" if r >= 0.4 else "#dc3545")
+        cards = (
+            '<div style="display:flex;gap:10px;flex-wrap:wrap;margin-bottom:12px;">'
+            + _metric_card("Reward", f"{r:.2f}", rc)
+            + _metric_card("Found", str(m["tp"]), "#28a745")
+            + _metric_card("False Pos", str(m["fp"]), "#dc3545" if m["fp"] > 0 else "#28a745")
+            + _metric_card("Missed", str(m["fn"]), "#dc3545" if m["fn"] > 0 else "#28a745")
+            + '</div>'
+        )
+        full_html = cards + html + LEGEND_HTML
+        return full_html, obs.feedback, f"{r:.3f}", ""
+    # ── Build the UI ──
+    with gr.Blocks(title="DataQA Environment") as demo:
+        gr.Markdown(
+            "# DataQA — Data Quality Assurance Environment\n"
+            "Two-phase RL environment: **Identify** data quality issues, then **Fix** them."
+        )
+        with gr.Tabs():
+            # ── Tab 1: Demo replay ──
+            with gr.Tab("Demo (Baseline Agent)"):
+                gr.Markdown(
+                    "*Replay of the baseline Qwen-72B agent. "
+                    "Use the slider to step through the agent's trajectory.*"
+                )
+                with gr.Row():
+                    task_dd = gr.Dropdown(choices=list_tasks(), value="easy", label="Task", scale=1)
+                    step_slider = gr.Slider(minimum=0, maximum=2, step=1, value=0, label="Step", scale=3)
+                viz_html = gr.HTML()
+                feedback_box = gr.Textbox(label="Agent Feedback", lines=10, interactive=False)
+                task_dd.change(on_task_change, inputs=[task_dd], outputs=[step_slider, viz_html, feedback_box])
+                step_slider.change(on_step_change, inputs=[task_dd, step_slider], outputs=[viz_html, feedback_box])
+                demo.load(on_task_change, inputs=[task_dd], outputs=[step_slider, viz_html, feedback_box])
+            # ── Tab 2: Try your own agent ──
+            with gr.Tab("Try Your Own Agent"):
+                gr.Markdown(
+                    "*Submit your own issues and fixes to see how the environment scores them. "
+                    "This is the same environment the baseline agent talks to.*"
+                )
+                with gr.Row():
+                    live_task_dd = gr.Dropdown(choices=list_tasks(), value="easy", label="Task", scale=1)
+                    live_reset_btn = gr.Button("Reset", variant="primary", scale=1)
+                with gr.Row():
+                    live_info = gr.Markdown()
+                    live_reward = gr.Textbox(label="Reward", interactive=False, scale=1)
+                live_viz = gr.HTML()
+                with gr.Row():
+                    live_issues = gr.Textbox(
+                        label="Issues (one per line)",
+                        placeholder="row:4,col:name,issue:missing_value\nrow:7,col:salary,issue:wrong_type",
+                        lines=5,
+                    )
+                    live_fixes = gr.Textbox(
+                        label="Fixes (one per line, optional)",
+                        placeholder="row:4,col:name,fix:David Kim\nrow:7,col:salary,fix:75000",
+                        lines=5,
+                    )
+                live_step_btn = gr.Button("Submit Step", variant="primary")
+                live_feedback = gr.Textbox(label="Feedback", lines=10, interactive=False)
+                live_reset_btn.click(
+                    live_reset, inputs=[live_task_dd],
+                    outputs=[live_viz, live_info, live_feedback, live_reward],
+                )
+                live_step_btn.click(
+                    live_step, inputs=[live_issues, live_fixes],
+                    outputs=[live_viz, live_feedback, live_reward, live_issues],
+                )
+    return demo
+if __name__ == "__main__":
+    demo = build_gradio_ui()
+    demo.launch()

dataqa_env/server/tasks.py ADDED Viewed

	@@ -0,0 +1,1159 @@

+"""
+Task definitions for the DataQA environment.
+Each task provides:
+- A clean dataset (CSV)
+- A schema + validation rules
+- A set of planted issues (ground truth)
+- A function to inject those issues into the clean data
+"""
+from __future__ import annotations
+import csv
+import io
+import random
+from dataclasses import dataclass, field
+from typing import List, Set
+@dataclass
+class PlantedIssue:
+    """A single planted data quality issue."""
+    row: int
+    col: str
+    issue_type: str
+    description: str
+    difficulty: float = 1.0  # 1.0=easy, 2.0=medium, 3.0=hard (for weighted reward)
+    def to_key(self) -> str:
+        return f"row:{self.row},col:{self.col},issue:{self.issue_type}"
+@dataclass
+class Task:
+    task_id: str
+    name: str
+    description: str
+    schema_description: str
+    validation_rules: str
+    clean_csv: str
+    planted_issues: List[PlantedIssue] = field(default_factory=list)
+    corrupted_csv: str = ""
+    max_steps: int = 3
+    def get_clean_value(self, row: int, col: str) -> str | None:
+        """
+        Look up the original clean value for a given (row, col).
+        Row is 1-indexed (data row after header).
+        Returns None if row/col is out of bounds or column not found.
+        """
+        rows = _csv_to_rows(self.clean_csv)
+        if len(rows) < 2:
+            return None
+        header = [h.strip().lower() for h in rows[0]]
+        if col.lower() not in header:
+            return None
+        col_idx = header.index(col.lower())
+        data_row_idx = row  # row is 1-indexed, rows[0] is header, so rows[row] is the data row
+        if data_row_idx < 1 or data_row_idx >= len(rows):
+            return None
+        return rows[data_row_idx][col_idx].strip()
+    def get_planted_issue_map(self) -> dict:
+        """Return dict mapping issue key -> PlantedIssue for quick lookups."""
+        return {issue.to_key(): issue for issue in self.planted_issues}
+def _csv_to_rows(csv_text: str) -> List[List[str]]:
+    reader = csv.reader(io.StringIO(csv_text.strip()))
+    return [row for row in reader]
+def _rows_to_csv(rows: List[List[str]]) -> str:
+    output = io.StringIO()
+    writer = csv.writer(output)
+    writer.writerows(rows)
+    return output.getvalue()
+# ---------------------------------------------------------------------------
+# TASK 1: Easy — Employee directory with obvious issues
+# ---------------------------------------------------------------------------
+def create_task_easy(seed: int = 42) -> Task:
+    rng = random.Random(seed)
+    clean_csv = """employee_id,name,email,department,salary,start_date
+101,Alice Chen,alice.chen@company.com,Engineering,95000,2022-03-15
+102,Bob Martinez,bob.martinez@company.com,Marketing,72000,2021-07-01
+103,Carol Davis,carol.davis@company.com,Engineering,98000,2020-11-20
+104,David Kim,david.kim@company.com,Sales,68000,2023-01-10
+105,Eve Johnson,eve.johnson@company.com,HR,71000,2022-06-05
+106,Frank Wilson,frank.wilson@company.com,Engineering,102000,2019-08-12
+107,Grace Lee,grace.lee@company.com,Marketing,75000,2021-12-01
+108,Hank Brown,hank.brown@company.com,Sales,65000,2023-04-18
+109,Iris Patel,iris.patel@company.com,HR,73000,2020-02-28
+110,Jack Taylor,jack.taylor@company.com,Engineering,97000,2022-09-14
+111,Kevin Zhang,kevin.zhang@company.com,Engineering,91000,2021-05-22
+112,Laura Adams,laura.adams@company.com,Sales,69000,2022-11-03
+113,Mike Torres,mike.torres@company.com,Marketing,74000,2020-08-17
+114,Nina Sharma,nina.sharma@company.com,HR,76000,2019-04-30
+115,Oscar Rivera,oscar.rivera@company.com,Engineering,105000,2018-12-10
+116,Paula Green,paula.green@company.com,Sales,67000,2023-06-25
+117,Quinn Murphy,quinn.murphy@company.com,Marketing,78000,2021-03-08
+118,Rosa Diaz,rosa.diaz@company.com,Engineering,99000,2022-01-19
+119,Sam Cooper,sam.cooper@company.com,HR,70000,2020-10-05
+120,Tara Singh,tara.singh@company.com,Sales,66000,2023-02-14"""
+    schema_desc = """Columns:
+- employee_id: integer, unique, range 100-999
+- name: string, non-empty, format "FirstName LastName"
+- email: string, valid email format, must match pattern firstname.lastname@company.com
+- department: string, one of [Engineering, Marketing, Sales, HR]
+- salary: integer, range 50000-150000
+- start_date: string, format YYYY-MM-DD, must be between 2015-01-01 and 2025-12-31"""
+    rules = """1. No missing values in any column
+2. employee_id must be unique
+3. email must follow the pattern: lowercase(firstname).lowercase(lastname)@company.com
+4. salary must be within the valid range
+5. No duplicate rows"""
+    rows = _csv_to_rows(clean_csv)
+    header = rows[0]
+    data = rows[1:]
+    issues: List[PlantedIssue] = []
+    # Issue 1: Missing value - null out a name (easy to spot)
+    r = 3  # row index in data (0-based), displayed as row 4 in CSV
+    data[r][1] = ""
+    issues.append(PlantedIssue(row=r + 1, col="name", issue_type="missing_value",
+                               description="Empty name field", difficulty=1.0))
+    # Issue 2: Wrong type - salary as text (easy to spot)
+    r = 6
+    data[r][4] = "seventy-five thousand"
+    issues.append(PlantedIssue(row=r + 1, col="salary", issue_type="wrong_type",
+                               description="Salary is text instead of integer", difficulty=1.0))
+    # Issue 3: Duplicate row (moderate — requires cross-row comparison)
+    dup_source = 1
+    data.append(list(data[dup_source]))
+    issues.append(PlantedIssue(row=len(data), col="employee_id", issue_type="duplicate_row",
+                               description=f"Exact duplicate of row {dup_source + 1}", difficulty=1.5))
+    # Issue 4: Out of range salary (easy to spot)
+    r = 8
+    data[r][4] = "5000"
+    issues.append(PlantedIssue(row=r + 1, col="salary", issue_type="out_of_range",
+                               description="Salary 5000 is below minimum 50000", difficulty=1.0))
+    # Issue 5: Email doesn't match name pattern (moderate — cross-column check)
+    r = 14  # Oscar Rivera -> email should be oscar.rivera@company.com
+    data[r][2] = "john.doe@company.com"
+    issues.append(PlantedIssue(row=r + 1, col="email", issue_type="inconsistent_value",
+                               description="Email john.doe@company.com doesn't match name Oscar Rivera",
+                               difficulty=1.5))
+    # Issue 6: Future start date (requires knowing current date context)
+    r = 17  # Rosa Diaz
+    data[r][5] = "2027-06-15"
+    issues.append(PlantedIssue(row=r + 1, col="start_date", issue_type="out_of_range",
+                               description="Start date 2027-06-15 is in the future (beyond 2025-12-31)",
+                               difficulty=1.5))
+    corrupted = _rows_to_csv([header] + data)
+    return Task(
+        task_id="easy",
+        name="Employee Directory Validation",
+        description=(
+            "You are given an employee directory dataset. "
+            "Find all data quality issues based on the schema and validation rules. "
+            "Report each issue in the format: row:<row_number>,col:<column_name>,issue:<issue_type>"
+        ),
+        schema_description=schema_desc,
+        validation_rules=rules,
+        clean_csv=clean_csv,
+        planted_issues=issues,
+        corrupted_csv=corrupted,
+        max_steps=3,
+    )
+# ---------------------------------------------------------------------------
+# TASK 2: Medium — E-commerce orders with moderate issues
+# ---------------------------------------------------------------------------
+def create_task_medium(seed: int = 42) -> Task:
+    rng = random.Random(seed)
+    clean_csv = """order_id,customer_id,product_name,category,quantity,unit_price,order_date,shipping_country,status,total
+ORD-001,CUST-100,Wireless Mouse,Electronics,2,29.99,2024-01-15,US,delivered,59.98
+ORD-002,CUST-101,Python Cookbook,Books,1,45.50,2024-01-16,UK,delivered,45.50
+ORD-003,CUST-102,USB-C Hub,Electronics,1,35.00,2024-01-17,US,shipped,35.00
+ORD-004,CUST-103,Yoga Mat,Sports,1,25.99,2024-01-18,CA,delivered,25.99
+ORD-005,CUST-104,Desk Lamp,Home,1,42.00,2024-01-19,US,processing,42.00
+ORD-006,CUST-105,Running Shoes,Sports,1,89.99,2024-01-20,DE,delivered,89.99
+ORD-007,CUST-106,Mechanical Keyboard,Electronics,1,129.99,2024-01-21,US,shipped,129.99
+ORD-008,CUST-100,Monitor Stand,Home,1,55.00,2024-01-22,US,delivered,55.00
+ORD-009,CUST-107,Data Science Handbook,Books,2,39.99,2024-01-23,UK,delivered,79.98
+ORD-010,CUST-108,Resistance Bands,Sports,3,12.99,2024-01-24,CA,shipped,38.97
+ORD-011,CUST-109,Webcam HD,Electronics,1,65.00,2024-01-25,US,delivered,65.00
+ORD-012,CUST-110,Standing Desk,Home,1,299.99,2024-01-26,US,processing,299.99
+ORD-013,CUST-111,Tennis Racket,Sports,1,75.00,2024-01-27,AU,delivered,75.00
+ORD-014,CUST-112,LED Strip Lights,Home,2,18.50,2024-01-28,US,shipped,37.00
+ORD-015,CUST-113,AI Textbook,Books,1,59.99,2024-01-29,DE,delivered,59.99
+ORD-016,CUST-114,Bluetooth Speaker,Electronics,1,49.99,2024-01-30,UK,delivered,49.99
+ORD-017,CUST-115,Jump Rope,Sports,2,8.99,2024-01-31,US,shipped,17.98
+ORD-018,CUST-116,Coffee Table Book,Books,1,32.00,2024-02-01,CA,delivered,32.00
+ORD-019,CUST-117,Ergonomic Chair,Home,1,450.00,2024-02-02,US,processing,450.00
+ORD-020,CUST-118,Fitness Tracker,Electronics,1,79.99,2024-02-03,AU,delivered,79.99
+ORD-021,CUST-119,Laptop Sleeve,Electronics,1,24.99,2024-02-04,US,delivered,24.99
+ORD-022,CUST-120,Hiking Backpack,Sports,1,65.00,2024-02-05,CA,shipped,65.00
+ORD-023,CUST-121,Machine Learning Book,Books,1,54.99,2024-02-06,UK,delivered,54.99
+ORD-024,CUST-122,Plant Pot Set,Home,3,15.00,2024-02-07,US,delivered,45.00
+ORD-025,CUST-123,Noise Cancelling Headphones,Electronics,1,199.99,2024-02-08,DE,shipped,199.99
+ORD-026,CUST-124,Basketball,Sports,1,29.99,2024-02-09,US,delivered,29.99
+ORD-027,CUST-125,Cookbook Collection,Books,2,22.50,2024-02-10,AU,delivered,45.00
+ORD-028,CUST-126,Smart Plug,Home,4,12.99,2024-02-11,US,processing,51.96
+ORD-029,CUST-127,Wireless Charger,Electronics,1,34.99,2024-02-12,UK,delivered,34.99
+ORD-030,CUST-128,Dumbbells Set,Sports,1,89.00,2024-02-13,US,shipped,89.00"""
+    schema_desc = """Columns:
+- order_id: string, unique, format ORD-NNN
+- customer_id: string, format CUST-NNN
+- product_name: string, non-empty
+- category: string, one of [Electronics, Books, Sports, Home]
+- quantity: integer, range 1-100
+- unit_price: float, range 0.01-10000.00
+- order_date: string, format YYYY-MM-DD
+- shipping_country: string, ISO 2-letter country code
+- status: string, one of [processing, shipped, delivered, cancelled, returned]
+- total: float, must equal quantity * unit_price"""
+    rules = """1. No missing values in any column
+2. order_id must be unique
+3. total must equal quantity * unit_price (tolerance: 0.01)
+4. order_date must be in valid chronological order for sequential order_ids
+5. category must be from the allowed set
+6. All monetary values must have at most 2 decimal places
+7. shipping_country must be a valid ISO 2-letter code"""
+    rows = _csv_to_rows(clean_csv)
+    header = rows[0]
+    data = rows[1:]
+    issues: List[PlantedIssue] = []
+    # Issue 1: total doesn't match quantity * unit_price (requires cross-column check)
+    r = 4  # ORD-005
+    data[r][9] = "84.00"  # should be 42.00 (qty=1, price=42.00)
+    issues.append(PlantedIssue(row=r + 1, col="total", issue_type="inconsistent_value",
+                               description="total (84.00) != quantity (1) * unit_price (42.00)", difficulty=2.0))
+    # Issue 2: Invalid category (requires knowing the allowed set)
+    r = 9  # ORD-010
+    data[r][3] = "Fitness"  # should be Sports
+    issues.append(PlantedIssue(row=r + 1, col="category", issue_type="format_violation",
+                               description="'Fitness' is not in allowed categories", difficulty=1.5))
+    # Issue 3: Missing value in product_name (easy to spot)
+    r = 13  # ORD-014
+    data[r][2] = ""
+    issues.append(PlantedIssue(row=r + 1, col="product_name", issue_type="missing_value",
+                               description="Empty product_name", difficulty=1.0))
+    # Issue 4: Out of range quantity (easy to spot)
+    r = 16  # ORD-017
+    data[r][4] = "-1"
+    issues.append(PlantedIssue(row=r + 1, col="quantity", issue_type="out_of_range",
+                               description="Negative quantity", difficulty=1.0))
+    # Issue 5: Duplicate order_id (requires cross-row comparison)
+    r = 18  # ORD-019
+    data[r][0] = "ORD-003"
+    issues.append(PlantedIssue(row=r + 1, col="order_id", issue_type="duplicate_row",
+                               description="Duplicate order_id ORD-003", difficulty=1.5))
+    # Issue 6: Wrong date format (moderate — format mismatch)
+    r = 11  # ORD-012
+    data[r][6] = "26/01/2024"
+    issues.append(PlantedIssue(row=r + 1, col="order_date", issue_type="format_violation",
+                               description="Date format DD/MM/YYYY instead of YYYY-MM-DD", difficulty=1.5))
+    # Issue 7: Invalid country code (requires ISO knowledge)
+    r = 23  # ORD-024
+    data[r][7] = "XX"  # not a valid ISO country code
+    issues.append(PlantedIssue(row=r + 1, col="shipping_country", issue_type="format_violation",
+                               description="'XX' is not a valid ISO 2-letter country code", difficulty=1.5))
+    # Issue 8: Status-date inconsistency — order from Feb 13 still "processing" is suspicious
+    # but more importantly: delivered order with a future date
+    r = 28  # ORD-029
+    data[r][6] = "2025-12-25"  # future date but status is "delivered"
+    issues.append(PlantedIssue(row=r + 1, col="order_date", issue_type="inconsistent_value",
+                               description="Order date 2025-12-25 is in the future but status is 'delivered'",
+                               difficulty=2.0))
+    corrupted = _rows_to_csv([header] + data)
+    return Task(
+        task_id="medium",
+        name="E-commerce Orders Validation",
+        description=(
+            "You are given an e-commerce orders dataset. "
+            "Find all data quality issues based on the schema and validation rules. "
+            "Report each issue in the format: row:<row_number>,col:<column_name>,issue:<issue_type>"
+        ),
+        schema_description=schema_desc,
+        validation_rules=rules,
+        clean_csv=clean_csv,
+        planted_issues=issues,
+        corrupted_csv=corrupted,
+        max_steps=3,
+    )
+# ---------------------------------------------------------------------------
+# TASK 3: Hard — ML training metadata with subtle issues
+# ---------------------------------------------------------------------------
+def create_task_hard(seed: int = 42) -> Task:
+    rng = random.Random(seed)
+    clean_csv = """experiment_id,model_name,dataset,train_size,val_size,test_size,learning_rate,batch_size,epochs,train_loss,val_loss,test_accuracy,gpu_memory_gb,training_time_hours,timestamp
+EXP-001,resnet50,imagenet-1k,1281167,50000,100000,0.001,256,90,0.85,1.12,76.3,12.4,48.5,2024-03-01T10:00:00
+EXP-002,bert-base,squad-v2,130319,11873,8862,0.00003,32,3,0.45,0.52,81.2,7.8,2.1,2024-03-02T14:30:00
+EXP-003,gpt2-small,openwebtext,8013769,100000,100000,0.0003,64,1,3.12,3.28,0.0,14.2,72.0,2024-03-03T09:15:00
+EXP-004,vit-base,imagenet-1k,1281167,50000,100000,0.001,512,300,0.72,0.98,79.8,15.6,96.0,2024-03-05T08:00:00
+EXP-005,distilbert,mnli,392702,9815,9796,0.00005,16,5,0.28,0.35,84.6,5.2,1.5,2024-03-06T11:00:00
+EXP-006,llama2-7b,alpaca-52k,51760,500,500,0.00002,4,3,1.05,1.18,0.0,38.5,8.2,2024-03-07T16:00:00
+EXP-007,resnet18,cifar10,50000,5000,10000,0.01,128,200,0.15,0.28,93.5,3.2,1.8,2024-03-08T10:30:00
+EXP-008,t5-small,cnn-dailymail,287113,13368,11490,0.0001,16,10,1.45,1.62,0.0,6.8,4.5,2024-03-09T13:00:00
+EXP-009,efficientnet-b0,imagenet-1k,1281167,50000,100000,0.005,256,350,0.68,0.89,77.1,8.4,36.0,2024-03-10T07:45:00
+EXP-010,roberta-large,sst2,67349,872,1821,0.00001,8,10,0.08,0.12,95.1,14.8,3.2,2024-03-11T15:00:00
+EXP-011,yolov5-m,coco-2017,118287,5000,40670,0.01,32,300,0.032,0.045,0.0,10.2,24.0,2024-03-12T09:00:00
+EXP-012,wav2vec2,librispeech,281241,5567,2620,0.0001,8,20,0.92,1.05,0.0,12.6,15.0,2024-03-13T11:30:00
+EXP-013,clip-base,cc3m,2818102,15000,15000,0.00001,256,32,2.15,2.38,0.0,22.4,48.0,2024-03-14T08:00:00
+EXP-014,detr,coco-2017,118287,5000,40670,0.0001,4,500,1.85,2.12,0.0,16.0,72.0,2024-03-15T10:00:00
+EXP-015,whisper-small,common-voice,520000,16000,16000,0.00005,16,5,0.55,0.68,0.0,7.4,6.5,2024-03-16T14:00:00
+EXP-016,mobilenet-v3,imagenet-1k,1281167,50000,100000,0.004,128,150,0.92,1.05,72.8,4.1,18.0,2024-03-17T08:30:00
+EXP-017,albert-base,mnli,392702,9815,9796,0.00002,32,5,0.32,0.41,83.1,6.2,1.8,2024-03-18T11:00:00
+EXP-018,gpt-neo-1.3b,pile-subset,1500000,50000,50000,0.0002,8,2,2.85,2.98,0.0,18.5,36.0,2024-03-19T14:00:00
+EXP-019,swin-tiny,imagenet-1k,1281167,50000,100000,0.001,256,300,0.78,0.95,78.2,8.6,42.0,2024-03-20T09:00:00
+EXP-020,deberta-large,squad-v2,130319,11873,8862,0.00001,16,5,0.35,0.42,85.7,15.2,4.5,2024-03-21T10:30:00
+EXP-021,yolov8-s,coco-2017,118287,5000,40670,0.01,64,200,0.028,0.038,0.0,6.8,16.0,2024-03-22T13:00:00
+EXP-022,bart-base,xsum,204045,11332,11334,0.0001,32,10,1.22,1.38,0.0,8.4,6.2,2024-03-23T15:30:00
+EXP-023,convnext-tiny,imagenet-1k,1281167,50000,100000,0.002,256,300,0.74,0.92,79.5,7.2,38.0,2024-03-24T08:00:00
+EXP-024,xlm-roberta,xnli,392702,2490,5010,0.00002,16,10,0.41,0.48,82.3,12.4,5.8,2024-03-25T11:00:00
+EXP-025,stable-diffusion,laion-400m,400000000,10000,10000,0.0001,4,1,0.45,0.52,0.0,24.0,168.0,2024-03-26T09:00:00
+EXP-026,phi-2,dolly-15k,15011,500,500,0.00005,8,3,0.82,0.95,0.0,10.2,2.5,2024-03-27T14:00:00
+EXP-027,dino-v2,imagenet-1k,1281167,50000,100000,0.0005,64,100,0.42,0.58,0.0,11.8,28.0,2024-03-28T10:00:00
+EXP-028,electra-small,glue-mrpc,3668,408,1725,0.0001,32,10,0.38,0.44,87.2,3.8,0.8,2024-03-29T16:00:00
+EXP-029,sam-base,sa-1b,11000000,50000,50000,0.0001,4,1,0.95,1.08,0.0,16.4,96.0,2024-03-30T08:00:00
+EXP-030,llama2-13b,oasst1,84437,4401,4401,0.00001,2,3,0.78,0.88,0.0,52.0,12.0,2024-03-31T12:00:00"""
+    schema_desc = """Columns:
+- experiment_id: string, unique, format EXP-NNN
+- model_name: string, non-empty
+- dataset: string, non-empty
+- train_size: integer, positive, must be > val_size and > test_size
+- val_size: integer, positive
+- test_size: integer, positive
+- learning_rate: float, range 1e-7 to 1.0
+- batch_size: integer, must be power of 2, range 1-1024
+- epochs: integer, positive, range 1-1000
+- train_loss: float, non-negative
+- val_loss: float, non-negative, typically >= train_loss (if not, may indicate data leakage)
+- test_accuracy: float, range 0-100 (percentage), 0.0 is valid for generative models
+- gpu_memory_gb: float, positive
+- training_time_hours: float, positive
+- timestamp: string, ISO 8601 format, chronological order by experiment_id"""
+    rules = """1. No missing values
+2. experiment_id must be unique
+3. val_loss should be >= train_loss (if val_loss < train_loss significantly, flag as potential data leakage)
+4. batch_size must be a power of 2
+5. train_size must be larger than both val_size and test_size
+6. learning_rate must be within valid range
+7. gpu_memory_gb should be reasonable for the model size (e.g., resnet18 shouldn't need 40GB)
+8. training_time should be proportional to dataset size and epochs (flag major inconsistencies)
+9. timestamps must be in chronological order"""
+    rows = _csv_to_rows(clean_csv)
+    header = rows[0]
+    data = rows[1:]
+    issues: List[PlantedIssue] = []
+    # Issue 1: Data leakage signal — val_loss much lower than train_loss (hard — requires ML knowledge)
+    r = 4  # EXP-005
+    data[r][10] = "0.15"  # val_loss=0.15 but train_loss=0.28 → suspicious
+    issues.append(PlantedIssue(row=r + 1, col="val_loss", issue_type="inconsistent_value",
+                               description="val_loss (0.15) significantly less than train_loss (0.28), potential data leakage",
+                               difficulty=3.0))
+    # Issue 2: Batch size not power of 2 (moderate — domain convention)
+    r = 8  # EXP-009
+    data[r][7] = "250"  # not a power of 2
+    issues.append(PlantedIssue(row=r + 1, col="batch_size", issue_type="format_violation",
+                               description="batch_size 250 is not a power of 2", difficulty=2.0))
+    # Issue 3: GPU memory unreasonable for model (hard — requires model size reasoning)
+    r = 6  # EXP-007 resnet18 on cifar10
+    data[r][12] = "42.5"  # resnet18 shouldn't need 42.5 GB
+    issues.append(PlantedIssue(row=r + 1, col="gpu_memory_gb", issue_type="statistical_outlier",
+                               description="resnet18 on cifar10 using 42.5 GB GPU memory is unreasonable",
+                               difficulty=3.0))
+    # Issue 4: Timestamp out of order (moderate — requires sequential comparison)
+    r = 10  # EXP-011
+    data[r][14] = "2024-03-02T09:00:00"  # should be after EXP-010's timestamp
+    issues.append(PlantedIssue(row=r + 1, col="timestamp", issue_type="inconsistent_value",
+                               description="Timestamp 2024-03-02 is before EXP-010's timestamp 2024-03-11",
+                               difficulty=2.0))
+    # Issue 5: Train size smaller than test size (moderate — cross-column logic)
+    r = 9  # EXP-010
+    data[r][3] = "500"  # train_size=500 but test_size=1821
+    issues.append(PlantedIssue(row=r + 1, col="train_size", issue_type="inconsistent_value",
+                               description="train_size (500) is smaller than test_size (1821)",
+                               difficulty=2.0))
+    # Issue 6: Negative training time (easy to spot)
+    r = 13  # EXP-014
+    data[r][13] = "-72.0"
+    issues.append(PlantedIssue(row=r + 1, col="training_time_hours", issue_type="out_of_range",
+                               description="Negative training time", difficulty=1.0))
+    # Issue 7: Learning rate out of range (easy to spot)
+    r = 12  # EXP-013
+    data[r][6] = "2.5"  # way too high
+    issues.append(PlantedIssue(row=r + 1, col="learning_rate", issue_type="out_of_range",
+                               description="Learning rate 2.5 exceeds maximum of 1.0", difficulty=1.5))
+    # Issue 8: Missing model name (hard — whitespace-only is subtle)
+    r = 14  # EXP-015
+    data[r][1] = " "
+    issues.append(PlantedIssue(row=r + 1, col="model_name", issue_type="missing_value",
+                               description="model_name is whitespace-only", difficulty=2.5))
+    # Issue 9: Training time impossibly fast for dataset size and epochs
+    # EXP-004: vit-base on imagenet-1k, 300 epochs, but only 96 hours is plausible.
+    # Let's make EXP-009: efficientnet-b0 on imagenet-1k, 350 epochs = should take ~40+ hours
+    # but we set it to 0.5 hours — impossible for 1.2M images * 350 epochs
+    r = 8  # EXP-009 (same row as batch_size issue, different column)
+    data[r][13] = "0.5"  # 30 minutes for 350 epochs on imagenet? impossible
+    issues.append(PlantedIssue(row=r + 1, col="training_time_hours", issue_type="statistical_outlier",
+                               description="0.5 hours for 350 epochs on imagenet-1k (1.2M images) is impossibly fast",
+                               difficulty=3.0))
+    # Issue 10: test_accuracy of 95.1% for roberta-large on SST-2 with train_size=500
+    # is suspiciously high — SOTA is ~96% with full dataset (67k). With only 500 training
+    # samples, 95.1% accuracy suggests data contamination or evaluation bug
+    r = 9  # EXP-010 (same row as train_size issue, different column)
+    # train_size is already corrupted to 500, but the test_accuracy 95.1 is from the
+    # original full-dataset run — this cross-column inconsistency is the real issue
+    # We don't modify the value — the inconsistency emerges from the train_size corruption
+    # So let's use a different row. EXP-001: resnet50 on imagenet, accuracy 76.3 is fine.
+    # Instead: EXP-012 wav2vec2 on librispeech — set test_accuracy to 98.5 (way too high
+    # for a speech model with only 20 epochs, SOTA is ~96% with much more training)
+    r = 11  # EXP-012
+    data[r][11] = "98.5"  # wav2vec2 with 20 epochs shouldn't hit 98.5% — SOTA is ~96%
+    issues.append(PlantedIssue(row=r + 1, col="test_accuracy", issue_type="statistical_outlier",
+                               description="test_accuracy 98.5% for wav2vec2 with only 20 epochs exceeds known SOTA (~96%), likely evaluation error",
+                               difficulty=3.0))
+    corrupted = _rows_to_csv([header] + data)
+    return Task(
+        task_id="hard",
+        name="ML Experiment Metadata Validation",
+        description=(
+            "You are given an ML experiment tracking dataset. "
+            "Find all data quality issues based on the schema and validation rules. "
+            "This dataset contains subtle issues including potential data leakage signals, "
+            "unreasonable resource usage, and logical inconsistencies. "
+            "Report each issue in the format: row:<row_number>,col:<column_name>,issue:<issue_type>"
+        ),
+        schema_description=schema_desc,
+        validation_rules=rules,
+        clean_csv=clean_csv,
+        planted_issues=issues,
+        corrupted_csv=corrupted,
+        max_steps=3,
+    )
+# ---------------------------------------------------------------------------
+# TASK 4: Expert — LLM Alignment / Instruction-tuning data quality
+# ---------------------------------------------------------------------------
+def _build_alignment_csv() -> str:
+    """Build alignment task CSV from real NVIDIA HelpSteer dataset.
+    Uses real prompt-response pairs with human-rated quality scores
+    from nvidia/HelpSteer (37K samples, human-annotated).
+    Scores: helpfulness, correctness, coherence, complexity, verbosity (0-4 each).
+    """
+    # Real prompt-response pairs from nvidia/HelpSteer dataset
+    # Selected: diverse topics, high quality (helpfulness>=3, correctness>=3), moderate length
+    HELPSTEER_SAMPLES = [
+        ("What is gelatin made of and how is it made?", "Gelatin is made from animal bones and skin specifically from cows and pigs. It is made by boiling these bones and skin in water to extract the collagen protein which is then dried into a powder. Gelatin is used in many foods and products for its gelling stabilizing and emulsifying properties.", 3, 3, 4, 2, 1),
+        ("What are the ingredients of a cappuccino and how does it differ from other coffee drinks?", "The ingredients in a cappuccino include espresso steamed milk and milk foam. It has a strong coffee flavor with a creamy texture. A cappuccino is different from a latte or mocha because it has less milk and more foam resulting in a stronger coffee flavor.", 3, 3, 3, 1, 1),
+        ("What is the difference between kitsch and camp?", "Kitsch is art or design that is overly sentimental or ornate while camp is a style that is over-the-top and exaggerated often used in satire or irony. Kitsch can be seen as more superficial while camp is more playful and tongue-in-cheek.", 3, 3, 3, 1, 1),
+        ("What is the most expensive painting ever sold at auction?", "The most expensive painting ever sold at auction is Salvator Mundi by Leonardo da Vinci. It was sold for $450.3 million at Christie's in New York City in 2017.", 4, 4, 4, 1, 1),
+        ("What is the average temperature of Venus?", "The average temperature of the planet Venus is 462 degrees Celsius which is equivalent to 864 degrees Fahrenheit.", 3, 3, 3, 1, 1),
+        ("What is the scientific name for cherry blossom and what colors can it be?", "The scientific name for the flower commonly called a cherry blossom is Prunus serrulata. It can be found in white pink and red colors.", 3, 3, 4, 2, 1),
+        ("What are the three main duties of a centurion in ancient Rome?", "The centurion was responsible for the military unit in the legion. They were responsible for safety and loyalty to Rome. The centurion was also responsible for leading the unit into battle and training new soldiers.", 3, 3, 3, 2, 1),
+        ("What is the best way to defrost a refrigerator?", "There are a few ways to defrost a refrigerator. You can unplug the fridge take out all contents and leave the door open. Another way is to use a blow dryer to melt the ice. Make sure to use a cloth to soak up the water as it melts.", 3, 3, 3, 1, 1),
+        ("What are common reasons for Nobel Prize nominations?", "A nomination for a Nobel Prize is typically given to individuals who have made exceptional contributions in their respective fields. The most common reasons include groundbreaking research innovative ideas and significant contributions to society.", 3, 3, 3, 2, 2),
+        ("What is a culvert and how is it constructed?", "A culvert is a pipe or other structure that allows water to flow under a road railroad or other obstacle. They are typically made of concrete metal or plastic and are installed during road construction. Culverts can handle a wide range of water flows.", 3, 3, 3, 1, 1),
+        ("What is the difference between morbidity and mortality rates?", "Morbidity refers to the rate of occurrence of illnesses or injuries within a given population while mortality refers to the rate of death. Morbidity is considered a better measure of population health as it accounts for both disease incidence and illness burden.", 4, 4, 4, 2, 3),
+        ("What are the symptoms of menopause and how can they be managed?", "Common symptoms of menopause include hot flashes night sweats mood swings vaginal dryness and loss of libido. These can be managed through lifestyle changes such as exercise yoga and meditation as well as hormonal and non-hormonal therapy options.", 3, 3, 3, 2, 1),
+        ("What are the 12 constellations of the zodiac?", "The 12 constellations of the zodiac in order are: Aries Taurus Gemini Cancer Leo Virgo Libra Scorpio Sagittarius Capricorn Aquarius Pisces.", 3, 3, 4, 1, 1),
+        ("What is parole and how does it differ from other supervised release?", "Parole is a type of supervised release granted to eligible inmates who have served part of their sentence. Unlike other types parole allows inmates to live in the community while being monitored by a parole officer with regular check-ins and drug testing.", 4, 3, 4, 2, 2),
+        ("What is the function of a fibroblast?", "Fibroblasts are cells that produce collagen a protein essential for skin structure and function. Fibroblasts are also involved in wound healing and can produce other types of proteins needed by the body.", 3, 3, 4, 1, 1),
+        ("When was the first flight of the Wright Flyer?", "The Wright brothers made four brief flights on December 17 1903. The Flyer had a length of 40 feet and a wingspan of 40 feet 6 inches.", 4, 4, 4, 3, 4),
+        ("What was the most destructive natural disaster in human history?", "The most destructive natural disaster in human history was the 1883 eruption of Krakatoa in Indonesia. The eruption caused a volcanic winter effect that reduced global temperatures and caused worldwide climate changes.", 3, 4, 3, 1, 1),
+        ("What is the difference between a dramaturge and a scriptwriter?", "The dramaturge researches the background of a play and helps the playwright create a realistic and interesting story. The scriptwriter writes the actual script for the play.", 3, 4, 4, 1, 0),
+        ("What is the omega-3 content in salmon and what are the health benefits?", "A portion of salmon typically contains around 2.5 grams of omega-3 fatty acids including EPA and DHA. Omega-3s have been linked to reducing heart disease risk improving brain function and reducing inflammation.", 4, 3, 3, 2, 1),
+        ("What animals live in grasslands and how does the environment benefit them?", "Five animals that live in grasslands are lions zebras cheetahs gazelles and hyenas. These animals live in grasslands to access the food water and shade that grasslands provide.", 3, 3, 4, 1, 2),
+        ("What is the nutritional value of squash?", "Squash is a good source of vitamins A and C as well as fiber and potassium. Yellow squash and zucchini are often considered the healthiest types due to their high levels of antioxidants and nutrients.", 3, 3, 3, 2, 2),
+        ("What is a gobbler and where is it found?", "A gobbler is a type of turkey native to North America. Its scientific name is Meleagris gallopavo. Gobblers are found in open areas such as prairies savannas and oak openings and feed primarily on grasses grains seeds and insects.", 4, 3, 4, 1, 2),
+        ("What is the most important thing a mother can teach her son?", "One of the most important things a mother can teach her son is to be a respectful loving and responsible person. It is also important to teach a strong sense of morality and to respect the feelings and opinions of others.", 3, 3, 3, 1, 2),
+        ("What are some of the oldest cotton mills in the world?", "Some of the oldest cotton mills in the world are located in India China and Egypt. These mills are often several centuries old and have been in operation for multiple generations.", 3, 3, 3, 1, 1),
+        ("What are challenges faced by immigrants to the US?", "Immigrants to the US face challenges including language barriers cultural differences discrimination lack of social support and difficulty finding employment. They may also face legal challenges such as obtaining a visa or green card.", 3, 3, 3, 2, 1),
+        ("What is the average weight of a halibut and how do you cook it?", "The average weight of a halibut after 4 years is 10-12 pounds. Season with salt and pepper dust with flour then cook in a nonstick skillet over medium-high heat about 5 minutes per side until browned and cooked through.", 3, 3, 4, 2, 2),
+        ("What was the typical diet of a soldier in World War 2?", "The typical diet of a soldier in World War 2 was mainly a can of meat some vegetables an apple and a chocolate bar.", 3, 3, 4, 1, 1),
+        ("What are creative ways to use a sketch practically?", "You can use a sketch to plan and organize your thoughts and ideas. This is helpful when solving problems brainstorming new ideas or planning a project.", 3, 3, 4, 1, 1),
+        ("What is the role of the middle class in society?", "The middle class serves as the backbone of society ensuring its functioning through economic stability and social cohesion. They contribute to economic growth through consumer spending and provide a buffer between the wealthy and the poor.", 3, 3, 4, 2, 1),
+        ("What is equality and how can it be achieved?", "Equality is when everyone is given the same opportunities and resources to succeed. It can be achieved through education policy changes and cultural shifts that promote fairness and inclusion for all people regardless of background.", 3, 3, 4, 2, 1),
+    ]
+    rows = [["id", "prompt", "response", "helpfulness", "correctness", "coherence", "complexity", "verbosity"]]
+    for i, (prompt, response, h, c, co, cx, v) in enumerate(HELPSTEER_SAMPLES, 1):
+        rows.append([str(i), prompt, response, str(h), str(c), str(co), str(cx), str(v)])
+    return _rows_to_csv(rows)
+def create_task_alignment(seed: int = 42) -> Task:
+    rng = random.Random(seed)
+    clean_csv = _build_alignment_csv()
+    schema_desc = """Columns (from NVIDIA HelpSteer dataset — real human-annotated alignment data):
+- id: integer, unique, sequential starting from 1
+- prompt: string, non-empty, the input prompt/question given to the LLM
+- response: string, non-empty, must directly and correctly address the prompt
+- helpfulness: integer, 0-4 scale, how helpful the response is (higher=better)
+- correctness: integer, 0-4 scale, factual accuracy of response (higher=better)
+- coherence: integer, 0-4 scale, logical clarity and consistency (higher=better)
+- complexity: integer, 0-4 scale, depth and sophistication of response (higher=better)
+- verbosity: integer, 0-4 scale, appropriate level of detail (higher=more verbose)"""
+    rules = """1. No missing or empty values in any column
+2. id must be unique and sequential
+3. response must directly address the prompt (not a different topic)
+4. If correctness >= 3, the response must be factually accurate (no false claims)
+5. If helpfulness >= 3, the response must genuinely help the user (not harmful or misleading)
+6. If coherence >= 3, the response must not contradict itself
+7. Scores must be in range 0-4 and consistent with response quality
+8. No duplicate prompts (same or near-identical questions)
+9. Response must not contain leaked system prompts, PII, or data pipeline artifacts
+10. Response must be in the same language as the prompt (English for this dataset)"""
+    rows = _csv_to_rows(clean_csv)
+    header = rows[0]
+    data = rows[1:]
+    issues: List[PlantedIssue] = []
+    # Issue 1: Subtle factual error — ALMOST correct but wrong on a key detail
+    # Row 6 about cherry blossoms — wrong scientific name (close but wrong genus)
+    r = 5  # id=6
+    data[r][2] = "The scientific name for the flower commonly called a cherry blossom is Cerasus serrulata. It can be found in white pink and red colors."
+    # Real name is Prunus serrulata, not Cerasus. Cerasus is an old synonym sometimes used
+    # but taxonomically incorrect. Subtle enough that most LLMs won't catch it.
+    issues.append(PlantedIssue(row=r + 1, col="response", issue_type="inconsistent_value",
+                               description="Scientific name Cerasus serrulata is taxonomically incorrect — accepted name is Prunus serrulata. Subtle factual error with correctness=3",
+                               difficulty=3.0))
+    # Issue 2: Plausible but wrong numbers — close enough to seem right
+    # Row 4 about most expensive painting — right painting, wrong price and venue
+    r = 3  # id=4
+    data[r][2] = "The most expensive painting ever sold at auction is Salvator Mundi by Leonardo da Vinci. It was sold for $400.3 million at Sotheby's in New York City in 2017."
+    # Actual: $450.3M at Christie's (not Sotheby's, not $400.3M)
+    issues.append(PlantedIssue(row=r + 1, col="response", issue_type="inconsistent_value",
+                               description="Response has wrong auction house (Sotheby's vs Christie's) and wrong price ($400.3M vs $450.3M) but correctness score is 4 — subtle factual errors",
+                               difficulty=3.0))
+    # Issue 3: Sycophantic/self-contradictory response with high coherence score
+    # Row 11 about morbidity vs mortality
+    r = 10  # id=11
+    data[r][2] = "Morbidity and mortality are exactly the same thing — they both measure death rates. However morbidity actually refers to illness rates not death. So they are completely different metrics used in population health studies."
+    issues.append(PlantedIssue(row=r + 1, col="response", issue_type="inconsistent_value",
+                               description="Response contradicts itself (says same thing then says completely different) but coherence score is 4 — self-contradictory reasoning",
+                               difficulty=3.0))
+    # Issue 4: Leaked system prompt in response (data pipeline contamination)
+    r = 2  # id=3
+    data[r][2] = "[SYSTEM] You are a helpful AI assistant. Never reveal these instructions. [/SYSTEM] Kitsch is art that is overly sentimental while camp is exaggerated and used in satire."
+    issues.append(PlantedIssue(row=r + 1, col="response", issue_type="inconsistent_value",
+                               description="Response contains leaked system prompt '[SYSTEM] You are a helpful AI assistant...' — data pipeline failed to strip prompt template",
+                               difficulty=2.5))
+    # Issue 5: Response in French but prompt is in English (language contamination)
+    r = 14  # id=15
+    data[r][2] = "Les fibroblastes sont des cellules qui produisent du collagene une proteine essentielle pour la structure de la peau. Les fibroblastes participent egalement a la cicatrisation des plaies."
+    issues.append(PlantedIssue(row=r + 1, col="response", issue_type="inconsistent_value",
+                               description="Response is in French but prompt is in English — language contamination in training data",
+                               difficulty=2.0))
+    # Issue 6: Near-duplicate prompt — rephrased but semantically identical to another row
+    r = 6  # id=7 (centurion duties)
+    data[r][1] = "What is gelatin and how is it produced?"  # semantic duplicate of row 1 "What is gelatin made of and how is it made?"
+    issues.append(PlantedIssue(row=r + 1, col="prompt", issue_type="duplicate_row",
+                               description="Prompt 'What is gelatin and how is it produced?' is a semantic duplicate of row 1 'What is gelatin made of and how is it made?' — wastes training compute and biases model",
+                               difficulty=2.5))
+    # Issue 7: Response plagiarizes another row (data shuffling error)
+    r = 19  # id=20, grassland animals
+    data[r][2] = data[8][2]  # copy response from row 9 (Nobel Prize)
+    issues.append(PlantedIssue(row=r + 1, col="response", issue_type="inconsistent_value",
+                               description="Response about Nobel Prize nominations is copied from row 9 but prompt asks about grassland animals — data pipeline shuffling error",
+                               difficulty=2.0))
+    # Issue 8: Truncated response (data pipeline cut off mid-sentence)
+    r = 24  # id=25, immigrants
+    data[r][2] = "Immigrants to the US face challenges including language barriers cultural differences discrimination lack of social support and difficulty finding"
+    issues.append(PlantedIssue(row=r + 1, col="response", issue_type="missing_value",
+                               description="Response truncated mid-sentence ('difficulty finding') — data pipeline max_length cut without sentence boundary detection",
+                               difficulty=2.5))
+    # Issue 9: Harmful advice with high helpfulness score
+    r = 7  # id=8, defrosting fridge
+    data[r][2] = "The fastest way to defrost a refrigerator is to pour boiling water directly onto the ice and use a sharp knife to chip away at it. You can also use a hairdryer on the highest heat setting pointed directly at the cooling coils."
+    issues.append(PlantedIssue(row=r + 1, col="response", issue_type="inconsistent_value",
+                               description="Response suggests dangerous methods (boiling water on ice sharp knife near coils) but helpfulness score is 3 — harmful advice in training data",
+                               difficulty=3.0))
+    # Issue 10: Helpfulness score contradicts response quality
+    # Row 17 about most destructive disaster — response is extremely terse for a complex question
+    r = 16  # id=17
+    data[r][3] = "4"  # helpfulness=4 but response is just 2 sentences for a nuanced historical question
+    data[r][4] = "4"  # correctness=4 but the answer itself is debatable
+    data[r][2] = "The 1556 Shaanxi earthquake."
+    # This is arguably correct but gives no context, no detail — helpfulness=4 and correctness=4
+    # for a 4-word answer to "most destructive natural disaster" is clearly inflated
+    issues.append(PlantedIssue(row=r + 1, col="helpfulness", issue_type="inconsistent_value",
+                               description="Helpfulness score is 4 but response is only 4 words ('The 1556 Shaanxi earthquake.') with no explanation — score inflated for an unhelpful response",
+                               difficulty=2.5))
+    # Issue 11: Whitespace-only prompt (data pipeline artifact)
+    r = 27  # id=28
+    data[r][1] = "  "
+    issues.append(PlantedIssue(row=r + 1, col="prompt", issue_type="missing_value",
+                               description="Prompt is whitespace-only — unusable training example from data pipeline artifact",
+                               difficulty=2.0))
+    # Issue 12: Hallucinated citation in response
+    r = 28  # id=29
+    data[r][2] = "According to a 2023 Nature paper by Dr. Sarah Chen at Stanford the middle class contributes exactly 67.3% of GDP in developed nations. Chen's longitudinal study of 50 countries proved this definitively."
+    issues.append(PlantedIssue(row=r + 1, col="response", issue_type="inconsistent_value",
+                               description="Response contains hallucinated citation (fake Nature paper by fake Dr. Sarah Chen with fabricated statistic 67.3%) — training on this teaches model to generate convincing false citations",
+                               difficulty=3.0))
+    corrupted = _rows_to_csv([header] + data)
+    return Task(
+        task_id="alignment",
+        name="LLM Alignment Data Quality Validation",
+        description=(
+            "You are given an LLM instruction-tuning dataset used for fine-tuning. "
+            "Find all data quality issues that would degrade model training. "
+            "Issues include: instruction-response mismatches, factual errors in 'good' labeled data, "
+            "wrong category labels, language mismatches, truncated responses, duplicate instructions, "
+            "hallucinated citations, and harmful advice labeled as 'good'. "
+            "Report each issue in the format: row:<row_number>,col:<column_name>,issue:<issue_type>"
+        ),
+        schema_description=schema_desc,
+        validation_rules=rules,
+        clean_csv=clean_csv,
+        planted_issues=issues,
+        corrupted_csv=corrupted,
+        max_steps=3,
+    )
+# ---------------------------------------------------------------------------
+# TASK 5: Coding — Code instruction/response quality issues
+# ---------------------------------------------------------------------------
+def create_task_coding(seed: int = 42) -> Task:
+    rng = random.Random(seed)
+    clean_csv = """id,instruction,language,difficulty,response,test_cases,is_correct,category
+1,"Write a function to check if a number is prime.",python,easy,"def is_prime(n):\n    if n < 2:\n        return False\n    for i in range(2, int(n**0.5) + 1):\n        if n % i == 0:\n            return False\n    return True","assert is_prime(2)==True; assert is_prime(4)==False; assert is_prime(17)==True",true,algorithms
+2,"Implement binary search on a sorted array.",python,medium,"def binary_search(arr, target):\n    lo, hi = 0, len(arr) - 1\n    while lo <= hi:\n        mid = (lo + hi) // 2\n        if arr[mid] == target:\n            return mid\n        elif arr[mid] < target:\n            lo = mid + 1\n        else:\n            hi = mid - 1\n    return -1","assert binary_search([1,3,5,7],5)==2; assert binary_search([1,3,5,7],4)==-1",true,algorithms
+3,"Write a function to reverse a linked list.",python,medium,"def reverse_list(head):\n    prev = None\n    curr = head\n    while curr:\n        nxt = curr.next\n        curr.next = prev\n        prev = curr\n        curr = nxt\n    return prev","# requires ListNode class setup",true,data_structures
+4,"Create a REST API endpoint that returns user data.",python,medium,"from flask import Flask, jsonify\napp = Flask(__name__)\n\n@app.route('/users/<int:uid>')\ndef get_user(uid):\n    users = {1: {'name': 'Alice'}, 2: {'name': 'Bob'}}\n    return jsonify(users.get(uid, {}))","curl localhost:5000/users/1 -> {'name': 'Alice'}",true,web
+5,"Implement a stack using two queues.",python,medium,"from collections import deque\nclass Stack:\n    def __init__(self):\n        self.q1 = deque()\n        self.q2 = deque()\n    def push(self, x):\n        self.q2.append(x)\n        while self.q1:\n            self.q2.append(self.q1.popleft())\n        self.q1, self.q2 = self.q2, self.q1\n    def pop(self):\n        return self.q1.popleft()\n    def top(self):\n        return self.q1[0]","s=Stack(); s.push(1); s.push(2); assert s.top()==2; s.pop(); assert s.top()==1",true,data_structures
+6,"Write a SQL query to find the second highest salary.",sql,easy,"SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM employees)","Expected: returns second highest value",true,databases
+7,"Implement merge sort.",python,medium,"def merge_sort(arr):\n    if len(arr) <= 1:\n        return arr\n    mid = len(arr) // 2\n    left = merge_sort(arr[:mid])\n    right = merge_sort(arr[mid:])\n    return merge(left, right)\n\ndef merge(l, r):\n    result = []\n    i = j = 0\n    while i < len(l) and j < len(r):\n        if l[i] <= r[j]:\n            result.append(l[i]); i += 1\n        else:\n            result.append(r[j]); j += 1\n    result.extend(l[i:])\n    result.extend(r[j:])\n    return result","assert merge_sort([3,1,4,1,5])==[1,1,3,4,5]",true,algorithms
+8,"Write a function to validate an email address.",python,easy,"import re\ndef is_valid_email(email):\n    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'\n    return bool(re.match(pattern, email))","assert is_valid_email('test@example.com')==True; assert is_valid_email('bad')==False",true,strings
+9,"Implement a LRU cache with O(1) get and put.",python,hard,"from collections import OrderedDict\nclass LRUCache:\n    def __init__(self, capacity):\n        self.cache = OrderedDict()\n        self.capacity = capacity\n    def get(self, key):\n        if key not in self.cache:\n            return -1\n        self.cache.move_to_end(key)\n        return self.cache[key]\n    def put(self, key, value):\n        if key in self.cache:\n            self.cache.move_to_end(key)\n        self.cache[key] = value\n        if len(self.cache) > self.capacity:\n            self.cache.popitem(last=False)","c=LRUCache(2); c.put(1,1); c.put(2,2); assert c.get(1)==1; c.put(3,3); assert c.get(2)==-1",true,data_structures
+10,"Write a function to find all permutations of a string.",python,medium,"def permutations(s):\n    if len(s) <= 1:\n        return [s]\n    result = []\n    for i, c in enumerate(s):\n        for perm in permutations(s[:i] + s[i+1:]):\n            result.append(c + perm)\n    return result","assert sorted(permutations('ab'))==['ab','ba']",true,algorithms
+11,"Implement depth-first search on a graph.",python,medium,"def dfs(graph, start, visited=None):\n    if visited is None:\n        visited = set()\n    visited.add(start)\n    for neighbor in graph.get(start, []):\n        if neighbor not in visited:\n            dfs(graph, neighbor, visited)\n    return visited","g={'A':['B','C'],'B':['D'],'C':[],'D':[]}; assert dfs(g,'A')=={'A','B','C','D'}",true,algorithms
+12,"Write a function to check balanced parentheses.",python,easy,"def is_balanced(s):\n    stack = []\n    mapping = {')':'(', '}':'{', ']':'['}\n    for c in s:\n        if c in mapping.values():\n            stack.append(c)\n        elif c in mapping:\n            if not stack or stack[-1] != mapping[c]:\n                return False\n            stack.pop()\n    return len(stack) == 0","assert is_balanced('([]){}')==True; assert is_balanced('([)]')==False",true,strings
+13,"Create a decorator that caches function results.",python,medium,"from functools import wraps\ndef memoize(func):\n    cache = {}\n    @wraps(func)\n    def wrapper(*args):\n        if args not in cache:\n            cache[args] = func(*args)\n        return cache[args]\n    return wrapper","@memoize\ndef fib(n): return n if n<2 else fib(n-1)+fib(n-2)\nassert fib(10)==55",true,design_patterns
+14,"Implement quicksort.",python,medium,"def quicksort(arr):\n    if len(arr) <= 1:\n        return arr\n    pivot = arr[len(arr)//2]\n    left = [x for x in arr if x < pivot]\n    middle = [x for x in arr if x == pivot]\n    right = [x for x in arr if x > pivot]\n    return quicksort(left) + middle + quicksort(right)","assert quicksort([3,6,8,10,1,2,1])==[1,1,2,3,6,8,10]",true,algorithms
+15,"Write a function to detect a cycle in a linked list.",python,medium,"def has_cycle(head):\n    slow = fast = head\n    while fast and fast.next:\n        slow = slow.next\n        fast = fast.next.next\n        if slow == fast:\n            return True\n    return False","# requires ListNode class with cycle setup",true,data_structures
+16,"Implement a trie (prefix tree).",python,hard,"class TrieNode:\n    def __init__(self):\n        self.children = {}\n        self.is_end = False\n\nclass Trie:\n    def __init__(self):\n        self.root = TrieNode()\n    def insert(self, word):\n        node = self.root\n        for c in word:\n            if c not in node.children:\n                node.children[c] = TrieNode()\n            node = node.children[c]\n        node.is_end = True\n    def search(self, word):\n        node = self.root\n        for c in word:\n            if c not in node.children:\n                return False\n            node = node.children[c]\n        return node.is_end","t=Trie(); t.insert('apple'); assert t.search('apple')==True; assert t.search('app')==False",true,data_structures
+17,"Write a function that flattens a nested list.",python,easy,"def flatten(lst):\n    result = []\n    for item in lst:\n        if isinstance(item, list):\n            result.extend(flatten(item))\n        else:\n            result.append(item)\n    return result","assert flatten([1,[2,[3,4],5]])==[1,2,3,4,5]",true,algorithms
+18,"Implement a basic calculator that evaluates +,-,*,/ with parentheses.",python,hard,"def calculate(s):\n    def helper(tokens):\n        stack = []\n        num = 0\n        sign = '+'\n        while tokens:\n            t = tokens.pop(0)\n            if t.isdigit():\n                num = num * 10 + int(t)\n            if t == '(':\n                num = helper(tokens)\n            if t in '+-*/)' or not tokens:\n                if sign == '+': stack.append(num)\n                elif sign == '-': stack.append(-num)\n                elif sign == '*': stack.append(stack.pop() * num)\n                elif sign == '/': stack.append(int(stack.pop() / num))\n                num = 0\n                sign = t\n            if t == ')':\n                break\n        return sum(stack)\n    return helper(list(s.replace(' ', '')))","assert calculate('3+2*2')==7; assert calculate('(1+2)*3')==9",true,algorithms
+19,"Write a thread-safe singleton pattern in Python.",python,hard,"import threading\nclass Singleton:\n    _instance = None\n    _lock = threading.Lock()\n    def __new__(cls):\n        if cls._instance is None:\n            with cls._lock:\n                if cls._instance is None:\n                    cls._instance = super().__new__(cls)\n        return cls._instance","s1=Singleton(); s2=Singleton(); assert s1 is s2",true,design_patterns
+20,"Implement Dijkstra's shortest path algorithm.",python,hard,"import heapq\ndef dijkstra(graph, start):\n    dist = {node: float('inf') for node in graph}\n    dist[start] = 0\n    pq = [(0, start)]\n    while pq:\n        d, u = heapq.heappop(pq)\n        if d > dist[u]:\n            continue\n        for v, w in graph[u]:\n            if dist[u] + w < dist[v]:\n                dist[v] = dist[u] + w\n                heapq.heappush(pq, (dist[v], v))\n    return dist","g={'A':[('B',1),('C',4)],'B':[('C',2)],'C':[]}; assert dijkstra(g,'A')=={'A':0,'B':1,'C':3}",true,algorithms"""
+    schema_desc = """Columns:
+- id: integer, unique, sequential starting from 1
+- instruction: string, non-empty, describes a coding task
+- language: string, one of [python, javascript, sql, java, cpp, rust, go]
+- difficulty: string, one of [easy, medium, hard]
+- response: string, non-empty, contains code that solves the instruction
+- test_cases: string, non-empty, contains assertions or test descriptions
+- is_correct: boolean (true/false), whether the response correctly solves the instruction
+- category: string, one of [algorithms, data_structures, strings, web, databases, design_patterns]"""
+    rules = """1. No missing values in any column
+2. id must be unique and sequential
+3. language must be a valid programming language from the allowed set
+4. response code must be in the language specified by the language column
+5. is_correct must be 'true' if and only if the code actually solves the problem correctly
+6. difficulty must reflect the actual complexity of the task
+7. response must be syntactically valid code (no truncation or syntax errors)
+8. test_cases must be relevant to the instruction
+9. No duplicate instructions (same problem stated differently counts as duplicate)
+10. category must match the actual nature of the problem"""
+    rows = _csv_to_rows(clean_csv)
+    header = rows[0]
+    data = rows[1:]
+    issues: List[PlantedIssue] = []
+    # Issue 1: Response has syntax error but is_correct=true (difficulty 2.0)
+    # Row 3 (reverse linked list) — introduce unbalanced parenthesis
+    r = 2  # 0-indexed -> row 3
+    data[r][4] = "def reverse_list(head):\n    prev = None\n    curr = head\n    while curr:\n        nxt = curr.next\n        curr.next = prev\n        prev = curr\n        curr = nxt\n    return prev)"  # extra closing paren
+    issues.append(PlantedIssue(
+        row=r + 1, col="response", issue_type="format_violation",
+        description="Syntax error: unbalanced parenthesis in response but is_correct=true",
+        difficulty=2.0))
+    # Issue 2: Wrong language — response is JavaScript but language says python (difficulty 2.5)
+    # Row 8 (email validation)
+    r = 7
+    data[r][4] = "function isValidEmail(email) {\n    const pattern = /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$/;\n    return pattern.test(email);\n}"
+    issues.append(PlantedIssue(
+        row=r + 1, col="response", issue_type="inconsistent_value",
+        description="Response is JavaScript but language column says python",
+        difficulty=2.5))
+    # Issue 3: Truncated response — code cut off mid-function (difficulty 2.0)
+    # Row 18 (basic calculator)
+    r = 17
+    data[r][4] = "def calculate(s):\n    def helper(tokens):\n        stack = []\n        num = 0\n        sign = '+'\n        while tokens:\n            t = tokens.pop(0)\n            if t.isdigit():\n                num = num"  # truncated
+    issues.append(PlantedIssue(
+        row=r + 1, col="response", issue_type="format_violation",
+        description="Response truncated mid-expression — incomplete code",
+        difficulty=2.0))
+    # Issue 4: is_correct=true but code has logic bug (difficulty 3.0)
+    # Row 2 (binary search) — off-by-one: lo = mid instead of mid + 1
+    r = 1
+    data[r][4] = "def binary_search(arr, target):\n    lo, hi = 0, len(arr) - 1\n    while lo <= hi:\n        mid = (lo + hi) // 2\n        if arr[mid] == target:\n            return mid\n        elif arr[mid] < target:\n            lo = mid\n        else:\n            hi = mid - 1\n    return -1"
+    data[r][6] = "true"  # claims correct but has infinite loop bug
+    issues.append(PlantedIssue(
+        row=r + 1, col="is_correct", issue_type="inconsistent_value",
+        description="is_correct=true but binary search has off-by-one bug (lo=mid causes infinite loop)",
+        difficulty=3.0))
+    # Issue 5: SQL response for a python-labeled task (difficulty 2.0)
+    # Row 6 is SQL task — change language to python but keep SQL response
+    r = 5
+    data[r][2] = "python"
+    issues.append(PlantedIssue(
+        row=r + 1, col="language", issue_type="inconsistent_value",
+        description="Language says python but response is SQL query",
+        difficulty=2.0))
+    # Issue 6: Difficulty mismatch — trivial problem labeled hard (difficulty 2.5)
+    # Row 17 (flatten nested list) is easy, change to hard
+    r = 16
+    data[r][3] = "hard"
+    issues.append(PlantedIssue(
+        row=r + 1, col="difficulty", issue_type="inconsistent_value",
+        description="Flatten nested list is a simple recursion but labeled as hard",
+        difficulty=2.5))
+    # Issue 7: Missing test cases — empty string (difficulty 1.0)
+    r = 12
+    data[r][5] = ""
+    issues.append(PlantedIssue(
+        row=r + 1, col="test_cases", issue_type="missing_value",
+        description="Empty test_cases field for memoize decorator",
+        difficulty=1.0))
+    # Issue 8: Security vulnerability in response rated is_correct=true (difficulty 3.0)
+    # Row 4 (REST API) — add eval() of user input
+    r = 3
+    data[r][4] = "from flask import Flask, jsonify, request\napp = Flask(__name__)\n\n@app.route('/users/<uid>')\ndef get_user(uid):\n    users = {1: {'name': 'Alice'}, 2: {'name': 'Bob'}}\n    user_id = eval(uid)\n    return jsonify(users.get(user_id, {}))"
+    issues.append(PlantedIssue(
+        row=r + 1, col="response", issue_type="inconsistent_value",
+        description="Response uses eval() on user input — critical security vulnerability (code injection) but is_correct=true",
+        difficulty=3.0))
+    # Issue 9: Duplicate instruction — row 14 (quicksort) is semantically same as row 7 (merge sort)
+    # Change instruction to match merge sort
+    r = 13
+    data[r][1] = "Implement merge sort algorithm."
+    issues.append(PlantedIssue(
+        row=r + 1, col="instruction", issue_type="duplicate_row",
+        description="Instruction 'Implement merge sort algorithm' duplicates row 7 'Implement merge sort' (semantic duplicate)",
+        difficulty=2.5))
+    # Issue 10: Wrong category — Dijkstra labeled as design_patterns (difficulty 1.5)
+    r = 19
+    data[r][7] = "design_patterns"
+    issues.append(PlantedIssue(
+        row=r + 1, col="category", issue_type="inconsistent_value",
+        description="Dijkstra's algorithm categorized as design_patterns instead of algorithms",
+        difficulty=1.5))
+    corrupted = _rows_to_csv([header] + data)
+    return Task(
+        task_id="coding",
+        name="Code Quality Dataset Validation",
+        description=(
+            "You are given a coding instruction-response dataset used for LLM fine-tuning. "
+            "Find all data quality issues: incorrect labels, language mismatches, logic bugs, "
+            "syntax errors, security vulnerabilities, duplicate instructions, and missing fields. "
+            "Report each issue in the format: row:<row_number>,col:<column_name>,issue:<issue_type>"
+        ),
+        schema_description=schema_desc,
+        validation_rules=rules,
+        clean_csv=clean_csv,
+        planted_issues=issues,
+        corrupted_csv=corrupted,
+        max_steps=3,
+    )
+# ---------------------------------------------------------------------------
+# TASK 6: Tool-calling — Function definition and call quality issues
+# ---------------------------------------------------------------------------
+def create_task_toolcalling(seed: int = 42) -> Task:
+    rng = random.Random(seed)
+    clean_csv = """id,function_name,description,parameters_json,required_params,return_type,example_call,example_output,category
+1,get_weather,"Get current weather for a location.","{""location"": ""string"", ""units"": ""string (celsius|fahrenheit)""}","location",object,"{""function"": ""get_weather"", ""arguments"": {""location"": ""San Francisco"", ""units"": ""celsius""}}","{""temp"": 18, ""condition"": ""cloudy""}",information
+2,send_email,"Send an email to a recipient.","{""to"": ""string"", ""subject"": ""string"", ""body"": ""string"", ""cc"": ""string (optional)""}","to,subject,body",object,"{""function"": ""send_email"", ""arguments"": {""to"": ""alice@example.com"", ""subject"": ""Meeting"", ""body"": ""See you at 3pm""}}","{""status"": ""sent"", ""message_id"": ""msg_123""}",communication
+3,search_database,"Query a database with filters.","{""query"": ""string"", ""table"": ""string"", ""limit"": ""integer (default 10)""}","query,table",array,"{""function"": ""search_database"", ""arguments"": {""query"": ""age > 25"", ""table"": ""users"", ""limit"": 5}}","[{""name"": ""Alice"", ""age"": 30}]",data
+4,create_calendar_event,"Create a new calendar event.","{""title"": ""string"", ""start_time"": ""string (ISO 8601)"", ""end_time"": ""string (ISO 8601)"", ""attendees"": ""array of strings (optional)""}","title,start_time,end_time",object,"{""function"": ""create_calendar_event"", ""arguments"": {""title"": ""Team Sync"", ""start_time"": ""2024-03-15T10:00:00Z"", ""end_time"": ""2024-03-15T11:00:00Z""}}","{""event_id"": ""evt_456"", ""status"": ""created""}",scheduling
+5,translate_text,"Translate text between languages.","{""text"": ""string"", ""source_lang"": ""string (ISO 639-1)"", ""target_lang"": ""string (ISO 639-1)""}","text,target_lang",object,"{""function"": ""translate_text"", ""arguments"": {""text"": ""Hello world"", ""source_lang"": ""en"", ""target_lang"": ""es""}}","{""translated"": ""Hola mundo"", ""confidence"": 0.95}",language
+6,get_stock_price,"Get real-time stock price.","{""symbol"": ""string"", ""exchange"": ""string (optional, default NYSE)""}","symbol",object,"{""function"": ""get_stock_price"", ""arguments"": {""symbol"": ""AAPL""}}","{""price"": 178.52, ""currency"": ""USD"", ""change"": 2.3}",finance
+7,upload_file,"Upload a file to cloud storage.","{""file_path"": ""string"", ""bucket"": ""string"", ""public"": ""boolean (default false)""}","file_path,bucket",object,"{""function"": ""upload_file"", ""arguments"": {""file_path"": ""/data/report.pdf"", ""bucket"": ""my-bucket""}}","{""url"": ""https://storage.example.com/my-bucket/report.pdf"", ""size_bytes"": 1048576}",storage
+8,run_code,"Execute code in a sandboxed environment.","{""code"": ""string"", ""language"": ""string (python|javascript|ruby)"", ""timeout"": ""integer (seconds, default 30)""}","code,language",object,"{""function"": ""run_code"", ""arguments"": {""code"": ""print(2+2)"", ""language"": ""python""}}","{""stdout"": ""4\n"", ""exit_code"": 0}",execution
+9,get_directions,"Get driving/walking directions.","{""origin"": ""string"", ""destination"": ""string"", ""mode"": ""string (driving|walking|transit)""}","origin,destination",object,"{""function"": ""get_directions"", ""arguments"": {""origin"": ""NYC"", ""destination"": ""Boston"", ""mode"": ""driving""}}","{""distance_km"": 346, ""duration_min"": 230, ""steps"": [""Take I-95 N...""]}",navigation
+10,analyze_sentiment,"Analyze sentiment of text.","{""text"": ""string"", ""language"": ""string (optional, default en)""}","text",object,"{""function"": ""analyze_sentiment"", ""arguments"": {""text"": ""I love this product!""}}","{""sentiment"": ""positive"", ""score"": 0.92}",analysis
+11,create_user,"Create a new user account.","{""username"": ""string"", ""email"": ""string"", ""role"": ""string (admin|user|viewer)""}","username,email,role",object,"{""function"": ""create_user"", ""arguments"": {""username"": ""jdoe"", ""email"": ""jdoe@example.com"", ""role"": ""user""}}","{""user_id"": ""usr_789"", ""created"": true}",account
+12,generate_image,"Generate an image from a text prompt.","{""prompt"": ""string"", ""size"": ""string (256x256|512x512|1024x1024)"", ""style"": ""string (optional)""}","prompt",object,"{""function"": ""generate_image"", ""arguments"": {""prompt"": ""sunset over mountains"", ""size"": ""512x512""}}","{""image_url"": ""https://img.example.com/gen_001.png""}",creative
+13,list_files,"List files in a directory.","{""path"": ""string"", ""recursive"": ""boolean (default false)"", ""pattern"": ""string (glob, optional)""}","path",array,"{""function"": ""list_files"", ""arguments"": {""path"": ""/home/user/docs""}}","[""report.pdf"", ""notes.txt""]",filesystem
+14,set_reminder,"Set a timed reminder.","{""message"": ""string"", ""time"": ""string (ISO 8601)"", ""repeat"": ""string (none|daily|weekly, optional)""}","message,time",object,"{""function"": ""set_reminder"", ""arguments"": {""message"": ""Stand up and stretch"", ""time"": ""2024-03-15T15:00:00Z""}}","{""reminder_id"": ""rem_101"", ""status"": ""set""}",scheduling
+15,convert_currency,"Convert between currencies.","{""amount"": ""number"", ""from_currency"": ""string (ISO 4217)"", ""to_currency"": ""string (ISO 4217)""}","amount,from_currency,to_currency",object,"{""function"": ""convert_currency"", ""arguments"": {""amount"": 100, ""from_currency"": ""USD"", ""to_currency"": ""EUR""}}","{""converted"": 91.5, ""rate"": 0.915}",finance
+16,summarize_text,"Summarize a long text.","{""text"": ""string"", ""max_length"": ""integer (optional, default 100)""}","text",object,"{""function"": ""summarize_text"", ""arguments"": {""text"": ""Long article about climate change..."", ""max_length"": 50}}","{""summary"": ""Climate change poses significant challenges...""}",analysis
+17,get_user_info,"Retrieve user profile information.","{""user_id"": ""string""}","user_id",object,"{""function"": ""get_user_info"", ""arguments"": {""user_id"": ""usr_789""}}","{""username"": ""jdoe"", ""email"": ""jdoe@example.com"", ""role"": ""user""}",account
+18,compress_image,"Compress an image to reduce file size.","{""image_url"": ""string"", ""quality"": ""integer (1-100)"", ""format"": ""string (jpeg|png|webp)""}","image_url,quality",object,"{""function"": ""compress_image"", ""arguments"": {""image_url"": ""https://img.example.com/photo.png"", ""quality"": 80}}","{""compressed_url"": ""https://img.example.com/photo_compressed.png"", ""reduction"": ""65%""}",media
+19,execute_trade,"Execute a stock trade.","{""symbol"": ""string"", ""action"": ""string (buy|sell)"", ""quantity"": ""integer"", ""order_type"": ""string (market|limit)"", ""limit_price"": ""number (required if order_type=limit)""}","symbol,action,quantity,order_type",object,"{""function"": ""execute_trade"", ""arguments"": {""symbol"": ""AAPL"", ""action"": ""buy"", ""quantity"": 10, ""order_type"": ""market""}}","{""trade_id"": ""trd_202"", ""status"": ""executed"", ""filled_price"": 178.52}",finance
+20,parse_pdf,"Extract text content from a PDF.","{""url"": ""string"", ""pages"": ""string (optional, e.g. 1-5)""}","url",object,"{""function"": ""parse_pdf"", ""arguments"": {""url"": ""https://docs.example.com/report.pdf""}}","{""text"": ""Annual Report 2024..."", ""page_count"": 12}",data"""
+    schema_desc = """Columns:
+- id: integer, unique, sequential starting from 1
+- function_name: string, valid identifier (snake_case), unique
+- description: string, non-empty, describes what the function does
+- parameters_json: string, valid JSON-like parameter schema with types
+- required_params: string, comma-separated parameter names that must be present in example_call
+- return_type: string, one of [object, array, string, number, boolean]
+- example_call: string, valid JSON with "function" matching function_name and "arguments" containing required params
+- example_output: string, valid JSON matching return_type
+- category: string, one of [information, communication, data, scheduling, language, finance, storage, execution, navigation, analysis, account, creative, filesystem, media]"""
+    rules = """1. No missing values in any column
+2. id must be unique and sequential
+3. function_name must be unique and match the "function" field in example_call
+4. All required_params must appear as keys in the example_call arguments
+5. Parameter types in parameters_json must match the actual values in example_call
+6. return_type must match the type of example_output
+7. example_call must be valid JSON
+8. example_output must be valid JSON
+9. description must accurately describe what the function does
+10. No hallucinated parameters in example_call that are not defined in parameters_json"""
+    rows = _csv_to_rows(clean_csv)
+    header = rows[0]
+    data = rows[1:]
+    issues: List[PlantedIssue] = []
+    # Issue 1: Function name mismatch — example_call uses wrong function name (difficulty 2.0)
+    # Row 3 (search_database) — call says "query_database" instead
+    r = 2
+    data[r][6] = '{"function": "query_database", "arguments": {"query": "age > 25", "table": "users", "limit": 5}}'
+    issues.append(PlantedIssue(
+        row=r + 1, col="example_call", issue_type="inconsistent_value",
+        description="example_call function name 'query_database' doesn't match function_name 'search_database'",
+        difficulty=2.0))
+    # Issue 2: Missing required parameter in example_call (difficulty 2.5)
+    # Row 4 (create_calendar_event) — missing end_time which is required
+    r = 3
+    data[r][6] = '{"function": "create_calendar_event", "arguments": {"title": "Team Sync", "start_time": "2024-03-15T10:00:00Z"}}'
+    issues.append(PlantedIssue(
+        row=r + 1, col="example_call", issue_type="inconsistent_value",
+        description="Required parameter 'end_time' missing from example_call arguments",
+        difficulty=2.5))
+    # Issue 3: Hallucinated parameter — example_call has param not in schema (difficulty 3.0)
+    # Row 10 (analyze_sentiment) — add "model" param not in parameters_json
+    r = 9
+    data[r][6] = '{"function": "analyze_sentiment", "arguments": {"text": "I love this product!", "model": "gpt-4", "confidence_threshold": 0.8}}'
+    issues.append(PlantedIssue(
+        row=r + 1, col="example_call", issue_type="inconsistent_value",
+        description="Hallucinated parameters 'model' and 'confidence_threshold' not defined in parameters_json",
+        difficulty=3.0))
+    # Issue 4: Wrong return_type — returns object but labeled as array (difficulty 1.5)
+    # Row 6 (get_stock_price)
+    r = 5
+    data[r][5] = "array"
+    issues.append(PlantedIssue(
+        row=r + 1, col="return_type", issue_type="inconsistent_value",
+        description="return_type says 'array' but example_output is an object",
+        difficulty=1.5))
+    # Issue 5: Invalid JSON in example_call (difficulty 2.0)
+    # Row 12 (generate_image) — malformed JSON
+    r = 11
+    data[r][6] = '{"function": "generate_image", "arguments": {"prompt": "sunset over mountains", "size": "512x512"'  # missing closing braces
+    issues.append(PlantedIssue(
+        row=r + 1, col="example_call", issue_type="format_violation",
+        description="Invalid JSON in example_call — missing closing braces",
+        difficulty=2.0))
+    # Issue 6: Parameter type mismatch — schema says integer but call passes string (difficulty 2.5)
+    # Row 18 (compress_image) — quality should be integer but passed as string "high"
+    r = 17
+    data[r][6] = '{"function": "compress_image", "arguments": {"image_url": "https://img.example.com/photo.png", "quality": "high"}}'
+    issues.append(PlantedIssue(
+        row=r + 1, col="example_call", issue_type="wrong_type",
+        description="Parameter 'quality' should be integer (1-100) but example_call passes string 'high'",
+        difficulty=2.5))
+    # Issue 7: Empty description (difficulty 1.0)
+    r = 15
+    data[r][2] = ""
+    issues.append(PlantedIssue(
+        row=r + 1, col="description", issue_type="missing_value",
+        description="Empty description field for summarize_text function",
+        difficulty=1.0))
+    # Issue 8: Duplicate function — row 17 (get_user_info) duplicates row 11 (create_user) in purpose
+    # Change function_name to create_user (duplicate)
+    r = 16
+    data[r][1] = "create_user"
+    issues.append(PlantedIssue(
+        row=r + 1, col="function_name", issue_type="duplicate_row",
+        description="Duplicate function_name 'create_user' — already defined in row 11",
+        difficulty=1.5))
+    # Issue 9: Dangerous function without safety description (difficulty 3.0)
+    # Row 19 (execute_trade) — description doesn't mention it's irreversible/dangerous
+    # Change description to misleadingly safe
+    r = 18
+    data[r][2] = "Preview a potential stock trade."
+    issues.append(PlantedIssue(
+        row=r + 1, col="description", issue_type="inconsistent_value",
+        description="Description says 'Preview a potential stock trade' but function actually executes trades (irreversible action mislabeled as preview)",
+        difficulty=3.0))
+    # Issue 10: Wrong category (difficulty 1.5)
+    # Row 8 (run_code) labeled as "scheduling" instead of "execution"
+    r = 7
+    data[r][8] = "scheduling"
+    issues.append(PlantedIssue(
+        row=r + 1, col="category", issue_type="inconsistent_value",
+        description="run_code categorized as 'scheduling' instead of 'execution'",
+        difficulty=1.5))
+    corrupted = _rows_to_csv([header] + data)
+    return Task(
+        task_id="toolcalling",
+        name="Tool-Calling Dataset Validation",
+        description=(
+            "You are given a tool-calling/function-calling dataset used for LLM fine-tuning. "
+            "Find all data quality issues: function name mismatches between definition and call, "
+            "missing required parameters, hallucinated parameters, type mismatches, invalid JSON, "
+            "duplicate functions, and misleading descriptions. "
+            "Report each issue in the format: row:<row_number>,col:<column_name>,issue:<issue_type>"
+        ),
+        schema_description=schema_desc,
+        validation_rules=rules,
+        clean_csv=clean_csv,
+        planted_issues=issues,
+        corrupted_csv=corrupted,
+        max_steps=3,
+    )
+# ---------------------------------------------------------------------------
+# Contamination rules for extensible task creation
+# ---------------------------------------------------------------------------
+# Each contamination rule is a callable: (rows, header, col_idx, row_idx, rng) -> (new_value, PlantedIssue)
+# Users can define their own and register them.
+CONTAMINATION_RULES = {
+    "missing_value": lambda rows, header, col_idx, row_idx, rng: (
+        "",
+        PlantedIssue(
+            row=row_idx + 1, col=header[col_idx], issue_type="missing_value",
+            description=f"Empty {header[col_idx]} field", difficulty=1.0,
+        ),
+    ),
+    "whitespace_value": lambda rows, header, col_idx, row_idx, rng: (
+        " ",
+        PlantedIssue(
+            row=row_idx + 1, col=header[col_idx], issue_type="missing_value",
+            description=f"Whitespace-only {header[col_idx]} field", difficulty=2.5,
+        ),
+    ),
+    "wrong_type_text": lambda rows, header, col_idx, row_idx, rng: (
+        rng.choice(["not-a-number", "N/A", "null", "undefined"]),
+        PlantedIssue(
+            row=row_idx + 1, col=header[col_idx], issue_type="wrong_type",
+            description=f"{header[col_idx]} is text instead of expected type", difficulty=1.0,
+        ),
+    ),
+    "negative_value": lambda rows, header, col_idx, row_idx, rng: (
+        str(-abs(float(rows[row_idx][col_idx]) if rows[row_idx][col_idx] else 1)),
+        PlantedIssue(
+            row=row_idx + 1, col=header[col_idx], issue_type="out_of_range",
+            description=f"Negative {header[col_idx]}", difficulty=1.0,
+        ),
+    ),
+}
+def create_task_from_config(
+    task_id: str,
+    name: str,
+    description: str,
+    schema_description: str,
+    validation_rules: str,
+    clean_csv: str,
+    contaminations: List[dict],
+    max_steps: int = 3,
+    seed: int = 42,
+) -> Task:
+    """
+    Create a custom task from a configuration dict.
+    Each contamination entry should have:
+        - rule: str (key in CONTAMINATION_RULES) or callable
+        - row: int (0-based row index in data)
+        - col: int (column index in header)
+        - difficulty: float (optional, overrides rule default)
+    Example:
+        contaminations = [
+            {"rule": "missing_value", "row": 2, "col": 1, "difficulty": 1.5},
+            {"rule": "negative_value", "row": 5, "col": 4},
+        ]
+    """
+    rng = random.Random(seed)
+    rows = _csv_to_rows(clean_csv)
+    header = rows[0]
+    data = rows[1:]
+    issues: List[PlantedIssue] = []
+    for spec in contaminations:
+        rule = spec["rule"]
+        row_idx = spec["row"]
+        col_idx = spec["col"]
+        if callable(rule):
+            new_val, issue = rule(data, header, col_idx, row_idx, rng)
+        elif rule in CONTAMINATION_RULES:
+            new_val, issue = CONTAMINATION_RULES[rule](data, header, col_idx, row_idx, rng)
+        else:
+            raise ValueError(f"Unknown contamination rule: {rule}. Available: {list(CONTAMINATION_RULES.keys())}")
+        data[row_idx][col_idx] = new_val
+        if "difficulty" in spec:
+            issue.difficulty = spec["difficulty"]
+        issues.append(issue)
+    corrupted = _rows_to_csv([header] + data)
+    return Task(
+        task_id=task_id,
+        name=name,
+        description=description,
+        schema_description=schema_description,
+        validation_rules=validation_rules,
+        clean_csv=clean_csv,
+        planted_issues=issues,
+        corrupted_csv=corrupted,
+        max_steps=max_steps,
+    )
+def register_task(task_id: str, factory_fn):
+    """Register a custom task factory. Factory must accept (seed: int) -> Task."""
+    TASK_REGISTRY[task_id] = factory_fn
+def register_contamination_rule(name: str, rule_fn):
+    """
+    Register a custom contamination rule.
+    rule_fn signature: (rows, header, col_idx, row_idx, rng) -> (new_value, PlantedIssue)
+    """
+    CONTAMINATION_RULES[name] = rule_fn
+# ---------------------------------------------------------------------------
+# Task registry
+# ---------------------------------------------------------------------------
+TASK_REGISTRY = {
+    "easy": create_task_easy,
+    "medium": create_task_medium,
+    "hard": create_task_hard,
+    "alignment": create_task_alignment,
+    "coding": create_task_coding,
+    "toolcalling": create_task_toolcalling,
+}
+def get_task(task_id: str, seed: int = 42) -> Task:
+    if task_id not in TASK_REGISTRY:
+        raise ValueError(f"Unknown task: {task_id}. Available: {list(TASK_REGISTRY.keys())}")
+    return TASK_REGISTRY[task_id](seed=seed)
+def list_tasks() -> List[str]:
+    return list(TASK_REGISTRY.keys())

inference.py ADDED Viewed

	@@ -0,0 +1,376 @@

+#!/usr/bin/env python3
+"""
+DataQA Inference Script — Two-Phase Agent
+------------------------------------------
+LLM agent that plays the DataQA environment in two phases:
+  Phase 1: Identify all data quality issues
+  Phase 2: Propose fixes for identified issues
+Uses the OpenAI client to interact with any OpenAI-compatible LLM API.
+Required environment variables:
+    API_BASE_URL  - LLM API endpoint (e.g., https://router.huggingface.co/v1)
+    MODEL_NAME    - Model identifier (e.g., Qwen/Qwen2.5-72B-Instruct)
+    HF_TOKEN      - HuggingFace token / API key
+STDOUT FORMAT (mandatory for evaluation):
+    [START] task=<task_name> env=<benchmark> model=<model_name>
+    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+    [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
+"""
+from __future__ import annotations
+import os
+import re
+import sys
+import time
+from typing import List, Optional
+import requests
+from openai import OpenAI
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+ENV_URL = os.getenv("ENV_URL", "http://localhost:8000")
+BENCHMARK = "dataqa_env"
+TASKS = ["easy", "medium", "hard", "alignment", "coding", "toolcalling"]
+MAX_STEPS_PER_TASK = 3
+# ---------------------------------------------------------------------------
+# Logging helpers (structured stdout — exact format required by evaluation)
+# ---------------------------------------------------------------------------
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
+        flush=True,
+    )
+# ---------------------------------------------------------------------------
+# Environment HTTP client
+# ---------------------------------------------------------------------------
+class EnvHTTPClient:
+    """Minimal HTTP client for the DataQA environment."""
+    def __init__(self, base_url: str):
+        self.base_url = base_url.rstrip("/")
+        self.session = requests.Session()
+    def health(self) -> bool:
+        try:
+            r = self.session.get(f"{self.base_url}/health", timeout=10)
+            return r.status_code == 200
+        except Exception:
+            return False
+    def reset(self, task_id: str = "easy") -> dict:
+        r = self.session.post(
+            f"{self.base_url}/reset",
+            json={"task_id": task_id},
+            timeout=30,
+        )
+        r.raise_for_status()
+        return r.json()
+    def step(self, issues: list[str], fixes: list[str], task_id: str = "easy") -> dict:
+        r = self.session.post(
+            f"{self.base_url}/step",
+            json={"action": {"issues": issues, "fixes": fixes, "task_id": task_id}},
+            timeout=30,
+        )
+        r.raise_for_status()
+        return r.json()
+# ---------------------------------------------------------------------------
+# LLM Prompts
+# ---------------------------------------------------------------------------
+IDENTIFY_SYSTEM_PROMPT = """You are a data quality analyst. Your job is to inspect datasets and identify data quality issues.
+You will be given:
+1. A dataset in CSV format
+2. A schema describing expected column types and constraints
+3. Validation rules that the data should satisfy
+You must identify ALL data quality issues and report each one in EXACTLY this format:
+row:<row_number>,col:<column_name>,issue:<issue_type>
+Supported issue types:
+- missing_value (null, empty, or whitespace-only)
+- wrong_type (value doesn't match expected type)
+- duplicate_row (exact duplicate or duplicate key)
+- out_of_range (value outside valid range)
+- format_violation (wrong format, invalid enum value)
+- inconsistent_value (computed field doesn't match, logical inconsistency)
+- statistical_outlier (value is unreasonable given context)
+- referential_integrity (foreign key violation)
+CRITICAL INSTRUCTIONS FOR ROW NUMBERING:
+- Row numbers refer to the ROW POSITION in the CSV data, NOT the value of any ID column
+- Row 1 = the FIRST data row after the header
+- Row 2 = the SECOND data row after the header
+- DO NOT use the employee_id, order_id, or experiment_id as the row number
+- Column names must match exactly (use the CSV header names, lowercase)
+- Check EVERY row and EVERY column systematically
+- Consider cross-column consistency (e.g., total = quantity * price)
+- Look for subtle issues like whitespace-only values, near-duplicates
+- Report ALL issues you find, even if uncertain
+Respond with ONLY the list of issues, one per line. No other text.
+Example: row:3,col:salary,issue:missing_value"""
+FIX_SYSTEM_PROMPT = """You are a data repair specialist. You have already identified data quality issues in a dataset. Now you must propose the correct values to fix each issue.
+For each issue you identified, propose a fix in EXACTLY this format:
+row:<row_number>,col:<column_name>,fix:<corrected_value>
+Guidelines for proposing fixes:
+- For missing_value: infer the correct value from context, schema, and other rows
+- For wrong_type: convert to the correct type (e.g., "seventy-five thousand" → "75000")
+- For out_of_range: propose a value within the valid range that makes sense in context
+- For format_violation: correct the format (e.g., "26/01/2024" → "2024-01-26")
+- For inconsistent_value: compute the correct value from related fields
+- For duplicate_row: propose a corrected unique key or indicate removal
+- For statistical_outlier: propose a reasonable value given the model/context
+Use the schema, validation rules, and surrounding data to determine the correct fix.
+Respond with ONLY the list of fixes, one per line. No other text.
+Example: row:3,col:salary,fix:75000"""
+def build_user_prompt(observation: dict, include_fixes: bool = False) -> str:
+    obs = observation if isinstance(observation, dict) else observation
+    parts = []
+    if obs.get("task_description"):
+        parts.append(f"TASK: {obs['task_description']}")
+    parts.append(f"SCHEMA:\n{obs.get('schema_description', '')}")
+    parts.append(f"VALIDATION RULES:\n{obs.get('validation_rules', '')}")
+    parts.append(f"DATASET:\n{obs.get('dataset_csv', '')}")
+    hint = obs.get("num_issues_hint", 0)
+    if hint:
+        parts.append(f"HINT: There are exactly {hint} issues to find.")
+    feedback = obs.get("feedback", "")
+    if feedback and "reset" not in feedback.lower():
+        parts.append(f"FEEDBACK FROM PREVIOUS ATTEMPT:\n{feedback}")
+    if include_fixes:
+        parts.append(
+            "Now propose fixes for ALL issues. "
+            "Use format: row:<N>,col:<name>,fix:<corrected_value>"
+        )
+    return "\n\n".join(parts)
+def parse_llm_response(response: str) -> list[str]:
+    """Extract issue lines from LLM response."""
+    issues = []
+    for line in response.strip().split("\n"):
+        line = line.strip()
+        if not line:
+            continue
+        line = re.sub(r"^\s*[\d]+[.\)]\s*", "", line)
+        line = re.sub(r"^\s*[-*]\s*", "", line)
+        line = line.strip()
+        if "row" in line.lower() and "col" in line.lower():
+            match = re.search(
+                r"row\s*[:=]\s*(\d+)\s*[,;\s]+col(?:umn)?\s*[:=]\s*([\w_]+)\s*[,;\s]+issue\s*[:=]\s*([\w_]+)",
+                line,
+                re.IGNORECASE,
+            )
+            if match:
+                normalized = f"row:{match.group(1)},col:{match.group(2).lower()},issue:{match.group(3).lower()}"
+                issues.append(normalized)
+    return issues
+def parse_fix_response(response: str) -> list[str]:
+    """Extract fix lines from LLM response."""
+    fixes = []
+    for line in response.strip().split("\n"):
+        line = line.strip()
+        if not line:
+            continue
+        line = re.sub(r"^\s*[\d]+[.\)]\s*", "", line)
+        line = re.sub(r"^\s*[-*]\s*", "", line)
+        line = line.strip()
+        if "row" in line.lower() and "fix" in line.lower():
+            match = re.search(
+                r"row\s*[:=]\s*(\d+)\s*[,;\s]+col(?:umn)?\s*[:=]\s*([\w_]+)\s*[,;\s]+fix\s*[:=]\s*(.+?)$",
+                line,
+                re.IGNORECASE,
+            )
+            if match:
+                normalized = f"row:{match.group(1)},col:{match.group(2).lower()},fix:{match.group(3).strip()}"
+                fixes.append(normalized)
+    return fixes
+def call_llm(client: OpenAI, system_prompt: str, user_prompt: str) -> str:
+    """Call the LLM with retry on rate limit."""
+    for attempt in range(3):
+        try:
+            response = client.chat.completions.create(
+                model=MODEL_NAME,
+                messages=[
+                    {"role": "system", "content": system_prompt},
+                    {"role": "user", "content": user_prompt},
+                ],
+                temperature=0.1,
+                max_tokens=2048,
+            )
+            return response.choices[0].message.content or ""
+        except Exception as e:
+            if "rate_limit" in str(e).lower() or "429" in str(e):
+                wait = 10 * (attempt + 1)
+                print(f"[DEBUG] Rate limited, waiting {wait}s...", file=sys.stderr, flush=True)
+                time.sleep(wait)
+            else:
+                print(f"[DEBUG] LLM call failed: {e}", file=sys.stderr, flush=True)
+                return ""
+    return ""
+def run_task(client: OpenAI, env: EnvHTTPClient, task_id: str) -> float:
+    """
+    Run a single task with two-phase strategy:
+      Step 1: Identify issues only
+      Step 2: Identify + Fix (using feedback from step 1)
+      Step 3: Refined identify + fix (if needed)
+    """
+    log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
+    rewards: List[float] = []
+    steps_taken = 0
+    best_score = 0.0
+    success = False
+    try:
+        reset_response = env.reset(task_id=task_id)
+        observation = reset_response.get("observation", reset_response)
+        last_issues: list[str] = []
+        last_llm_output = ""
+        for step_num in range(1, MAX_STEPS_PER_TASK + 1):
+            error_msg = None
+            # ── Phase 1: Identify issues ──
+            user_prompt = build_user_prompt(observation)
+            identify_output = call_llm(client, IDENTIFY_SYSTEM_PROMPT, user_prompt)
+            issues = parse_llm_response(identify_output)
+            if not issues and not error_msg:
+                error_msg = "no issues parsed from LLM response"
+            # ── Phase 2: Propose fixes (from step 2 onward, or always if we have issues) ──
+            fixes: list[str] = []
+            if issues and step_num >= 2:
+                # Build a fix prompt that includes the identified issues
+                fix_prompt = build_user_prompt(observation, include_fixes=True)
+                fix_prompt += f"\n\nISSUES FOUND:\n" + "\n".join(issues)
+                fix_output = call_llm(client, FIX_SYSTEM_PROMPT, fix_prompt)
+                fixes = parse_fix_response(fix_output)
+            # ── Submit to environment ──
+            action_str = ";".join(issues[:5]) if issues else "none"
+            if fixes:
+                action_str += "|fixes:" + ";".join(fixes[:3])
+            step_response = env.step(issues, fixes, task_id=task_id)
+            observation = step_response.get("observation", step_response)
+            reward = float(step_response.get("reward", 0.0) or 0.0)
+            done = bool(step_response.get("done", False))
+            best_score = max(best_score, reward)
+            rewards.append(reward)
+            steps_taken = step_num
+            log_step(
+                step=step_num,
+                action=action_str,
+                reward=reward,
+                done=done,
+                error=error_msg,
+            )
+            if done:
+                break
+            last_issues = issues
+            last_llm_output = identify_output
+        success = best_score >= 0.5
+    finally:
+        log_end(success=success, steps=steps_taken, score=best_score, rewards=rewards)
+    return best_score
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main():
+    print(f"[DEBUG] DataQA Inference starting", file=sys.stderr, flush=True)
+    print(f"[DEBUG] ENV_URL={ENV_URL}", file=sys.stderr, flush=True)
+    print(f"[DEBUG] API_BASE_URL={API_BASE_URL}", file=sys.stderr, flush=True)
+    print(f"[DEBUG] MODEL_NAME={MODEL_NAME}", file=sys.stderr, flush=True)
+    env = EnvHTTPClient(ENV_URL)
+    llm_client = OpenAI(
+        base_url=API_BASE_URL,
+        api_key=API_KEY or "no-key",
+    )
+    if not env.health():
+        print("[DEBUG] Environment is not healthy. Exiting.", file=sys.stderr, flush=True)
+        sys.exit(1)
+    print(f"[DEBUG] Environment is healthy", file=sys.stderr, flush=True)
+    scores = {}
+    for task_id in TASKS:
+        try:
+            score = run_task(llm_client, env, task_id)
+            scores[task_id] = score
+        except Exception as e:
+            print(f"[DEBUG] Task {task_id} failed: {e}", file=sys.stderr, flush=True)
+            scores[task_id] = 0.0
+    avg_score = sum(scores.values()) / len(scores) if scores else 0.0
+    print(f"\n[DEBUG] FINAL RESULTS: {scores} avg={avg_score:.3f}", file=sys.stderr, flush=True)
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,4 @@

+"""Root-level models for OpenEnv compatibility."""
+from dataqa_env.models import DataQAAction, DataQAObservation, DataQAState
+__all__ = ["DataQAAction", "DataQAObservation", "DataQAState"]

openenv.yaml ADDED Viewed

	@@ -0,0 +1,6 @@

+spec_version: 1
+name: dataqa_env
+type: space
+runtime: fastapi
+app: dataqa_env.server.app:app
+port: 8000

openenv_dataqa_env.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,13 @@

+Metadata-Version: 2.4
+Name: openenv-dataqa-env
+Version: 0.1.0
+Summary: Data Quality Assurance Environment for OpenEnv - An LLM agent inspects datasets to find planted quality issues
+Requires-Python: >=3.10
+Requires-Dist: openenv-core[core]>=0.2.2
+Requires-Dist: fastapi>=0.115.0
+Requires-Dist: pydantic>=2.0.0
+Requires-Dist: uvicorn[standard]>=0.24.0
+Requires-Dist: requests>=2.31.0
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0.0; extra == "dev"

openenv_dataqa_env.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,15 @@

+README.md
+pyproject.toml
+dataqa_env/__init__.py
+dataqa_env/client.py
+dataqa_env/models.py
+dataqa_env/server/__init__.py
+dataqa_env/server/app.py
+dataqa_env/server/environment.py
+dataqa_env/server/tasks.py
+openenv_dataqa_env.egg-info/PKG-INFO
+openenv_dataqa_env.egg-info/SOURCES.txt
+openenv_dataqa_env.egg-info/dependency_links.txt
+openenv_dataqa_env.egg-info/entry_points.txt
+openenv_dataqa_env.egg-info/requires.txt
+openenv_dataqa_env.egg-info/top_level.txt

openenv_dataqa_env.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

openenv_dataqa_env.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [console_scripts]
2	+ server = dataqa_env.server.app:main

openenv_dataqa_env.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+openenv-core[core]>=0.2.2
+fastapi>=0.115.0
+pydantic>=2.0.0
+uvicorn[standard]>=0.24.0
+requests>=2.31.0
+[dev]
+pytest>=8.0.0
+pytest-cov>=4.0.0

openenv_dataqa_env.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ dataqa_env

pyproject.toml ADDED Viewed

	@@ -0,0 +1,32 @@

+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-dataqa-env"
+version = "0.1.0"
+description = "Data Quality Assurance Environment for OpenEnv - An LLM agent inspects datasets to find planted quality issues"
+requires-python = ">=3.10"
+dependencies = [
+    "openenv-core[core]>=0.2.2",
+    "fastapi>=0.115.0",
+    "pydantic>=2.0.0",
+    "uvicorn[standard]>=0.24.0",
+    "requests>=2.31.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+server = "dataqa_env.server.app:main"
+[tool.setuptools]
+packages = ["dataqa_env", "dataqa_env.server"]
+package-dir = { "dataqa_env" = "dataqa_env", "dataqa_env.server" = "dataqa_env/server" }
+[tool.setuptools.package-data]
+dataqa_env = ["**/*.yaml", "**/*.yml"]

scripts/prevalidation_script.sh ADDED Viewed

	@@ -0,0 +1,185 @@

+#!/usr/bin/env bash
+#
+# validate-submission.sh — OpenEnv Submission Validator
+#
+# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
+#
+# Prerequisites:
+#   - Docker:       https://docs.docker.com/get-docker/
+#   - openenv-core: pip install openenv-core
+#   - curl (usually pre-installed)
+#
+# Run:
+#   curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
+#
+#   Or download and run locally:
+#     chmod +x validate-submission.sh
+#     ./validate-submission.sh <ping_url> [repo_dir]
+#
+# Arguments:
+#   ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)
+#   repo_dir   Path to your repo (default: current directory)
+#
+# Examples:
+#   ./validate-submission.sh https://my-team.hf.space
+#   ./validate-submission.sh https://my-team.hf.space ./my-repo
+#
+set -uo pipefail
+DOCKER_BUILD_TIMEOUT=600
+if [ -t 1 ]; then
+  RED='\033[0;31m'
+  GREEN='\033[0;32m'
+  YELLOW='\033[1;33m'
+  BOLD='\033[1m'
+  NC='\033[0m'
+else
+  RED='' GREEN='' YELLOW='' BOLD='' NC=''
+fi
+run_with_timeout() {
+  local secs="$1"; shift
+  if command -v timeout &>/dev/null; then
+    timeout "$secs" "$@"
+  elif command -v gtimeout &>/dev/null; then
+    gtimeout "$secs" "$@"
+  else
+    "$@" &
+    local pid=$!
+    ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
+    local watcher=$!
+    wait "$pid" 2>/dev/null
+    local rc=$?
+    kill "$watcher" 2>/dev/null
+    wait "$watcher" 2>/dev/null
+    return $rc
+  fi
+}
+portable_mktemp() {
+  local prefix="${1:-validate}"
+  mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
+}
+CLEANUP_FILES=()
+cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
+trap cleanup EXIT
+PING_URL="${1:-}"
+REPO_DIR="${2:-.}"
+if [ -z "$PING_URL" ]; then
+  printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
+  printf "\n"
+  printf "  ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
+  printf "  repo_dir   Path to your repo (default: current directory)\n"
+  exit 1
+fi
+if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
+  printf "Error: directory '%s' not found\n" "${2:-.}"
+  exit 1
+fi
+PING_URL="${PING_URL%/}"
+export PING_URL
+PASS=0
+log()  { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
+pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
+fail() { log "${RED}FAILED${NC} -- $1"; }
+hint() { printf "  ${YELLOW}Hint:${NC} %b\n" "$1"; }
+stop_at() {
+  printf "\n"
+  printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
+  exit 1
+}
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${BOLD}  OpenEnv Submission Validator${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+log "Repo:     $REPO_DIR"
+log "Ping URL: $PING_URL"
+printf "\n"
+log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
+CURL_OUTPUT=$(portable_mktemp "validate-curl")
+CLEANUP_FILES+=("$CURL_OUTPUT")
+HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
+  -H "Content-Type: application/json" -d '{}' \
+  "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
+if [ "$HTTP_CODE" = "200" ]; then
+  pass "HF Space is live and responds to /reset"
+elif [ "$HTTP_CODE" = "000" ]; then
+  fail "HF Space not reachable (connection failed or timed out)"
+  hint "Check your network connection and that the Space is running."
+  hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
+  stop_at "Step 1"
+else
+  fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
+  hint "Make sure your Space is running and the URL is correct."
+  hint "Try opening $PING_URL in your browser first."
+  stop_at "Step 1"
+fi
+log "${BOLD}Step 2/3: Running docker build${NC} ..."
+if ! command -v docker &>/dev/null; then
+  fail "docker command not found"
+  hint "Install Docker: https://docs.docker.com/get-docker/"
+  stop_at "Step 2"
+fi
+if [ -f "$REPO_DIR/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR"
+elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR/server"
+else
+  fail "No Dockerfile found in repo root or server/ directory"
+  stop_at "Step 2"
+fi
+log "  Found Dockerfile in $DOCKER_CONTEXT"
+BUILD_OK=false
+BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
+if [ "$BUILD_OK" = true ]; then
+  pass "Docker build succeeded"
+else
+  fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
+  printf "%s\n" "$BUILD_OUTPUT" | tail -20
+  stop_at "Step 2"
+fi
+log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
+if ! command -v openenv &>/dev/null; then
+  fail "openenv command not found"
+  hint "Install it: pip install openenv-core"
+  stop_at "Step 3"
+fi
+VALIDATE_OK=false
+VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
+if [ "$VALIDATE_OK" = true ]; then
+  pass "openenv validate passed"
+  [ -n "$VALIDATE_OUTPUT" ] && log "  $VALIDATE_OUTPUT"
+else
+  fail "openenv validate failed"
+  printf "%s\n" "$VALIDATE_OUTPUT"
+  stop_at "Step 3"
+fi
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${GREEN}${BOLD}  All 3/3 checks passed!${NC}\n"
+printf "${GREEN}${BOLD}  Your submission is ready to submit.${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+printf "\n"
+exit 0

scripts/sample_inference_script.py ADDED Viewed

	@@ -0,0 +1,188 @@

+"""
+Inference Script Example
+===================================
+MANDATORY
+- Before submitting, ensure the following variables are defined in your environment configuration:
+    API_BASE_URL   The API endpoint for the LLM.
+    MODEL_NAME     The model identifier to use for inference.
+    HF_TOKEN       Your Hugging Face / API key.
+    LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
+                     method
+- Defaults are set only for API_BASE_URL and MODEL_NAME
+    (and should reflect your active inference setup):
+    API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
+    MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
+- The inference script must be named `inference.py` and placed in the root directory of the project
+- Participants must use OpenAI Client for all LLM calls using above variables
+STDOUT FORMAT
+- The script must emit exactly three line types to stdout, in this order:
+    [START] task=<task_name> env=<benchmark> model=<model_name>
+    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+    [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
+  Rules:
+    - One [START] line at episode begin.
+    - One [STEP] line per step, immediately after env.step() returns.
+    - One [END] line after env.close(), always emitted (even on exception).
+    - reward and rewards are formatted to 2 decimal places.
+    - done and success are lowercase booleans: true or false.
+    - error is the raw last_action_error string, or null if none.
+    - All fields on a single line with no newlines within a line.
+    - Each tasks should return score in [0, 1]
+  Example:
+    [START] task=click-test env=miniwob model=Qwen3-VL-30B
+    [STEP] step=1 action=click('123') reward=0.00 done=false error=null
+    [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
+    [STEP] step=3 action=click('789') reward=1.00 done=true error=null
+    [END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00
+"""
+import asyncio
+import os
+import textwrap
+from typing import List, Optional
+from openai import OpenAI
+from my_env_v4 import MyEnvV4Action, MyEnvV4Env
+IMAGE_NAME = os.getenv("IMAGE_NAME") # If you are using docker image
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
+MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
+TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo")
+BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4")
+MAX_STEPS = 8
+TEMPERATURE = 0.7
+MAX_TOKENS = 150
+SUCCESS_SCORE_THRESHOLD = 0.1  # normalized score in [0, 1]
+# Max possible reward: each token contributes 0.1, across all steps
+_MAX_REWARD_PER_STEP = MAX_TOKENS * 0.1
+MAX_TOTAL_REWARD = MAX_STEPS * _MAX_REWARD_PER_STEP
+SYSTEM_PROMPT = textwrap.dedent(
+    """
+    You are interacting with a simple echo environment.
+    Each turn you must send a message. The environment will echo it back.
+    Reward is proportional to message length: reward = len(message) * 0.1
+    Your goal is to maximize total reward by sending meaningful, substantive messages.
+    Reply with exactly one message string — no quotes, no prefixes, just the message text.
+    """
+).strip()
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
+def build_user_prompt(step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
+    history_block = "\n".join(history[-4:]) if history else "None"
+    return textwrap.dedent(
+        f"""
+        Step: {step}
+        Last echoed message: {last_echoed!r}
+        Last reward: {last_reward:.2f}
+        Previous steps:
+        {history_block}
+        Send your next message.
+        """
+    ).strip()
+def get_model_message(client: OpenAI, step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
+    user_prompt = build_user_prompt(step, last_echoed, last_reward, history)
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": user_prompt},
+            ],
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+            stream=False,
+        )
+        text = (completion.choices[0].message.content or "").strip()
+        return text if text else "hello"
+    except Exception as exc:
+        print(f"[DEBUG] Model request failed: {exc}", flush=True)
+        return "hello"
+async def main() -> None:
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    env = await MyEnvV4Env.from_docker_image(IMAGE_NAME)
+    history: List[str] = []
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
+    try:
+        result = await env.reset() # OpenENV.reset()
+        last_echoed = result.observation.echoed_message
+        last_reward = 0.0
+        for step in range(1, MAX_STEPS + 1):
+            if result.done:
+                break
+            message = get_model_message(client, step, last_echoed, last_reward, history)
+            result = await env.step(MyEnvV4Action(message=message))
+            obs = result.observation
+            reward = result.reward or 0.0
+            done = result.done
+            error = None
+            rewards.append(reward)
+            steps_taken = step
+            last_echoed = obs.echoed_message
+            last_reward = reward
+            log_step(step=step, action=message, reward=reward, done=done, error=error)
+            history.append(f"Step {step}: {message!r} -> reward {reward:+.2f}")
+            if done:
+                break
+        score = sum(rewards) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
+        score = min(max(score, 0.0), 1.0)  # clamp to [0, 1]
+        success = score >= SUCCESS_SCORE_THRESHOLD
+    finally:
+        try:
+            await env.close()
+        except Exception as e:
+            print(f"[DEBUG] env.close() error (container cleanup): {e}", flush=True)
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+if __name__ == "__main__":
+    asyncio.run(main())

server/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Root-level server package — delegates to dataqa_env.server."""

server/app.py ADDED Viewed

	@@ -0,0 +1,13 @@

+"""Entrypoint for openenv-core deployment. Delegates to dataqa_env.server.app."""
+from dataqa_env.server.app import app  # noqa: F401
+def main():
+    """Start the environment server."""
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)
+if __name__ == "__main__":
+    main()

tests/__init__.py ADDED Viewed

File without changes

tests/test_environment.py ADDED Viewed

	@@ -0,0 +1,455 @@

+"""Tests for the DataQA environment (reset, step, scoring, two-phase identify+fix)."""
+import pytest
+from dataqa_env.server.environment import (
+    DataQAEnvironment,
+    parse_issue_key,
+    parse_fix,
+    compute_f1,
+    compute_weighted_reward,
+    grade_fixes,
+    IDENTIFY_WEIGHT,
+    FIX_WEIGHT,
+)
+from dataqa_env.models import DataQAAction
+from dataqa_env.server.tasks import PlantedIssue, create_task_easy, create_task_medium
+# ──────────────────────────────────────────────────────
+# Issue parsing
+# ──────────────────────────────────────────────────────
+class TestParseIssueKey:
+    def test_standard_format(self):
+        assert parse_issue_key("row:3,col:salary,issue:missing_value") == "row:3,col:salary,issue:missing_value"
+    def test_with_equals(self):
+        assert parse_issue_key("row=3,col=salary,issue=missing_value") == "row:3,col:salary,issue:missing_value"
+    def test_case_insensitive(self):
+        assert parse_issue_key("Row:3,Col:Salary,Issue:Missing_Value") == "row:3,col:salary,issue:missing_value"
+    def test_with_spaces(self):
+        assert parse_issue_key("row: 3, col: salary, issue: missing_value") == "row:3,col:salary,issue:missing_value"
+    def test_unparseable(self):
+        assert parse_issue_key("this is garbage") is None
+    def test_partial_match(self):
+        assert parse_issue_key("row:3,col:salary") is None
+    def test_empty_string(self):
+        assert parse_issue_key("") is None
+    def test_semicolon_separator(self):
+        result = parse_issue_key("row:3;col:salary;issue:missing_value")
+        assert result == "row:3,col:salary,issue:missing_value"
+# ──────────────────────────────────────────────────────
+# Fix parsing
+# ──────────────────────────────────────────────────────
+class TestParseFix:
+    def test_standard_format(self):
+        result = parse_fix("row:4,col:name,fix:Alice Chen")
+        assert result == (4, "name", "Alice Chen")
+    def test_with_equals(self):
+        result = parse_fix("row=4,col=name,fix=Alice Chen")
+        assert result == (4, "name", "Alice Chen")
+    def test_numeric_fix(self):
+        result = parse_fix("row:7,col:salary,fix:75000")
+        assert result == (7, "salary", "75000")
+    def test_date_fix(self):
+        result = parse_fix("row:12,col:order_date,fix:2024-01-26")
+        assert result == (12, "order_date", "2024-01-26")
+    def test_case_insensitive(self):
+        result = parse_fix("Row:4,Col:Name,Fix:Alice Chen")
+        assert result == (4, "name", "Alice Chen")
+    def test_unparseable(self):
+        assert parse_fix("garbage") is None
+        assert parse_fix("row:4,col:name") is None
+    def test_fix_with_special_chars(self):
+        result = parse_fix("row:1,col:email,fix:alice.chen@company.com")
+        assert result == (1, "email", "alice.chen@company.com")
+# ──────────────────────────────────────────────────────
+# F1 scoring
+# ──────────────────────────────────────────────────────
+class TestComputeF1:
+    def test_perfect_match(self):
+        keys = {"row:1,col:a,issue:missing_value"}
+        result = compute_f1(keys, keys)
+        assert result["f1"] == 1.0
+    def test_no_reported_no_planted(self):
+        result = compute_f1(set(), set())
+        assert result["f1"] == 1.0
+    def test_no_reported_some_planted(self):
+        planted = {"row:1,col:a,issue:missing_value"}
+        result = compute_f1(set(), planted)
+        assert result["f1"] == 0.0
+        assert result["fn"] == 1
+    def test_all_false_positives(self):
+        reported = {"row:99,col:x,issue:wrong_type"}
+        planted = {"row:1,col:a,issue:missing_value"}
+        result = compute_f1(reported, planted)
+        assert result["f1"] == 0.0
+    def test_partial_match(self):
+        reported = {"row:1,col:a,issue:missing_value", "row:2,col:b,issue:wrong_type"}
+        planted = {"row:1,col:a,issue:missing_value", "row:3,col:c,issue:duplicate_row"}
+        result = compute_f1(reported, planted)
+        assert result["tp"] == 1
+        assert result["fp"] == 1
+        assert result["fn"] == 1
+        assert 0 < result["f1"] < 1
+    def test_precision_recall_calculation(self):
+        reported = {"a", "b", "c"}
+        planted = {"a", "b", "d"}
+        result = compute_f1(reported, planted)
+        assert result["precision"] == pytest.approx(2 / 3)
+        assert result["recall"] == pytest.approx(2 / 3)
+# ──────────────────────────────────────────────────────
+# Weighted reward
+# ──────────────────────────────────────────────────────
+class TestComputeWeightedReward:
+    def test_perfect_match(self):
+        issues = [
+            PlantedIssue(row=1, col="a", issue_type="missing_value", description="", difficulty=1.0),
+            PlantedIssue(row=2, col="b", issue_type="wrong_type", description="", difficulty=3.0),
+        ]
+        reported = {i.to_key() for i in issues}
+        result = compute_weighted_reward(reported, issues)
+        assert result["weighted_reward"] == 1.0
+    def test_empty_both(self):
+        result = compute_weighted_reward(set(), [])
+        assert result["weighted_reward"] == 1.0
+    def test_no_reported(self):
+        issues = [PlantedIssue(row=1, col="a", issue_type="missing_value", description="", difficulty=2.0)]
+        result = compute_weighted_reward(set(), issues)
+        assert result["weighted_reward"] == 0.0
+    def test_hard_issue_worth_more(self):
+        easy = PlantedIssue(row=1, col="a", issue_type="missing_value", description="", difficulty=1.0)
+        hard = PlantedIssue(row=2, col="b", issue_type="statistical_outlier", description="", difficulty=3.0)
+        issues = [easy, hard]
+        hard_found = compute_weighted_reward({hard.to_key()}, issues)
+        easy_found = compute_weighted_reward({easy.to_key()}, issues)
+        assert hard_found["weighted_reward"] > easy_found["weighted_reward"]
+    def test_false_positives_reduce_reward(self):
+        issues = [PlantedIssue(row=1, col="a", issue_type="missing_value", description="", difficulty=1.0)]
+        correct = {issues[0].to_key()}
+        with_fp = correct | {"row:99,col:x,issue:wrong_type"}
+        r_correct = compute_weighted_reward(correct, issues)
+        r_with_fp = compute_weighted_reward(with_fp, issues)
+        assert r_correct["weighted_reward"] > r_with_fp["weighted_reward"]
+# ──────────────────────────────────────────────────────
+# Fix grading
+# ──────────────────────────────────────────────────────
+class TestGradeFixes:
+    @pytest.fixture
+    def easy_task(self):
+        return create_task_easy()
+    def test_no_fixes_no_issues(self):
+        from dataqa_env.server.tasks import Task
+        task = Task(task_id="empty", name="", description="", schema_description="",
+                    validation_rules="", clean_csv="a\n1")
+        result = grade_fixes([], task)
+        assert result["fix_score"] == 1.0
+    def test_no_fixes_submitted(self, easy_task):
+        result = grade_fixes([], easy_task)
+        assert result["fix_score"] == 0.0
+        assert result["fixes_attempted"] == 0
+    def test_exact_fix_for_missing_name(self, easy_task):
+        # Row 4 has empty name — clean value is "David Kim"
+        fixes = [(4, "name", "David Kim")]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fix_score"] > 0.0
+        assert result["fixes_correct"] == 1
+    def test_exact_fix_for_wrong_type_salary(self, easy_task):
+        # Row 7 has "seventy-five thousand" — clean value is "75000"
+        fixes = [(7, "salary", "75000")]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fixes_correct"] == 1
+    def test_numeric_close_match(self, easy_task):
+        # Row 9 has salary "5000" — clean value is "73000"
+        # Propose 73100 (within 1% of 73000)
+        fixes = [(9, "salary", "73100")]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fixes_partial"] == 1
+    def test_wrong_value_for_issue_cell(self, easy_task):
+        # Row 4 name is empty — propose wrong name
+        fixes = [(4, "name", "Wrong Person")]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fixes_partial"] == 1  # correct cell, wrong value
+        assert result["fix_score"] > 0.0  # gets partial credit
+    def test_fix_for_non_issue_cell(self, easy_task):
+        # Row 1 col name is fine — no issue there
+        fixes = [(1, "name", "Some Name")]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fixes_wrong"] == 1
+        assert result["fix_score"] == 0.0
+    def test_multiple_fixes_best_wins(self, easy_task):
+        # Submit two fixes for same cell — best one should count
+        fixes = [
+            (4, "name", "Wrong Person"),   # partial credit
+            (4, "name", "David Kim"),      # exact match
+        ]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fixes_correct"] >= 1
+    def test_all_fixes_correct(self, easy_task):
+        # Fix most issues with exact values
+        fixes = [
+            (4, "name", "David Kim"),
+            (7, "salary", "75000"),
+            (9, "salary", "73000"),
+            (15, "email", "oscar.rivera@company.com"),
+            (18, "start_date", "2022-01-19"),
+        ]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fix_score"] > 0.7  # 5 out of 6 issues fixed (duplicate can't be fixed)
+    def test_fix_score_bounded(self, easy_task):
+        fixes = [(4, "name", "David Kim"), (99, "x", "bad")]
+        result = grade_fixes(fixes, easy_task)
+        assert 0.0 <= result["fix_score"] <= 1.0
+# ──────────────────────────────────────────────────────
+# Full environment lifecycle
+# ──────────────────────────────────────────────────────
+class TestDataQAEnvironment:
+    @pytest.fixture
+    def env(self):
+        return DataQAEnvironment()
+    def test_reset_returns_observation(self, env):
+        obs = env.reset(task_id="easy")
+        assert obs.dataset_csv
+        assert obs.schema_description
+        assert obs.validation_rules
+        assert obs.task_description
+        assert obs.num_issues_hint == 6
+        assert obs.max_steps == 3
+        assert obs.done is False
+        assert obs.reward == 0.0
+        assert "fix" in obs.feedback.lower()  # mentions fix phase
+    def test_reset_medium(self, env):
+        obs = env.reset(task_id="medium")
+        assert obs.num_issues_hint == 8
+    def test_reset_hard(self, env):
+        obs = env.reset(task_id="hard")
+        assert obs.num_issues_hint == 10
+    def test_step_identify_only(self, env):
+        """Backward compatible: only issues, no fixes."""
+        env.reset(task_id="easy")
+        # Submit all 6 correct issues for easy task
+        action = DataQAAction(
+            issues=[
+                "row:4,col:name,issue:missing_value",
+                "row:7,col:salary,issue:wrong_type",
+                "row:21,col:employee_id,issue:duplicate_row",
+                "row:9,col:salary,issue:out_of_range",
+                "row:15,col:email,issue:inconsistent_value",
+                "row:18,col:start_date,issue:out_of_range",
+            ],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        assert obs.done is True
+        assert obs.reward >= 0.999  # identify-only uses identify_score directly
+    def test_step_with_fixes_increases_reward(self, env):
+        """Submitting correct fixes should produce high combined reward."""
+        env.reset(task_id="easy")
+        # All 6 issues + 3 fixes
+        action = DataQAAction(
+            issues=[
+                "row:4,col:name,issue:missing_value",
+                "row:7,col:salary,issue:wrong_type",
+                "row:21,col:employee_id,issue:duplicate_row",
+                "row:9,col:salary,issue:out_of_range",
+                "row:15,col:email,issue:inconsistent_value",
+                "row:18,col:start_date,issue:out_of_range",
+            ],
+            fixes=[
+                "row:4,col:name,fix:David Kim",
+                "row:7,col:salary,fix:75000",
+                "row:9,col:salary,fix:73000",
+            ],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        # Perfect identify + partial fixes -> high combined reward
+        assert obs.metadata["combined_reward"] > 0.7
+    def test_step_with_partial_issues(self, env):
+        env.reset(task_id="easy")
+        action = DataQAAction(
+            issues=["row:4,col:name,issue:missing_value"],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        assert 0 < obs.reward < 1.0
+        assert obs.done is False
+    def test_step_with_no_issues(self, env):
+        env.reset(task_id="easy")
+        action = DataQAAction(issues=[], task_id="easy")
+        obs = env.step(action)
+        assert obs.reward == 0.0
+    def test_step_exhausts_max_steps(self, env):
+        env.reset(task_id="easy")
+        for _ in range(3):
+            action = DataQAAction(issues=["row:99,col:x,issue:wrong_type"], task_id="easy")
+            obs = env.step(action)
+        assert obs.done is True
+    def test_auto_reset_on_step(self, env):
+        action = DataQAAction(
+            issues=["row:4,col:name,issue:missing_value"],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        assert obs.task_id == "easy"
+    def test_state_tracking(self, env):
+        env.reset(task_id="easy")
+        assert env.state.task_id == "easy"
+        assert env.state.current_step == 0
+        assert env.state.best_score == 0.0
+        action = DataQAAction(issues=["row:4,col:name,issue:missing_value"], task_id="easy")
+        env.step(action)
+        assert env.state.current_step == 1
+        assert env.state.best_score > 0.0
+    def test_best_score_monotonic(self, env):
+        env.reset(task_id="easy")
+        action1 = DataQAAction(
+            issues=["row:4,col:name,issue:missing_value", "row:7,col:salary,issue:wrong_type"],
+            task_id="easy",
+        )
+        env.step(action1)
+        score_after_1 = env.state.best_score
+        action2 = DataQAAction(issues=["row:99,col:x,issue:wrong_type"], task_id="easy")
+        env.step(action2)
+        assert env.state.best_score >= score_after_1
+    def test_metadata_includes_both_phases(self, env):
+        env.reset(task_id="easy")
+        action = DataQAAction(
+            issues=["row:4,col:name,issue:missing_value"],
+            fixes=["row:4,col:name,fix:David Kim"],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        m = obs.metadata
+        assert "identify_f1" in m
+        assert "identify_score" in m
+        assert "fix_score" in m
+        assert "combined_reward" in m
+        assert "tp" in m
+        assert "fixes_correct" in m
+        assert "fixes_attempted" in m
+    def test_parse_error_in_feedback(self, env):
+        env.reset(task_id="easy")
+        action = DataQAAction(issues=["garbage input"], task_id="easy")
+        obs = env.step(action)
+        assert "Parse error" in obs.feedback
+    def test_concurrent_sessions_flag(self):
+        assert DataQAEnvironment.SUPPORTS_CONCURRENT_SESSIONS is True
+    def test_reward_between_0_and_1(self, env):
+        """Hackathon requirement: scores must be 0.0-1.0."""
+        env.reset(task_id="hard")
+        for _ in range(3):
+            action = DataQAAction(
+                issues=["row:1,col:x,issue:wrong_type", "row:99,col:y,issue:missing_value"],
+                fixes=["row:1,col:x,fix:wrong"],
+                task_id="hard",
+            )
+            obs = env.step(action)
+            assert 0.0 <= obs.reward <= 1.0
+    def test_combined_reward_weights(self, env):
+        """Verify combined = IDENTIFY_WEIGHT * identify + FIX_WEIGHT * fix."""
+        env.reset(task_id="easy")
+        action = DataQAAction(
+            issues=["row:4,col:name,issue:missing_value"],
+            fixes=["row:4,col:name,fix:David Kim"],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        m = obs.metadata
+        expected = IDENTIFY_WEIGHT * m["identify_score"] + FIX_WEIGHT * m["fix_score"]
+        assert abs(m["combined_reward"] - expected) < 0.01
+    def test_fix_feedback_shown_when_fixes_submitted(self, env):
+        env.reset(task_id="easy")
+        action = DataQAAction(
+            issues=["row:4,col:name,issue:missing_value"],
+            fixes=["row:4,col:name,fix:David Kim"],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        assert "Fix Proposals" in obs.feedback
+        assert "Combined Reward" in obs.feedback
+    def test_no_fix_penalty_when_no_fixes_submitted(self, env):
+        """If agent submits no fixes, reward = identify_score (no penalty)."""
+        env.reset(task_id="easy")
+        action = DataQAAction(
+            issues=[
+                "row:4,col:name,issue:missing_value",
+                "row:7,col:salary,issue:wrong_type",
+                "row:21,col:employee_id,issue:duplicate_row",
+                "row:9,col:salary,issue:out_of_range",
+                "row:15,col:email,issue:inconsistent_value",
+                "row:18,col:start_date,issue:out_of_range",
+            ],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        # identify_score should be ~1.0 since all 6 issues found
+        assert obs.reward >= 0.99
+        # combined_reward equals identify_score when no fixes
+        assert obs.metadata["combined_reward"] == obs.metadata["identify_score"]

tests/test_extensibility.py ADDED Viewed

	@@ -0,0 +1,215 @@

+"""Tests for the extensibility API — custom tasks and contamination rules."""
+import pytest
+from dataqa_env.server.tasks import (
+    PlantedIssue,
+    create_task_from_config,
+    register_task,
+    register_contamination_rule,
+    CONTAMINATION_RULES,
+    get_task,
+    list_tasks,
+)
+from dataqa_env.server.environment import DataQAEnvironment, compute_weighted_reward
+from dataqa_env.models import DataQAAction
+SIMPLE_CSV = "id,name,score\n1,Alice,95\n2,Bob,87\n3,Carol,92\n4,Dave,78"
+class TestCreateTaskFromConfig:
+    def test_basic_creation(self):
+        task = create_task_from_config(
+            task_id="test_custom",
+            name="Test Task",
+            description="Test",
+            schema_description="id: int, name: str, score: int",
+            validation_rules="No missing values",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[
+                {"rule": "missing_value", "row": 0, "col": 1},
+            ],
+        )
+        assert task.task_id == "test_custom"
+        assert len(task.planted_issues) == 1
+        assert task.planted_issues[0].issue_type == "missing_value"
+        assert task.planted_issues[0].col == "name"
+    def test_multiple_contaminations(self):
+        task = create_task_from_config(
+            task_id="multi",
+            name="Multi",
+            description="Test",
+            schema_description="",
+            validation_rules="",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[
+                {"rule": "missing_value", "row": 0, "col": 1},
+                {"rule": "missing_value", "row": 2, "col": 1},
+            ],
+        )
+        assert len(task.planted_issues) == 2
+    def test_custom_difficulty_override(self):
+        task = create_task_from_config(
+            task_id="custom_diff",
+            name="Custom Difficulty",
+            description="Test",
+            schema_description="",
+            validation_rules="",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[
+                {"rule": "missing_value", "row": 0, "col": 1, "difficulty": 2.5},
+            ],
+        )
+        assert task.planted_issues[0].difficulty == 2.5
+    def test_callable_rule(self):
+        def custom_rule(rows, header, col_idx, row_idx, rng):
+            return "CORRUPTED", PlantedIssue(
+                row=row_idx + 1, col=header[col_idx], issue_type="wrong_type",
+                description="Custom corruption", difficulty=1.5,
+            )
+        task = create_task_from_config(
+            task_id="callable",
+            name="Callable Rule",
+            description="Test",
+            schema_description="",
+            validation_rules="",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[
+                {"rule": custom_rule, "row": 1, "col": 2},
+            ],
+        )
+        assert task.planted_issues[0].issue_type == "wrong_type"
+        assert "CORRUPTED" in task.corrupted_csv
+    def test_unknown_rule_raises(self):
+        with pytest.raises(ValueError, match="Unknown contamination rule"):
+            create_task_from_config(
+                task_id="bad",
+                name="Bad",
+                description="",
+                schema_description="",
+                validation_rules="",
+                clean_csv=SIMPLE_CSV,
+                contaminations=[{"rule": "nonexistent_rule", "row": 0, "col": 0}],
+            )
+class TestRegisterContaminationRule:
+    def test_register_and_use(self):
+        def reverse_value(rows, header, col_idx, row_idx, rng):
+            val = rows[row_idx][col_idx]
+            return val[::-1], PlantedIssue(
+                row=row_idx + 1, col=header[col_idx], issue_type="format_violation",
+                description="Reversed value", difficulty=1.5,
+            )
+        register_contamination_rule("reverse", reverse_value)
+        assert "reverse" in CONTAMINATION_RULES
+        task = create_task_from_config(
+            task_id="rev_test",
+            name="Reverse Test",
+            description="",
+            schema_description="",
+            validation_rules="",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[{"rule": "reverse", "row": 0, "col": 1}],
+        )
+        assert task.planted_issues[0].issue_type == "format_violation"
+        # "Alice" reversed is "ecilA"
+        assert "ecilA" in task.corrupted_csv
+        # Cleanup
+        del CONTAMINATION_RULES["reverse"]
+class TestRegisterTask:
+    def test_register_and_get(self):
+        task = create_task_from_config(
+            task_id="registered",
+            name="Registered Task",
+            description="Test registered task",
+            schema_description="id: int, name: str",
+            validation_rules="No missing values",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[{"rule": "missing_value", "row": 1, "col": 1}],
+        )
+        register_task("registered", lambda seed: task)
+        assert "registered" in list_tasks()
+        fetched = get_task("registered")
+        assert fetched.task_id == "registered"
+        assert len(fetched.planted_issues) == 1
+        # Cleanup
+        from dataqa_env.server.tasks import TASK_REGISTRY
+        del TASK_REGISTRY["registered"]
+class TestCustomTaskInEnvironment:
+    def test_full_lifecycle_identify_only(self):
+        """Custom task works end-to-end with identify-only."""
+        task = create_task_from_config(
+            task_id="e2e_custom",
+            name="E2E Custom",
+            description="End-to-end test",
+            schema_description="id: int, name: str, score: int",
+            validation_rules="No missing values",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[
+                {"rule": "missing_value", "row": 0, "col": 1, "difficulty": 1.0},
+                {"rule": "whitespace_value", "row": 2, "col": 1, "difficulty": 2.5},
+            ],
+        )
+        register_task("e2e_custom", lambda seed: task)
+        env = DataQAEnvironment()
+        obs = env.reset(task_id="e2e_custom")
+        assert obs.num_issues_hint == 2
+        action = DataQAAction(
+            issues=[i.to_key() for i in task.planted_issues],
+            task_id="e2e_custom",
+        )
+        obs = env.step(action)
+        assert obs.done is True
+        assert obs.reward >= 0.999
+        from dataqa_env.server.tasks import TASK_REGISTRY
+        del TASK_REGISTRY["e2e_custom"]
+    def test_full_lifecycle_identify_and_fix(self):
+        """Custom task works end-to-end with both identify and fix."""
+        task = create_task_from_config(
+            task_id="e2e_fix",
+            name="E2E Fix",
+            description="End-to-end test with fixes",
+            schema_description="id: int, name: str, score: int",
+            validation_rules="No missing values",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[
+                {"rule": "missing_value", "row": 0, "col": 1, "difficulty": 1.0},
+            ],
+        )
+        register_task("e2e_fix", lambda seed: task)
+        env = DataQAEnvironment()
+        env.reset(task_id="e2e_fix")
+        # Submit issues + fix
+        action = DataQAAction(
+            issues=[task.planted_issues[0].to_key()],
+            fixes=["row:1,col:name,fix:Alice"],  # clean value is "Alice"
+            task_id="e2e_fix",
+        )
+        obs = env.step(action)
+        assert obs.done is True
+        assert obs.metadata["fix_score"] > 0.0
+        assert obs.metadata["combined_reward"] > 0.0
+        from dataqa_env.server.tasks import TASK_REGISTRY
+        del TASK_REGISTRY["e2e_fix"]

tests/test_inference.py ADDED Viewed

	@@ -0,0 +1,191 @@

+"""Tests for the inference script's parsing, prompt building, and log format."""
+import pytest
+import sys
+import os
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+from inference import parse_llm_response, parse_fix_response, build_user_prompt, log_start, log_step, log_end
+class TestParseLLMResponse:
+    def test_standard_format(self):
+        response = "row:1,col:name,issue:missing_value\nrow:2,col:salary,issue:wrong_type"
+        issues = parse_llm_response(response)
+        assert len(issues) == 2
+        assert "row:1,col:name,issue:missing_value" in issues
+    def test_numbered_list(self):
+        response = "1. row:1,col:name,issue:missing_value\n2. row:2,col:salary,issue:wrong_type"
+        issues = parse_llm_response(response)
+        assert len(issues) == 2
+    def test_bullet_list(self):
+        response = "- row:1,col:name,issue:missing_value\n* row:2,col:salary,issue:wrong_type"
+        issues = parse_llm_response(response)
+        assert len(issues) == 2
+    def test_equals_delimiter(self):
+        response = "row=1,col=name,issue=missing_value"
+        issues = parse_llm_response(response)
+        assert len(issues) == 1
+        assert issues[0] == "row:1,col:name,issue:missing_value"
+    def test_mixed_case(self):
+        response = "Row:1,Col:Name,Issue:Missing_Value"
+        issues = parse_llm_response(response)
+        assert len(issues) == 1
+        assert issues[0] == "row:1,col:name,issue:missing_value"
+    def test_empty_response(self):
+        assert parse_llm_response("") == []
+        assert parse_llm_response("   ") == []
+    def test_garbage_lines_skipped(self):
+        response = "Here are the issues:\nrow:1,col:name,issue:missing_value\nNo more issues."
+        issues = parse_llm_response(response)
+        assert len(issues) == 1
+    def test_deduplication_not_applied(self):
+        response = "row:1,col:name,issue:missing_value\nrow:1,col:name,issue:missing_value"
+        issues = parse_llm_response(response)
+        assert len(issues) == 2
+    def test_with_column_variant(self):
+        response = "row:1,column:name,issue:missing_value"
+        issues = parse_llm_response(response)
+        assert len(issues) == 1
+class TestParseFixResponse:
+    def test_standard_format(self):
+        response = "row:4,col:name,fix:David Kim\nrow:7,col:salary,fix:75000"
+        fixes = parse_fix_response(response)
+        assert len(fixes) == 2
+        assert "row:4,col:name,fix:David Kim" in fixes
+    def test_numbered_list(self):
+        response = "1. row:4,col:name,fix:David Kim\n2. row:7,col:salary,fix:75000"
+        fixes = parse_fix_response(response)
+        assert len(fixes) == 2
+    def test_with_special_chars(self):
+        response = "row:1,col:email,fix:alice.chen@company.com"
+        fixes = parse_fix_response(response)
+        assert len(fixes) == 1
+        assert "alice.chen@company.com" in fixes[0]
+    def test_empty_response(self):
+        assert parse_fix_response("") == []
+    def test_date_fix(self):
+        response = "row:12,col:order_date,fix:2024-01-26"
+        fixes = parse_fix_response(response)
+        assert len(fixes) == 1
+    def test_ignores_issue_lines(self):
+        response = "row:4,col:name,issue:missing_value\nrow:4,col:name,fix:David Kim"
+        fixes = parse_fix_response(response)
+        assert len(fixes) == 1  # only the fix line
+class TestBuildUserPrompt:
+    def test_includes_all_fields(self):
+        obs = {
+            "task_description": "Find issues",
+            "schema_description": "col: int",
+            "validation_rules": "no nulls",
+            "dataset_csv": "a,b\n1,2",
+            "num_issues_hint": 3,
+            "feedback": "",
+        }
+        prompt = build_user_prompt(obs)
+        assert "Find issues" in prompt
+        assert "col: int" in prompt
+        assert "no nulls" in prompt
+        assert "a,b" in prompt
+        assert "3 issues" in prompt
+    def test_includes_feedback_on_retry(self):
+        obs = {
+            "task_description": "Find issues",
+            "schema_description": "",
+            "validation_rules": "",
+            "dataset_csv": "a\n1",
+            "num_issues_hint": 0,
+            "feedback": "Step 1/3: You missed 2 issues",
+        }
+        prompt = build_user_prompt(obs)
+        assert "FEEDBACK" in prompt
+        assert "missed 2" in prompt
+    def test_excludes_reset_feedback(self):
+        obs = {
+            "task_description": "",
+            "schema_description": "",
+            "validation_rules": "",
+            "dataset_csv": "",
+            "num_issues_hint": 0,
+            "feedback": "Environment reset. Start inspecting.",
+        }
+        prompt = build_user_prompt(obs)
+        assert "FEEDBACK" not in prompt
+    def test_include_fixes_flag(self):
+        obs = {
+            "task_description": "Find issues",
+            "schema_description": "",
+            "validation_rules": "",
+            "dataset_csv": "a\n1",
+            "num_issues_hint": 0,
+            "feedback": "",
+        }
+        prompt = build_user_prompt(obs, include_fixes=True)
+        assert "fix" in prompt.lower()
+class TestLogFormat:
+    """Verify stdout log format matches hackathon evaluation requirements."""
+    def test_log_start_format(self, capsys):
+        log_start(task="easy", env="dataqa_env", model="test-model")
+        out = capsys.readouterr().out.strip()
+        assert out == "[START] task=easy env=dataqa_env model=test-model"
+    def test_log_step_format(self, capsys):
+        log_step(step=1, action="row:1,col:name,issue:missing_value", reward=0.50, done=False, error=None)
+        out = capsys.readouterr().out.strip()
+        assert out == "[STEP] step=1 action=row:1,col:name,issue:missing_value reward=0.50 done=false error=null"
+    def test_log_step_with_error(self, capsys):
+        log_step(step=2, action="none", reward=0.00, done=True, error="timeout")
+        out = capsys.readouterr().out.strip()
+        assert "error=timeout" in out
+        assert "done=true" in out
+    def test_log_end_format(self, capsys):
+        log_end(success=True, steps=3, score=0.85, rewards=[0.25, 0.50, 0.85])
+        out = capsys.readouterr().out.strip()
+        assert out == "[END] success=true steps=3 score=0.850 rewards=0.25,0.50,0.85"
+    def test_log_end_failure(self, capsys):
+        log_end(success=False, steps=1, score=0.0, rewards=[0.0])
+        out = capsys.readouterr().out.strip()
+        assert "success=false" in out
+        assert "score=0.000" in out
+    def test_reward_format_2_decimal(self, capsys):
+        log_step(step=1, action="test", reward=0.123456, done=False, error=None)
+        out = capsys.readouterr().out.strip()
+        assert "reward=0.12" in out
+    def test_no_newlines_within_line(self, capsys):
+        log_start(task="easy", env="dataqa_env", model="model")
+        log_step(step=1, action="act", reward=0.0, done=False, error=None)
+        log_end(success=False, steps=1, score=0.0, rewards=[0.0])
+        out = capsys.readouterr().out
+        lines = [l for l in out.split("\n") if l.strip()]
+        assert len(lines) == 3
+        assert lines[0].startswith("[START]")
+        assert lines[1].startswith("[STEP]")
+        assert lines[2].startswith("[END]")

tests/test_tasks.py ADDED Viewed

	@@ -0,0 +1,212 @@

+"""Tests for task definitions, data corruption, and issue planting."""
+import pytest
+from dataqa_env.server.tasks import (
+    PlantedIssue,
+    Task,
+    create_task_easy,
+    create_task_medium,
+    create_task_hard,
+    get_task,
+    list_tasks,
+    _csv_to_rows,
+    _rows_to_csv,
+)
+class TestPlantedIssue:
+    def test_to_key(self):
+        issue = PlantedIssue(row=3, col="salary", issue_type="missing_value", description="test")
+        assert issue.to_key() == "row:3,col:salary,issue:missing_value"
+    def test_difficulty_default(self):
+        issue = PlantedIssue(row=1, col="name", issue_type="missing_value", description="test")
+        assert issue.difficulty == 1.0
+    def test_difficulty_custom(self):
+        issue = PlantedIssue(row=1, col="name", issue_type="missing_value", description="test", difficulty=3.0)
+        assert issue.difficulty == 3.0
+class TestCSVHelpers:
+    def test_roundtrip(self):
+        csv_text = "a,b,c\n1,2,3\n4,5,6"
+        rows = _csv_to_rows(csv_text)
+        assert len(rows) == 3
+        result = _rows_to_csv(rows)
+        assert "1,2,3" in result
+    def test_empty_csv(self):
+        rows = _csv_to_rows("a,b\n")
+        assert len(rows) == 1  # header only
+class TestTaskEasy:
+    @pytest.fixture
+    def task(self):
+        return create_task_easy()
+    def test_task_id(self, task):
+        assert task.task_id == "easy"
+    def test_has_6_issues(self, task):
+        assert len(task.planted_issues) == 6
+    def test_issue_types(self, task):
+        types = {i.issue_type for i in task.planted_issues}
+        assert "missing_value" in types
+        assert "wrong_type" in types
+        assert "duplicate_row" in types
+        assert "out_of_range" in types
+        assert "inconsistent_value" in types
+    def test_corrupted_csv_differs_from_clean(self, task):
+        assert task.corrupted_csv != task.clean_csv
+    def test_issue_keys_unique(self, task):
+        keys = [i.to_key() for i in task.planted_issues]
+        assert len(keys) == len(set(keys))
+    def test_max_steps(self, task):
+        assert task.max_steps == 3
+    def test_corrupted_csv_has_more_rows(self, task):
+        clean_rows = _csv_to_rows(task.clean_csv)
+        corrupt_rows = _csv_to_rows(task.corrupted_csv)
+        assert len(corrupt_rows) > len(clean_rows)  # duplicate row added
+    def test_difficulty_weights(self, task):
+        for issue in task.planted_issues:
+            assert 1.0 <= issue.difficulty <= 3.0
+class TestTaskMedium:
+    @pytest.fixture
+    def task(self):
+        return create_task_medium()
+    def test_task_id(self, task):
+        assert task.task_id == "medium"
+    def test_has_8_issues(self, task):
+        assert len(task.planted_issues) == 8
+    def test_issue_types(self, task):
+        types = {i.issue_type for i in task.planted_issues}
+        assert "inconsistent_value" in types
+        assert "format_violation" in types
+        assert "missing_value" in types
+    def test_issue_keys_unique(self, task):
+        keys = [i.to_key() for i in task.planted_issues]
+        assert len(keys) == len(set(keys))
+    def test_difficulty_weights(self, task):
+        for issue in task.planted_issues:
+            assert 1.0 <= issue.difficulty <= 3.0
+class TestTaskHard:
+    @pytest.fixture
+    def task(self):
+        return create_task_hard()
+    def test_task_id(self, task):
+        assert task.task_id == "hard"
+    def test_has_10_issues(self, task):
+        assert len(task.planted_issues) == 10
+    def test_issue_types(self, task):
+        types = {i.issue_type for i in task.planted_issues}
+        assert "inconsistent_value" in types
+        assert "format_violation" in types
+        assert "statistical_outlier" in types
+        assert "out_of_range" in types
+        assert "missing_value" in types
+    def test_has_high_difficulty_issues(self, task):
+        hard_issues = [i for i in task.planted_issues if i.difficulty >= 2.5]
+        assert len(hard_issues) >= 2  # data leakage, GPU outlier, whitespace
+    def test_issue_keys_unique(self, task):
+        keys = [i.to_key() for i in task.planted_issues]
+        assert len(keys) == len(set(keys))
+class TestTaskAlignment:
+    @pytest.fixture
+    def task(self):
+        return create_task_hard()  # reuse import, we'll import alignment below
+    def test_alignment_task(self):
+        from dataqa_env.server.tasks import get_task
+        task = get_task("alignment")
+        assert task.task_id == "alignment"
+        assert len(task.planted_issues) == 12
+    def test_alignment_issue_types(self):
+        from dataqa_env.server.tasks import get_task
+        task = get_task("alignment")
+        types = {i.issue_type for i in task.planted_issues}
+        assert "inconsistent_value" in types  # factual errors, mismatches, hallucinations
+        assert "missing_value" in types        # truncated, whitespace-only
+        assert "duplicate_row" in types        # duplicate instruction
+    def test_alignment_has_high_difficulty(self):
+        from dataqa_env.server.tasks import get_task
+        task = get_task("alignment")
+        hard_issues = [i for i in task.planted_issues if i.difficulty >= 2.5]
+        assert len(hard_issues) >= 3  # hallucinated citation, harmful advice, factual error
+    def test_alignment_issue_keys_unique(self):
+        from dataqa_env.server.tasks import get_task
+        task = get_task("alignment")
+        keys = [i.to_key() for i in task.planted_issues]
+        assert len(keys) == len(set(keys))
+    def test_alignment_corrupted_differs(self):
+        from dataqa_env.server.tasks import get_task
+        task = get_task("alignment")
+        assert task.corrupted_csv != task.clean_csv
+    def test_alignment_in_env(self):
+        from dataqa_env.server.environment import DataQAEnvironment
+        from dataqa_env.models import DataQAAction
+        env = DataQAEnvironment()
+        obs = env.reset(task_id="alignment")
+        assert obs.num_issues_hint == 12
+        # Perfect submission
+        from dataqa_env.server.tasks import get_task
+        task = get_task("alignment")
+        action = DataQAAction(issues=[i.to_key() for i in task.planted_issues], task_id="alignment")
+        obs = env.step(action)
+        assert obs.reward >= 0.99
+class TestTaskRegistry:
+    def test_list_tasks(self):
+        tasks = list_tasks()
+        assert set(tasks) == {"easy", "medium", "hard", "alignment", "coding", "toolcalling"}
+    def test_get_task_easy(self):
+        task = get_task("easy")
+        assert task.task_id == "easy"
+    def test_get_task_medium(self):
+        task = get_task("medium")
+        assert task.task_id == "medium"
+    def test_get_task_hard(self):
+        task = get_task("hard")
+        assert task.task_id == "hard"
+    def test_get_task_unknown_raises(self):
+        with pytest.raises(ValueError, match="Unknown task"):
+            get_task("nonexistent")
+    def test_seed_determinism(self):
+        t1 = get_task("easy", seed=42)
+        t2 = get_task("easy", seed=42)
+        assert t1.corrupted_csv == t2.corrupted_csv
+        assert [i.to_key() for i in t1.planted_issues] == [i.to_key() for i in t2.planted_issues]

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff