srishtichugh commited on
Commit
40fcf49
Β·
1 Parent(s): 2215348
README.md CHANGED
@@ -10,30 +10,67 @@ tags:
10
  - openenv
11
  - rl
12
  - data-cleaning
 
 
13
  ---
14
 
15
- # Data Cleaning OpenEnv
16
 
17
- A **real-world data cleaning environment** for training and evaluating AI agents.
18
 
19
- An agent interacts with a dirty pandas DataFrame through a standard `reset() / step() / state()` HTTP API, learning to fix common data quality problems β€” missing values, duplicate rows, inconsistent formats, statistical outliers, and dtype errors β€” across three progressively harder tasks.
20
 
21
  πŸ€— **Live HuggingFace Space:** https://srishtichugh-openenv-hack.hf.space
 
22
  πŸ“– **Interactive API docs:** https://srishtichugh-openenv-hack.hf.space/docs
23
  βœ… **Health check:** https://srishtichugh-openenv-hack.hf.space/health
24
 
25
  ---
26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  ## Environment Description & Motivation
28
 
29
- Real-world datasets are almost never clean. Data engineers routinely spend 60–80 % of their time on data cleaning tasks: filling missing values with statistically appropriate strategies, removing duplicates, standardising inconsistent formats (phone numbers, dates, country names), and detecting extreme outliers.
30
 
31
- This environment turns those tasks into a reinforcement learning challenge with:
 
 
 
 
 
 
 
32
 
33
- - **Deterministic, programmatic graders** β€” ground-truth clean DataFrames are generated with a fixed seed; every reward signal is reproducible.
34
- - **Meaningful partial rewards** β€” every step emits a delta reward proportional to how much of the dataset it cleaned, so the agent receives useful signal throughout the episode rather than only at the end.
35
- - **Three difficulty levels** β€” easy, medium, hard β€” letting agents learn a curriculum from simple null-filling up to full multi-issue pipelines.
36
- - **No external data downloads** β€” all datasets are generated synthetically via `numpy` + `Faker` with `seed=42`.
 
 
 
 
 
 
 
 
 
37
 
38
  ---
39
 
@@ -49,6 +86,8 @@ Actions are JSON objects sent to `POST /step`.
49
  | `replace_value` | βœ… | `{"old": ..., "new": ...}` | Replace a specific value |
50
  | `drop_outliers` | βœ… | β€” | Remove IQR outliers from a numeric column |
51
  | `fix_dtype` | βœ… | `{"dtype": "float\|int\|str"}` | Cast column to correct dtype |
 
 
52
 
53
  **Format rules enforced by `fix_format`:**
54
 
@@ -56,23 +95,14 @@ Actions are JSON objects sent to `POST /step`.
56
  |---|---|
57
  | `phone` | `NNN-NNN-NNNN` |
58
  | `listed_date` / `signup_date` | `YYYY-MM-DD` |
59
- | `country` | Title-cased canonical name (`USA`, `UK`, `Canada`, `Australia`, `Germany`) |
60
-
61
- **Example actions:**
62
- ```json
63
- {"operation": "fill_missing", "column": "salary", "params": {"strategy": "median"}}
64
- {"operation": "fill_missing", "column": "department", "params": {"strategy": "mode"}}
65
- {"operation": "drop_duplicates"}
66
- {"operation": "fix_format", "column": "phone"}
67
- {"operation": "fix_format", "column": "signup_date"}
68
- {"operation": "drop_outliers", "column": "purchase_amount"}
69
- ```
70
 
71
  ---
72
 
73
  ## Observation Space
74
 
75
  Every `POST /reset` and `POST /step` returns:
 
76
  ```json
77
  {
78
  "observation": {
@@ -86,7 +116,21 @@ Every `POST /reset` and `POST /step` returns:
86
  "task_description": "Task 1 (Easy) β€” Fill Missing Values\n...",
87
  "message": "Filled 20 missing values in 'age' using median.",
88
  "step_count": 1,
89
- "current_score": 0.4000
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
  },
91
  "reward": 0.40,
92
  "done": false,
@@ -97,32 +141,19 @@ Every `POST /reset` and `POST /step` returns:
97
  | Field | Type | Description |
98
  |---|---|---|
99
  | `done` | bool | Episode finished (score β‰₯ 0.95 or max steps reached) |
100
- | `reward` | float | Per-step delta reward (see Reward Function) |
101
- | `data_preview` | string | First 10 rows of current DataFrame as CSV |
102
  | `data_shape` | [int, int] | Current `[rows, cols]` |
103
  | `missing_counts` | object | `{column: null_count}` for columns with NaN |
104
  | `duplicate_count` | int | Number of duplicate rows |
105
- | `dtype_issues` | object | `{column: issue_description}` for suspected dtype mismatches |
106
- | `task_description` | string | Full task instructions with available operations |
107
- | `message` | string | Human-readable result of the last action |
108
- | `step_count` | int | Steps taken in this episode |
109
- | `current_score` | float | Running grader score 0.0 – 1.0 |
110
-
111
- ---
112
-
113
- ## State Space
114
-
115
- `GET /state` returns episode metadata (does not modify state):
116
- ```json
117
- {
118
- "episode_id": "a8f026a9-...",
119
- "task_id": 1,
120
- "step_count": 2,
121
- "max_steps": 20,
122
- "total_errors": 50,
123
- "errors_remaining": 30
124
- }
125
- ```
126
 
127
  ---
128
 
@@ -133,19 +164,19 @@ Every `POST /reset` and `POST /step` returns:
133
  | Property | Value |
134
  |---|---|
135
  | Dataset | 100-row employee records (name, age, salary, department, experience) |
136
- | Issues | ~20 % NaN in `age`, `salary`; ~10 % NaN in `department` |
137
  | Goal | Fill all missing values |
138
  | Valid operations | `fill_missing` |
139
  | Grader | `1.0 βˆ’ remaining_nulls / original_nulls` |
140
  | Max steps | 20 |
141
- | Optimal steps | 3 (one per affected column) |
142
 
143
  ### Task 2 β€” Fix Formats + Remove Duplicates *(Medium)*
144
 
145
  | Property | Value |
146
  |---|---|
147
  | Dataset | 215-row product catalog (product_id, price, category, phone, listed_date) |
148
- | Issues | ~60 % phone numbers in mixed formats, ~60 % dates in mixed formats, 15 duplicate rows |
149
  | Goal | Standardise all phone/date formats and remove duplicates |
150
  | Valid operations | `fix_format`, `drop_duplicates` |
151
  | Grader | `0.35 Γ— phone_score + 0.35 Γ— date_score + 0.30 Γ— dupe_score` |
@@ -157,13 +188,28 @@ Every `POST /reset` and `POST /step` returns:
157
  | Property | Value |
158
  |---|---|
159
  | Dataset | 320-row customer database (name, age, purchase_amount, country, email, signup_date) |
160
- | Issues | Missing values (4 cols), 20 duplicate rows, outliers in `purchase_amount` (~3Γ— normal), mixed country capitalisation, mixed date formats |
161
  | Goal | Fix all issues end-to-end |
162
  | Valid operations | All 6 operations |
163
  | Grader | `0.25Γ—null + 0.20Γ—dupe + 0.20Γ—outlier + 0.175Γ—country + 0.175Γ—date` |
164
  | Max steps | 40 |
165
  | Optimal steps | 8 |
166
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
167
  ---
168
 
169
  ## Reward Function
@@ -173,22 +219,62 @@ Every `POST /reset` and `POST /step` returns:
173
  | Score improves (delta > 0) | `new_score βˆ’ old_score` (positive) |
174
  | Operation had no effect | `βˆ’0.01` |
175
  | Invalid operation / bad column | `βˆ’0.05` |
176
- | Episode completed (score β‰₯ 0.95) | `delta + 0.20` terminal bonus |
177
 
178
- Rewards are bounded to **[βˆ’0.05, 1.2]**. A partial reward is emitted on every step, giving the agent dense signal throughout the episode.
179
 
180
  ---
181
 
182
- ## API Endpoints
183
 
184
  | Method | Path | Description |
185
  |---|---|---|
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
186
  | `GET` | `/health` | Health check β†’ `{"status": "healthy"}` |
187
- | `POST` | `/reset` | Start episode. Body: `{"task_id": 1\|2\|3}` (optional; default: round-robin) |
188
  | `POST` | `/step` | Execute action. Body: action JSON |
189
- | `POST` | `/state` | Get episode metadata |
190
- | `GET` | `/metadata` | Environment name, version, task list |
191
- | `GET` | `/schema` | Full action / observation / state JSON schemas |
 
 
 
192
  | `GET` | `/docs` | Interactive Swagger UI |
193
 
194
  ---
@@ -200,79 +286,40 @@ Rewards are bounded to **[βˆ’0.05, 1.2]**. A partial reward is emitted on every
200
  | 1 β€” Fill Missing Values | Easy | 0.999 |
201
  | 2 β€” Fix Formats + Duplicates | Medium | 0.999 |
202
  | 3 β€” Full Cleaning Pipeline | Hard | 0.999 |
203
- | **Average** | β€” | **0.999** |
204
-
205
- *Produced by `google/gemma-3-27b-it` via NVIDIA NIM, `temperature=0`. Full step-by-step agent logs: `inference_log.txt`.*
206
 
207
  ---
208
 
209
  ## Setup & Usage
210
 
211
  ### Prerequisites
212
-
213
  - Python 3.11+
214
  - Docker (for containerised deployment)
215
 
216
  ### Local β€” Python
217
  ```bash
218
- # 1. Clone and install dependencies
219
  git clone https://github.com/Tanvi51204/openEnv.git
220
  cd openEnv
221
  pip install -r requirements.txt
222
-
223
- # 2. Start the server
224
- uvicorn server.app:app --host 0.0.0.0 --port 8000
225
-
226
- # 3. Open Swagger UI
227
- open http://localhost:8000/docs
228
  ```
229
 
 
 
 
 
230
  ### Local β€” Docker
231
  ```bash
232
  docker build -t data-cleaning-env .
233
  docker run -p 8000:8000 data-cleaning-env
234
  ```
235
 
236
- ### Quick API test
237
- ```bash
238
- # Health
239
- curl http://localhost:8000/health
240
-
241
- # Start Task 1
242
- curl -X POST http://localhost:8000/reset \
243
- -H "Content-Type: application/json" \
244
- -d '{"task_id": 1}'
245
-
246
- # Fill missing values
247
- curl -X POST http://localhost:8000/step \
248
- -H "Content-Type: application/json" \
249
- -d '{"operation": "fill_missing", "column": "salary", "params": {"strategy": "median"}}'
250
- ```
251
-
252
- ### Python client
253
- ```python
254
- from client import DataCleaningEnvClient
255
- from models import DataCleaningAction
256
-
257
- with DataCleaningEnvClient("http://localhost:8000") as env:
258
- result = env.reset(task_id=1)
259
- print(result.observation.missing_counts) # {'age': 20, 'salary': 20, 'department': 10}
260
-
261
- action = DataCleaningAction(
262
- operation="fill_missing",
263
- column="salary",
264
- params={"strategy": "median"},
265
- )
266
- result = env.step(action)
267
- print(result.observation.current_score) # 0.4
268
- print(result.reward) # 0.4
269
- ```
270
-
271
  ### Run baseline inference
272
  ```bash
273
  export API_BASE_URL="https://api.openai.com/v1"
274
  export MODEL_NAME="gpt-4o-mini"
275
- export HF_TOKEN="sk-..." # your API key
276
  export ENV_URL="http://localhost:8000"
277
 
278
  python inference.py
@@ -292,23 +339,24 @@ Produces `[START]` / `[STEP]` / `[END]` lines to stdout and `baseline_scores.jso
292
  ---
293
 
294
  ## Project Structure
 
295
  ```
296
  openenv-data-cleaning/
297
- β”œβ”€β”€ models.py Pydantic contracts β€” Action / Observation / State
298
  β”œβ”€β”€ client.py Sync HTTP client (reset / step / state / health)
299
  β”œβ”€β”€ inference.py Baseline LLM agent with [START]/[STEP]/[END] logging
300
- β”œβ”€β”€ openenv.yaml OpenEnv manifest
301
  β”œβ”€β”€ Dockerfile python:3.11-slim, non-root user, HEALTHCHECK
302
  β”œβ”€β”€ requirements.txt pip dependencies
303
- β”œβ”€β”€ pyproject.toml Python package metadata + openenv-core dependency
304
  └── server/
305
- β”œβ”€β”€ app.py FastAPI routes + /metadata + /schema
306
- β”œβ”€β”€ environment.py reset / step / state logic + 6 operations + rewards
307
  β”œβ”€β”€ data_generator.py Synthetic dataset generation (seed=42, reproducible)
 
308
  └── tasks/
309
- β”œβ”€β”€ task1_missing.py Easy β€” fill NaN grader
310
  β”œβ”€β”€ task2_format.py Medium β€” format + duplicates grader
311
- └── task3_pipeline.py Hard β€” full pipeline grader
 
312
  ```
313
 
314
  ---
@@ -317,5 +365,8 @@ openenv-data-cleaning/
317
 
318
  πŸ€— **HuggingFace Space:** https://srishtichugh-openenv-hack.hf.space
319
 
 
320
  - Health: https://srishtichugh-openenv-hack.hf.space/health
321
- - Docs: https://srishtichugh-openenv-hack.hf.space/docs
 
 
 
10
  - openenv
11
  - rl
12
  - data-cleaning
13
+ - multi-agent
14
+ - data-quality
15
  ---
16
 
17
+ # DataMedic β€” AI Data Cleaning OpenEnv
18
 
19
+ An **agentic data quality environment** for training and evaluating AI agents on real-world data cleaning tasks.
20
 
21
+ An agent interacts with dirty pandas DataFrames through a standard `reset() / step() / state()` HTTP API, learning to fix missing values, duplicate rows, inconsistent formats, statistical outliers, and dtype errors β€” across **four progressively harder tasks** including a novel multi-source schema alignment challenge.
22
 
23
  πŸ€— **Live HuggingFace Space:** https://srishtichugh-openenv-hack.hf.space
24
+ πŸ–₯️ **Live DataMedic UI:** https://srishtichugh-openenv-hack.hf.space
25
  πŸ“– **Interactive API docs:** https://srishtichugh-openenv-hack.hf.space/docs
26
  βœ… **Health check:** https://srishtichugh-openenv-hack.hf.space/health
27
 
28
  ---
29
 
30
+ ## What Makes This Different
31
+
32
+ Most data cleaning tools are one-shot. DataMedic is an **RL training environment** where:
33
+
34
+ - The agent **diagnoses** a dirty dataset via `/profile` (completeness, uniqueness, validity %)
35
+ - It **plans** a treatment β€” every observation includes a `plan` field with the next recommended actions
36
+ - It **executes** cleaning operations step by step with dense per-step rewards
37
+ - It **receives a health certificate** via `/report` summarising what was fixed and how efficiently
38
+ - It **exports** the cleaned result via `/export`
39
+
40
+ Grounded in peer-reviewed research:
41
+ - **Bendinelli et al. 2025** β€” LLM Agents for Cleaning Tabular ML Datasets (arXiv:2503.06664)
42
+ - **CleanAgent** β€” Qi & Wang 2024 (arXiv:2403.08291)
43
+ - **AutoDCWorkflow** β€” EMNLP 2025 Findings
44
+ - **HoloClean** β€” Rekatsinas et al. 2017
45
+
46
+ ---
47
+
48
  ## Environment Description & Motivation
49
 
50
+ Real-world datasets are almost never clean. Data engineers routinely spend 60–80% of their time on data cleaning. This environment turns that into an RL challenge with:
51
 
52
+ - **Deterministic, programmatic graders** β€” ground-truth DataFrames generated with `seed=42`; every reward is reproducible
53
+ - **Meaningful partial rewards** β€” dense delta reward every step, not just at episode end
54
+ - **Four difficulty levels** β€” easy β†’ medium β†’ hard β†’ expert (multi-source merge)
55
+ - **Live DQ metrics** β€” completeness %, uniqueness %, validity % in every observation
56
+ - **Agentic planning** β€” `plan` field recommends next actions; `tried_operations` prevents loops
57
+ - **No external data downloads** β€” all datasets generated synthetically via `numpy` + `Faker`
58
+
59
+ ---
60
 
61
+ ## DataMedic UI
62
+
63
+ Open `https://srishtichugh-openenv-hack.hf.space` in your browser to see the live monitoring dashboard:
64
+
65
+ - **Health Score Ring** β€” animated score gauge, color-coded by severity (green/amber/red)
66
+ - **DQ Dimension Bars** β€” live completeness, uniqueness, validity bars updating each step
67
+ - **Score Trajectory Chart** β€” real-time line chart of score vs steps
68
+ - **Agent Treatment Plan** β€” next recommended actions shown before each step
69
+ - **Operation Log** β€” every action taken, result, and reward delta streamed live
70
+ - **Dataset Preview** β€” first 10 rows with NULL values highlighted in red
71
+ - **Export CSV** β€” download the cleaned DataFrame at any point
72
+
73
+ Click any task button β€” the dataset loads automatically and the demo agent runs end-to-end.
74
 
75
  ---
76
 
 
86
  | `replace_value` | βœ… | `{"old": ..., "new": ...}` | Replace a specific value |
87
  | `drop_outliers` | βœ… | β€” | Remove IQR outliers from a numeric column |
88
  | `fix_dtype` | βœ… | `{"dtype": "float\|int\|str"}` | Cast column to correct dtype |
89
+ | `align_schema` | ❌ | β€” | Rename Source A columns to canonical schema *(Task 4 only)* |
90
+ | `merge_sources` | ❌ | β€” | Concatenate aligned Source A + Source B *(Task 4 only)* |
91
 
92
  **Format rules enforced by `fix_format`:**
93
 
 
95
  |---|---|
96
  | `phone` | `NNN-NNN-NNNN` |
97
  | `listed_date` / `signup_date` | `YYYY-MM-DD` |
98
+ | `country` | Canonical name (`USA`, `UK`, `Canada`, `Australia`, `Germany`) |
 
 
 
 
 
 
 
 
 
 
99
 
100
  ---
101
 
102
  ## Observation Space
103
 
104
  Every `POST /reset` and `POST /step` returns:
105
+
106
  ```json
107
  {
108
  "observation": {
 
116
  "task_description": "Task 1 (Easy) β€” Fill Missing Values\n...",
117
  "message": "Filled 20 missing values in 'age' using median.",
118
  "step_count": 1,
119
+ "current_score": 0.4000,
120
+ "dq_metrics": {
121
+ "completeness_pct": 86.67,
122
+ "uniqueness_pct": 100.0,
123
+ "validity_pct": 94.5,
124
+ "total_cells": 500,
125
+ "null_cells": 50,
126
+ "duplicate_rows": 0,
127
+ "invalid_cells": 12
128
+ },
129
+ "tried_operations": ["fill_missing:age"],
130
+ "plan": [
131
+ "fill_missing on \"salary\" (20 nulls) using median",
132
+ "fill_missing on \"department\" (10 nulls) using mode"
133
+ ]
134
  },
135
  "reward": 0.40,
136
  "done": false,
 
141
  | Field | Type | Description |
142
  |---|---|---|
143
  | `done` | bool | Episode finished (score β‰₯ 0.95 or max steps reached) |
144
+ | `reward` | float | Per-step delta reward |
145
+ | `data_preview` | string | First 10 rows as CSV |
146
  | `data_shape` | [int, int] | Current `[rows, cols]` |
147
  | `missing_counts` | object | `{column: null_count}` for columns with NaN |
148
  | `duplicate_count` | int | Number of duplicate rows |
149
+ | `dtype_issues` | object | `{column: issue_description}` |
150
+ | `task_description` | string | Full task instructions |
151
+ | `message` | string | Human-readable result of last action |
152
+ | `step_count` | int | Steps taken this episode |
153
+ | `current_score` | float | Running grader score 0.0–1.0 |
154
+ | `dq_metrics` | object | Completeness / uniqueness / validity % + raw counts |
155
+ | `tried_operations` | array | Operations already applied β€” prevents agent loops |
156
+ | `plan` | array | Up to 3 recommended next actions (rule-based planning engine) |
 
 
 
 
 
 
 
 
 
 
 
 
 
157
 
158
  ---
159
 
 
164
  | Property | Value |
165
  |---|---|
166
  | Dataset | 100-row employee records (name, age, salary, department, experience) |
167
+ | Issues | ~20% NaN in `age`, `salary`; ~10% NaN in `department` |
168
  | Goal | Fill all missing values |
169
  | Valid operations | `fill_missing` |
170
  | Grader | `1.0 βˆ’ remaining_nulls / original_nulls` |
171
  | Max steps | 20 |
172
+ | Optimal steps | 3 |
173
 
174
  ### Task 2 β€” Fix Formats + Remove Duplicates *(Medium)*
175
 
176
  | Property | Value |
177
  |---|---|
178
  | Dataset | 215-row product catalog (product_id, price, category, phone, listed_date) |
179
+ | Issues | ~60% phone numbers in mixed formats, ~60% dates in mixed formats, 15 duplicate rows |
180
  | Goal | Standardise all phone/date formats and remove duplicates |
181
  | Valid operations | `fix_format`, `drop_duplicates` |
182
  | Grader | `0.35 Γ— phone_score + 0.35 Γ— date_score + 0.30 Γ— dupe_score` |
 
188
  | Property | Value |
189
  |---|---|
190
  | Dataset | 320-row customer database (name, age, purchase_amount, country, email, signup_date) |
191
+ | Issues | Missing values (4 cols), 20 duplicate rows, outliers in `purchase_amount`, mixed country case, mixed date formats |
192
  | Goal | Fix all issues end-to-end |
193
  | Valid operations | All 6 operations |
194
  | Grader | `0.25Γ—null + 0.20Γ—dupe + 0.20Γ—outlier + 0.175Γ—country + 0.175Γ—date` |
195
  | Max steps | 40 |
196
  | Optimal steps | 8 |
197
 
198
+ ### Task 4 β€” Multi-Source Schema Alignment + Merge *(Expert)*
199
+
200
+ | Property | Value |
201
+ |---|---|
202
+ | Source A | 150-row CRM export: `cust_id, full_name, Age, purchase_amt, Country, signup, email` |
203
+ | Source B | 100-row Marketing export: `customer_id, name, age_years, spend, country_name, registration_date, email` |
204
+ | Issues | Misaligned schemas, missing values, mixed country case, mixed date formats, 10 duplicate rows |
205
+ | Goal | Align schemas β†’ merge β†’ clean |
206
+ | Valid operations | `align_schema`, `merge_sources`, `fill_missing`, `fix_format`, `drop_duplicates` |
207
+ | Grader | `0.30Γ—schema + 0.25Γ—null + 0.20Γ—country + 0.15Γ—date + 0.10Γ—dupe` |
208
+ | Max steps | 50 |
209
+ | Optimal steps | 8 |
210
+
211
+ *Inspired by Meta's DataSchema system β€” column-level semantic annotation across misaligned sources.*
212
+
213
  ---
214
 
215
  ## Reward Function
 
219
  | Score improves (delta > 0) | `new_score βˆ’ old_score` (positive) |
220
  | Operation had no effect | `βˆ’0.01` |
221
  | Invalid operation / bad column | `βˆ’0.05` |
 
222
 
223
+ Rewards are bounded to **[βˆ’0.05, 0.99]**. Dense signal every step.
224
 
225
  ---
226
 
227
+ ## Intelligence Endpoints (Phase 2)
228
 
229
  | Method | Path | Description |
230
  |---|---|---|
231
+ | `GET` | `/profile` | Rich per-column DQ profile β€” null %, unique %, min/max/mean, top values |
232
+ | `GET` | `/report` | Full episode cleaning summary β€” score improvement, efficiency, issues fixed |
233
+ | `GET` | `/export` | Download current cleaned DataFrame as CSV |
234
+
235
+ ### `/profile` response example
236
+ ```json
237
+ {
238
+ "dq_metrics": {
239
+ "completeness_pct": 90.0,
240
+ "uniqueness_pct": 100.0,
241
+ "validity_pct": 88.5
242
+ },
243
+ "columns": {
244
+ "age": {"null_count": 20, "null_pct": 20.0, "min": 22, "max": 59, "mean": 40.3}
245
+ }
246
+ }
247
+ ```
248
+
249
+ ### `/report` response example
250
+ ```json
251
+ {
252
+ "initial_score": 0.01,
253
+ "final_score": 0.99,
254
+ "score_improvement": 0.98,
255
+ "steps_taken": 3,
256
+ "step_efficiency_pct": 85.0,
257
+ "issues_fixed": {"nulls_filled": 50, "dupes_removed": 15, "formats_fixed": 168},
258
+ "completed": true
259
+ }
260
+ ```
261
+
262
+ ---
263
+
264
+ ## All API Endpoints
265
+
266
+ | Method | Path | Description |
267
+ |---|---|---|
268
+ | `GET` | `/` | DataMedic live monitoring UI |
269
  | `GET` | `/health` | Health check β†’ `{"status": "healthy"}` |
270
+ | `POST` | `/reset` | Start episode. Body: `{"task_id": 1\|2\|3\|4}` |
271
  | `POST` | `/step` | Execute action. Body: action JSON |
272
+ | `GET` | `/state` | Episode metadata |
273
+ | `GET` | `/metadata` | Environment info + paper citations |
274
+ | `GET` | `/schema` | Full action/observation/state JSON schemas |
275
+ | `GET` | `/profile` | Rich data quality profile of current DataFrame |
276
+ | `GET` | `/report` | Full episode cleaning summary |
277
+ | `GET` | `/export` | Download cleaned DataFrame as CSV |
278
  | `GET` | `/docs` | Interactive Swagger UI |
279
 
280
  ---
 
286
  | 1 β€” Fill Missing Values | Easy | 0.999 |
287
  | 2 β€” Fix Formats + Duplicates | Medium | 0.999 |
288
  | 3 β€” Full Cleaning Pipeline | Hard | 0.999 |
289
+ | 4 β€” Multi-Source Merge | Expert | 0.990 |
290
+ | **Average** | β€” | **0.997** |
 
291
 
292
  ---
293
 
294
  ## Setup & Usage
295
 
296
  ### Prerequisites
 
297
  - Python 3.11+
298
  - Docker (for containerised deployment)
299
 
300
  ### Local β€” Python
301
  ```bash
 
302
  git clone https://github.com/Tanvi51204/openEnv.git
303
  cd openEnv
304
  pip install -r requirements.txt
305
+ python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
 
 
 
 
 
306
  ```
307
 
308
+ Then open:
309
+ - UI: http://localhost:8000
310
+ - Docs: http://localhost:8000/docs
311
+
312
  ### Local β€” Docker
313
  ```bash
314
  docker build -t data-cleaning-env .
315
  docker run -p 8000:8000 data-cleaning-env
316
  ```
317
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
318
  ### Run baseline inference
319
  ```bash
320
  export API_BASE_URL="https://api.openai.com/v1"
321
  export MODEL_NAME="gpt-4o-mini"
322
+ export HF_TOKEN="sk-..."
323
  export ENV_URL="http://localhost:8000"
324
 
325
  python inference.py
 
339
  ---
340
 
341
  ## Project Structure
342
+
343
  ```
344
  openenv-data-cleaning/
345
+ β”œβ”€β”€ models.py Pydantic contracts β€” Action / Observation / State / DQMetrics / Report
346
  β”œβ”€β”€ client.py Sync HTTP client (reset / step / state / health)
347
  β”œβ”€β”€ inference.py Baseline LLM agent with [START]/[STEP]/[END] logging
 
348
  β”œβ”€β”€ Dockerfile python:3.11-slim, non-root user, HEALTHCHECK
349
  β”œβ”€β”€ requirements.txt pip dependencies
 
350
  └── server/
351
+ β”œβ”€β”€ app.py FastAPI routes + /profile + /report + /export + UI
352
+ β”œβ”€β”€ environment.py reset / step / state + 8 operations + planning engine + DQ metrics
353
  β”œβ”€β”€ data_generator.py Synthetic dataset generation (seed=42, reproducible)
354
+ β”œβ”€β”€ ui.html DataMedic live monitoring dashboard
355
  └── tasks/
356
+ β”œβ”€β”€ task1_missing.py Easy β€” fill NaN grader
357
  β”œβ”€β”€ task2_format.py Medium β€” format + duplicates grader
358
+ β”œβ”€β”€ task3_pipeline.py Hard β€” full pipeline grader
359
+ └── task4_merge.py Expert β€” multi-source schema alignment + merge grader
360
  ```
361
 
362
  ---
 
365
 
366
  πŸ€— **HuggingFace Space:** https://srishtichugh-openenv-hack.hf.space
367
 
368
+ - UI: https://srishtichugh-openenv-hack.hf.space
369
  - Health: https://srishtichugh-openenv-hack.hf.space/health
370
+ - Docs: https://srishtichugh-openenv-hack.hf.space/docs
371
+ - Profile: https://srishtichugh-openenv-hack.hf.space/profile
372
+ - Report: https://srishtichugh-openenv-hack.hf.space/report
inference.py CHANGED
@@ -37,6 +37,8 @@ if not HF_TOKEN:
37
 
38
  client = OpenAI(api_key=HF_TOKEN, base_url=API_BASE_URL)
39
 
 
 
40
  SYSTEM_PROMPT = """You are a data cleaning agent. You control a data cleaning environment
41
  through JSON actions. Each turn you receive an observation JSON describing the current state
42
  of a dataset (preview, missing counts, duplicate count, dtype issues, current score, etc.)
@@ -165,10 +167,13 @@ def run_task(task_id: int) -> float:
165
  obs_text = obs_to_text(obs)
166
  history.append({"role": "user", "content": obs_text})
167
 
 
 
 
168
  try:
169
  response = client.chat.completions.create(
170
  model = MODEL_NAME,
171
- messages = [{"role": "system", "content": SYSTEM_PROMPT}] + history,
172
  temperature = 0.0,
173
  max_tokens = 256,
174
  )
@@ -207,7 +212,7 @@ def run_task(task_id: int) -> float:
207
  obs = result["observation"]
208
  step_reward = result["reward"]
209
  done = result["done"]
210
- error_msg = None if obs["message"].startswith("Fill") or step_reward >= 0 else obs["message"]
211
 
212
  print(f" -> {obs['message']}", file=sys.stderr)
213
 
 
37
 
38
  client = OpenAI(api_key=HF_TOKEN, base_url=API_BASE_URL)
39
 
40
+ HISTORY_WINDOW = 8 # keep last N turns (user+assistant pairs) to cap token usage
41
+
42
  SYSTEM_PROMPT = """You are a data cleaning agent. You control a data cleaning environment
43
  through JSON actions. Each turn you receive an observation JSON describing the current state
44
  of a dataset (preview, missing counts, duplicate count, dtype issues, current score, etc.)
 
167
  obs_text = obs_to_text(obs)
168
  history.append({"role": "user", "content": obs_text})
169
 
170
+ # Sliding window β€” keep system prompt + last HISTORY_WINDOW messages
171
+ windowed_history = history[-(HISTORY_WINDOW * 2):]
172
+
173
  try:
174
  response = client.chat.completions.create(
175
  model = MODEL_NAME,
176
+ messages = [{"role": "system", "content": SYSTEM_PROMPT}] + windowed_history,
177
  temperature = 0.0,
178
  max_tokens = 256,
179
  )
 
212
  obs = result["observation"]
213
  step_reward = result["reward"]
214
  done = result["done"]
215
+ error_msg = None if step_reward >= 0 else obs["message"]
216
 
217
  print(f" -> {obs['message']}", file=sys.stderr)
218
 
models.py CHANGED
@@ -9,28 +9,46 @@ class DataCleaningAction(BaseModel):
9
  operation choices:
10
  fill_missing – fill NaN values in a column
11
  drop_duplicates – remove duplicate rows
12
- fix_format – standardise string formats (phone, date, text)
13
  replace_value – replace a specific value with another
14
  drop_outliers – remove rows where column value is a statistical outlier
15
  fix_dtype – cast a column to the correct dtype
 
 
16
  """
17
  operation: str
18
  column: Optional[str] = None
19
  params: Dict[str, Any] = {}
20
 
21
 
 
 
 
 
 
 
 
 
 
 
 
22
  class DataCleaningObservation(BaseModel):
23
  done: bool
24
  reward: float
25
- data_preview: str # First 10 rows as CSV string
26
- data_shape: List[int] # [rows, cols]
27
  missing_counts: Dict[str, int]
28
  duplicate_count: int
29
  dtype_issues: Dict[str, str]
30
  task_description: str
31
  message: str
32
  step_count: int
33
- current_score: float # Running grader score 0.0–1.0
 
 
 
 
 
34
 
35
 
36
  class DataCleaningState(BaseModel):
@@ -40,3 +58,20 @@ class DataCleaningState(BaseModel):
40
  max_steps: int
41
  total_errors: int
42
  errors_remaining: int
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  operation choices:
10
  fill_missing – fill NaN values in a column
11
  drop_duplicates – remove duplicate rows
12
+ fix_format – standardise string formats (phone, date, country)
13
  replace_value – replace a specific value with another
14
  drop_outliers – remove rows where column value is a statistical outlier
15
  fix_dtype – cast a column to the correct dtype
16
+ align_schema – rename / reorder columns to match target schema (Task 4)
17
+ merge_sources – merge the two aligned source DataFrames (Task 4)
18
  """
19
  operation: str
20
  column: Optional[str] = None
21
  params: Dict[str, Any] = {}
22
 
23
 
24
+ class DataQualityMetrics(BaseModel):
25
+ """Standard DQ dimensions β€” populated by /profile and embedded in every observation."""
26
+ completeness_pct: float # % non-null cells across whole DataFrame
27
+ uniqueness_pct: float # % rows that are not duplicates
28
+ validity_pct: float # % cells passing format / dtype / range constraints
29
+ total_cells: int
30
+ null_cells: int
31
+ duplicate_rows: int
32
+ invalid_cells: int # format violations + dtype issues + out-of-range values
33
+
34
+
35
  class DataCleaningObservation(BaseModel):
36
  done: bool
37
  reward: float
38
+ data_preview: str # First 10 rows as CSV string
39
+ data_shape: List[int] # [rows, cols]
40
  missing_counts: Dict[str, int]
41
  duplicate_count: int
42
  dtype_issues: Dict[str, str]
43
  task_description: str
44
  message: str
45
  step_count: int
46
+ current_score: float # Running grader score 0.0-1.0
47
+
48
+ # --- Phase 2 additions ---
49
+ dq_metrics: DataQualityMetrics # Live data quality vitals
50
+ tried_operations: List[str] # e.g. ["fill_missing:age", "drop_duplicates"]
51
+ plan: List[str] # Agent-facing recommended next 1-3 actions
52
 
53
 
54
  class DataCleaningState(BaseModel):
 
58
  max_steps: int
59
  total_errors: int
60
  errors_remaining: int
61
+
62
+
63
+ class EpisodeReport(BaseModel):
64
+ """Returned by GET /report β€” full cleaning episode summary."""
65
+ episode_id: str
66
+ task_id: int
67
+ task_name: str
68
+ initial_score: float
69
+ final_score: float
70
+ score_improvement: float
71
+ steps_taken: int
72
+ max_steps: int
73
+ step_efficiency_pct: float # How few steps used vs max (higher = better)
74
+ operations_applied: List[str] # Ordered list of what was done
75
+ issues_fixed: Dict[str, int] # e.g. {"nulls_filled": 40, "dupes_removed": 15}
76
+ final_dq_metrics: DataQualityMetrics
77
+ completed: bool # True if score >= 0.95
server/app.py CHANGED
@@ -1,26 +1,46 @@
1
  """
2
  FastAPI application exposing the OpenEnv-compatible HTTP API.
3
- Endpoints: GET /health, GET /metadata, GET /schema,
4
- POST /reset, POST /step, GET /state, POST /state, GET /docs
 
 
 
 
 
 
 
 
 
 
5
  """
6
 
 
 
7
  from typing import Any, Dict, Optional
8
  from fastapi import Body, FastAPI, HTTPException
 
9
  from pydantic import BaseModel
10
  import uvicorn
11
 
12
- from models import DataCleaningAction, DataCleaningObservation, DataCleaningState
13
  from server.environment import DataCleaningEnvironment
14
 
15
  app = FastAPI(
16
  title="Data Cleaning OpenEnv",
17
- description="A real-world data cleaning environment for AI agent training.",
18
- version="0.1.0",
 
 
 
 
 
19
  )
20
 
21
- # Single shared environment instance (stateful server)
22
  env = DataCleaningEnvironment()
23
 
 
 
24
 
25
  class ResetRequest(BaseModel):
26
  task_id: Optional[int] = None
@@ -34,9 +54,17 @@ class StepResponse(BaseModel):
34
 
35
 
36
  # ------------------------------------------------------------------
37
- # Routes
38
  # ------------------------------------------------------------------
39
 
 
 
 
 
 
 
 
 
40
  @app.get("/health")
41
  def health():
42
  return {"status": "healthy"}
@@ -47,16 +75,24 @@ def metadata():
47
  return {
48
  "name": "data-cleaning-env",
49
  "description": (
50
- "A real-world data cleaning environment where an AI agent fixes "
51
- "missing values, duplicate rows, format inconsistencies, outliers, "
52
- "and dtype errors across three progressively harder tasks."
53
  ),
54
- "version": "0.1.0",
55
- "tags": ["openenv", "data-cleaning", "rl", "real-world"],
56
  "tasks": [
57
- {"id": "task1", "name": "Fill Missing Values", "difficulty": "easy"},
58
  {"id": "task2", "name": "Fix Formats and Remove Duplicates", "difficulty": "medium"},
59
- {"id": "task3", "name": "Full Cleaning Pipeline", "difficulty": "hard"},
 
 
 
 
 
 
 
 
60
  ],
61
  }
62
 
@@ -70,16 +106,13 @@ def schema():
70
  "operation": {
71
  "type": "string",
72
  "enum": [
73
- "fill_missing",
74
- "drop_duplicates",
75
- "fix_format",
76
- "replace_value",
77
- "drop_outliers",
78
- "fix_dtype",
79
  ],
80
  },
81
  "column": {"type": "string", "nullable": True},
82
- "params": {"type": "object", "nullable": True},
83
  },
84
  "required": ["operation"],
85
  },
@@ -97,6 +130,9 @@ def schema():
97
  "message": {"type": "string"},
98
  "step_count": {"type": "integer"},
99
  "current_score": {"type": "number"},
 
 
 
100
  },
101
  },
102
  "state": {
@@ -127,13 +163,20 @@ async def step(body: Dict[str, Any] = Body(...)):
127
  """
128
  Accept both openenv-core wrapped format:
129
  {"action": {"operation": "...", ...}, "timeout_s": 15}
130
- and direct format (for backward compat with our own client/inference):
131
  {"operation": "...", "column": "...", "params": {...}}
 
132
  """
133
  action_data = body.get("action", body)
134
  try:
135
  action = DataCleaningAction(**action_data)
136
- obs = env.step(action)
 
 
 
 
 
 
137
  except (TypeError, KeyError, Exception) as e:
138
  raise HTTPException(status_code=400, detail=str(e))
139
  return StepResponse(observation=obs, reward=obs.reward, done=obs.done)
@@ -141,23 +184,77 @@ async def step(body: Dict[str, Any] = Body(...)):
141
 
142
  @app.get("/state", response_model=DataCleaningState)
143
  def state_get():
144
- """GET /state β€” openenv-core spec."""
145
  return env.state()
146
 
147
 
148
  @app.post("/state", response_model=DataCleaningState)
149
  def state_post():
150
- """POST /state β€” backward compatibility."""
151
  return env.state()
152
 
153
 
154
  # ------------------------------------------------------------------
155
- # Entry point (required by openenv-core and [project.scripts])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
156
  # ------------------------------------------------------------------
157
 
158
  def main():
159
- uvicorn.run("server.app:app", host="0.0.0.0", port=8000)
160
 
161
 
162
  if __name__ == "__main__":
163
- main()
 
1
  """
2
  FastAPI application exposing the OpenEnv-compatible HTTP API.
3
+
4
+ Endpoints:
5
+ GET /health Health check
6
+ GET /metadata Environment info
7
+ GET /schema Action / observation / state schemas
8
+ POST /reset Start new episode
9
+ POST /step Execute cleaning action (with 30s timeout)
10
+ GET /state Episode metadata
11
+ POST /state Episode metadata (backward compat)
12
+ GET /profile Rich data quality profile of current DataFrame
13
+ GET /report Full episode cleaning summary (health certificate)
14
+ GET /export Download current cleaned DataFrame as CSV
15
  """
16
 
17
+ import asyncio
18
+ import os
19
  from typing import Any, Dict, Optional
20
  from fastapi import Body, FastAPI, HTTPException
21
+ from fastapi.responses import PlainTextResponse, HTMLResponse
22
  from pydantic import BaseModel
23
  import uvicorn
24
 
25
+ from models import DataCleaningAction, DataCleaningObservation, DataCleaningState, EpisodeReport
26
  from server.environment import DataCleaningEnvironment
27
 
28
  app = FastAPI(
29
  title="Data Cleaning OpenEnv",
30
+ description=(
31
+ "A real-world data cleaning environment for AI agent training and evaluation. "
32
+ "An agent interacts with dirty pandas DataFrames through a standard reset/step/state API, "
33
+ "learning to fix missing values, duplicates, format inconsistencies, outliers, and dtype errors. "
34
+ "Grounded in CleanAgent (2024), AutoDCWorkflow (EMNLP 2025), and Meta-scale data quality principles."
35
+ ),
36
+ version="0.2.0",
37
  )
38
 
39
+ # Single shared environment instance
40
  env = DataCleaningEnvironment()
41
 
42
+ STEP_TIMEOUT_SECONDS = 30
43
+
44
 
45
  class ResetRequest(BaseModel):
46
  task_id: Optional[int] = None
 
54
 
55
 
56
  # ------------------------------------------------------------------
57
+ # Core OpenEnv routes
58
  # ------------------------------------------------------------------
59
 
60
+ @app.get("/", response_class=HTMLResponse, include_in_schema=False)
61
+ def ui():
62
+ """DataMedic β€” live agent monitoring dashboard."""
63
+ ui_path = os.path.join(os.path.dirname(__file__), "ui.html")
64
+ with open(ui_path, "r") as f:
65
+ return HTMLResponse(content=f.read())
66
+
67
+
68
  @app.get("/health")
69
  def health():
70
  return {"status": "healthy"}
 
75
  return {
76
  "name": "data-cleaning-env",
77
  "description": (
78
+ "A real-world data cleaning RL environment. The agent diagnoses dirty datasets, "
79
+ "plans a treatment, executes cleaning operations step-by-step, and produces a "
80
+ "health certificate β€” grounded in AutoDCWorkflow, CleanAgent, and HoloClean research."
81
  ),
82
+ "version": "0.2.0",
83
+ "tags": ["openenv", "data-cleaning", "rl", "real-world", "agentic"],
84
  "tasks": [
85
+ {"id": "task1", "name": "Fill Missing Values", "difficulty": "easy"},
86
  {"id": "task2", "name": "Fix Formats and Remove Duplicates", "difficulty": "medium"},
87
+ {"id": "task3", "name": "Full Cleaning Pipeline", "difficulty": "hard"},
88
+ {"id": "task4", "name": "Multi-Source Schema Alignment + Merge", "difficulty": "expert"},
89
+ ],
90
+ "observation_extras": ["dq_metrics", "tried_operations", "plan"],
91
+ "papers": [
92
+ "Bendinelli et al. 2025 β€” LLM Agents for Cleaning Tabular ML Datasets (arXiv:2503.06664)",
93
+ "CleanAgent β€” Qi & Wang 2024 (arXiv:2403.08291)",
94
+ "AutoDCWorkflow β€” EMNLP 2025 Findings",
95
+ "HoloClean β€” Rekatsinas et al. 2017",
96
  ],
97
  }
98
 
 
106
  "operation": {
107
  "type": "string",
108
  "enum": [
109
+ "fill_missing", "drop_duplicates", "fix_format",
110
+ "replace_value", "drop_outliers", "fix_dtype",
111
+ "align_schema", "merge_sources",
 
 
 
112
  ],
113
  },
114
  "column": {"type": "string", "nullable": True},
115
+ "params": {"type": "object", "nullable": True},
116
  },
117
  "required": ["operation"],
118
  },
 
130
  "message": {"type": "string"},
131
  "step_count": {"type": "integer"},
132
  "current_score": {"type": "number"},
133
+ "dq_metrics": {"type": "object", "description": "Completeness/uniqueness/validity %"},
134
+ "tried_operations": {"type": "array", "description": "Operations already applied"},
135
+ "plan": {"type": "array", "description": "Agent recommended next actions"},
136
  },
137
  },
138
  "state": {
 
163
  """
164
  Accept both openenv-core wrapped format:
165
  {"action": {"operation": "...", ...}, "timeout_s": 15}
166
+ and direct format:
167
  {"operation": "...", "column": "...", "params": {...}}
168
+ Times out after 30 seconds to prevent hanging during evaluation.
169
  """
170
  action_data = body.get("action", body)
171
  try:
172
  action = DataCleaningAction(**action_data)
173
+ loop = asyncio.get_event_loop()
174
+ obs = await asyncio.wait_for(
175
+ loop.run_in_executor(None, env.step, action),
176
+ timeout=STEP_TIMEOUT_SECONDS,
177
+ )
178
+ except asyncio.TimeoutError:
179
+ raise HTTPException(status_code=504, detail=f"Step timed out after {STEP_TIMEOUT_SECONDS}s")
180
  except (TypeError, KeyError, Exception) as e:
181
  raise HTTPException(status_code=400, detail=str(e))
182
  return StepResponse(observation=obs, reward=obs.reward, done=obs.done)
 
184
 
185
  @app.get("/state", response_model=DataCleaningState)
186
  def state_get():
 
187
  return env.state()
188
 
189
 
190
  @app.post("/state", response_model=DataCleaningState)
191
  def state_post():
 
192
  return env.state()
193
 
194
 
195
  # ------------------------------------------------------------------
196
+ # Phase 2: Intelligence endpoints
197
+ # ------------------------------------------------------------------
198
+
199
+ @app.get("/profile")
200
+ def profile():
201
+ """
202
+ Rich data quality profile of the current DataFrame.
203
+
204
+ Returns per-column statistics (null %, unique %, min/max/mean for numerics,
205
+ top values for categoricals) plus dataset-level DQ metrics:
206
+ completeness %, uniqueness %, validity %.
207
+
208
+ Inspired by standard Data Quality dimensions (ISO 8000) and
209
+ Meta's data schematization approach.
210
+ """
211
+ try:
212
+ return env.get_profile()
213
+ except Exception as e:
214
+ raise HTTPException(status_code=400, detail=str(e))
215
+
216
+
217
+ @app.get("/report", response_model=EpisodeReport)
218
+ def report():
219
+ """
220
+ Full episode cleaning summary β€” the 'health certificate'.
221
+
222
+ Returns: initial vs final score, score improvement, step efficiency,
223
+ ordered list of operations applied, issues fixed by category,
224
+ and final DQ metrics. Call after the episode completes for best results.
225
+ """
226
+ try:
227
+ return env.get_report()
228
+ except Exception as e:
229
+ raise HTTPException(status_code=400, detail=str(e))
230
+
231
+
232
+ @app.get("/export")
233
+ def export():
234
+ """
235
+ Download the current (cleaned) DataFrame as a CSV file.
236
+
237
+ Returns the live state of the DataFrame β€” call after the agent
238
+ finishes cleaning to get the cleaned output.
239
+ """
240
+ try:
241
+ csv_data = env.get_export()
242
+ return PlainTextResponse(
243
+ content=csv_data,
244
+ media_type="text/csv",
245
+ headers={"Content-Disposition": "attachment; filename=cleaned_data.csv"},
246
+ )
247
+ except Exception as e:
248
+ raise HTTPException(status_code=400, detail=str(e))
249
+
250
+
251
+ # ------------------------------------------------------------------
252
+ # Entry point
253
  # ------------------------------------------------------------------
254
 
255
  def main():
256
+ uvicorn.run("server.app:app", host="0.0.0.0", port=8000, workers=1)
257
 
258
 
259
  if __name__ == "__main__":
260
+ main()
server/data_generator.py CHANGED
@@ -195,3 +195,105 @@ def generate_task3_datasets():
195
  dirty_df = pd.concat([dirty_df, dup_rows], ignore_index=True)
196
 
197
  return dirty_df.reset_index(drop=True), clean_df.reset_index(drop=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
195
  dirty_df = pd.concat([dirty_df, dup_rows], ignore_index=True)
196
 
197
  return dirty_df.reset_index(drop=True), clean_df.reset_index(drop=True)
198
+
199
+
200
+ # ---------------------------------------------------------------------------
201
+ # Task 4 β€” Multi-source merge pipeline (Expert)
202
+ # ---------------------------------------------------------------------------
203
+ # Two independently generated "source" DataFrames with misaligned schemas
204
+ # that must be aligned and merged before the standard cleaning pipeline.
205
+ #
206
+ # Source A β€” CRM export (150 rows):
207
+ # cust_id, full_name, Age, purchase_amt, Country, signup
208
+ #
209
+ # Source B β€” Marketing export (100 rows):
210
+ # customer_id, name, age_years, spend, country_name, registration_date, email
211
+ #
212
+ # Target schema after align_schema + merge_sources (250 rows):
213
+ # customer_id, name, age, purchase_amount, country, signup_date, email
214
+ #
215
+ # Additional dirty issues injected after merge:
216
+ # - Missing values in age, purchase_amount, country (~10%)
217
+ # - Mixed country capitalisation (~30%)
218
+ # - Mixed date formats in signup_date (~40%)
219
+ # - 10 duplicate rows
220
+
221
+ def generate_task4_datasets():
222
+ """
223
+ Returns (source_a, source_b, clean_merged_df).
224
+ source_a and source_b have misaligned schemas.
225
+ clean_merged_df is the ground-truth after alignment + merge + cleaning.
226
+ """
227
+ rng = np.random.default_rng(SEED + 4) # distinct seed offset
228
+ random.seed(SEED + 4)
229
+
230
+ countries = ["USA", "UK", "Canada", "Australia", "Germany"]
231
+ first_names = ["Alice", "Bob", "Carol", "David", "Eve", "Frank",
232
+ "Grace", "Heidi", "Ivan", "Judy", "Karl", "Laura"]
233
+ last_names = ["Smith", "Jones", "Brown", "Taylor", "Wilson",
234
+ "Davis", "Miller", "Anderson", "Thomas", "Jackson"]
235
+
236
+ # ---- Source A β€” CRM (150 rows) ----
237
+ n_a = 150
238
+ names_a = [f"{random.choice(first_names)} {random.choice(last_names)}" for _ in range(n_a)]
239
+ ages_a = rng.integers(18, 75, size=n_a).astype(float)
240
+ amounts_a = np.round(rng.uniform(10.0, 500.0, size=n_a), 2)
241
+ countries_a = rng.choice(countries, size=n_a)
242
+ days_a = rng.integers(0, 730, size=n_a)
243
+ dates_a = [(pd.Timestamp("2022-01-01") + pd.Timedelta(days=int(d))).strftime("%Y-%m-%d")
244
+ for d in days_a]
245
+ emails_a = [f"crm_{i}@example.com" for i in range(1, n_a + 1)]
246
+
247
+ source_a = pd.DataFrame({
248
+ "cust_id": [f"A{str(i).zfill(4)}" for i in range(1, n_a + 1)],
249
+ "full_name": names_a, # β†’ name
250
+ "Age": ages_a, # β†’ age (capital A β€” schema mismatch)
251
+ "purchase_amt": amounts_a, # β†’ purchase_amount (truncated name)
252
+ "Country": countries_a, # β†’ country (capital C)
253
+ "signup": dates_a, # β†’ signup_date (truncated name)
254
+ "email": emails_a,
255
+ })
256
+
257
+ # ---- Source B β€” Marketing (100 rows) ----
258
+ n_b = 100
259
+ names_b = [f"{random.choice(first_names)} {random.choice(last_names)}" for _ in range(n_b)]
260
+ ages_b = rng.integers(18, 75, size=n_b).astype(float)
261
+ amounts_b = np.round(rng.uniform(10.0, 500.0, size=n_b), 2)
262
+ countries_b = rng.choice(countries, size=n_b)
263
+ days_b = rng.integers(0, 730, size=n_b)
264
+ dates_b = [(pd.Timestamp("2022-01-01") + pd.Timedelta(days=int(d))).strftime("%Y-%m-%d")
265
+ for d in days_b]
266
+ emails_b = [f"mkt_{i}@example.com" for i in range(1, n_b + 1)]
267
+
268
+ source_b = pd.DataFrame({
269
+ "customer_id": [f"B{str(i).zfill(4)}" for i in range(1, n_b + 1)],
270
+ "name": names_b,
271
+ "age_years": ages_b, # β†’ age (suffix mismatch)
272
+ "spend": amounts_b, # β†’ purchase_amount (synonym)
273
+ "country_name": countries_b, # β†’ country (suffix mismatch)
274
+ "registration_date": dates_b, # β†’ signup_date (synonym)
275
+ "email": emails_b,
276
+ })
277
+
278
+ # ---- Ground-truth clean merged DataFrame ----
279
+ clean_a = pd.DataFrame({
280
+ "customer_id": source_a["cust_id"],
281
+ "name": source_a["full_name"],
282
+ "age": source_a["Age"],
283
+ "purchase_amount":source_a["purchase_amt"],
284
+ "country": source_a["Country"],
285
+ "signup_date": source_a["signup"],
286
+ "email": source_a["email"],
287
+ })
288
+ clean_b = pd.DataFrame({
289
+ "customer_id": source_b["customer_id"],
290
+ "name": source_b["name"],
291
+ "age": source_b["age_years"],
292
+ "purchase_amount":source_b["spend"],
293
+ "country": source_b["country_name"],
294
+ "signup_date": source_b["registration_date"],
295
+ "email": source_b["email"],
296
+ })
297
+ clean_merged = pd.concat([clean_a, clean_b], ignore_index=True).reset_index(drop=True)
298
+
299
+ return source_a.copy(), source_b.copy(), clean_merged
server/environment.py CHANGED
@@ -1,21 +1,37 @@
1
  """
2
  Core environment implementing reset / step / state.
3
- Each call to reset() picks a task (round-robin: 1 β†’ 2 β†’ 3 β†’ 1 …)
4
  or a specific task_id can be forced via reset(task_id=N).
 
 
 
 
 
 
5
  """
6
 
7
  import re
8
  import uuid
9
  import numpy as np
10
  import pandas as pd
11
- from typing import Any, Dict, Optional, Tuple
12
 
13
- from models import DataCleaningAction, DataCleaningObservation, DataCleaningState
 
 
 
14
  import server.tasks.task1_missing as t1
15
  import server.tasks.task2_format as t2
16
  import server.tasks.task3_pipeline as t3
 
17
 
18
- TASK_MODULES = {1: t1, 2: t2, 3: t3}
 
 
 
 
 
 
19
 
20
  PHONE_RE = re.compile(r"^\d{3}-\d{3}-\d{4}$")
21
  DATE_RE = re.compile(r"^\d{4}-\d{2}-\d{2}$")
@@ -25,16 +41,28 @@ VALID_COUNTRIES = {"USA", "UK", "Canada", "Australia", "Germany"}
25
  class DataCleaningEnvironment:
26
 
27
  def __init__(self):
28
- self._df: Optional[pd.DataFrame] = None
29
  self._clean_df: Optional[pd.DataFrame] = None
30
- self._meta: Any = None # task-specific metadata
31
- self._task_id: int = 1
32
- self._episode_id: str = ""
33
- self._step_count: int = 0
34
- self._max_steps: int = 20
35
- self._total_errors: int = 0
36
- self._last_score: float = 0.01
37
- self._task_cycle: int = 0 # for round-robin default
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
  # ------------------------------------------------------------------
40
  # Public API
@@ -46,21 +74,35 @@ class DataCleaningEnvironment:
46
  task_id = self._task_cycle
47
 
48
  if task_id not in TASK_MODULES:
49
- raise ValueError(f"task_id must be 1, 2, or 3 β€” got {task_id}")
50
 
51
  mod = TASK_MODULES[task_id]
52
- self._task_id = task_id
53
  self._episode_id = str(uuid.uuid4())
54
  self._step_count = 0
55
  self._max_steps = mod.MAX_STEPS
56
 
57
- if task_id == 1:
58
- self._df, self._clean_df, self._meta = mod.load()
 
 
 
59
  else:
60
  self._df, self._clean_df, self._meta = mod.load()
 
 
 
61
 
62
- self._last_score = self._compute_score()
63
- self._total_errors = self._count_errors()
 
 
 
 
 
 
 
 
64
 
65
  return self._build_obs(self._last_score, False, "Episode started. Begin cleaning.")
66
 
@@ -71,27 +113,32 @@ class DataCleaningEnvironment:
71
  self._step_count += 1
72
  score_before = self._last_score
73
 
 
 
 
74
  message, applied = self._apply_action(action)
75
 
76
- score_after = self._compute_score()
77
  self._last_score = score_after
78
 
79
  delta = score_after - score_before
80
  if not applied:
81
- reward = 0.01
82
  elif delta <= 0:
83
- reward = 0.01
84
  else:
85
  reward = round(delta, 4)
 
 
 
 
 
86
 
87
  done = (score_after >= 0.95) or (self._step_count >= self._max_steps)
88
-
89
- # Clamp reward strictly within (0.01, 0.99) β€” no terminal bonus
90
- reward = round(max(0.01, min(0.99, reward)), 4)
91
 
92
  return self._build_obs(reward, done, message)
93
 
94
-
95
  def state(self) -> DataCleaningState:
96
  if self._df is None:
97
  return DataCleaningState(
@@ -99,37 +146,235 @@ class DataCleaningEnvironment:
99
  max_steps=0, total_errors=0, errors_remaining=0,
100
  )
101
  return DataCleaningState(
102
- episode_id = self._episode_id,
103
- task_id = self._task_id,
104
- step_count = self._step_count,
105
- max_steps = self._max_steps,
106
- total_errors = self._total_errors,
107
  errors_remaining = self._count_errors(),
108
  )
109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
  # ------------------------------------------------------------------
111
  # Internal helpers
112
  # ------------------------------------------------------------------
113
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
  def _compute_score(self) -> float:
115
  if self._task_id == 1:
116
  raw = t1.score(self._df, self._meta)
117
  elif self._task_id == 2:
118
  raw = t2.score(self._df, self._meta)
119
- else:
120
  raw = t3.score(self._df, self._meta)
121
-
122
- EPS = 1e-4
123
-
124
- # First round safely
125
  raw = float(raw)
126
-
127
- # HARD clamp AFTER rounding risk
128
  if raw >= 1.0:
129
  raw = 1.0 - EPS
130
  elif raw <= 0.0:
131
  raw = EPS
132
-
133
  return round(raw, 4)
134
 
135
  def _count_errors(self) -> int:
@@ -137,28 +382,35 @@ class DataCleaningEnvironment:
137
  return t1.count_errors(self._df)
138
  elif self._task_id == 2:
139
  return t2.count_errors(self._df, self._meta)
140
- else:
141
  return t3.count_errors(self._df, self._meta)
 
 
142
 
143
  def _build_obs(self, reward: float, done: bool, message: str) -> DataCleaningObservation:
144
- mod = TASK_MODULES[self._task_id]
145
- missing = {col: int(n) for col, n in self._df.isnull().sum().items() if n > 0}
146
- dupes = len(self._df) - len(self._df.drop_duplicates())
147
  dtype_issues = self._detect_dtype_issues()
148
- preview = self._df.head(10).to_csv(index=False)
 
 
149
 
150
  return DataCleaningObservation(
151
- done = done,
152
- reward = reward,
153
- data_preview = preview,
154
- data_shape = list(self._df.shape),
155
- missing_counts = missing,
156
- duplicate_count = dupes,
157
- dtype_issues = dtype_issues,
158
- task_description = mod.DESCRIPTION,
159
- message = message,
160
- step_count = self._step_count,
161
- current_score = self._last_score,
 
 
 
162
  )
163
 
164
  def _detect_dtype_issues(self) -> Dict[str, str]:
@@ -195,8 +447,17 @@ class DataCleaningEnvironment:
195
  return self._drop_outliers(col)
196
  elif op == "fix_dtype":
197
  return self._fix_dtype(col, p)
 
 
 
 
198
  else:
199
- return f"Unknown operation '{op}'. Choose from: fill_missing, drop_duplicates, fix_format, replace_value, drop_outliers, fix_dtype.", False
 
 
 
 
 
200
  except Exception as exc:
201
  return f"Operation failed: {exc}", False
202
 
@@ -230,8 +491,7 @@ class DataCleaningEnvironment:
230
  def _drop_duplicates(self) -> Tuple[str, bool]:
231
  n_before = len(self._df)
232
  self._df = self._df.drop_duplicates().reset_index(drop=True)
233
- n_after = len(self._df)
234
- removed = n_before - n_after
235
  if removed == 0:
236
  return "No duplicate rows found.", False
237
  return f"Dropped {removed} duplicate rows.", True
@@ -239,7 +499,6 @@ class DataCleaningEnvironment:
239
  def _fix_format(self, col) -> Tuple[str, bool]:
240
  if col is None or col not in self._df.columns:
241
  return f"Column '{col}' not found.", False
242
-
243
  if col == "phone":
244
  return self._fix_phone(col)
245
  elif col in ("listed_date", "signup_date"):
@@ -278,7 +537,6 @@ class DataCleaningEnvironment:
278
  return pd.to_datetime(s, format=fmt).strftime("%Y-%m-%d")
279
  except Exception:
280
  pass
281
- # last-resort flexible parse
282
  try:
283
  return pd.to_datetime(s).strftime("%Y-%m-%d")
284
  except Exception:
@@ -311,7 +569,7 @@ class DataCleaningEnvironment:
311
  after = (~self._df[col].isin(VALID_COUNTRIES) & self._df[col].notna()).sum()
312
  fixed = int(before - after)
313
  if fixed == 0:
314
- return f"No country capitalisation issues found.", False
315
  return f"Fixed {fixed} country values to correct capitalisation.", True
316
 
317
  def _replace_value(self, col, p) -> Tuple[str, bool]:
@@ -343,6 +601,69 @@ class DataCleaningEnvironment:
343
  return f"No outliers found in '{col}'.", False
344
  return f"Removed {removed} outlier rows from '{col}' using IQR method.", True
345
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
346
  def _fix_dtype(self, col, p) -> Tuple[str, bool]:
347
  if col is None or col not in self._df.columns:
348
  return f"Column '{col}' not found.", False
@@ -358,4 +679,4 @@ class DataCleaningEnvironment:
358
  return f"Unknown dtype '{dtype}'.", False
359
  return f"Converted '{col}' to {dtype}.", True
360
  except Exception as exc:
361
- return f"dtype conversion failed: {exc}", False
 
1
  """
2
  Core environment implementing reset / step / state.
3
+ Each call to reset() picks a task (round-robin: 1 -> 2 -> 3 -> 1 ...)
4
  or a specific task_id can be forced via reset(task_id=N).
5
+
6
+ Phase 2 additions:
7
+ - DataQualityMetrics computed every step (completeness, uniqueness, validity)
8
+ - tried_operations: deduplication log so agent avoids repeating useless ops
9
+ - plan: rule-based next-action recommendations surfaced in every observation
10
+ - Episode history tracked for /report endpoint
11
  """
12
 
13
  import re
14
  import uuid
15
  import numpy as np
16
  import pandas as pd
17
+ from typing import Any, Dict, List, Optional, Tuple
18
 
19
+ from models import (
20
+ DataCleaningAction, DataCleaningObservation,
21
+ DataCleaningState, DataQualityMetrics, EpisodeReport,
22
+ )
23
  import server.tasks.task1_missing as t1
24
  import server.tasks.task2_format as t2
25
  import server.tasks.task3_pipeline as t3
26
+ import server.tasks.task4_merge as t4
27
 
28
+ TASK_MODULES = {1: t1, 2: t2, 3: t3, 4: t4}
29
+ TASK_NAMES = {
30
+ 1: "Fill Missing Values",
31
+ 2: "Fix Formats + Remove Duplicates",
32
+ 3: "Full Cleaning Pipeline",
33
+ 4: "Multi-Source Schema Alignment + Merge",
34
+ }
35
 
36
  PHONE_RE = re.compile(r"^\d{3}-\d{3}-\d{4}$")
37
  DATE_RE = re.compile(r"^\d{4}-\d{2}-\d{2}$")
 
41
  class DataCleaningEnvironment:
42
 
43
  def __init__(self):
44
+ self._df: Optional[pd.DataFrame] = None
45
  self._clean_df: Optional[pd.DataFrame] = None
46
+ self._meta: Any = None
47
+ self._task_id: int = 1
48
+ self._episode_id: str = ""
49
+ self._step_count: int = 0
50
+ self._max_steps: int = 20
51
+ self._total_errors: int = 0
52
+ self._last_score: float = 0.01
53
+ self._initial_score: float = 0.01
54
+ self._task_cycle: int = 0
55
+
56
+ # Phase 2 tracking
57
+ self._tried_operations: List[str] = []
58
+ self._operations_log: List[str] = []
59
+ self._issues_fixed: Dict[str, int] = {}
60
+ self._initial_dq: Optional[DataQualityMetrics] = None
61
+
62
+ # Task 4 state
63
+ self._source_b: Optional[pd.DataFrame] = None # held until merge_sources called
64
+ self._schema_aligned: bool = False
65
+ self._sources_merged: bool = False
66
 
67
  # ------------------------------------------------------------------
68
  # Public API
 
74
  task_id = self._task_cycle
75
 
76
  if task_id not in TASK_MODULES:
77
+ raise ValueError(f"task_id must be 1, 2, 3, or 4 β€” got {task_id}")
78
 
79
  mod = TASK_MODULES[task_id]
80
+ self._task_id = task_id
81
  self._episode_id = str(uuid.uuid4())
82
  self._step_count = 0
83
  self._max_steps = mod.MAX_STEPS
84
 
85
+ # Task 4 returns 4 values; others return 3
86
+ if task_id == 4:
87
+ self._df, self._source_b, self._clean_df, self._meta = mod.load()
88
+ self._schema_aligned = False
89
+ self._sources_merged = False
90
  else:
91
  self._df, self._clean_df, self._meta = mod.load()
92
+ self._source_b = None
93
+ self._schema_aligned = False
94
+ self._sources_merged = False
95
 
96
+ self._last_score = self._compute_score()
97
+ self._initial_score = self._last_score
98
+ self._total_errors = self._count_errors()
99
+
100
+ # Reset Phase 2 state
101
+ self._tried_operations = []
102
+ self._operations_log = []
103
+ self._issues_fixed = {"nulls_filled": 0, "dupes_removed": 0,
104
+ "formats_fixed": 0, "outliers_removed": 0}
105
+ self._initial_dq = self._compute_dq_metrics()
106
 
107
  return self._build_obs(self._last_score, False, "Episode started. Begin cleaning.")
108
 
 
113
  self._step_count += 1
114
  score_before = self._last_score
115
 
116
+ # Track tried operations BEFORE applying (for feedback loop)
117
+ op_key = self._make_op_key(action)
118
+
119
  message, applied = self._apply_action(action)
120
 
121
+ score_after = self._compute_score()
122
  self._last_score = score_after
123
 
124
  delta = score_after - score_before
125
  if not applied:
126
+ reward = -0.01
127
  elif delta <= 0:
128
+ reward = -0.01
129
  else:
130
  reward = round(delta, 4)
131
+ # Log successful operation
132
+ if op_key not in self._tried_operations:
133
+ self._tried_operations.append(op_key)
134
+ self._operations_log.append(message)
135
+ self._update_issues_fixed(action, message)
136
 
137
  done = (score_after >= 0.95) or (self._step_count >= self._max_steps)
138
+ reward = round(max(-0.05, min(0.99, reward)), 4)
 
 
139
 
140
  return self._build_obs(reward, done, message)
141
 
 
142
  def state(self) -> DataCleaningState:
143
  if self._df is None:
144
  return DataCleaningState(
 
146
  max_steps=0, total_errors=0, errors_remaining=0,
147
  )
148
  return DataCleaningState(
149
+ episode_id = self._episode_id,
150
+ task_id = self._task_id,
151
+ step_count = self._step_count,
152
+ max_steps = self._max_steps,
153
+ total_errors = self._total_errors,
154
  errors_remaining = self._count_errors(),
155
  )
156
 
157
+ def get_profile(self) -> Dict[str, Any]:
158
+ """Rich data profile for GET /profile endpoint."""
159
+ if self._df is None:
160
+ return {}
161
+
162
+ dq = self._compute_dq_metrics()
163
+ profile: Dict[str, Any] = {
164
+ "episode_id": self._episode_id,
165
+ "task_id": self._task_id,
166
+ "shape": {"rows": self._df.shape[0], "cols": self._df.shape[1]},
167
+ "dq_metrics": dq.model_dump(),
168
+ "columns": {},
169
+ }
170
+
171
+ for col in self._df.columns:
172
+ series = self._df[col]
173
+ col_info: Dict[str, Any] = {
174
+ "dtype": str(series.dtype),
175
+ "null_count": int(series.isnull().sum()),
176
+ "null_pct": round(series.isnull().mean() * 100, 2),
177
+ "unique_count": int(series.nunique(dropna=True)),
178
+ "unique_pct": round(series.nunique(dropna=True) / max(len(series), 1) * 100, 2),
179
+ }
180
+ if pd.api.types.is_numeric_dtype(series):
181
+ desc = series.describe()
182
+ col_info.update({
183
+ "min": round(float(desc["min"]), 4) if pd.notna(desc["min"]) else None,
184
+ "max": round(float(desc["max"]), 4) if pd.notna(desc["max"]) else None,
185
+ "mean": round(float(desc["mean"]), 4) if pd.notna(desc["mean"]) else None,
186
+ "median": round(float(series.median()), 4) if pd.notna(series.median()) else None,
187
+ "std": round(float(desc["std"]), 4) if pd.notna(desc.get("std", float("nan"))) else None,
188
+ })
189
+ else:
190
+ top = series.value_counts(dropna=True).head(3).to_dict()
191
+ col_info["top_values"] = {str(k): int(v) for k, v in top.items()}
192
+
193
+ profile["columns"][col] = col_info
194
+
195
+ return profile
196
+
197
+ def get_report(self) -> EpisodeReport:
198
+ """Full episode cleaning summary for GET /report endpoint."""
199
+ if self._df is None:
200
+ raise RuntimeError("No active episode.")
201
+
202
+ steps_used = self._step_count
203
+ efficiency = round((1 - steps_used / max(self._max_steps, 1)) * 100, 1)
204
+
205
+ return EpisodeReport(
206
+ episode_id = self._episode_id,
207
+ task_id = self._task_id,
208
+ task_name = TASK_NAMES.get(self._task_id, f"Task {self._task_id}"),
209
+ initial_score = self._initial_score,
210
+ final_score = self._last_score,
211
+ score_improvement = round(self._last_score - self._initial_score, 4),
212
+ steps_taken = steps_used,
213
+ max_steps = self._max_steps,
214
+ step_efficiency_pct = max(0.0, efficiency),
215
+ operations_applied = list(self._operations_log),
216
+ issues_fixed = dict(self._issues_fixed),
217
+ final_dq_metrics = self._compute_dq_metrics(),
218
+ completed = self._last_score >= 0.95,
219
+ )
220
+
221
+ def get_export(self) -> str:
222
+ """Return current cleaned DataFrame as CSV string for GET /export."""
223
+ if self._df is None:
224
+ raise RuntimeError("No active episode.")
225
+ return self._df.to_csv(index=False)
226
+
227
  # ------------------------------------------------------------------
228
  # Internal helpers
229
  # ------------------------------------------------------------------
230
 
231
+ def _make_op_key(self, action: DataCleaningAction) -> str:
232
+ if action.column:
233
+ return f"{action.operation}:{action.column}"
234
+ return action.operation
235
+
236
+ def _update_issues_fixed(self, action: DataCleaningAction, message: str) -> None:
237
+ op = action.operation.lower()
238
+ # Parse numbers from message e.g. "Filled 20 missing values..."
239
+ nums = re.findall(r"\d+", message)
240
+ n = int(nums[0]) if nums else 1
241
+ if op == "fill_missing":
242
+ self._issues_fixed["nulls_filled"] = self._issues_fixed.get("nulls_filled", 0) + n
243
+ elif op == "drop_duplicates":
244
+ self._issues_fixed["dupes_removed"] = self._issues_fixed.get("dupes_removed", 0) + n
245
+ elif op == "fix_format":
246
+ self._issues_fixed["formats_fixed"] = self._issues_fixed.get("formats_fixed", 0) + n
247
+ elif op == "drop_outliers":
248
+ self._issues_fixed["outliers_removed"] = self._issues_fixed.get("outliers_removed", 0) + n
249
+
250
+ def _compute_dq_metrics(self) -> DataQualityMetrics:
251
+ total_cells = int(self._df.size)
252
+ null_cells = int(self._df.isnull().sum().sum())
253
+ duplicate_rows = int(len(self._df) - len(self._df.drop_duplicates()))
254
+ invalid_cells = self._count_invalid_cells()
255
+
256
+ completeness = round((1 - null_cells / max(total_cells, 1)) * 100, 2)
257
+ uniqueness = round((1 - duplicate_rows / max(len(self._df), 1)) * 100, 2)
258
+ validity = round((1 - invalid_cells / max(total_cells, 1)) * 100, 2)
259
+
260
+ return DataQualityMetrics(
261
+ completeness_pct = completeness,
262
+ uniqueness_pct = uniqueness,
263
+ validity_pct = validity,
264
+ total_cells = total_cells,
265
+ null_cells = null_cells,
266
+ duplicate_rows = duplicate_rows,
267
+ invalid_cells = invalid_cells,
268
+ )
269
+
270
+ def _count_invalid_cells(self) -> int:
271
+ """Count cells with format/dtype/range violations."""
272
+ invalid = 0
273
+ for col in self._df.columns:
274
+ series = self._df[col].dropna()
275
+ if col == "phone":
276
+ invalid += int((~series.astype(str).str.match(PHONE_RE, na=False)).sum())
277
+ elif col in ("listed_date", "signup_date"):
278
+ invalid += int((~series.apply(
279
+ lambda x: bool(DATE_RE.match(str(x)))
280
+ )).sum())
281
+ elif col == "country":
282
+ invalid += int((~series.isin(VALID_COUNTRIES)).sum())
283
+ elif col == "age":
284
+ invalid += int(((series < 0) | (series > 120)).sum())
285
+ elif col == "salary":
286
+ invalid += int((series < 0).sum())
287
+ elif col == "purchase_amount":
288
+ invalid += int((series < 0).sum())
289
+ return invalid
290
+
291
+ def _generate_plan(self) -> List[str]:
292
+ """
293
+ Rule-based planning engine β€” inspects current DataFrame state
294
+ and returns up to 3 prioritised recommended actions.
295
+ Inspired by AutoDCWorkflow (EMNLP 2025).
296
+ """
297
+ plan: List[str] = []
298
+ if self._df is None:
299
+ return plan
300
+
301
+ # Task 4: schema alignment + merge must happen first
302
+ if self._task_id == 4:
303
+ if not self._schema_aligned:
304
+ return ["align_schema β€” rename Source A columns to canonical schema (required first step)"]
305
+ if not self._sources_merged:
306
+ return ["merge_sources β€” concatenate aligned Source A + Source B (required before cleaning)"]
307
+
308
+ missing = {col: int(n) for col, n in self._df.isnull().sum().items() if n > 0}
309
+ dupes = len(self._df) - len(self._df.drop_duplicates())
310
+
311
+ # Priority 1: missing values (highest DQ impact)
312
+ for col, count in sorted(missing.items(), key=lambda x: -x[1]):
313
+ op_key = f"fill_missing:{col}"
314
+ if op_key not in self._tried_operations:
315
+ strategy = "mode" if self._df[col].dtype == object else "median"
316
+ plan.append(
317
+ f'fill_missing on "{col}" ({count} nulls) using {strategy}'
318
+ )
319
+ if len(plan) >= 2:
320
+ break
321
+
322
+ # Priority 2: duplicates
323
+ if dupes > 0 and "drop_duplicates" not in self._tried_operations:
324
+ plan.append(f"drop_duplicates ({dupes} duplicate rows found)")
325
+
326
+ # Priority 3: format issues
327
+ for col in self._df.columns:
328
+ if len(plan) >= 3:
329
+ break
330
+ op_key = f"fix_format:{col}"
331
+ if op_key in self._tried_operations:
332
+ continue
333
+ if col == "phone":
334
+ bad = int((~self._df[col].dropna().astype(str).str.match(PHONE_RE)).sum())
335
+ if bad > 0:
336
+ plan.append(f'fix_format on "phone" ({bad} malformed numbers)')
337
+ elif col in ("listed_date", "signup_date"):
338
+ bad = int((~self._df[col].dropna().apply(
339
+ lambda x: bool(DATE_RE.match(str(x)))
340
+ )).sum())
341
+ if bad > 0:
342
+ plan.append(f'fix_format on "{col}" ({bad} malformed dates)')
343
+ elif col == "country":
344
+ bad = int((~self._df[col].dropna().isin(VALID_COUNTRIES)).sum())
345
+ if bad > 0:
346
+ plan.append(f'fix_format on "country" ({bad} invalid values)')
347
+
348
+ # Priority 4: outliers on numeric cols
349
+ if len(plan) < 3:
350
+ for col in self._df.select_dtypes(include=[np.number]).columns:
351
+ op_key = f"drop_outliers:{col}"
352
+ if op_key in self._tried_operations:
353
+ continue
354
+ q1, q3 = self._df[col].quantile(0.25), self._df[col].quantile(0.75)
355
+ iqr = q3 - q1
356
+ outliers = int((self._df[col] > q3 + 3 * iqr).sum())
357
+ if outliers > 0:
358
+ plan.append(f'drop_outliers on "{col}" ({outliers} extreme values)')
359
+ break
360
+
361
+ return plan[:3]
362
+
363
  def _compute_score(self) -> float:
364
  if self._task_id == 1:
365
  raw = t1.score(self._df, self._meta)
366
  elif self._task_id == 2:
367
  raw = t2.score(self._df, self._meta)
368
+ elif self._task_id == 3:
369
  raw = t3.score(self._df, self._meta)
370
+ else:
371
+ raw = t4.score(self._df, self._meta)
 
 
372
  raw = float(raw)
373
+ EPS = 1e-4
 
374
  if raw >= 1.0:
375
  raw = 1.0 - EPS
376
  elif raw <= 0.0:
377
  raw = EPS
 
378
  return round(raw, 4)
379
 
380
  def _count_errors(self) -> int:
 
382
  return t1.count_errors(self._df)
383
  elif self._task_id == 2:
384
  return t2.count_errors(self._df, self._meta)
385
+ elif self._task_id == 3:
386
  return t3.count_errors(self._df, self._meta)
387
+ else:
388
+ return t4.count_errors(self._df, self._meta)
389
 
390
  def _build_obs(self, reward: float, done: bool, message: str) -> DataCleaningObservation:
391
+ mod = TASK_MODULES[self._task_id]
392
+ missing = {col: int(n) for col, n in self._df.isnull().sum().items() if n > 0}
393
+ dupes = len(self._df) - len(self._df.drop_duplicates())
394
  dtype_issues = self._detect_dtype_issues()
395
+ preview = self._df.head(10).to_csv(index=False)
396
+ dq_metrics = self._compute_dq_metrics()
397
+ plan = self._generate_plan()
398
 
399
  return DataCleaningObservation(
400
+ done = done,
401
+ reward = reward,
402
+ data_preview = preview,
403
+ data_shape = list(self._df.shape),
404
+ missing_counts = missing,
405
+ duplicate_count = dupes,
406
+ dtype_issues = dtype_issues,
407
+ task_description = mod.DESCRIPTION,
408
+ message = message,
409
+ step_count = self._step_count,
410
+ current_score = self._last_score,
411
+ dq_metrics = dq_metrics,
412
+ tried_operations = list(self._tried_operations),
413
+ plan = plan,
414
  )
415
 
416
  def _detect_dtype_issues(self) -> Dict[str, str]:
 
447
  return self._drop_outliers(col)
448
  elif op == "fix_dtype":
449
  return self._fix_dtype(col, p)
450
+ elif op == "align_schema":
451
+ return self._align_schema()
452
+ elif op == "merge_sources":
453
+ return self._merge_sources()
454
  else:
455
+ return (
456
+ f"Unknown operation '{op}'. Choose from: fill_missing, "
457
+ "drop_duplicates, fix_format, replace_value, drop_outliers, "
458
+ "fix_dtype, align_schema, merge_sources.",
459
+ False,
460
+ )
461
  except Exception as exc:
462
  return f"Operation failed: {exc}", False
463
 
 
491
  def _drop_duplicates(self) -> Tuple[str, bool]:
492
  n_before = len(self._df)
493
  self._df = self._df.drop_duplicates().reset_index(drop=True)
494
+ removed = n_before - len(self._df)
 
495
  if removed == 0:
496
  return "No duplicate rows found.", False
497
  return f"Dropped {removed} duplicate rows.", True
 
499
  def _fix_format(self, col) -> Tuple[str, bool]:
500
  if col is None or col not in self._df.columns:
501
  return f"Column '{col}' not found.", False
 
502
  if col == "phone":
503
  return self._fix_phone(col)
504
  elif col in ("listed_date", "signup_date"):
 
537
  return pd.to_datetime(s, format=fmt).strftime("%Y-%m-%d")
538
  except Exception:
539
  pass
 
540
  try:
541
  return pd.to_datetime(s).strftime("%Y-%m-%d")
542
  except Exception:
 
569
  after = (~self._df[col].isin(VALID_COUNTRIES) & self._df[col].notna()).sum()
570
  fixed = int(before - after)
571
  if fixed == 0:
572
+ return "No country capitalisation issues found.", False
573
  return f"Fixed {fixed} country values to correct capitalisation.", True
574
 
575
  def _replace_value(self, col, p) -> Tuple[str, bool]:
 
601
  return f"No outliers found in '{col}'.", False
602
  return f"Removed {removed} outlier rows from '{col}' using IQR method.", True
603
 
604
+ def _align_schema(self) -> Tuple[str, bool]:
605
+ """Rename Source A columns to canonical target schema (Task 4 only)."""
606
+ if self._task_id != 4:
607
+ return "align_schema is only available in Task 4.", False
608
+ if self._schema_aligned:
609
+ return "Schema already aligned.", False
610
+
611
+ from server.tasks.task4_merge import SOURCE_A_RENAME, TARGET_COLUMNS
612
+ missing_src = [c for c in SOURCE_A_RENAME if c not in self._df.columns]
613
+ if missing_src:
614
+ return f"Expected Source A columns not found: {missing_src}.", False
615
+
616
+ self._df = self._df.rename(columns=SOURCE_A_RENAME)
617
+ self._schema_aligned = True
618
+ renamed = list(SOURCE_A_RENAME.keys())
619
+ return (
620
+ f"Aligned Source A schema: renamed {len(SOURCE_A_RENAME)} columns "
621
+ f"({', '.join(renamed)}) to canonical target schema.", True
622
+ )
623
+
624
+ def _merge_sources(self) -> Tuple[str, bool]:
625
+ """Concatenate aligned Source A with Source B (Task 4 only)."""
626
+ if self._task_id != 4:
627
+ return "merge_sources is only available in Task 4.", False
628
+ if self._sources_merged:
629
+ return "Sources already merged.", False
630
+ if not self._schema_aligned:
631
+ return "Run align_schema before merge_sources.", False
632
+ if self._source_b is None:
633
+ return "Source B not available.", False
634
+
635
+ from server.tasks.task4_merge import TARGET_COLUMNS, _META_TEMPLATE
636
+ n_a = len(self._df)
637
+ n_b = len(self._source_b)
638
+
639
+ # Rename source_b columns to canonical schema
640
+ source_b_rename = {
641
+ "age_years": "age",
642
+ "spend": "purchase_amount",
643
+ "country_name": "country",
644
+ "registration_date": "signup_date",
645
+ }
646
+ source_b_aligned = self._source_b.rename(columns=source_b_rename)
647
+
648
+ # Concatenate both aligned sources
649
+ merged = pd.concat(
650
+ [self._df[TARGET_COLUMNS], source_b_aligned[TARGET_COLUMNS]],
651
+ ignore_index=True
652
+ ).reset_index(drop=True)
653
+
654
+ # Inject pre-computed dirty issues so grader baseline is correct
655
+ dirty_merged = _META_TEMPLATE["dirty_merged"].copy()
656
+ self._df = dirty_merged
657
+ self._sources_merged = True
658
+ self._source_b = None
659
+
660
+ return (
661
+ f"Merged Source A ({n_a} rows) + Source B ({n_b} rows) β†’ "
662
+ f"{len(self._df)} rows with canonical schema. "
663
+ f"Dataset now has dirty issues to clean: missing values, "
664
+ f"mixed country case, mixed date formats, duplicate rows.", True
665
+ )
666
+
667
  def _fix_dtype(self, col, p) -> Tuple[str, bool]:
668
  if col is None or col not in self._df.columns:
669
  return f"Column '{col}' not found.", False
 
679
  return f"Unknown dtype '{dtype}'.", False
680
  return f"Converted '{col}' to {dtype}.", True
681
  except Exception as exc:
682
+ return f"dtype conversion failed: {exc}", False
server/tasks/task1_missing.py CHANGED
@@ -19,12 +19,14 @@ DESCRIPTION = (
19
  "Example action: {\"operation\": \"fill_missing\", \"column\": \"age\", \"params\": {\"strategy\": \"median\"}}"
20
  )
21
 
 
 
 
 
22
 
23
  def load():
24
- """Return (dirty_df, clean_df, original_null_count)."""
25
- dirty, clean = generate_task1_datasets()
26
- original_nulls = int(dirty.isnull().sum().sum())
27
- return dirty.copy(), clean, original_nulls
28
 
29
 
30
  def score(current_df, original_nulls: int) -> float:
@@ -36,4 +38,4 @@ def score(current_df, original_nulls: int) -> float:
36
 
37
 
38
  def count_errors(current_df) -> int:
39
- return int(current_df.isnull().sum().sum())
 
19
  "Example action: {\"operation\": \"fill_missing\", \"column\": \"age\", \"params\": {\"strategy\": \"median\"}}"
20
  )
21
 
22
+ # Cache at module load β€” seed=42 makes output identical every time
23
+ _DIRTY_TEMPLATE, _CLEAN_DF = generate_task1_datasets()
24
+ _ORIGINAL_NULLS = int(_DIRTY_TEMPLATE.isnull().sum().sum())
25
+
26
 
27
  def load():
28
+ """Return (dirty_df, clean_df, original_null_count) β€” uses cached template."""
29
+ return _DIRTY_TEMPLATE.copy(), _CLEAN_DF, _ORIGINAL_NULLS
 
 
30
 
31
 
32
  def score(current_df, original_nulls: int) -> float:
 
38
 
39
 
40
  def count_errors(current_df) -> int:
41
+ return int(current_df.isnull().sum().sum())
server/tasks/task2_format.py CHANGED
@@ -29,23 +29,24 @@ PHONE_RE = re.compile(r"^\d{3}-\d{3}-\d{4}$")
29
  DATE_RE = re.compile(r"^\d{4}-\d{2}-\d{2}$")
30
 
31
 
32
- def load():
33
- dirty, clean = generate_task2_datasets()
34
- original_phone_issues = int((~dirty["phone"].str.match(PHONE_RE)).sum())
35
- original_date_issues = int((~dirty["listed_date"].apply(
 
36
  lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
37
- )).sum())
38
- original_dupes = len(dirty) - len(dirty.drop_duplicates())
39
- meta = {
40
- "orig_phone": original_phone_issues,
41
- "orig_date": original_date_issues,
42
- "orig_dupes": original_dupes,
43
- }
44
- return dirty.copy(), clean, meta
45
 
46
 
47
  def score(current_df, meta: dict) -> float:
48
- phone_issues = int((~current_df["phone"].str.match(PHONE_RE)).sum())
49
  date_issues = int((~current_df["listed_date"].apply(
50
  lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
51
  )).sum())
@@ -60,9 +61,9 @@ def score(current_df, meta: dict) -> float:
60
 
61
 
62
  def count_errors(current_df, meta: dict) -> int:
63
- phone_issues = int((~current_df["phone"].str.match(PHONE_RE)).sum())
64
  date_issues = int((~current_df["listed_date"].apply(
65
  lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
66
  )).sum())
67
  dupes = len(current_df) - len(current_df.drop_duplicates())
68
- return phone_issues + date_issues + dupes
 
29
  DATE_RE = re.compile(r"^\d{4}-\d{2}-\d{2}$")
30
 
31
 
32
+ # Cache at module load β€” seed=42 makes output identical every time
33
+ _DIRTY_TEMPLATE, _CLEAN_DF = generate_task2_datasets()
34
+ _META_TEMPLATE = {
35
+ "orig_phone": int((~_DIRTY_TEMPLATE["phone"].str.match(PHONE_RE, na=False)).sum()),
36
+ "orig_date": int((~_DIRTY_TEMPLATE["listed_date"].apply(
37
  lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
38
+ )).sum()),
39
+ "orig_dupes": len(_DIRTY_TEMPLATE) - len(_DIRTY_TEMPLATE.drop_duplicates()),
40
+ }
41
+
42
+
43
+ def load():
44
+ """Return (dirty_df, clean_df, meta) β€” uses cached template."""
45
+ return _DIRTY_TEMPLATE.copy(), _CLEAN_DF, dict(_META_TEMPLATE)
46
 
47
 
48
  def score(current_df, meta: dict) -> float:
49
+ phone_issues = int((~current_df["phone"].str.match(PHONE_RE, na=False)).sum())
50
  date_issues = int((~current_df["listed_date"].apply(
51
  lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
52
  )).sum())
 
61
 
62
 
63
  def count_errors(current_df, meta: dict) -> int:
64
+ phone_issues = int((~current_df["phone"].str.match(PHONE_RE, na=False)).sum())
65
  date_issues = int((~current_df["listed_date"].apply(
66
  lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
67
  )).sum())
68
  dupes = len(current_df) - len(current_df.drop_duplicates())
69
+ return phone_issues + date_issues + dupes
server/tasks/task3_pipeline.py CHANGED
@@ -38,32 +38,35 @@ DATE_RE = re.compile(r"^\d{4}-\d{2}-\d{2}$")
38
  VALID_COUNTRIES = {"USA", "UK", "Canada", "Australia", "Germany"}
39
 
40
 
41
- def load():
42
- dirty, clean = generate_task3_datasets()
43
  orig_nulls = int(dirty.isnull().sum().sum())
44
  orig_dupes = len(dirty) - len(dirty.drop_duplicates())
45
-
46
- # Outlier baseline: count rows where purchase_amount > Q3 + 3*IQR
47
  pa = dirty["purchase_amount"].dropna()
48
  q1, q3 = pa.quantile(0.25), pa.quantile(0.75)
49
  iqr = q3 - q1
50
  orig_outliers = int((pa > q3 + 3 * iqr).sum())
51
-
52
  orig_country_issues = int((~dirty["country"].isin(VALID_COUNTRIES) &
53
  dirty["country"].notna()).sum())
54
- orig_date_issues = int((~dirty["signup_date"].apply(
55
  lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
56
  )).sum())
57
-
58
- meta = {
59
- "orig_nulls": orig_nulls,
60
- "orig_dupes": orig_dupes,
61
- "orig_outliers": max(orig_outliers, 1),
62
- "orig_country_issues": max(orig_country_issues, 1),
63
- "orig_date_issues": max(orig_date_issues, 1),
64
  "q1": q1, "q3": q3, "iqr": iqr,
65
  }
66
- return dirty.copy(), clean, meta
 
 
 
 
 
 
 
67
 
68
 
69
  def score(current_df, meta: dict) -> float:
@@ -101,4 +104,4 @@ def count_errors(current_df, meta: dict) -> int:
101
  lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
102
  )).sum())
103
  return remaining_nulls + remaining_dupes + remaining_outliers + \
104
- remaining_country + remaining_dates
 
38
  VALID_COUNTRIES = {"USA", "UK", "Canada", "Australia", "Germany"}
39
 
40
 
41
+ # Cache at module load β€” seed=42 makes output identical every time
42
+ def _build_meta(dirty):
43
  orig_nulls = int(dirty.isnull().sum().sum())
44
  orig_dupes = len(dirty) - len(dirty.drop_duplicates())
 
 
45
  pa = dirty["purchase_amount"].dropna()
46
  q1, q3 = pa.quantile(0.25), pa.quantile(0.75)
47
  iqr = q3 - q1
48
  orig_outliers = int((pa > q3 + 3 * iqr).sum())
 
49
  orig_country_issues = int((~dirty["country"].isin(VALID_COUNTRIES) &
50
  dirty["country"].notna()).sum())
51
+ orig_date_issues = int((~dirty["signup_date"].apply(
52
  lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
53
  )).sum())
54
+ return {
55
+ "orig_nulls": orig_nulls,
56
+ "orig_dupes": orig_dupes,
57
+ "orig_outliers": max(orig_outliers, 1),
58
+ "orig_country_issues": max(orig_country_issues, 1),
59
+ "orig_date_issues": max(orig_date_issues, 1),
 
60
  "q1": q1, "q3": q3, "iqr": iqr,
61
  }
62
+
63
+ _DIRTY_TEMPLATE, _CLEAN_DF = generate_task3_datasets()
64
+ _META_TEMPLATE = _build_meta(_DIRTY_TEMPLATE)
65
+
66
+
67
+ def load():
68
+ """Return (dirty_df, clean_df, meta) β€” uses cached template."""
69
+ return _DIRTY_TEMPLATE.copy(), _CLEAN_DF, dict(_META_TEMPLATE)
70
 
71
 
72
  def score(current_df, meta: dict) -> float:
 
104
  lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
105
  )).sum())
106
  return remaining_nulls + remaining_dupes + remaining_outliers + \
107
+ remaining_country + remaining_dates
server/tasks/task4_merge.py ADDED
@@ -0,0 +1,231 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Task 4 β€” Expert: Multi-Source Schema Alignment + Merge Pipeline
3
+
4
+ Two independent data sources (CRM + Marketing) have been exported with
5
+ misaligned column names and must be aligned to a canonical schema,
6
+ merged into one DataFrame, and then cleaned.
7
+
8
+ Grader sub-scores (equal weight):
9
+ 0.30 Γ— schema_score β€” correct columns present after align + merge
10
+ 0.25 Γ— null_score β€” missing values filled
11
+ 0.20 Γ— country_score β€” country capitalisation fixed
12
+ 0.15 Γ— date_score β€” signup_date format standardised
13
+ 0.10 Γ— dupe_score β€” duplicate rows removed
14
+
15
+ Inspired by:
16
+ - CleanAgent (Qi & Wang, 2024) β€” declarative schema standardisation
17
+ - Meta DataSchema system β€” column-level semantic annotation at scale
18
+ """
19
+
20
+ import re
21
+ import pandas as pd
22
+ from server.data_generator import generate_task4_datasets
23
+
24
+ TASK_ID = 4
25
+ MAX_STEPS = 50
26
+
27
+ DESCRIPTION = (
28
+ "Task 4 (Expert) β€” Multi-Source Schema Alignment + Merge Pipeline\n"
29
+ "You have TWO source DataFrames with misaligned schemas:\n\n"
30
+ " Source A (CRM, 150 rows) columns:\n"
31
+ " cust_id, full_name, Age, purchase_amt, Country, signup, email\n\n"
32
+ " Source B (Marketing, 100 rows) columns:\n"
33
+ " customer_id, name, age_years, spend, country_name, registration_date, email\n\n"
34
+ "Target canonical schema (250 rows after merge):\n"
35
+ " customer_id, name, age, purchase_amount, country, signup_date, email\n\n"
36
+ "Step 1 β€” align_schema: rename Source A columns to match target.\n"
37
+ "Step 2 β€” merge_sources: concatenate Source A + Source B.\n"
38
+ "Step 3 β€” Clean the merged dataset:\n"
39
+ " β€’ fill_missing β€” age, purchase_amount, country (~10% nulls each)\n"
40
+ " β€’ fix_format β€” country (mixed case), signup_date (mixed formats)\n"
41
+ " β€’ drop_duplicates β€” ~10 duplicate rows\n\n"
42
+ "Available operations:\n"
43
+ " align_schema β€” no column needed; renames Source A to canonical schema\n"
44
+ " merge_sources β€” no column needed; concatenates aligned A + B\n"
45
+ " fill_missing β€” column + params.strategy\n"
46
+ " fix_format β€” column: 'country' | 'signup_date'\n"
47
+ " drop_duplicates β€” no column needed\n\n"
48
+ "Example actions:\n"
49
+ ' {"operation": "align_schema"}\n'
50
+ ' {"operation": "merge_sources"}\n'
51
+ ' {"operation": "fill_missing", "column": "age", "params": {"strategy": "median"}}\n'
52
+ ' {"operation": "fix_format", "column": "country"}\n'
53
+ ' {"operation": "fix_format", "column": "signup_date"}\n'
54
+ ' {"operation": "drop_duplicates"}'
55
+ )
56
+
57
+ DATE_RE = re.compile(r"^\d{4}-\d{2}-\d{2}$")
58
+ VALID_COUNTRIES = {"USA", "UK", "Canada", "Australia", "Germany"}
59
+ TARGET_COLUMNS = ["customer_id", "name", "age", "purchase_amount",
60
+ "country", "signup_date", "email"]
61
+
62
+ # Column mapping: Source A dirty names β†’ canonical target names
63
+ SOURCE_A_RENAME = {
64
+ "cust_id": "customer_id",
65
+ "full_name": "name",
66
+ "Age": "age",
67
+ "purchase_amt": "purchase_amount",
68
+ "Country": "country",
69
+ "signup": "signup_date",
70
+ # "email" already matches
71
+ }
72
+
73
+
74
+ # ---------------------------------------------------------------------------
75
+ # Cache at module load
76
+ # ---------------------------------------------------------------------------
77
+
78
+ def _build_meta(source_a, source_b, clean_merged):
79
+ import numpy as np
80
+
81
+ # Align source_a and source_b to canonical schema before merging
82
+ aligned_a = source_a.rename(columns=SOURCE_A_RENAME)
83
+ source_b_rename = {
84
+ "age_years": "age",
85
+ "spend": "purchase_amount",
86
+ "country_name": "country",
87
+ "registration_date": "signup_date",
88
+ }
89
+ aligned_b = source_b.rename(columns=source_b_rename)
90
+
91
+ merged = pd.concat(
92
+ [aligned_a[TARGET_COLUMNS], aligned_b[TARGET_COLUMNS]],
93
+ ignore_index=True
94
+ ).reset_index(drop=True)
95
+
96
+ # Inject dirty issues deterministically
97
+ import numpy as np
98
+ rng = np.random.default_rng(42 + 4)
99
+
100
+ n = len(merged)
101
+ # Missing values
102
+ for col, frac in [("age", 0.10), ("purchase_amount", 0.10), ("country", 0.08)]:
103
+ idx = rng.choice(n, size=int(n * frac), replace=False)
104
+ merged.loc[idx, col] = None
105
+
106
+ # Mixed country case
107
+ case_idx = rng.choice(n, size=int(n * 0.30), replace=False)
108
+ merged.loc[case_idx, "country"] = merged.loc[case_idx, "country"].str.lower()
109
+
110
+ # Mixed date formats
111
+ import random as _random
112
+ _random.seed(42 + 4)
113
+ date_idx = rng.choice(n, size=int(n * 0.40), replace=False)
114
+ for i in date_idx:
115
+ val = merged.loc[i, "signup_date"]
116
+ if pd.notna(val):
117
+ try:
118
+ dt = pd.to_datetime(str(val))
119
+ fmt = rng.integers(0, 3)
120
+ if fmt == 1:
121
+ merged.loc[i, "signup_date"] = dt.strftime("%b %d %Y")
122
+ elif fmt == 2:
123
+ merged.loc[i, "signup_date"] = dt.strftime("%d/%m/%Y")
124
+ except Exception:
125
+ pass
126
+
127
+ # Duplicates
128
+ dup_idx = rng.choice(n, size=10, replace=False)
129
+ dup_rows = merged.iloc[dup_idx].copy()
130
+ merged = pd.concat([merged, dup_rows], ignore_index=True)
131
+
132
+ orig_nulls = int(merged.isnull().sum().sum())
133
+ orig_dupes = len(merged) - len(merged.drop_duplicates())
134
+ orig_country_issues = int(
135
+ (~merged["country"].isin(VALID_COUNTRIES) & merged["country"].notna()).sum()
136
+ )
137
+ orig_date_issues = int(
138
+ (~merged["signup_date"].apply(
139
+ lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
140
+ )).sum()
141
+ )
142
+
143
+ return {
144
+ "orig_nulls": max(orig_nulls, 1),
145
+ "orig_dupes": max(orig_dupes, 1),
146
+ "orig_country_issues": max(orig_country_issues, 1),
147
+ "orig_date_issues": max(orig_date_issues, 1),
148
+ "dirty_merged": merged, # stored for environment to use post-merge
149
+ }
150
+
151
+
152
+ _SOURCE_A, _SOURCE_B, _CLEAN_MERGED = generate_task4_datasets()
153
+ _META_TEMPLATE = _build_meta(_SOURCE_A, _SOURCE_B, _CLEAN_MERGED)
154
+
155
+
156
+ def load():
157
+ """
158
+ Returns (source_a, source_b, clean_merged, meta).
159
+ source_a is the initial active DataFrame (pre-alignment).
160
+ source_b is held separately until merge_sources is called.
161
+ """
162
+ import copy
163
+ meta = {k: v for k, v in _META_TEMPLATE.items() if k != "dirty_merged"}
164
+ meta["dirty_merged"] = _META_TEMPLATE["dirty_merged"].copy()
165
+ return _SOURCE_A.copy(), _SOURCE_B.copy(), _CLEAN_MERGED.copy(), meta
166
+
167
+
168
+ # ---------------------------------------------------------------------------
169
+ # Grader
170
+ # ---------------------------------------------------------------------------
171
+
172
+ def score(current_df, meta: dict) -> float:
173
+ """
174
+ Weighted score across 5 sub-dimensions:
175
+ 0.30 schema_score β€” all target columns present, no extra columns
176
+ 0.25 null_score β€” missing values filled
177
+ 0.20 country_score β€” country capitalisation correct
178
+ 0.15 date_score β€” signup_date in YYYY-MM-DD
179
+ 0.10 dupe_score β€” no duplicate rows
180
+ """
181
+ # Schema score: are all target columns present?
182
+ present = sum(1 for c in TARGET_COLUMNS if c in current_df.columns)
183
+ schema_score = present / len(TARGET_COLUMNS)
184
+
185
+ # Can only score the rest if schema is aligned AND merged
186
+ if not all(c in current_df.columns for c in TARGET_COLUMNS):
187
+ # Partial credit: schema only
188
+ return round(max(0.01, min(0.99, 0.30 * schema_score)), 4)
189
+
190
+ remaining_nulls = int(current_df.isnull().sum().sum())
191
+ remaining_dupes = len(current_df) - len(current_df.drop_duplicates())
192
+ remaining_country = int(
193
+ (~current_df["country"].isin(VALID_COUNTRIES) & current_df["country"].notna()).sum()
194
+ )
195
+ remaining_dates = int(
196
+ (~current_df["signup_date"].apply(
197
+ lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
198
+ )).sum()
199
+ )
200
+
201
+ null_score = 1.0 - remaining_nulls / meta["orig_nulls"]
202
+ dupe_score = 1.0 - remaining_dupes / meta["orig_dupes"]
203
+ country_score = 1.0 - remaining_country / meta["orig_country_issues"]
204
+ date_score = 1.0 - remaining_dates / meta["orig_date_issues"]
205
+
206
+ combined = (0.30 * schema_score +
207
+ 0.25 * null_score +
208
+ 0.20 * country_score +
209
+ 0.15 * date_score +
210
+ 0.10 * dupe_score)
211
+
212
+ return round(max(0.01, min(0.99, combined)), 4)
213
+
214
+
215
+ def count_errors(current_df, meta: dict) -> int:
216
+ errors = 0
217
+ missing_cols = sum(1 for c in TARGET_COLUMNS if c not in current_df.columns)
218
+ errors += missing_cols * 10 # heavy penalty for schema misalignment
219
+
220
+ if all(c in current_df.columns for c in TARGET_COLUMNS):
221
+ errors += int(current_df.isnull().sum().sum())
222
+ errors += len(current_df) - len(current_df.drop_duplicates())
223
+ errors += int(
224
+ (~current_df["country"].isin(VALID_COUNTRIES) & current_df["country"].notna()).sum()
225
+ )
226
+ errors += int(
227
+ (~current_df["signup_date"].apply(
228
+ lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
229
+ )).sum()
230
+ )
231
+ return errors
server/ui.html ADDED
@@ -0,0 +1,1237 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+
4
+ <head>
5
+ <meta charset="UTF-8">
6
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
7
+ <title>DataMedic - AI Data Cleaning Monitor</title>
8
+ <style>
9
+ :root {
10
+ --bg: #050d1a;
11
+ --bg2: #0a1628;
12
+ --bg3: #0f1f38;
13
+ --border: #1a3050;
14
+ --green: #00e5a0;
15
+ --green-dim: #00704e;
16
+ --amber: #f5a623;
17
+ --red: #ff4d6d;
18
+ --blue: #4db8ff;
19
+ --text: #c8dff5;
20
+ --text-dim: #4a6a8a;
21
+ --mono: 'Courier New', Courier, monospace;
22
+ --sans: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
23
+ }
24
+
25
+ * {
26
+ box-sizing: border-box;
27
+ margin: 0;
28
+ padding: 0;
29
+ }
30
+
31
+ body {
32
+ background: var(--bg);
33
+ color: var(--text);
34
+ font-family: var(--sans);
35
+ min-height: 100vh;
36
+ overflow-x: hidden;
37
+ }
38
+
39
+ body::before {
40
+ content: '';
41
+ position: fixed;
42
+ inset: 0;
43
+ background: repeating-linear-gradient(0deg, transparent, transparent 2px,
44
+ rgba(0, 0, 0, 0.06) 2px, rgba(0, 0, 0, 0.06) 4px);
45
+ pointer-events: none;
46
+ z-index: 999;
47
+ }
48
+
49
+ /* ── Header ── */
50
+ header {
51
+ display: flex;
52
+ align-items: center;
53
+ justify-content: space-between;
54
+ padding: 14px 28px;
55
+ border-bottom: 1px solid var(--border);
56
+ background: var(--bg2);
57
+ position: sticky;
58
+ top: 0;
59
+ z-index: 100;
60
+ }
61
+
62
+ .logo {
63
+ display: flex;
64
+ align-items: center;
65
+ gap: 12px;
66
+ }
67
+
68
+ .logo-pulse {
69
+ width: 10px;
70
+ height: 10px;
71
+ background: var(--green);
72
+ border-radius: 50%;
73
+ box-shadow: 0 0 10px var(--green);
74
+ animation: pulse 2s infinite;
75
+ flex-shrink: 0;
76
+ }
77
+
78
+ @keyframes pulse {
79
+
80
+ 0%,
81
+ 100% {
82
+ opacity: 1;
83
+ transform: scale(1);
84
+ }
85
+
86
+ 50% {
87
+ opacity: 0.3;
88
+ transform: scale(0.7);
89
+ }
90
+ }
91
+
92
+ .logo-text {
93
+ font-family: var(--mono);
94
+ font-size: 17px;
95
+ font-weight: 700;
96
+ letter-spacing: 3px;
97
+ color: var(--green);
98
+ }
99
+
100
+ .logo-sub {
101
+ font-size: 10px;
102
+ color: var(--text-dim);
103
+ letter-spacing: 1px;
104
+ text-transform: uppercase;
105
+ margin-top: 2px;
106
+ }
107
+
108
+ .status-pill {
109
+ font-family: var(--mono);
110
+ font-size: 11px;
111
+ padding: 4px 14px;
112
+ border-radius: 20px;
113
+ border: 1px solid;
114
+ letter-spacing: 1px;
115
+ text-transform: uppercase;
116
+ }
117
+
118
+ .status-pill.idle {
119
+ color: var(--text-dim);
120
+ border-color: var(--text-dim);
121
+ }
122
+
123
+ .status-pill.running {
124
+ color: var(--green);
125
+ border-color: var(--green);
126
+ box-shadow: 0 0 8px rgba(0, 229, 160, 0.3);
127
+ animation: pulse 1s infinite;
128
+ }
129
+
130
+ .status-pill.done {
131
+ color: var(--blue);
132
+ border-color: var(--blue);
133
+ }
134
+
135
+ /* ── Controls ── */
136
+ .controls {
137
+ padding: 16px 28px;
138
+ display: flex;
139
+ align-items: center;
140
+ gap: 12px;
141
+ border-bottom: 1px solid var(--border);
142
+ flex-wrap: wrap;
143
+ background: var(--bg2);
144
+ }
145
+
146
+ .ctrl-label {
147
+ font-family: var(--mono);
148
+ font-size: 10px;
149
+ color: var(--text-dim);
150
+ text-transform: uppercase;
151
+ letter-spacing: 1px;
152
+ white-space: nowrap;
153
+ }
154
+
155
+ .task-btn {
156
+ font-family: var(--mono);
157
+ font-size: 11px;
158
+ padding: 7px 16px;
159
+ border-radius: 4px;
160
+ border: 1px solid var(--border);
161
+ background: var(--bg3);
162
+ color: var(--text-dim);
163
+ cursor: pointer;
164
+ transition: all 0.2s;
165
+ letter-spacing: 1px;
166
+ }
167
+
168
+ .task-btn:hover {
169
+ border-color: var(--green);
170
+ color: var(--green);
171
+ }
172
+
173
+ .task-btn.active {
174
+ border-color: var(--green);
175
+ color: var(--green);
176
+ background: rgba(0, 229, 160, 0.08);
177
+ }
178
+
179
+ .sep {
180
+ width: 1px;
181
+ height: 24px;
182
+ background: var(--border);
183
+ margin: 0 4px;
184
+ }
185
+
186
+ .reset-btn {
187
+ font-family: var(--mono);
188
+ font-size: 11px;
189
+ padding: 7px 16px;
190
+ border-radius: 4px;
191
+ border: 1px solid var(--amber);
192
+ background: transparent;
193
+ color: var(--amber);
194
+ cursor: pointer;
195
+ letter-spacing: 1px;
196
+ transition: all 0.2s;
197
+ }
198
+
199
+ .reset-btn:hover {
200
+ background: rgba(245, 166, 35, 0.1);
201
+ }
202
+
203
+ .reset-btn:disabled {
204
+ opacity: 0.4;
205
+ cursor: not-allowed;
206
+ }
207
+
208
+ .run-btn {
209
+ font-family: var(--mono);
210
+ font-size: 11px;
211
+ padding: 7px 20px;
212
+ border-radius: 4px;
213
+ border: none;
214
+ background: var(--green);
215
+ color: #050d1a;
216
+ cursor: pointer;
217
+ font-weight: 700;
218
+ letter-spacing: 1px;
219
+ transition: all 0.2s;
220
+ margin-left: auto;
221
+ }
222
+
223
+ .run-btn:hover {
224
+ background: #00ffb3;
225
+ box-shadow: 0 0 16px rgba(0, 229, 160, 0.4);
226
+ }
227
+
228
+ .run-btn:disabled {
229
+ background: var(--green-dim);
230
+ cursor: not-allowed;
231
+ opacity: 0.5;
232
+ }
233
+
234
+ .run-hint {
235
+ font-size: 10px;
236
+ color: var(--text-dim);
237
+ font-family: var(--mono);
238
+ white-space: nowrap;
239
+ }
240
+
241
+ /* ── Main grid ── */
242
+ .main {
243
+ display: grid;
244
+ grid-template-columns: 320px 1fr;
245
+ min-height: calc(100vh - 118px);
246
+ }
247
+
248
+ /* ── Vitals panel ── */
249
+ .vitals-panel {
250
+ border-right: 1px solid var(--border);
251
+ padding: 20px;
252
+ display: flex;
253
+ flex-direction: column;
254
+ gap: 18px;
255
+ overflow-y: auto;
256
+ }
257
+
258
+ .panel-title {
259
+ font-family: var(--mono);
260
+ font-size: 10px;
261
+ color: var(--text-dim);
262
+ text-transform: uppercase;
263
+ letter-spacing: 2px;
264
+ padding-bottom: 10px;
265
+ border-bottom: 1px solid var(--border);
266
+ }
267
+
268
+ /* Score ring */
269
+ .score-ring-wrap {
270
+ display: flex;
271
+ flex-direction: column;
272
+ align-items: center;
273
+ gap: 6px;
274
+ padding: 8px 0;
275
+ }
276
+
277
+ .ring-container {
278
+ position: relative;
279
+ width: 130px;
280
+ height: 130px;
281
+ }
282
+
283
+ .ring-container svg {
284
+ transform: rotate(-90deg);
285
+ width: 130px;
286
+ height: 130px;
287
+ }
288
+
289
+ .ring-bg {
290
+ fill: none;
291
+ stroke: var(--bg3);
292
+ stroke-width: 10;
293
+ }
294
+
295
+ .ring-fill {
296
+ fill: none;
297
+ stroke: var(--green);
298
+ stroke-width: 10;
299
+ stroke-linecap: round;
300
+ stroke-dasharray: 326.73;
301
+ stroke-dashoffset: 326.73;
302
+ transition: stroke-dashoffset 0.7s cubic-bezier(0.4, 0, 0.2, 1), stroke 0.4s;
303
+ filter: drop-shadow(0 0 5px var(--green));
304
+ }
305
+
306
+ .ring-text {
307
+ position: absolute;
308
+ inset: 0;
309
+ display: flex;
310
+ flex-direction: column;
311
+ align-items: center;
312
+ justify-content: center;
313
+ font-family: var(--mono);
314
+ }
315
+
316
+ .ring-score {
317
+ font-size: 28px;
318
+ font-weight: 700;
319
+ color: var(--green);
320
+ line-height: 1;
321
+ }
322
+
323
+ .ring-label {
324
+ font-size: 9px;
325
+ color: var(--text-dim);
326
+ text-transform: uppercase;
327
+ letter-spacing: 1px;
328
+ margin-top: 4px;
329
+ }
330
+
331
+ /* Vital grid */
332
+ .vital-grid {
333
+ display: grid;
334
+ grid-template-columns: 1fr 1fr;
335
+ gap: 8px;
336
+ }
337
+
338
+ .vital-card {
339
+ background: var(--bg2);
340
+ border: 1px solid var(--border);
341
+ border-radius: 5px;
342
+ padding: 10px;
343
+ }
344
+
345
+ .vital-name {
346
+ font-size: 9px;
347
+ color: var(--text-dim);
348
+ text-transform: uppercase;
349
+ letter-spacing: 1px;
350
+ font-family: var(--mono);
351
+ margin-bottom: 5px;
352
+ }
353
+
354
+ .vital-value {
355
+ font-family: var(--mono);
356
+ font-size: 20px;
357
+ font-weight: 700;
358
+ line-height: 1;
359
+ }
360
+
361
+ .vital-value.green {
362
+ color: var(--green);
363
+ }
364
+
365
+ .vital-value.amber {
366
+ color: var(--amber);
367
+ }
368
+
369
+ .vital-value.red {
370
+ color: var(--red);
371
+ }
372
+
373
+ .vital-value.blue {
374
+ color: var(--blue);
375
+ }
376
+
377
+ .vital-sub {
378
+ font-size: 9px;
379
+ color: var(--text-dim);
380
+ margin-top: 3px;
381
+ font-family: var(--mono);
382
+ }
383
+
384
+ /* DQ bars */
385
+ .dq-bars {
386
+ display: flex;
387
+ flex-direction: column;
388
+ gap: 10px;
389
+ }
390
+
391
+ .dq-row {
392
+ display: flex;
393
+ flex-direction: column;
394
+ gap: 4px;
395
+ }
396
+
397
+ .dq-header {
398
+ display: flex;
399
+ justify-content: space-between;
400
+ font-family: var(--mono);
401
+ font-size: 10px;
402
+ }
403
+
404
+ .dq-name {
405
+ color: var(--text-dim);
406
+ text-transform: uppercase;
407
+ letter-spacing: 1px;
408
+ }
409
+
410
+ .dq-val {
411
+ font-weight: 700;
412
+ }
413
+
414
+ .dq-bar-bg {
415
+ height: 4px;
416
+ background: var(--bg3);
417
+ border-radius: 2px;
418
+ overflow: hidden;
419
+ }
420
+
421
+ .dq-bar-fill {
422
+ height: 100%;
423
+ border-radius: 2px;
424
+ transition: width 0.5s cubic-bezier(0.4, 0, 0.2, 1);
425
+ }
426
+
427
+ /* ── Content area ── */
428
+ .content-area {
429
+ display: flex;
430
+ flex-direction: column;
431
+ overflow: hidden;
432
+ }
433
+
434
+ /* Chart */
435
+ .chart-section {
436
+ padding: 20px 28px;
437
+ border-bottom: 1px solid var(--border);
438
+ }
439
+
440
+ .chart-wrap {
441
+ margin-top: 14px;
442
+ height: 90px;
443
+ position: relative;
444
+ }
445
+
446
+ #score-chart {
447
+ width: 100%;
448
+ height: 100%;
449
+ }
450
+
451
+ /* Plan */
452
+ .plan-section {
453
+ padding: 16px 28px;
454
+ border-bottom: 1px solid var(--border);
455
+ }
456
+
457
+ .plan-items {
458
+ margin-top: 10px;
459
+ display: flex;
460
+ flex-direction: column;
461
+ gap: 6px;
462
+ }
463
+
464
+ .plan-item {
465
+ display: flex;
466
+ align-items: flex-start;
467
+ gap: 10px;
468
+ font-size: 12px;
469
+ animation: fadeIn 0.3s ease;
470
+ }
471
+
472
+ .plan-num {
473
+ font-family: var(--mono);
474
+ font-size: 9px;
475
+ width: 18px;
476
+ height: 18px;
477
+ border: 1px solid var(--amber);
478
+ border-radius: 50%;
479
+ display: flex;
480
+ align-items: center;
481
+ justify-content: center;
482
+ flex-shrink: 0;
483
+ color: var(--amber);
484
+ margin-top: 1px;
485
+ }
486
+
487
+ @keyframes fadeIn {
488
+ from {
489
+ opacity: 0;
490
+ transform: translateY(6px);
491
+ }
492
+
493
+ to {
494
+ opacity: 1;
495
+ transform: translateY(0);
496
+ }
497
+ }
498
+
499
+ /* Thought stream */
500
+ .thought-section {
501
+ padding: 16px 28px;
502
+ border-bottom: 1px solid var(--border);
503
+ flex: 1;
504
+ }
505
+
506
+ .thought-stream {
507
+ margin-top: 10px;
508
+ display: flex;
509
+ flex-direction: column;
510
+ gap: 7px;
511
+ max-height: 200px;
512
+ overflow-y: auto;
513
+ }
514
+
515
+ .thought-stream::-webkit-scrollbar {
516
+ width: 3px;
517
+ }
518
+
519
+ .thought-stream::-webkit-scrollbar-thumb {
520
+ background: var(--border);
521
+ border-radius: 2px;
522
+ }
523
+
524
+ .thought-item {
525
+ display: flex;
526
+ gap: 10px;
527
+ align-items: flex-start;
528
+ animation: fadeIn 0.3s ease;
529
+ }
530
+
531
+ .thought-step {
532
+ font-family: var(--mono);
533
+ font-size: 9px;
534
+ color: var(--text-dim);
535
+ padding: 2px 5px;
536
+ border: 1px solid var(--border);
537
+ border-radius: 3px;
538
+ white-space: nowrap;
539
+ margin-top: 1px;
540
+ flex-shrink: 0;
541
+ }
542
+
543
+ .thought-body {
544
+ flex: 1;
545
+ min-width: 0;
546
+ }
547
+
548
+ .thought-action {
549
+ font-family: var(--mono);
550
+ font-size: 11px;
551
+ color: var(--blue);
552
+ margin-bottom: 2px;
553
+ word-break: break-all;
554
+ }
555
+
556
+ .thought-result {
557
+ font-size: 11px;
558
+ color: var(--text-dim);
559
+ }
560
+
561
+ .thought-reward {
562
+ font-family: var(--mono);
563
+ font-size: 10px;
564
+ padding: 2px 7px;
565
+ border-radius: 3px;
566
+ margin-top: 2px;
567
+ display: inline-block;
568
+ }
569
+
570
+ .reward-pos {
571
+ background: rgba(0, 229, 160, 0.12);
572
+ color: var(--green);
573
+ }
574
+
575
+ .reward-neg {
576
+ background: rgba(255, 77, 109, 0.12);
577
+ color: var(--red);
578
+ }
579
+
580
+ /* Data table */
581
+ .preview-section {
582
+ padding: 16px 28px 20px;
583
+ }
584
+
585
+ .data-table-wrap {
586
+ margin-top: 10px;
587
+ overflow-x: auto;
588
+ border: 1px solid var(--border);
589
+ border-radius: 5px;
590
+ max-height: 220px;
591
+ overflow-y: auto;
592
+ }
593
+
594
+ .data-table {
595
+ width: 100%;
596
+ border-collapse: collapse;
597
+ font-family: var(--mono);
598
+ font-size: 11px;
599
+ }
600
+
601
+ .data-table th {
602
+ background: var(--bg3);
603
+ color: var(--text-dim);
604
+ padding: 7px 10px;
605
+ text-align: left;
606
+ text-transform: uppercase;
607
+ letter-spacing: 1px;
608
+ border-bottom: 1px solid var(--border);
609
+ white-space: nowrap;
610
+ position: sticky;
611
+ top: 0;
612
+ }
613
+
614
+ .data-table td {
615
+ padding: 5px 10px;
616
+ border-bottom: 1px solid rgba(26, 48, 80, 0.4);
617
+ color: var(--text);
618
+ white-space: nowrap;
619
+ }
620
+
621
+ .data-table tr:last-child td {
622
+ border-bottom: none;
623
+ }
624
+
625
+ .data-table tr:hover td {
626
+ background: rgba(255, 255, 255, 0.02);
627
+ }
628
+
629
+ .cell-null {
630
+ color: var(--red);
631
+ font-style: italic;
632
+ }
633
+
634
+ /* Empty state */
635
+ .empty-state {
636
+ display: flex;
637
+ flex-direction: column;
638
+ align-items: center;
639
+ justify-content: center;
640
+ padding: 40px 24px;
641
+ gap: 10px;
642
+ color: var(--text-dim);
643
+ text-align: center;
644
+ }
645
+
646
+ .empty-icon {
647
+ font-size: 36px;
648
+ opacity: 0.25;
649
+ }
650
+
651
+ .empty-title {
652
+ font-family: var(--mono);
653
+ font-size: 12px;
654
+ letter-spacing: 2px;
655
+ text-transform: uppercase;
656
+ }
657
+
658
+ .empty-sub {
659
+ font-size: 12px;
660
+ max-width: 280px;
661
+ line-height: 1.6;
662
+ }
663
+
664
+ /* Bottom bar */
665
+ .bottom-bar {
666
+ padding: 10px 28px;
667
+ border-top: 1px solid var(--border);
668
+ background: var(--bg2);
669
+ display: flex;
670
+ align-items: center;
671
+ gap: 20px;
672
+ font-family: var(--mono);
673
+ font-size: 10px;
674
+ color: var(--text-dim);
675
+ grid-column: 1 / -1;
676
+ flex-wrap: wrap;
677
+ }
678
+
679
+ .bottom-stat {
680
+ display: flex;
681
+ gap: 6px;
682
+ }
683
+
684
+ .bottom-stat span:last-child {
685
+ color: var(--text);
686
+ }
687
+
688
+ .dl-btn {
689
+ margin-left: auto;
690
+ font-family: var(--mono);
691
+ font-size: 10px;
692
+ padding: 5px 14px;
693
+ border-radius: 4px;
694
+ border: 1px solid var(--green-dim);
695
+ background: transparent;
696
+ color: var(--green);
697
+ cursor: pointer;
698
+ letter-spacing: 1px;
699
+ transition: all 0.2s;
700
+ }
701
+
702
+ .dl-btn:hover {
703
+ border-color: var(--green);
704
+ box-shadow: 0 0 10px rgba(0, 229, 160, 0.2);
705
+ }
706
+
707
+ .dl-btn:disabled {
708
+ opacity: 0.3;
709
+ cursor: not-allowed;
710
+ }
711
+
712
+ ::-webkit-scrollbar {
713
+ width: 5px;
714
+ height: 5px;
715
+ }
716
+
717
+ ::-webkit-scrollbar-track {
718
+ background: var(--bg);
719
+ }
720
+
721
+ ::-webkit-scrollbar-thumb {
722
+ background: var(--border);
723
+ border-radius: 3px;
724
+ }
725
+ </style>
726
+ </head>
727
+
728
+ <body>
729
+
730
+ <!-- Header -->
731
+ <header>
732
+ <div class="logo">
733
+ <div class="logo-pulse" id="logo-pulse"></div>
734
+ <div>
735
+ <div class="logo-text">DATAMEDIC</div>
736
+ <div class="logo-sub">AI Data Quality Monitor Β· OpenEnv</div>
737
+ </div>
738
+ </div>
739
+ <span class="status-pill idle" id="status-pill">IDLE</span>
740
+ </header>
741
+
742
+ <!-- Controls -->
743
+ <div class="controls">
744
+ <span class="ctrl-label">Select Task:</span>
745
+ <button class="task-btn active" data-task="1" onclick="selectTask(1)">TASK 1 Β· Easy</button>
746
+ <button class="task-btn" data-task="2" onclick="selectTask(2)">TASK 2 Β· Medium</button>
747
+ <button class="task-btn" data-task="3" onclick="selectTask(3)">TASK 3 Β· Hard</button>
748
+ <button class="task-btn" data-task="4" onclick="selectTask(4)">TASK 4 Β· Expert</button>
749
+ <div class="sep"></div>
750
+ <button class="reset-btn" id="reset-btn" onclick="resetEnv()">RESET EPISODE</button>
751
+ <button class="run-btn" id="run-btn" onclick="runAgent()">RUN DEMO AGENT</button>
752
+ <span class="run-hint">rule-based Β· follows plan field</span>
753
+ </div>
754
+
755
+ <!-- Main -->
756
+ <div class="main">
757
+
758
+ <!-- LEFT: Vitals -->
759
+ <div class="vitals-panel">
760
+ <div class="panel-title">Patient Vitals</div>
761
+
762
+ <div class="score-ring-wrap">
763
+ <div class="ring-container">
764
+ <svg viewBox="0 0 130 130">
765
+ <circle class="ring-bg" cx="65" cy="65" r="52" />
766
+ <circle class="ring-fill" cx="65" cy="65" r="52" id="ring-fill" />
767
+ </svg>
768
+ <div class="ring-text">
769
+ <div class="ring-score" id="ring-score">--</div>
770
+ <div class="ring-label">Health Score</div>
771
+ </div>
772
+ </div>
773
+ </div>
774
+
775
+ <div class="vital-grid">
776
+ <div class="vital-card">
777
+ <div class="vital-name">Step</div>
778
+ <div class="vital-value blue" id="v-step">--</div>
779
+ <div class="vital-sub" id="v-maxstep">of --</div>
780
+ </div>
781
+ <div class="vital-card">
782
+ <div class="vital-name">Reward</div>
783
+ <div class="vital-value green" id="v-reward">--</div>
784
+ <div class="vital-sub">last delta</div>
785
+ </div>
786
+ <div class="vital-card">
787
+ <div class="vital-name">Nulls</div>
788
+ <div class="vital-value amber" id="v-nulls">--</div>
789
+ <div class="vital-sub">missing cells</div>
790
+ </div>
791
+ <div class="vital-card">
792
+ <div class="vital-name">Dupes</div>
793
+ <div class="vital-value amber" id="v-dupes">--</div>
794
+ <div class="vital-sub">duplicate rows</div>
795
+ </div>
796
+ </div>
797
+
798
+ <div class="panel-title">DQ Dimensions</div>
799
+ <div class="dq-bars">
800
+ <div class="dq-row">
801
+ <div class="dq-header">
802
+ <span class="dq-name">Completeness</span>
803
+ <span class="dq-val" id="dq-completeness" style="color:var(--green)">--</span>
804
+ </div>
805
+ <div class="dq-bar-bg">
806
+ <div class="dq-bar-fill" id="bar-completeness" style="width:0%;background:var(--green)"></div>
807
+ </div>
808
+ </div>
809
+ <div class="dq-row">
810
+ <div class="dq-header">
811
+ <span class="dq-name">Uniqueness</span>
812
+ <span class="dq-val" id="dq-uniqueness" style="color:var(--blue)">--</span>
813
+ </div>
814
+ <div class="dq-bar-bg">
815
+ <div class="dq-bar-fill" id="bar-uniqueness" style="width:0%;background:var(--blue)"></div>
816
+ </div>
817
+ </div>
818
+ <div class="dq-row">
819
+ <div class="dq-header">
820
+ <span class="dq-name">Validity</span>
821
+ <span class="dq-val" id="dq-validity" style="color:var(--amber)">--</span>
822
+ </div>
823
+ <div class="dq-bar-bg">
824
+ <div class="dq-bar-fill" id="bar-validity" style="width:0%;background:var(--amber)"></div>
825
+ </div>
826
+ </div>
827
+ </div>
828
+ </div>
829
+
830
+ <!-- RIGHT: Content -->
831
+ <div class="content-area">
832
+
833
+ <div class="chart-section">
834
+ <div class="panel-title">Health Score Trajectory</div>
835
+ <div class="chart-wrap">
836
+ <svg id="score-chart" preserveAspectRatio="none">
837
+ <defs>
838
+ <linearGradient id="chartGrad" x1="0" y1="0" x2="0" y2="1">
839
+ <stop offset="0%" stop-color="#00e5a0" stop-opacity="0.25" />
840
+ <stop offset="100%" stop-color="#00e5a0" stop-opacity="0" />
841
+ </linearGradient>
842
+ </defs>
843
+ <path id="chart-area" fill="url(#chartGrad)" d="" />
844
+ <path id="chart-line" fill="none" stroke="#00e5a0" stroke-width="2" stroke-linecap="round"
845
+ stroke-linejoin="round" d="" style="filter:drop-shadow(0 0 3px #00e5a0)" />
846
+ <text x="50%" y="50%" text-anchor="middle" dominant-baseline="middle" fill="#4a6a8a"
847
+ font-size="11" id="chart-empty-msg" font-family="Courier New, monospace">
848
+ Run demo agent to see score trajectory
849
+ </text>
850
+ </svg>
851
+ </div>
852
+ </div>
853
+
854
+ <div class="plan-section">
855
+ <div class="panel-title">Agent Treatment Plan &nbsp;<span style="color:var(--amber);font-size:9px">(next
856
+ recommended actions)</span></div>
857
+ <div class="plan-items" id="plan-items">
858
+ <div style="color:var(--text-dim);font-size:11px;font-family:var(--mono);padding:4px 0">
859
+ Awaiting diagnosis...
860
+ </div>
861
+ </div>
862
+ </div>
863
+
864
+ <div class="thought-section">
865
+ <div class="panel-title">Agent Operation Log &nbsp;<span
866
+ style="color:var(--text-dim);font-size:9px">(actions taken + results)</span></div>
867
+ <div class="thought-stream" id="thought-stream">
868
+ <div class="empty-state" style="padding:16px">
869
+ <div class="empty-sub">Actions will appear here as the demo agent runs</div>
870
+ </div>
871
+ </div>
872
+ </div>
873
+
874
+ <div class="preview-section">
875
+ <div class="panel-title">Dataset Preview &nbsp;<span style="color:var(--text-dim);font-size:9px">(first
876
+ 10 rows Β· NULL shown in red)</span></div>
877
+ <div class="data-table-wrap" id="table-wrap">
878
+ <div class="empty-state">
879
+ <div class="empty-icon">[?]</div>
880
+ <div class="empty-title">No Dataset Loaded</div>
881
+ <div class="empty-sub">Select a task β€” dataset loads automatically</div>
882
+ </div>
883
+ </div>
884
+ </div>
885
+
886
+ </div>
887
+
888
+ <!-- Bottom bar -->
889
+ <div class="bottom-bar">
890
+ <div class="bottom-stat"><span>Episode:</span><span id="b-episode">--</span></div>
891
+ <div class="bottom-stat"><span>Task:</span><span id="b-task">--</span></div>
892
+ <div class="bottom-stat"><span>Errors Left:</span><span id="b-errors">--</span></div>
893
+ <div class="bottom-stat"><span>Shape:</span><span id="b-shape">--</span></div>
894
+ <button class="dl-btn" id="dl-btn" disabled onclick="downloadCSV()">EXPORT CSV</button>
895
+ </div>
896
+
897
+ </div>
898
+
899
+ <script>
900
+ const BASE = '';
901
+ let selectedTask = 1;
902
+ let scores = [];
903
+ let isRunning = false;
904
+
905
+ const TASK_LABELS = {
906
+ 1: 'Task 1 - Fill Missing Values',
907
+ 2: 'Task 2 - Fix Formats + Duplicates',
908
+ 3: 'Task 3 - Full Pipeline',
909
+ 4: 'Task 4 - Multi-Source Merge'
910
+ };
911
+
912
+ // ── Task selection: switch + auto-reset ──────────────────────────
913
+ function selectTask(n) {
914
+ if (isRunning) return;
915
+ selectedTask = n;
916
+ document.querySelectorAll('.task-btn').forEach(b => b.classList.remove('active'));
917
+ document.querySelector('[data-task="' + n + '"]').classList.add('active');
918
+ resetEnv(); // <-- auto-reset when task changes
919
+ }
920
+
921
+ // ── Reset ────────────────────────────────────────────────────────
922
+ async function resetEnv() {
923
+ if (isRunning) return;
924
+ setButtons(false);
925
+
926
+ // Immediately update task label and dim ring while loading
927
+ document.getElementById('b-task').textContent = TASK_LABELS[selectedTask] || 'Task ' + selectedTask;
928
+ document.getElementById('ring-score').textContent = '...';
929
+ document.getElementById('ring-fill').style.strokeDashoffset = 326.73;
930
+
931
+ try {
932
+ const r = await fetch(BASE + '/reset', {
933
+ method: 'POST',
934
+ headers: { 'Content-Type': 'application/json' },
935
+ body: JSON.stringify({ task_id: selectedTask })
936
+ });
937
+ if (!r.ok) throw new Error('Reset failed: ' + r.status);
938
+ const data = await r.json();
939
+ scores = [data.observation.current_score];
940
+ updateUI(data.observation, null);
941
+ clearThoughts();
942
+ updateChart();
943
+ addThought(0, 'Episode started - Task ' + selectedTask, data.observation.message, null);
944
+ document.getElementById('dl-btn').disabled = false;
945
+ setStatus('idle');
946
+ document.getElementById('b-task').textContent = TASK_LABELS[selectedTask] || 'Task ' + selectedTask;
947
+ } catch (e) {
948
+ addThought('!', 'Error', e.message, null);
949
+ console.error(e);
950
+ }
951
+ setButtons(true);
952
+ }
953
+
954
+ // ── Run demo agent ───────────────────────────────────────────────
955
+ async function runAgent() {
956
+ if (isRunning) return;
957
+ isRunning = true;
958
+ setButtons(false);
959
+ setStatus('running');
960
+
961
+ // Fresh reset first
962
+ try {
963
+ const initR = await fetch(BASE + '/reset', {
964
+ method: 'POST',
965
+ headers: { 'Content-Type': 'application/json' },
966
+ body: JSON.stringify({ task_id: selectedTask })
967
+ });
968
+ const initData = await initR.json();
969
+ let obs = initData.observation;
970
+ scores = [obs.current_score];
971
+ clearThoughts();
972
+ updateUI(obs, null);
973
+ updateChart();
974
+ addThought(0, 'Demo agent started', obs.message, null);
975
+
976
+ const MAX = 50;
977
+ let step = 0;
978
+
979
+ while (!obs.done && step < MAX) {
980
+ await sleep(700);
981
+ const action = pickAction(obs);
982
+ if (!action) {
983
+ addThought('--', 'Agent halted', 'No more actions available from plan', null);
984
+ break;
985
+ }
986
+
987
+ step++;
988
+ const r = await fetch(BASE + '/step', {
989
+ method: 'POST',
990
+ headers: { 'Content-Type': 'application/json' },
991
+ body: JSON.stringify(action)
992
+ });
993
+ const data = await r.json();
994
+ obs = data.observation;
995
+ scores.push(obs.current_score);
996
+
997
+ updateUI(obs, data.reward);
998
+ updateChart();
999
+ addThought(step, JSON.stringify(action), obs.message, data.reward);
1000
+
1001
+ const ts = document.getElementById('thought-stream');
1002
+ ts.scrollTop = ts.scrollHeight;
1003
+ }
1004
+
1005
+ const done = obs.current_score >= 0.95;
1006
+ setStatus(done ? 'done' : 'idle');
1007
+ if (done) {
1008
+ addThought('OK', 'Cleaning complete!',
1009
+ 'Final score: ' + (obs.current_score * 100).toFixed(1) + '%', null);
1010
+ }
1011
+ } catch (e) {
1012
+ console.error(e);
1013
+ addThought('!', 'Error during agent run', e.message, null);
1014
+ setStatus('idle');
1015
+ }
1016
+
1017
+ isRunning = false;
1018
+ setButtons(true);
1019
+ }
1020
+
1021
+ // ── Rule-based action picker (follows plan field) ────────────────
1022
+ function pickAction(obs) {
1023
+ if (obs.plan && obs.plan.length > 0) {
1024
+ const p = obs.plan[0];
1025
+
1026
+ if (p.startsWith('align_schema'))
1027
+ return { operation: 'align_schema' };
1028
+ if (p.startsWith('merge_sources'))
1029
+ return { operation: 'merge_sources' };
1030
+ if (p.startsWith('drop_duplicates'))
1031
+ return { operation: 'drop_duplicates' };
1032
+
1033
+ const fillM = p.match(/fill_missing on "([^"]+)".*?(median|mode|mean)/);
1034
+ if (fillM)
1035
+ return { operation: 'fill_missing', column: fillM[1], params: { strategy: fillM[2] } };
1036
+
1037
+ const fmtM = p.match(/fix_format on "([^"]+)"/);
1038
+ if (fmtM)
1039
+ return { operation: 'fix_format', column: fmtM[1] };
1040
+
1041
+ const outM = p.match(/drop_outliers on "([^"]+)"/);
1042
+ if (outM)
1043
+ return { operation: 'drop_outliers', column: outM[1] };
1044
+ }
1045
+
1046
+ // Fallback: scan missing counts directly
1047
+ const missing = obs.missing_counts || {};
1048
+ for (const [col, cnt] of Object.entries(missing)) {
1049
+ if (cnt > 0) {
1050
+ const cat = ['department', 'country', 'email', 'name', 'category'].includes(col);
1051
+ return { operation: 'fill_missing', column: col, params: { strategy: cat ? 'mode' : 'median' } };
1052
+ }
1053
+ }
1054
+
1055
+ if (obs.duplicate_count > 0)
1056
+ return { operation: 'drop_duplicates' };
1057
+
1058
+ return null;
1059
+ }
1060
+
1061
+ // ── UI update ────────────────────────────────────────────────────
1062
+ function updateUI(obs, reward) {
1063
+ const pct = obs.current_score;
1064
+ const CIRCUM = 326.73; // exact: 2 * pi * 52
1065
+
1066
+ // Ring β€” minimum 3% arc so ring is never invisible at very low scores
1067
+ const displayPct = Math.max(pct, 0.03);
1068
+ document.getElementById('ring-fill').style.strokeDashoffset = CIRCUM * (1 - displayPct);
1069
+
1070
+ // Score text β€” show raw value accurately
1071
+ const scoreText = pct < 0.1
1072
+ ? (pct * 100).toFixed(1) + '%' // e.g. "4.3%"
1073
+ : (pct * 100).toFixed(1) + '%'; // e.g. "87.5%"
1074
+ document.getElementById('ring-score').textContent = scoreText;
1075
+
1076
+ // Color ring by health
1077
+ const col = pct >= 0.85 ? '#00e5a0' : pct >= 0.5 ? '#f5a623' : '#ff4d6d';
1078
+ const rf = document.getElementById('ring-fill');
1079
+ rf.style.stroke = col;
1080
+ rf.style.filter = 'drop-shadow(0 0 5px ' + col + ')';
1081
+ document.getElementById('ring-score').style.color = col;
1082
+
1083
+ // Stats
1084
+ document.getElementById('v-step').textContent = obs.step_count;
1085
+ document.getElementById('v-maxstep').textContent = 'of ' + (obs.step_count + 20);
1086
+
1087
+ if (reward !== null) {
1088
+ const rv = document.getElementById('v-reward');
1089
+ rv.textContent = (reward >= 0 ? '+' : '') + reward.toFixed(4);
1090
+ rv.className = 'vital-value ' + (reward >= 0 ? 'green' : 'red');
1091
+ }
1092
+
1093
+ const nullTotal = Object.values(obs.missing_counts || {}).reduce(function (a, b) { return a + b; }, 0);
1094
+ const vn = document.getElementById('v-nulls');
1095
+ vn.textContent = nullTotal;
1096
+ vn.className = 'vital-value ' + (nullTotal === 0 ? 'green' : 'amber');
1097
+
1098
+ const vd = document.getElementById('v-dupes');
1099
+ vd.textContent = obs.duplicate_count;
1100
+ vd.className = 'vital-value ' + (obs.duplicate_count === 0 ? 'green' : 'amber');
1101
+
1102
+ // DQ bars
1103
+ if (obs.dq_metrics) {
1104
+ setDQBar('completeness', obs.dq_metrics.completeness_pct, 'var(--green)');
1105
+ setDQBar('uniqueness', obs.dq_metrics.uniqueness_pct, 'var(--blue)');
1106
+ setDQBar('validity', obs.dq_metrics.validity_pct, 'var(--amber)');
1107
+ }
1108
+
1109
+ // Plan
1110
+ const planEl = document.getElementById('plan-items');
1111
+ if (obs.plan && obs.plan.length > 0) {
1112
+ planEl.innerHTML = obs.plan.map(function (p, i) {
1113
+ return '<div class="plan-item">' +
1114
+ '<div class="plan-num">' + (i + 1) + '</div>' +
1115
+ '<span style="color:var(--text)">' + p + '</span>' +
1116
+ '</div>';
1117
+ }).join('');
1118
+ } else if (obs.done) {
1119
+ planEl.innerHTML = '<div style="color:var(--green);font-family:var(--mono);font-size:11px;padding:4px 0">Dataset fully cleaned</div>';
1120
+ } else {
1121
+ planEl.innerHTML = '<div style="color:var(--text-dim);font-family:var(--mono);font-size:11px;padding:4px 0">No further actions needed</div>';
1122
+ }
1123
+
1124
+ // Table
1125
+ if (obs.data_preview) renderTable(obs.data_preview);
1126
+
1127
+ // Bottom bar
1128
+ document.getElementById('b-shape').textContent = obs.data_shape[0] + ' x ' + obs.data_shape[1];
1129
+ }
1130
+
1131
+ function setDQBar(name, val, color) {
1132
+ document.getElementById('dq-' + name).textContent = val.toFixed(1) + '%';
1133
+ document.getElementById('bar-' + name).style.width = Math.min(val, 100) + '%';
1134
+ document.getElementById('bar-' + name).style.background = color;
1135
+ }
1136
+
1137
+ // ── Chart ────────────────────────────────────────────────────────
1138
+ function updateChart() {
1139
+ const svg = document.getElementById('score-chart');
1140
+ const W = svg.clientWidth || 600;
1141
+ const H = svg.clientHeight || 90;
1142
+ const pad = 6;
1143
+
1144
+ if (scores.length < 2) return;
1145
+ document.getElementById('chart-empty-msg').style.display = 'none';
1146
+
1147
+ const xs = scores.map(function (_, i) { return pad + (i / (scores.length - 1)) * (W - 2 * pad); });
1148
+ const ys = scores.map(function (s) { return (H - pad) - s * (H - 2 * pad); });
1149
+ const pts = xs.map(function (x, i) { return x + ',' + ys[i]; }).join(' L ');
1150
+
1151
+ document.getElementById('chart-line').setAttribute('d', 'M ' + pts);
1152
+ document.getElementById('chart-area').setAttribute('d',
1153
+ 'M ' + xs[0] + ',' + H + ' L ' + pts + ' L ' + xs[xs.length - 1] + ',' + H + ' Z'
1154
+ );
1155
+ }
1156
+
1157
+ // ── Table ────────────────────────────────────────────────────────
1158
+ function renderTable(csv) {
1159
+ const lines = csv.trim().split('\n');
1160
+ if (lines.length < 2) return;
1161
+ const headers = lines[0].split(',');
1162
+ const rows = lines.slice(1, 11).map(function (l) { return l.split(','); });
1163
+
1164
+ var html = '<table class="data-table"><thead><tr>' +
1165
+ headers.map(function (h) { return '<th>' + h.trim() + '</th>'; }).join('') +
1166
+ '</tr></thead><tbody>';
1167
+
1168
+ rows.forEach(function (row) {
1169
+ html += '<tr>' + row.map(function (cell) {
1170
+ var v = cell.trim();
1171
+ var empty = v === '' || v.toLowerCase() === 'nan' || v.toLowerCase() === 'none';
1172
+ return '<td class="' + (empty ? 'cell-null' : '') + '">' + (empty ? 'NULL' : v) + '</td>';
1173
+ }).join('') + '</tr>';
1174
+ });
1175
+
1176
+ html += '</tbody></table>';
1177
+ document.getElementById('table-wrap').innerHTML = html;
1178
+ }
1179
+
1180
+ // ── Thought stream ───────────────────────────────────────────────
1181
+ function clearThoughts() {
1182
+ document.getElementById('thought-stream').innerHTML = '';
1183
+ }
1184
+
1185
+ function addThought(step, action, result, reward) {
1186
+ const ts = document.getElementById('thought-stream');
1187
+ const rewardHtml = reward !== null
1188
+ ? '<div class="thought-reward ' + (reward >= 0 ? 'reward-pos' : 'reward-neg') + '">' +
1189
+ (reward >= 0 ? '+' : '') + reward.toFixed(4) + '</div>'
1190
+ : '';
1191
+
1192
+ var el = document.createElement('div');
1193
+ el.className = 'thought-item';
1194
+ el.innerHTML =
1195
+ '<div class="thought-step">S' + step + '</div>' +
1196
+ '<div class="thought-body">' +
1197
+ '<div class="thought-action">' + action + '</div>' +
1198
+ '<div class="thought-result">' + result + '</div>' +
1199
+ rewardHtml +
1200
+ '</div>';
1201
+ ts.appendChild(el);
1202
+ }
1203
+
1204
+ // ── Helpers ──────────────────────────────────────────────────────
1205
+ function setStatus(s) {
1206
+ const el = document.getElementById('status-pill');
1207
+ el.className = 'status-pill ' + s;
1208
+ el.textContent = s.toUpperCase();
1209
+ }
1210
+
1211
+ function setButtons(enabled) {
1212
+ document.getElementById('run-btn').disabled = !enabled;
1213
+ document.getElementById('reset-btn').disabled = !enabled;
1214
+ }
1215
+
1216
+ async function downloadCSV() {
1217
+ try {
1218
+ const r = await fetch(BASE + '/export');
1219
+ const text = await r.text();
1220
+ const blob = new Blob([text], { type: 'text/csv' });
1221
+ const a = document.createElement('a');
1222
+ a.href = URL.createObjectURL(blob);
1223
+ a.download = 'cleaned_task' + selectedTask + '.csv';
1224
+ a.click();
1225
+ } catch (e) {
1226
+ console.error('Export failed:', e);
1227
+ }
1228
+ }
1229
+
1230
+ function sleep(ms) { return new Promise(function (r) { setTimeout(r, ms); }); }
1231
+
1232
+ // Auto-load Task 1 on open
1233
+ window.addEventListener('load', function () { resetEnv(); });
1234
+ </script>
1235
+ </body>
1236
+
1237
+ </html>