Taniieeee83 commited on
Commit
f137938
Β·
1 Parent(s): 5ededc8

updated README with links

Browse files
Files changed (1) hide show
  1. README.md +223 -103
README.md CHANGED
@@ -12,70 +12,115 @@ tags:
12
  - data-cleaning
13
  ---
14
 
15
- # Data Cleaning OpenEnv
16
 
17
- A **real-world data cleaning environment** for AI agent training, built for the Scaler Γ— OpenEnv hackathon.
18
 
19
- An agent interacts with a dirty DataFrame through a simple `reset() / step() / state()` API, learning to fix common data quality issues: missing values, duplicate rows, format inconsistencies, outliers, and dtype errors.
 
 
 
 
20
 
21
  ---
22
 
23
- ## Environment Description
 
 
24
 
25
- Real-world datasets are rarely clean. Data engineers spend a significant fraction of their time:
26
- - Filling missing values with appropriate strategies (median/mean/mode)
27
- - Removing duplicate records
28
- - Standardising inconsistent formats (phone numbers, dates, country names)
29
- - Detecting and removing statistical outliers
30
 
31
- This environment turns those tasks into a reinforcement learning challenge with deterministic, programmatic graders and a meaningful partial-progress reward signal.
 
 
 
32
 
33
  ---
34
 
35
  ## Action Space
36
 
37
- Actions are JSON objects sent to `POST /step`:
 
 
 
 
 
 
 
 
 
38
 
39
- | `operation` | `column` | `params` | Description |
40
- |------------------|------------|--------------------------------------------------|-------------------------------------|
41
- | `fill_missing` | required | `{"strategy": "median\|mean\|mode\|constant", "value": ...}` | Fill NaN values |
42
- | `drop_duplicates`| β€” | β€” | Remove duplicate rows |
43
- | `fix_format` | required | β€” | Standardise phone/date/country col |
44
- | `replace_value` | required | `{"old": ..., "new": ...}` | Replace a specific value |
45
- | `drop_outliers` | required | β€” | Remove IQR outliers in numeric col |
46
- | `fix_dtype` | required | `{"dtype": "float\|int\|str"}` | Cast column to correct dtype |
47
 
48
- **Example:**
 
 
 
 
 
 
49
  ```json
50
- {"operation": "fill_missing", "column": "salary", "params": {"strategy": "median"}}
 
51
  {"operation": "drop_duplicates"}
52
- {"operation": "fix_format", "column": "signup_date"}
 
 
53
  ```
54
 
55
  ---
56
 
57
  ## Observation Space
58
 
59
- The `POST /step` and `POST /reset` responses return:
60
-
61
  ```json
62
  {
63
  "observation": {
64
  "done": false,
65
- "reward": 0.05,
66
  "data_preview": "name,age,salary,...\n...",
67
  "data_shape": [100, 5],
68
- "missing_counts": {"salary": 18, "age": 20},
69
  "duplicate_count": 0,
70
  "dtype_issues": {},
71
  "task_description": "Task 1 (Easy) β€” Fill Missing Values\n...",
72
  "message": "Filled 20 missing values in 'age' using median.",
73
  "step_count": 1,
74
- "current_score": 0.25
75
  },
76
- "reward": 0.05,
77
- "done": false,
78
- "info": {}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
  }
80
  ```
81
 
@@ -83,119 +128,194 @@ The `POST /step` and `POST /reset` responses return:
83
 
84
  ## Tasks
85
 
86
- ### Task 1 β€” Fill Missing Values (Easy)
87
- - **Dataset:** 100-row employee records (name, age, salary, department, experience)
88
- - **Issues:** ~20 % NaN in `age`, `salary`, `department`
89
- - **Goal:** Fill all missing values
90
- - **Grader:** `1.0 - remaining_nulls / original_nulls`
91
- - **Max steps:** 20
92
- - **Expected baseline score:** ~0.95
93
-
94
- ### Task 2 β€” Fix Formats + Remove Duplicates (Medium)
95
- - **Dataset:** 200-row product catalog (product_id, price, phone, listed_date, …)
96
- - **Issues:** Mixed phone formats, mixed date formats, 15 duplicate rows
97
- - **Goal:** Standardise all formats and remove duplicates
98
- - **Grader:** `0.35 Γ— phone_score + 0.35 Γ— date_score + 0.30 Γ— dupe_score`
99
- - **Max steps:** 30
100
- - **Expected baseline score:** ~0.80
101
-
102
- ### Task 3 β€” Full Cleaning Pipeline (Hard)
103
- - **Dataset:** 300-row customer database (name, age, purchase_amount, country, email, signup_date)
104
- - **Issues:** Missing values (4 cols), 20 duplicates, outliers in `purchase_amount`, mixed country case, mixed date formats
105
- - **Goal:** Clean all issues end-to-end
106
- - **Grader:** `0.25 Γ— null + 0.20 Γ— dupe + 0.20 Γ— outlier + 0.175 Γ— country + 0.175 Γ— date`
107
- - **Max steps:** 40
108
- - **Expected baseline score:** ~0.70
 
 
 
 
 
 
 
 
 
 
 
 
109
 
110
  ---
111
 
112
  ## Reward Function
113
 
114
- | Scenario | Reward |
115
- |----------------------------|------------------------------------|
116
- | Progress (score improves) | `new_score - old_score` (β‰₯ 0) |
117
- | No effect | `-0.01` |
118
- | Invalid operation | `-0.05` |
119
- | Episode completion (β‰₯0.95) | `delta + 0.20` terminal bonus |
120
 
121
- Rewards are bounded to `[-0.05, 1.2]`. Partial rewards are emitted every step.
122
 
123
  ---
124
 
125
  ## API Endpoints
126
 
127
- | Method | Path | Description |
128
- |--------|-----------|-----------------------------------|
129
- | GET | `/health` | Health check β†’ `{"status":"ok"}` |
130
- | POST | `/reset` | Start episode. Body: `{"task_id": 1\|2\|3}` (optional; default: round-robin) |
131
- | POST | `/step` | Execute action. Body: action JSON |
132
- | POST | `/state` | Get episode state |
133
- | GET | `/docs` | Interactive Swagger UI |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
 
135
  ---
136
 
137
  ## Setup & Usage
138
 
139
- ### Local (Python)
 
 
 
 
 
140
  ```bash
 
 
 
141
  pip install -r requirements.txt
 
 
142
  uvicorn server.app:app --host 0.0.0.0 --port 8000
 
 
 
143
  ```
144
 
145
- ### Docker
146
  ```bash
147
  docker build -t data-cleaning-env .
148
  docker run -p 8000:8000 data-cleaning-env
149
  ```
150
 
151
- ### Run Baseline Inference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
  ```bash
153
  export API_BASE_URL="https://api.openai.com/v1"
154
  export MODEL_NAME="gpt-4o-mini"
155
- export HF_TOKEN="your-api-key"
156
  export ENV_URL="http://localhost:8000"
157
 
158
  python inference.py
159
  ```
160
 
161
- ---
162
-
163
- ## Baseline Scores
164
-
165
- | Task | Difficulty | Score |
166
- |------|------------|--------|
167
- | 1 | Easy | 1.000 |
168
- | 2 | Medium | 1.000 |
169
- | 3 | Hard | 1.000 |
170
- | avg | β€” | 1.000 |
171
 
172
- *(Scores produced by `google/gemma-3-27b-it` via NVIDIA NIM, temperature=0)*
173
 
174
- > Full agent step-by-step logs available in `inference_log.txt`
175
- ---
176
- ## Live Demo
177
- πŸ€— **HuggingFace Space:** https://srishtichugh-openenv-hack.hf.space
 
 
178
 
179
- - Health check: https://srishtichugh-openenv-hack.hf.space/health
180
- - Interactive API docs: https://srishtichugh-openenv-hack.hf.space/docs
181
  ---
182
 
183
  ## Project Structure
184
-
185
  ```
186
  openenv-data-cleaning/
187
- β”œβ”€β”€ server/
188
- β”‚ β”œβ”€β”€ environment.py # Core env: reset/step/state + action dispatcher
189
- β”‚ β”œβ”€β”€ app.py # FastAPI HTTP API
190
- β”‚ β”œβ”€β”€ data_generator.py # Synthetic dataset generation (fixed seed=42)
191
- β”‚ └── tasks/
192
- β”‚ β”œβ”€β”€ task1_missing.py # Task 1: missing values dataset + grader
193
- β”‚ β”œβ”€β”€ task2_format.py # Task 2: format + duplicates dataset + grader
194
- β”‚ └── task3_pipeline.py # Task 3: full pipeline dataset + grader
195
- β”œβ”€β”€ models.py # Pydantic models (Action, Observation, State)
196
- β”œβ”€β”€ inference.py # Baseline inference script
197
- β”œβ”€β”€ openenv.yaml # OpenEnv manifest
198
- β”œβ”€β”€ Dockerfile
199
- β”œβ”€β”€ requirements.txt
200
- └── README.md
 
201
  ```
 
 
 
 
 
 
 
 
 
 
12
  - data-cleaning
13
  ---
14
 
15
+ # 🧹 Data Cleaning OpenEnv
16
 
17
+ A **real-world data cleaning environment** for training and evaluating AI agents.
18
 
19
+ An agent interacts with a dirty pandas DataFrame through a standard `reset() / step() / state()` HTTP API, learning to fix common data quality problems β€” missing values, duplicate rows, inconsistent formats, statistical outliers, and dtype errors β€” across three progressively harder tasks.
20
+
21
+ πŸ€— **Live HuggingFace Space:** https://srishtichugh-openenv-hack.hf.space
22
+ πŸ“– **Interactive API docs:** https://srishtichugh-openenv-hack.hf.space/docs
23
+ βœ… **Health check:** https://srishtichugh-openenv-hack.hf.space/health
24
 
25
  ---
26
 
27
+ ## Environment Description & Motivation
28
+
29
+ Real-world datasets are almost never clean. Data engineers routinely spend 60–80 % of their time on data cleaning tasks: filling missing values with statistically appropriate strategies, removing duplicates, standardising inconsistent formats (phone numbers, dates, country names), and detecting extreme outliers.
30
 
31
+ This environment turns those tasks into a reinforcement learning challenge with:
 
 
 
 
32
 
33
+ - **Deterministic, programmatic graders** β€” ground-truth clean DataFrames are generated with a fixed seed; every reward signal is reproducible.
34
+ - **Meaningful partial rewards** β€” every step emits a delta reward proportional to how much of the dataset it cleaned, so the agent receives useful signal throughout the episode rather than only at the end.
35
+ - **Three difficulty levels** β€” easy, medium, hard β€” letting agents learn a curriculum from simple null-filling up to full multi-issue pipelines.
36
+ - **No external data downloads** β€” all datasets are generated synthetically via `numpy` + `Faker` with `seed=42`.
37
 
38
  ---
39
 
40
  ## Action Space
41
 
42
+ Actions are JSON objects sent to `POST /step`.
43
+
44
+ | `operation` | Required `column` | `params` | Description |
45
+ |---|---|---|---|
46
+ | `fill_missing` | βœ… | `{"strategy": "median\|mean\|mode\|constant", "value": ...}` | Fill NaN values in a column |
47
+ | `drop_duplicates` | ❌ | β€” | Remove all duplicate rows |
48
+ | `fix_format` | βœ… | β€” | Standardise phone/date/country format |
49
+ | `replace_value` | βœ… | `{"old": ..., "new": ...}` | Replace a specific value |
50
+ | `drop_outliers` | βœ… | β€” | Remove IQR outliers from a numeric column |
51
+ | `fix_dtype` | βœ… | `{"dtype": "float\|int\|str"}` | Cast column to correct dtype |
52
 
53
+ **Format rules enforced by `fix_format`:**
 
 
 
 
 
 
 
54
 
55
+ | Column | Target format |
56
+ |---|---|
57
+ | `phone` | `NNN-NNN-NNNN` |
58
+ | `listed_date` / `signup_date` | `YYYY-MM-DD` |
59
+ | `country` | Title-cased canonical name (`USA`, `UK`, `Canada`, `Australia`, `Germany`) |
60
+
61
+ **Example actions:**
62
  ```json
63
+ {"operation": "fill_missing", "column": "salary", "params": {"strategy": "median"}}
64
+ {"operation": "fill_missing", "column": "department", "params": {"strategy": "mode"}}
65
  {"operation": "drop_duplicates"}
66
+ {"operation": "fix_format", "column": "phone"}
67
+ {"operation": "fix_format", "column": "signup_date"}
68
+ {"operation": "drop_outliers", "column": "purchase_amount"}
69
  ```
70
 
71
  ---
72
 
73
  ## Observation Space
74
 
75
+ Every `POST /reset` and `POST /step` returns:
 
76
  ```json
77
  {
78
  "observation": {
79
  "done": false,
80
+ "reward": 0.40,
81
  "data_preview": "name,age,salary,...\n...",
82
  "data_shape": [100, 5],
83
+ "missing_counts": {"age": 20, "salary": 20, "department": 10},
84
  "duplicate_count": 0,
85
  "dtype_issues": {},
86
  "task_description": "Task 1 (Easy) β€” Fill Missing Values\n...",
87
  "message": "Filled 20 missing values in 'age' using median.",
88
  "step_count": 1,
89
+ "current_score": 0.4000
90
  },
91
+ "reward": 0.40,
92
+ "done": false,
93
+ "info": {}
94
+ }
95
+ ```
96
+
97
+ | Field | Type | Description |
98
+ |---|---|---|
99
+ | `done` | bool | Episode finished (score β‰₯ 0.95 or max steps reached) |
100
+ | `reward` | float | Per-step delta reward (see Reward Function) |
101
+ | `data_preview` | string | First 10 rows of current DataFrame as CSV |
102
+ | `data_shape` | [int, int] | Current `[rows, cols]` |
103
+ | `missing_counts` | object | `{column: null_count}` for columns with NaN |
104
+ | `duplicate_count` | int | Number of duplicate rows |
105
+ | `dtype_issues` | object | `{column: issue_description}` for suspected dtype mismatches |
106
+ | `task_description` | string | Full task instructions with available operations |
107
+ | `message` | string | Human-readable result of the last action |
108
+ | `step_count` | int | Steps taken in this episode |
109
+ | `current_score` | float | Running grader score 0.0 – 1.0 |
110
+
111
+ ---
112
+
113
+ ## State Space
114
+
115
+ `GET /state` returns episode metadata (does not modify state):
116
+ ```json
117
+ {
118
+ "episode_id": "a8f026a9-...",
119
+ "task_id": 1,
120
+ "step_count": 2,
121
+ "max_steps": 20,
122
+ "total_errors": 50,
123
+ "errors_remaining": 30
124
  }
125
  ```
126
 
 
128
 
129
  ## Tasks
130
 
131
+ ### Task 1 β€” Fill Missing Values *(Easy)*
132
+
133
+ | Property | Value |
134
+ |---|---|
135
+ | Dataset | 100-row employee records (name, age, salary, department, experience) |
136
+ | Issues | ~20 % NaN in `age`, `salary`; ~10 % NaN in `department` |
137
+ | Goal | Fill all missing values |
138
+ | Valid operations | `fill_missing` |
139
+ | Grader | `1.0 βˆ’ remaining_nulls / original_nulls` |
140
+ | Max steps | 20 |
141
+ | Optimal steps | 3 (one per affected column) |
142
+
143
+ ### Task 2 β€” Fix Formats + Remove Duplicates *(Medium)*
144
+
145
+ | Property | Value |
146
+ |---|---|
147
+ | Dataset | 215-row product catalog (product_id, price, category, phone, listed_date) |
148
+ | Issues | ~60 % phone numbers in mixed formats, ~60 % dates in mixed formats, 15 duplicate rows |
149
+ | Goal | Standardise all phone/date formats and remove duplicates |
150
+ | Valid operations | `fix_format`, `drop_duplicates` |
151
+ | Grader | `0.35 Γ— phone_score + 0.35 Γ— date_score + 0.30 Γ— dupe_score` |
152
+ | Max steps | 30 |
153
+ | Optimal steps | 3 |
154
+
155
+ ### Task 3 β€” Full Cleaning Pipeline *(Hard)*
156
+
157
+ | Property | Value |
158
+ |---|---|
159
+ | Dataset | 320-row customer database (name, age, purchase_amount, country, email, signup_date) |
160
+ | Issues | Missing values (4 cols), 20 duplicate rows, outliers in `purchase_amount` (~3Γ— normal), mixed country capitalisation, mixed date formats |
161
+ | Goal | Fix all issues end-to-end |
162
+ | Valid operations | All 6 operations |
163
+ | Grader | `0.25Γ—null + 0.20Γ—dupe + 0.20Γ—outlier + 0.175Γ—country + 0.175Γ—date` |
164
+ | Max steps | 40 |
165
+ | Optimal steps | 8 |
166
 
167
  ---
168
 
169
  ## Reward Function
170
 
171
+ | Scenario | Reward |
172
+ |---|---|
173
+ | Score improves (delta > 0) | `new_score βˆ’ old_score` (positive) |
174
+ | Operation had no effect | `βˆ’0.01` |
175
+ | Invalid operation / bad column | `βˆ’0.05` |
176
+ | Episode completed (score β‰₯ 0.95) | `delta + 0.20` terminal bonus |
177
 
178
+ Rewards are bounded to **[βˆ’0.05, 1.2]**. A partial reward is emitted on every step, giving the agent dense signal throughout the episode.
179
 
180
  ---
181
 
182
  ## API Endpoints
183
 
184
+ | Method | Path | Description |
185
+ |---|---|---|
186
+ | `GET` | `/health` | Health check β†’ `{"status": "healthy"}` |
187
+ | `POST` | `/reset` | Start episode. Body: `{"task_id": 1\|2\|3}` (optional; default: round-robin) |
188
+ | `POST` | `/step` | Execute action. Body: action JSON |
189
+ | `POST` | `/state` | Get episode metadata |
190
+ | `GET` | `/metadata` | Environment name, version, task list |
191
+ | `GET` | `/schema` | Full action / observation / state JSON schemas |
192
+ | `GET` | `/docs` | Interactive Swagger UI |
193
+
194
+ ---
195
+
196
+ ## Baseline Scores
197
+
198
+ | Task | Difficulty | Score |
199
+ |---|---|---|
200
+ | 1 β€” Fill Missing Values | Easy | 1.000 |
201
+ | 2 β€” Fix Formats + Duplicates | Medium | 1.000 |
202
+ | 3 β€” Full Cleaning Pipeline | Hard | 1.000 |
203
+ | **Average** | β€” | **1.000** |
204
+
205
+ *Produced by `google/gemma-3-27b-it` via NVIDIA NIM, `temperature=0`. Full step-by-step agent logs: `inference_log.txt`.*
206
 
207
  ---
208
 
209
  ## Setup & Usage
210
 
211
+ ### Prerequisites
212
+
213
+ - Python 3.11+
214
+ - Docker (for containerised deployment)
215
+
216
+ ### Local β€” Python
217
  ```bash
218
+ # 1. Clone and install dependencies
219
+ git clone https://github.com/Tanvi51204/openEnv.git
220
+ cd openEnv
221
  pip install -r requirements.txt
222
+
223
+ # 2. Start the server
224
  uvicorn server.app:app --host 0.0.0.0 --port 8000
225
+
226
+ # 3. Open Swagger UI
227
+ open http://localhost:8000/docs
228
  ```
229
 
230
+ ### Local β€” Docker
231
  ```bash
232
  docker build -t data-cleaning-env .
233
  docker run -p 8000:8000 data-cleaning-env
234
  ```
235
 
236
+ ### Quick API test
237
+ ```bash
238
+ # Health
239
+ curl http://localhost:8000/health
240
+
241
+ # Start Task 1
242
+ curl -X POST http://localhost:8000/reset \
243
+ -H "Content-Type: application/json" \
244
+ -d '{"task_id": 1}'
245
+
246
+ # Fill missing values
247
+ curl -X POST http://localhost:8000/step \
248
+ -H "Content-Type: application/json" \
249
+ -d '{"operation": "fill_missing", "column": "salary", "params": {"strategy": "median"}}'
250
+ ```
251
+
252
+ ### Python client
253
+ ```python
254
+ from client import DataCleaningEnvClient
255
+ from models import DataCleaningAction
256
+
257
+ with DataCleaningEnvClient("http://localhost:8000") as env:
258
+ result = env.reset(task_id=1)
259
+ print(result.observation.missing_counts) # {'age': 20, 'salary': 20, 'department': 10}
260
+
261
+ action = DataCleaningAction(
262
+ operation="fill_missing",
263
+ column="salary",
264
+ params={"strategy": "median"},
265
+ )
266
+ result = env.step(action)
267
+ print(result.observation.current_score) # 0.4
268
+ print(result.reward) # 0.4
269
+ ```
270
+
271
+ ### Run baseline inference
272
  ```bash
273
  export API_BASE_URL="https://api.openai.com/v1"
274
  export MODEL_NAME="gpt-4o-mini"
275
+ export HF_TOKEN="sk-..." # your API key
276
  export ENV_URL="http://localhost:8000"
277
 
278
  python inference.py
279
  ```
280
 
281
+ Produces `[START]` / `[STEP]` / `[END]` lines to stdout and `baseline_scores.json`.
 
 
 
 
 
 
 
 
 
282
 
283
+ ### Environment variables
284
 
285
+ | Variable | Default | Description |
286
+ |---|---|---|
287
+ | `API_BASE_URL` | `https://api.openai.com/v1` | LLM API endpoint (OpenAI-compatible) |
288
+ | `MODEL_NAME` | `gpt-4o-mini` | Model identifier |
289
+ | `HF_TOKEN` | β€” | API key for LLM calls |
290
+ | `ENV_URL` | `http://localhost:8000` | Environment server URL |
291
 
 
 
292
  ---
293
 
294
  ## Project Structure
 
295
  ```
296
  openenv-data-cleaning/
297
+ β”œβ”€β”€ models.py Pydantic contracts β€” Action / Observation / State
298
+ β”œβ”€β”€ client.py Sync HTTP client (reset / step / state / health)
299
+ β”œβ”€β”€ inference.py Baseline LLM agent with [START]/[STEP]/[END] logging
300
+ β”œβ”€β”€ openenv.yaml OpenEnv manifest
301
+ β”œβ”€β”€ Dockerfile python:3.11-slim, non-root user, HEALTHCHECK
302
+ β”œβ”€β”€ requirements.txt pip dependencies
303
+ β”œβ”€β”€ pyproject.toml Python package metadata + openenv-core dependency
304
+ └── server/
305
+ β”œβ”€β”€ app.py FastAPI routes + /metadata + /schema
306
+ β”œβ”€β”€ environment.py reset / step / state logic + 6 operations + rewards
307
+ β”œβ”€β”€ data_generator.py Synthetic dataset generation (seed=42, reproducible)
308
+ └── tasks/
309
+ β”œβ”€β”€ task1_missing.py Easy β€” fill NaN grader
310
+ β”œβ”€β”€ task2_format.py Medium β€” format + duplicates grader
311
+ └── task3_pipeline.py Hard β€” full pipeline grader
312
  ```
313
+
314
+ ---
315
+
316
+ ## Live Demo
317
+
318
+ πŸ€— **HuggingFace Space:** https://srishtichugh-openenv-hack.hf.space
319
+
320
+ - Health: https://srishtichugh-openenv-hack.hf.space/health
321
+ - Docs: https://srishtichugh-openenv-hack.hf.space/docs