Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,5 +1,4 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
title: Dataops Env
|
| 4 |
emoji: 🧼
|
| 5 |
colorFrom: indigo
|
|
@@ -7,306 +6,372 @@ colorTo: gray
|
|
| 7 |
sdk: docker
|
| 8 |
app_port: 7860
|
| 9 |
pinned: false
|
| 10 |
-
|
| 11 |
---
|
| 12 |
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
### ⚡ The First Hallucination-Aware Data Cleaning Environment
|
| 16 |
-
|
| 17 |
-
> ❌ Most systems ask: *“Did you fix the data?”*
|
| 18 |
-
> ✅ We ask: *“Did you think before fixing?”*
|
| 19 |
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
# 🚨 THE PROBLEM
|
| 23 |
|
| 24 |
-
*
|
| 25 |
|
| 26 |
-
|
| 27 |
|
| 28 |
-
|
| 29 |
-
* hallucinate corrections
|
| 30 |
-
* ignore contradictions
|
| 31 |
-
* break real-world logic
|
| 32 |
-
|
| 33 |
-
---
|
| 34 |
|
| 35 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
-
|
| 38 |
|
| 39 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
|
| 44 |
-
2. Fixes data **only when confident**
|
| 45 |
-
3. Outputs **"cannot determine"** when uncertain
|
| 46 |
-
4. Maintains **cross-record consistency**
|
| 47 |
-
5. Learns through **reward-based feedback**
|
| 48 |
|
| 49 |
---
|
| 50 |
|
| 51 |
-
|
| 52 |
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
---
|
| 58 |
|
| 59 |
-
#
|
| 60 |
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
```json
|
| 64 |
-
{
|
| 65 |
-
"action_type": "detect_issue | fix_value | cannot_determine | skip",
|
| 66 |
-
"record_id": "string",
|
| 67 |
-
"field": "string",
|
| 68 |
-
"value": "string",
|
| 69 |
-
"confidence": 0.0
|
| 70 |
-
}
|
| 71 |
-
```
|
| 72 |
-
|
| 73 |
-
---
|
| 74 |
|
| 75 |
-
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
-
|
| 78 |
|
| 79 |
---
|
| 80 |
|
| 81 |
-
# 🧠
|
| 82 |
|
| 83 |
-
| Traditional
|
| 84 |
-
|
|
| 85 |
-
|
|
| 86 |
-
|
|
| 87 |
-
|
|
| 88 |
-
|
|
| 89 |
-
|
|
|
|
|
| 90 |
|
| 91 |
-
-
|
| 92 |
-
|
| 93 |
-
# 💰 REWARD SYSTEM
|
| 94 |
|
| 95 |
---
|
| 96 |
|
| 97 |
-
##
|
| 98 |
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
|
| 104 |
---
|
| 105 |
|
| 106 |
-
##
|
| 107 |
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
---
|
| 114 |
|
| 115 |
-
##
|
| 116 |
|
| 117 |
-
|
| 118 |
|
| 119 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 120 |
|
| 121 |
-
#
|
|
|
|
|
|
|
|
|
|
| 122 |
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
0.5 * normalized_record_score
|
| 126 |
-
+ 0.2 * (1 - hallucination_rate)
|
| 127 |
-
+ 0.15 * uncertainty_accuracy
|
| 128 |
-
+ 0.15 * consistency_score
|
| 129 |
```
|
| 130 |
|
|
|
|
|
|
|
| 131 |
---
|
| 132 |
|
| 133 |
-
#
|
| 134 |
|
| 135 |
-
|
| 136 |
-
| ----------------------- | ---------------------- |
|
| 137 |
-
| 🧠 Hallucination Rate | Wrong invented fixes |
|
| 138 |
-
| ⚖️ Uncertainty Accuracy | Correct abstentions |
|
| 139 |
-
| 🔗 Consistency Score | Cross-record reasoning |
|
| 140 |
|
| 141 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 142 |
|
| 143 |
-
|
| 144 |
-
> ⚡ Each task is carefully designed to evaluate **reasoning, restraint, and reliability** — not just accuracy.
|
| 145 |
|
| 146 |
---
|
| 147 |
|
| 148 |
-
##
|
| 149 |
|
| 150 |
-
|
| 151 |
-
<b>“Can the agent fix obvious issues without breaking anything?”</b>
|
| 152 |
-
</p>
|
| 153 |
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 157 |
|
| 158 |
---
|
| 159 |
|
| 160 |
-
##
|
| 161 |
|
| 162 |
-
|
| 163 |
-
<b>“Can the agent reason across records and handle uncertainty?”</b>
|
| 164 |
-
</p>
|
| 165 |
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
|
| 170 |
-
|
| 171 |
|
| 172 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 173 |
|
| 174 |
-
|
| 175 |
-
<b>“Can the agent survive contradictions, missing context, and unsolvable data?”</b>
|
| 176 |
-
</p>
|
| 177 |
|
| 178 |
-
*
|
| 179 |
-
|
| 180 |
-
|
| 181 |
|
| 182 |
-
|
| 183 |
|
| 184 |
-
|
| 185 |
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
| 🟡 Medium | Reasoning under ambiguity |
|
| 190 |
-
| 🔴 Hard | Decision-making under uncertainty |
|
| 191 |
|
| 192 |
---
|
| 193 |
|
| 194 |
-
#
|
| 195 |
-
|
| 196 |
-
```json
|
| 197 |
-
{
|
| 198 |
-
"record_id": "T3",
|
| 199 |
-
"error_type": "hallucination",
|
| 200 |
-
"details": "assigned value without evidence",
|
| 201 |
-
"confidence": 0.9
|
| 202 |
-
}
|
| 203 |
-
```
|
| 204 |
|
| 205 |
-
|
| 206 |
|
| 207 |
-
|
| 208 |
|
| 209 |
-
|
| 210 |
|
| 211 |
-
|
|
|
|
|
|
|
| 212 |
|
| 213 |
-
|
| 214 |
-
pip install -r requirements.txt
|
| 215 |
-
```
|
| 216 |
|
| 217 |
---
|
| 218 |
|
| 219 |
-
##
|
| 220 |
|
| 221 |
-
```
|
| 222 |
-
python -m server.app
|
| 223 |
-
```
|
| 224 |
|
| 225 |
-
|
| 226 |
|
| 227 |
-
|
|
|
|
|
|
|
|
|
|
| 228 |
|
| 229 |
-
``
|
| 230 |
-
python inference.py
|
| 231 |
-
```
|
| 232 |
|
| 233 |
---
|
| 234 |
|
| 235 |
-
##
|
| 236 |
|
| 237 |
-
```
|
| 238 |
-
easy → 0.73
|
| 239 |
-
medium → 0.55
|
| 240 |
-
hard → 0.38
|
| 241 |
-
```
|
| 242 |
-
|
| 243 |
-
> ⚠️ Replace with your actual results
|
| 244 |
|
| 245 |
-
--
|
| 246 |
|
| 247 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 248 |
|
| 249 |
-
|
| 250 |
-
| --------- | ----------------- |
|
| 251 |
-
| `/reset` | Start new episode |
|
| 252 |
-
| `/step` | Take action |
|
| 253 |
-
| `/state` | Get current state |
|
| 254 |
-
| `/health` | Health check |
|
| 255 |
|
| 256 |
---
|
| 257 |
|
| 258 |
-
#
|
| 259 |
|
| 260 |
-
``
|
| 261 |
-
docker build -t dataops-gym .
|
| 262 |
-
docker run -p 7860:7860 dataops-gym
|
| 263 |
-
```
|
| 264 |
|
| 265 |
-
|
| 266 |
|
| 267 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 268 |
|
| 269 |
-
|
| 270 |
-
2. Penalize confident mistakes
|
| 271 |
-
3. Avoid over-correction
|
| 272 |
-
4. Enforce cross-record consistency
|
| 273 |
-
5. Reward safe reasoning
|
| 274 |
|
| 275 |
---
|
| 276 |
|
| 277 |
-
#
|
| 278 |
|
| 279 |
-
|
| 280 |
-
| ------ | ----------- |
|
| 281 |
-
| Easy | 0.65 – 0.85 |
|
| 282 |
-
| Medium | 0.45 – 0.65 |
|
| 283 |
-
| Hard | 0.05 – 0.40 |
|
| 284 |
|
| 285 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 286 |
|
| 287 |
-
|
| 288 |
|
| 289 |
-
|
| 290 |
-
|
| 291 |
-
|
| 292 |
-
* healthcare record validation
|
| 293 |
-
* LLM safety benchmarking
|
| 294 |
|
| 295 |
---
|
| 296 |
|
| 297 |
-
#
|
| 298 |
|
| 299 |
-
|
| 300 |
-
> ⚡ **It’s about knowing when NOT to answer.**
|
| 301 |
|
| 302 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 303 |
|
| 304 |
-
# 🔥 TAGLINE
|
| 305 |
|
| 306 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 307 |
|
| 308 |
-
---
|
| 309 |
|
|
|
|
| 310 |
|
|
|
|
| 311 |
|
|
|
|
| 312 |
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
title: Dataops Env
|
| 3 |
emoji: 🧼
|
| 4 |
colorFrom: indigo
|
|
|
|
| 6 |
sdk: docker
|
| 7 |
app_port: 7860
|
| 8 |
pinned: false
|
|
|
|
| 9 |
---
|
| 10 |
|
| 11 |
+
<div align="center">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
+
# 🏋️ DataOps GYM
|
|
|
|
|
|
|
| 14 |
|
| 15 |
+
### *The Benchmark That Punishes Overconfidence — Not Just Wrong Answers*
|
| 16 |
|
| 17 |
+
**A semantic, step-based reinforcement learning environment for evaluating data-cleaning agents on tabular datasets**
|
| 18 |
|
| 19 |
+
<br/>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
+
[](https://python.org)
|
| 22 |
+
[](https://fastapi.tiangolo.com)
|
| 23 |
+
[](https://docs.pydantic.dev)
|
| 24 |
+
[](https://docker.com)
|
| 25 |
+
[](https://huggingface.co/spaces)
|
| 26 |
|
| 27 |
+
<br/>
|
| 28 |
|
| 29 |
+
> **"Any model can clean data. Only a smart one knows when *not* to."**
|
| 30 |
+
>
|
| 31 |
+
> DataOps GYM is an interactive gym environment for training and benchmarking LLM-based data-cleaning agents —
|
| 32 |
+
> with dense per-step rewards, structured action protocols, and deliberate adversarial traps
|
| 33 |
+
> designed to expose hallucination, overcorrection, and overconfidence.
|
| 34 |
+
> **The first benchmark that penalizes an LLM for being too confident about dirty data — not just for being wrong.**
|
| 35 |
|
| 36 |
+
<br/>
|
| 37 |
|
| 38 |
+
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
---
|
| 41 |
|
| 42 |
+
## 📌 Table of Contents
|
| 43 |
|
| 44 |
+
- [Why DataOps GYM Exists](#-why-dataops-gym-exists)
|
| 45 |
+
- [Core Philosophy](#-core-philosophy)
|
| 46 |
+
- [Architecture Overview](#-architecture-overview)
|
| 47 |
+
- [Repository Layout](#-repository-layout)
|
| 48 |
+
- [The Environment Model](#-the-environment-model)
|
| 49 |
+
- [Action Protocol](#-action-protocol)
|
| 50 |
+
- [Task Difficulty Tiers](#-task-difficulty-tiers)
|
| 51 |
+
- [Scoring & Reward System](#-scoring--reward-system)
|
| 52 |
+
- [HTTP API Reference](#-http-api-reference)
|
| 53 |
|
| 54 |
---
|
| 55 |
|
| 56 |
+
## 🔍 Why DataOps GYM Exists
|
| 57 |
|
| 58 |
+
Real-world data pipelines fail silently. Automated cleaners and LLM agents frequently:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
+
- **Hallucinate corrections** — inventing plausible-sounding values with no evidentiary basis
|
| 61 |
+
- **Over-correct valid data** — mistaking unusual-but-correct formats as errors *(e.g., `q.xu+vip@example.com` is a valid plus-address — don't touch it)*
|
| 62 |
+
- **Flatten genuine ambiguity** — making irreversible decisions where `cannot_determine` was the right call
|
| 63 |
+
- **Ignore cross-record consistency** — fixing one row while silently creating a new constraint violation in another
|
| 64 |
|
| 65 |
+
**DataOps GYM was built to measure all of these failure modes simultaneously**, forcing agents to balance **precision, restraint, and consistency** — not just produce a tidy-looking output table.
|
| 66 |
|
| 67 |
---
|
| 68 |
|
| 69 |
+
## 🧠 Core Philosophy
|
| 70 |
|
| 71 |
+
| Traditional Benchmark | DataOps GYM |
|
| 72 |
+
|---|---|
|
| 73 |
+
| Compares final table to ground truth | Evaluates **every step** semantically |
|
| 74 |
+
| Rewards correct fixes | Also **penalizes hallucination** and **rewards appropriate abstention** |
|
| 75 |
+
| Single-pass evaluation | Multi-turn, stateful episode loop |
|
| 76 |
+
| No cross-record awareness | Tracks **consistency across related rows** |
|
| 77 |
+
| Ignores agent confidence | **Confidence calibration** affects reward directly |
|
| 78 |
+
| `cannot_determine` = failure | `cannot_determine` = **first-class correct action** |
|
| 79 |
|
| 80 |
+
> DataOps GYM is purpose-built around the insight that **knowing when not to act is as important as knowing how to act.**
|
|
|
|
|
|
|
| 81 |
|
| 82 |
---
|
| 83 |
|
| 84 |
+
## 🏗 Architecture Overview
|
| 85 |
|
| 86 |
+
```
|
| 87 |
+
┌──────────────────────────────────────────────────────────────────┐
|
| 88 |
+
│ DataOps GYM │
|
| 89 |
+
│ │
|
| 90 |
+
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │
|
| 91 |
+
│ │ task.py │─────▶│ env.py │─────▶│ grader.py │ │
|
| 92 |
+
│ │ │ │ │ │ │ │
|
| 93 |
+
│ │ Task │ │ Episode │ │ Per-step │ │
|
| 94 |
+
│ │ Factory │ │ Lifecycle │ │ Reward + │ │
|
| 95 |
+
│ │ 3 tiers │ │ + State │ │ Final Score │ │
|
| 96 |
+
│ │ 2 vars │ │ Tracking │ │ │ │
|
| 97 |
+
│ └──────────┘ └──────────────┘ └──────────────┘ │
|
| 98 |
+
│ │ │
|
| 99 |
+
│ ▼ │
|
| 100 |
+
│ ┌──────────────┐ │
|
| 101 |
+
│ │ models.py │ │
|
| 102 |
+
│ │ Action / │ │
|
| 103 |
+
│ │ Observation │ │
|
| 104 |
+
│ │ (Pydantic) │ │
|
| 105 |
+
│ └──────────────┘ │
|
| 106 |
+
│ │ │
|
| 107 |
+
│ ▼ │
|
| 108 |
+
│ ┌──────────────┐ ┌──────────────────┐ │
|
| 109 |
+
│ │ server/ │◀─────│ inference.py │ │
|
| 110 |
+
│ │ app.py │ │ Reference Agent │ │
|
| 111 |
+
│ │ (FastAPI) │ │ / Evaluator │ │
|
| 112 |
+
│ └──────────────┘ └──────────────────┘ │
|
| 113 |
+
└──────────────────────────────────────────────────────────────────┘
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
Every layer is cleanly separated — the environment knows nothing about the HTTP layer; the grader knows nothing about environment internals. Each component is independently testable and swappable.
|
| 117 |
|
| 118 |
---
|
| 119 |
|
| 120 |
+
## 📁 Repository Layout
|
| 121 |
|
| 122 |
+
```
|
| 123 |
+
DataOps-GYM/
|
| 124 |
+
│
|
| 125 |
+
├── env.py # Core RL environment: reset / step / observe / metrics
|
| 126 |
+
├── task.py # Task factories: easy / medium / hard (2 variants each)
|
| 127 |
+
├── grader.py # Per-step reward math + final task score formula
|
| 128 |
+
├── models.py # Pydantic schemas: Action, Observation
|
| 129 |
+
├── inference.py # Reference baseline agent + evaluator script
|
| 130 |
+
│
|
| 131 |
+
├── server/
|
| 132 |
+
│ └── app.py # FastAPI HTTP server (/reset, /step, /state, /health)
|
| 133 |
+
│
|
| 134 |
+
├── utils/ # Shared helper utilities
|
| 135 |
+
├── .dataops_policy_cache.json # Cached policy artifacts
|
| 136 |
+
│
|
| 137 |
+
├── Dockerfile # Container definition (port 7860, HF Spaces-ready)
|
| 138 |
+
├── .dockerignore
|
| 139 |
+
├── openenv.yaml # HuggingFace Spaces metadata
|
| 140 |
+
├── pyproject.toml # Project metadata & build configuration
|
| 141 |
+
├── requirements.txt # Python dependencies
|
| 142 |
+
└── uv.lock # Reproducible lock file for uv package manager
|
| 143 |
+
```
|
| 144 |
|
| 145 |
---
|
| 146 |
|
| 147 |
+
## ⚙️ The Environment Model
|
| 148 |
|
| 149 |
+
### Episode Lifecycle
|
| 150 |
|
| 151 |
+
Every interaction follows the standard gym pattern:
|
| 152 |
+
|
| 153 |
+
```python
|
| 154 |
+
# 1. Initialize a task episode (easy / medium / hard, seeded for reproducibility)
|
| 155 |
+
obs = env.reset(task_name="hard", seed=42)
|
| 156 |
|
| 157 |
+
# 2. Agent acts step-by-step until done
|
| 158 |
+
while not done:
|
| 159 |
+
action = agent.decide(obs)
|
| 160 |
+
obs, reward, done, info = env.step(action)
|
| 161 |
|
| 162 |
+
# 3. Retrieve terminal score in range (0, 1)
|
| 163 |
+
final_score = info["final_task_score"]
|
|
|
|
|
|
|
|
|
|
|
|
|
| 164 |
```
|
| 165 |
|
| 166 |
+
When `task_name` is not fixed, the environment randomly samples a difficulty tier and variant (both seeded), making the benchmark resistant to test-set memorization.
|
| 167 |
+
|
| 168 |
---
|
| 169 |
|
| 170 |
+
### What the Agent Sees — `Observation`
|
| 171 |
|
| 172 |
+
The observation gives the agent everything it needs to reason — without ever revealing the hidden answer key:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 173 |
|
| 174 |
+
| Field | Description |
|
| 175 |
+
|---|---|
|
| 176 |
+
| `dataset.original` | Immutable snapshot of the table at episode start |
|
| 177 |
+
| `dataset.modified` | Current working table reflecting all accepted fixes so far |
|
| 178 |
+
| `action_history` | Full sequence of all past actions taken this episode |
|
| 179 |
+
| `per_record_scores` | Cumulative score contribution per row ID |
|
| 180 |
+
| `current_iteration_score` | Score delta from the most recent step |
|
| 181 |
+
| `previous_iteration_score` | Score delta from the prior step (for trend awareness) |
|
| 182 |
+
| `steps_remaining` | Hard cap on remaining interactions |
|
| 183 |
|
| 184 |
+
> ⚠️ The agent **never** sees `hidden_issues`. All semantic evaluation is performed internally.
|
|
|
|
| 185 |
|
| 186 |
---
|
| 187 |
|
| 188 |
+
### Hidden Issues — What's Lurking in the Data
|
| 189 |
|
| 190 |
+
Each task defines a set of typed hidden issues the agent must discover and resolve:
|
|
|
|
|
|
|
| 191 |
|
| 192 |
+
| Issue Type | Description | Fixable? |
|
| 193 |
+
|---|---|---|
|
| 194 |
+
| `duplicate` | Two rows represent the same real entity | ❌ Not by `fix_value` alone |
|
| 195 |
+
| `missing_value` | A required field is null | ✅ Yes |
|
| 196 |
+
| `invalid_format` | Email / phone / date doesn't match expected pattern | ✅ Yes |
|
| 197 |
+
| `inconsistent_casing` | Name or city uses wrong casing convention | ✅ Yes |
|
| 198 |
+
| `conflict` | Same customer has contradictory field values across rows | ❌ Irreconcilable |
|
| 199 |
+
| `constraint_violation` | Two distinct rows violate a uniqueness constraint (e.g., same email) | ❌ Requires judgment |
|
| 200 |
+
| `valid_trap` | Row looks suspicious but is actually correct — **do not touch** | N/A |
|
| 201 |
|
| 202 |
---
|
| 203 |
|
| 204 |
+
## 🎮 Action Protocol
|
| 205 |
|
| 206 |
+
Agents interact through a strict, typed JSON protocol validated by Pydantic:
|
|
|
|
|
|
|
| 207 |
|
| 208 |
+
```json
|
| 209 |
+
{
|
| 210 |
+
"action_type": "fix_value",
|
| 211 |
+
"record_id": "C201",
|
| 212 |
+
"field": "email",
|
| 213 |
+
"value": "evan.cole@example.com",
|
| 214 |
+
"confidence": 0.92
|
| 215 |
+
}
|
| 216 |
+
```
|
| 217 |
|
| 218 |
+
### Action Types
|
| 219 |
|
| 220 |
+
| Action | When to Use | Reward Signal |
|
| 221 |
+
|---|---|---|
|
| 222 |
+
| `detect_issue` | Flag a problem without yet resolving it | Low positive — passive identification only |
|
| 223 |
+
| `fix_value` | Apply a concrete correction to a specific field | High positive if correct; severe penalty if hallucinated |
|
| 224 |
+
| `cannot_determine` | Abstain when conflict is genuinely irreconcilable | Rewarded when `fixable: false`; penalized otherwise |
|
| 225 |
+
| `skip` | Explicitly pass on a record/field | Penalized if a real issue existed there |
|
| 226 |
|
| 227 |
+
### Protocol Validation Rules
|
|
|
|
|
|
|
| 228 |
|
| 229 |
+
- `value` is **required** for `fix_value` and **forbidden** for all other action types
|
| 230 |
+
- `record_id` and `field` must be non-empty strings
|
| 231 |
+
- `confidence` must be a float in `[0.0, 1.0]`
|
| 232 |
|
| 233 |
+
### Behavioral Discipline
|
| 234 |
|
| 235 |
+
The environment enforces **follow-through discipline** across steps:
|
| 236 |
|
| 237 |
+
- After `detect_issue`, the agent must follow up on that same record/field before moving on — or receive a `passive_penalty`
|
| 238 |
+
- Handling a duplicate/conflict pair inconsistently (different strategies for related rows) triggers `inconsistent_handling` penalty
|
| 239 |
+
- Re-flagging an already-detected issue yields `repeated_detection` penalty
|
|
|
|
|
|
|
| 240 |
|
| 241 |
---
|
| 242 |
|
| 243 |
+
## 📊 Task Difficulty Tiers
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 244 |
|
| 245 |
+
### 🟢 Easy — `easy_cleaning_task`
|
| 246 |
|
| 247 |
+
**Scenarios:** `easy_customer_master`, `easy_vendor_onboarding`
|
| 248 |
|
| 249 |
+
**Goal:** Foundational hygiene — deduplicate obvious duplicate rows and fill required missing values without deleting rows just because they are incomplete.
|
| 250 |
|
| 251 |
+
**Issues planted:**
|
| 252 |
+
- Exact duplicate rows (identical across all fields)
|
| 253 |
+
- Missing required values (`city`, `email`)
|
| 254 |
|
| 255 |
+
**Agent strategy:** Detect duplicates → deduplicate → fill missing fields. No traps. No ambiguity.
|
|
|
|
|
|
|
| 256 |
|
| 257 |
---
|
| 258 |
|
| 259 |
+
### 🟡 Medium — `medium_normalization_task`
|
| 260 |
|
| 261 |
+
**Scenarios:** `medium_customer_normalization`, `medium_partner_directory`
|
|
|
|
|
|
|
| 262 |
|
| 263 |
+
**Goal:** Normalize — consistent casing, valid email shapes, deduplication where needed.
|
| 264 |
|
| 265 |
+
**Issues planted:**
|
| 266 |
+
- Duplicate rows
|
| 267 |
+
- Inconsistent casing on `name` and `city` (e.g., `"OMAR HASSAN"` → `"Omar Hassan"`)
|
| 268 |
+
- Invalid email tokens (e.g., `[at]` instead of `@`, missing `@` entirely)
|
| 269 |
|
| 270 |
+
**Agent strategy:** Normalize casing to `title_case`, repair malformed emails, deduplicate. Validators check format correctness, not just non-null values.
|
|
|
|
|
|
|
| 271 |
|
| 272 |
---
|
| 273 |
|
| 274 |
+
### 🔴 Hard — `hard_conflict_resolution_task`
|
| 275 |
|
| 276 |
+
**Scenarios:** `hard_customer_conflicts`, `hard_account_merges`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 277 |
|
| 278 |
+
**Goal:** Multi-way reasoning under adversarial traps — deduplicate, handle irreconcilable conflicts, enforce unique constraints, fix formats, and **leave valid-looking unusual rows completely untouched**.
|
| 279 |
|
| 280 |
+
**Issues planted:**
|
| 281 |
+
- Exact duplicates
|
| 282 |
+
- **Irreconcilable conflicts** — same customer ID with contradictory `age` values (e.g., `250` vs `45`). Correct answer: `cannot_determine`
|
| 283 |
+
- Invalid email and phone formats
|
| 284 |
+
- **Unique constraint violations** — two distinct customers sharing the same email address
|
| 285 |
+
- **`valid_trap` rows** — rows that look suspicious but are correct:
|
| 286 |
+
- `q.xu+vip@example.com` — a valid RFC-compliant plus-address
|
| 287 |
+
- `A. J. Brown` — a valid abbreviated name
|
| 288 |
|
| 289 |
+
**Agent strategy:** Nuanced multi-step reasoning, cross-record constraint checking, confident abstention, and deliberate non-intervention on valid traps.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 290 |
|
| 291 |
---
|
| 292 |
|
| 293 |
+
## 🏆 Scoring & Reward System
|
| 294 |
|
| 295 |
+
### Per-Step Reward — `grade_step_details`
|
|
|
|
|
|
|
|
|
|
| 296 |
|
| 297 |
+
Each step produces a composite scalar reward (no clamping — scores can go negative):
|
| 298 |
|
| 299 |
+
| Component | Condition | Δ Score |
|
| 300 |
+
|---|---|---|
|
| 301 |
+
| **Classification** | Correct action type for the situation | `+0.1` (detect) / `+0.2` (fix or cd) |
|
| 302 |
+
| **Classification** | Wrong action type | `−0.20` |
|
| 303 |
+
| **Issue Detection** | Correctly identified real issue | `+0.05` (detect) / `+0.15` (fix or cd) |
|
| 304 |
+
| **Issue Detection** | Missed a real issue | `−0.15` |
|
| 305 |
+
| **Issue Detection** | False positive (no issue there) | `−0.05` |
|
| 306 |
+
| **Decision** | Correct fix (passes `validate_fix`) | `+0.25` |
|
| 307 |
+
| **Decision** | Correct `cannot_determine` on non-fixable issue | `+0.25` |
|
| 308 |
+
| **Decision** | Hallucinated fix (no matching issue) | `−0.50` |
|
| 309 |
+
| **Decision** | Wrong fix (fails validation) | `−0.40` |
|
| 310 |
+
| **Decision** | Wrong `cannot_determine` (abstained when fixable) | `−0.20` |
|
| 311 |
+
| **Cross-record Consistency** | Consistent handling of related row pair | `+0.20` |
|
| 312 |
+
| **Cross-record Consistency** | Inconsistent handling of related row pair | `−0.30` |
|
| 313 |
+
| **Confidence Calibration** | confidence > 0.7 AND correct | `+0.05` |
|
| 314 |
+
| **Confidence Calibration** | confidence > 0.7 AND wrong | `−0.10` |
|
| 315 |
+
| **Confident Hallucination** | confidence > 0.8 AND hallucinated fix | `−0.20` (amplifier) |
|
| 316 |
+
| **Resolution Reward** | Previously detected issue now resolved | `+0.15` |
|
| 317 |
+
| **Passive Penalty** | Unresolved detection + off-topic action | `−0.05` |
|
| 318 |
+
| **Overcorrection** | Extra fields modified unintentionally | `−0.05 × N` |
|
| 319 |
+
| **Repeated Detection** | Same issue flagged again | `−0.10` |
|
| 320 |
|
| 321 |
+
> The returned step reward also adjusts by **±0.1** based on whether the sum of `per_record_scores` improved over the previous iteration.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 322 |
|
| 323 |
---
|
| 324 |
|
| 325 |
+
### Final Task Score — `grade_task_result`
|
| 326 |
|
| 327 |
+
Terminal score is a weighted composite guaranteed in the open interval **(0, 1)**:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 328 |
|
| 329 |
+
```
|
| 330 |
+
Final Score = 0.50 × normalized_record_score
|
| 331 |
+
+ 0.20 × (1 − hallucination_rate)
|
| 332 |
+
+ 0.15 × uncertainty_accuracy
|
| 333 |
+
+ 0.15 × consistency_score
|
| 334 |
+
```
|
| 335 |
+
|
| 336 |
+
| Task | Difficulty | Score |
|
| 337 |
+
|---|---|---|
|
| 338 |
+
| `easy_vendor_onboarding` | 🟢 Easy | `0.73` |
|
| 339 |
+
| `medium_customer_normalization` | 🟡 Medium | `0.40` |
|
| 340 |
+
| `hard_customer_conflicts` | 🔴 Hard | `0.39` |
|
| 341 |
|
| 342 |
+
> Evaluated using `inference.py` with `Qwen/Qwen3-VL-30B-A3B-Instruct` via Novita.
|
| 343 |
|
| 344 |
+
### Failure Telemetry
|
| 345 |
+
|
| 346 |
+
The `task_failure_messages` function surfaces structured, human-readable failure logs from the episode — making it straightforward to diagnose specific agent failure modes during evaluation and iteration.
|
|
|
|
|
|
|
| 347 |
|
| 348 |
---
|
| 349 |
|
| 350 |
+
## 🌐 HTTP API Reference
|
| 351 |
|
| 352 |
+
The FastAPI server exposes a clean REST interface for agent integration:
|
|
|
|
| 353 |
|
| 354 |
+
| Endpoint | Method | Body / Params | Description |
|
| 355 |
+
|---|---|---|---|
|
| 356 |
+
| `/reset` | `POST` | `{ "seed": 42, "task_name": "hard" }` | Start a new episode |
|
| 357 |
+
| `/step` | `POST` | JSON matching `Action` schema | Submit one agent action |
|
| 358 |
+
| `/state` | `GET` | — | Full internal state snapshot (debugging) |
|
| 359 |
+
| `/health` | `GET` | — | Liveness probe |
|
| 360 |
+
| `/docs` | `GET` | — | Interactive Swagger UI |
|
| 361 |
|
|
|
|
| 362 |
|
| 363 |
+
<div align="center">
|
| 364 |
+
|
| 365 |
+
<br/>
|
| 366 |
+
|
| 367 |
+
**Built to make data-cleaning agents honest — not just accurate.**
|
| 368 |
|
|
|
|
| 369 |
|
| 370 |
+
<br/>
|
| 371 |
|
| 372 |
+
⭐ **Star this repo** if DataOps GYM helped your research or evaluation work!
|
| 373 |
|
| 374 |
+
<br/>
|
| 375 |
|
| 376 |
+
</div>
|
| 377 |
+
|