hellinferno commited on
Commit
c98afe9
Β·
1 Parent(s): 35a203e

chore: remove planning/reference files directory to reduce Docker image size

Browse files
files/00-winning-plan.md DELETED
@@ -1,200 +0,0 @@
1
- # OpenEnv Hackathon β€” Winning Plan
2
-
3
- **Participant:** Ravi (Solo)
4
- **Deadline:** April 12, 2026, 11:59 PM IST
5
- **Goal:** Top 3,000 out of 20,000 teams β†’ Finale April 25–26
6
-
7
- ---
8
-
9
- ## Chosen Domain: **SQL Query Optimizer Review**
10
-
11
- An environment where an AI agent reviews SQL queries for correctness, performance, and security issues β€” then suggests fixes. This scores high on real-world utility (30% weight), is novel in OpenEnv, has natural difficulty progression, and produces clear measurable rewards.
12
-
13
- **Why this wins:**
14
- - Every engineering team at Meta deals with SQL/data pipelines daily β€” maximum relevance
15
- - Clear grading: each query has known issues, agent either finds them or doesn't β†’ partial credit is natural
16
- - Difficulty scales cleanly: syntax errors (easy) β†’ performance anti-patterns (medium) β†’ subtle injection vulnerabilities + schema-aware optimization (hard)
17
- - Novel domain not seen in existing OpenEnv environments (creativity 10%)
18
- - Deterministic grading with score variance (agents that find more issues score higher)
19
-
20
- ---
21
-
22
- ## Timeline
23
-
24
- | When | What |
25
- |---|---|
26
- | **Apr 10, Morning** | Complete prep modules 1-4 on Colab, watch bootcamp recording |
27
- | **Apr 10, Afternoon** | Install prerequisites, study sample inference script, study echo env code |
28
- | **Apr 10, Evening** | Scaffold project with `openenv init`, define Pydantic models, implement core env logic |
29
- | **Apr 11, Morning** | Implement 3 tasks (easy/medium/hard) with graders and reward functions |
30
- | **Apr 11, Afternoon** | Write `inference.py`, test locally, iterate on reward shaping |
31
- | **Apr 11, Evening** | Dockerize, deploy to HF Spaces, run pre-validation script |
32
- | **Apr 12, Morning** | Write README, final testing, fix issues |
33
- | **Apr 12, Afternoon** | Final pre-validation, submit |
34
- | **Apr 12, Before 11:59 PM** | Verify HF Space is live and responding |
35
-
36
- ---
37
-
38
- ## Phase 0: Preparation (Today β€” First 3 Hours)
39
-
40
- ### Step 1: Complete Prep Course Modules
41
- - Module 1: Interface basics (`reset()`, `step()`, `state()`)
42
- - Module 2: Using existing environments, typed models
43
- - Module 3: Deployment to HF Spaces with `openenv push`
44
- - Module 4: **Building your own environment** β€” most critical, take detailed notes
45
-
46
- ### Step 2: Watch Bootcamp Recording
47
- - Note tips from Ben Burtenshaw (HF) and Pulkit Aneja about what judges look for
48
-
49
- ### Step 3: Install Prerequisites
50
- ```bash
51
- pip install openenv-core huggingface_hub openai pydantic
52
- pip install docker # or ensure Docker Desktop is running
53
- huggingface-cli login
54
- ```
55
-
56
- ### Step 4: Study the Sample Inference Script
57
- - Memorize the `[START]`, `[STEP]`, `[END]` stdout format
58
- - Any deviation in field names/ordering = incorrect evaluation scoring
59
-
60
- ### Step 5: Study Existing Environments
61
- - Clone `https://github.com/meta-pytorch/OpenEnv`
62
- - Study `envs/echo_env/` structure: models.py, client.py, server/environment.py, server/app.py, server/Dockerfile
63
-
64
- ---
65
-
66
- ## Phase 1: Build the Environment
67
-
68
- ### Project Structure
69
- ```
70
- sql-query-reviewer/
71
- β”œβ”€β”€ openenv.yaml
72
- β”œβ”€β”€ models.py # Action, Observation, State Pydantic models
73
- β”œβ”€β”€ client.py # EnvClient subclass
74
- β”œβ”€β”€ inference.py # Baseline inference script (root!)
75
- β”œβ”€β”€ README.md
76
- β”œβ”€β”€ tasks/
77
- β”‚ β”œβ”€β”€ easy_tasks.json # Syntax error queries
78
- β”‚ β”œβ”€β”€ medium_tasks.json # Performance anti-pattern queries
79
- β”‚ └── hard_tasks.json # Security + schema-aware optimization queries
80
- └── server/
81
- β”œβ”€β”€ environment.py # Core environment logic
82
- β”œβ”€β”€ grader.py # Deterministic grading functions
83
- β”œβ”€β”€ app.py # FastAPI server
84
- β”œβ”€β”€ Dockerfile
85
- └── requirements.txt
86
- ```
87
-
88
- ### Pydantic Models Design
89
-
90
- **Observation:**
91
- - `query`: The SQL query to review
92
- - `schema_info`: Table/column definitions (for medium/hard tasks)
93
- - `context`: What the query is supposed to do
94
- - `issues_found_so_far`: List of issues already identified
95
- - `remaining_actions`: How many review steps remain
96
- - `difficulty`: easy | medium | hard
97
-
98
- **Action:**
99
- - `action_type`: "identify_issue" | "suggest_fix" | "approve" | "request_more_context"
100
- - `issue_category`: "syntax" | "performance" | "security" | "logic" | "style"
101
- - `issue_description`: Free text description of the issue
102
- - `suggested_fix`: The corrected SQL (optional)
103
- - `confidence`: Float 0.0-1.0
104
-
105
- **Reward:** Float 0.0-1.0 with partial credit
106
-
107
- ### Three Tasks with Progressive Difficulty
108
-
109
- **Task 1 β€” Easy: Syntax & Basic Logic Errors**
110
- - Queries with missing keywords, wrong joins, typos in column names
111
- - Agent identifies each error β†’ 0.2 reward per correct identification
112
- - Suggesting a valid fix β†’ bonus 0.1 per fix
113
- - Expected baseline score: 0.7-0.9
114
-
115
- **Task 2 β€” Medium: Performance Anti-Patterns**
116
- - SELECT *, missing indexes, N+1 patterns, unnecessary subqueries, missing WHERE clauses on large tables
117
- - Requires understanding schema context
118
- - Agent identifies anti-pattern + suggests optimization β†’ partial credit
119
- - Expected baseline score: 0.4-0.6
120
-
121
- **Task 3 β€” Hard: Security Vulnerabilities + Schema-Aware Optimization**
122
- - SQL injection vectors, privilege escalation, data leakage, plus complex optimization (query plan awareness)
123
- - Requires multi-step reasoning about schema relationships
124
- - Expected baseline score: 0.2-0.4
125
-
126
- ### Reward Function Design
127
- - Per-step rewards (not just end-of-episode)
128
- - Correct issue identification: +0.2 (scaled by issue severity)
129
- - Valid fix suggestion: +0.1
130
- - False positive (flagging non-issue): -0.1
131
- - Missing critical issue at episode end: -0.15
132
- - Approving a query with unfound issues: -0.2
133
- - Smooth, informative signal throughout the trajectory
134
-
135
- ### Grader Design
136
- - Each task has a ground-truth list of issues with categories and severity
137
- - Grader compares agent's identified issues against ground truth using fuzzy matching on descriptions
138
- - Score = (correctly_identified Γ— severity_weight) / total_possible_score
139
- - Deterministic: same agent output β†’ same score every time
140
- - Returns float in [0.0, 1.0]
141
- - Never returns the same score for all inputs (variety of queries ensures variance)
142
-
143
- ---
144
-
145
- ## Phase 2: Inference Script
146
-
147
- Key requirements:
148
- - Named `inference.py` in root directory
149
- - Uses OpenAI Client for all LLM calls
150
- - Reads `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` from env vars
151
- - Emits `[START]`, `[STEP]`, `[END]` logs exactly per spec
152
- - Completes in <20 minutes on 2 vCPU, 8GB RAM
153
- - Reproducible scores
154
-
155
- ---
156
-
157
- ## Phase 3: Containerize & Deploy
158
-
159
- ```bash
160
- # Build and test locally
161
- docker build -t sql-query-reviewer ./server
162
- docker run -p 8000:8000 sql-query-reviewer
163
-
164
- # Verify endpoints
165
- curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{}'
166
-
167
- # Deploy to HF Spaces
168
- openenv push --repo-id ravi/sql-query-reviewer
169
-
170
- # Verify deployed version
171
- curl -X POST https://ravi-sql-query-reviewer.hf.space/reset
172
- ```
173
-
174
- ---
175
-
176
- ## Phase 4: Pre-Submission QA
177
-
178
- Run pre-validation script:
179
- ```bash
180
- ./validate-submission.sh https://ravi-sql-query-reviewer.hf.space .
181
- ```
182
-
183
- Checklist:
184
- - [ ] HF Space deploys and responds to `/reset` with 200
185
- - [ ] `openenv validate` passes
186
- - [ ] Dockerfile builds cleanly
187
- - [ ] Inference script runs without errors, produces scores
188
- - [ ] 3+ tasks, each grader returns scores in 0.0-1.0 range
189
- - [ ] Scores are reproducible across runs
190
- - [ ] README is compelling and complete
191
-
192
- ---
193
-
194
- ## Winning Differentiators
195
-
196
- 1. **Real-world utility (30%)**: SQL review is something every data team needs β€” immediate value for the RL/agent community
197
- 2. **Score variance**: Different agent capabilities produce meaningfully different scores β€” a basic agent catches syntax errors but misses security issues
198
- 3. **Reward shaping**: Per-step partial credit signals, not binary end-of-episode
199
- 4. **Novelty**: No SQL review environment exists in OpenEnv yet
200
- 5. **Spec compliance**: Bulletproof adherence to every technical requirement β€” this alone eliminates most competitors
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
files/01-problem-statement.md DELETED
@@ -1,32 +0,0 @@
1
- # 01 β€” Problem Statement & Domain Selection
2
-
3
- ## Domain: SQL Query Review Environment
4
-
5
- ### The Real-World Problem
6
- Every software team reviews SQL queries β€” in code reviews, database migrations, ETL pipeline audits, and security assessments. This is a genuine, high-frequency task that requires:
7
- - Pattern recognition (anti-patterns, vulnerabilities)
8
- - Domain knowledge (schema relationships, indexing strategies)
9
- - Multi-step reasoning (understanding query intent before evaluating correctness)
10
-
11
- ### Why This Domain Wins
12
-
13
- | Evaluation Criteria | Weight | How We Score |
14
- |---|---|---|
15
- | Real-world utility | 30% | SQL review is universal β€” Meta runs millions of queries daily. Fills a real gap in agent evaluation. |
16
- | Task & grader quality | 25% | Clear ground truth per query, deterministic grading, natural difficulty progression |
17
- | Environment design | 20% | Clean state (per-query episode), rich observations, well-typed actions, per-step rewards |
18
- | Code quality & spec compliance | 15% | Full OpenEnv spec, clean project structure, Docker, typed models |
19
- | Creativity & novelty | 10% | No SQL review env exists in OpenEnv. Reward design uses severity-weighted partial credit. |
20
-
21
- ### What the Agent Does
22
- 1. Receives a SQL query + optional schema context
23
- 2. Reviews it step-by-step, identifying issues (syntax, performance, security, logic)
24
- 3. Suggests fixes for each identified issue
25
- 4. Decides when to approve or flag the query
26
- 5. Gets rewarded for correctly identified issues and penalized for false positives
27
-
28
- ### Scope Boundaries
29
- - **In scope**: SELECT, INSERT, UPDATE, DELETE queries; joins; subqueries; CTEs; window functions
30
- - **Out of scope**: Stored procedures, database-specific dialect features, real database execution
31
- - **Episode length**: 3-8 steps depending on query complexity
32
- - **No external dependencies**: All query analysis is rule-based and deterministic
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
files/02-requirements.md DELETED
@@ -1,58 +0,0 @@
1
- # 02 β€” Requirements Specification
2
-
3
- ## Functional Requirements
4
-
5
- ### FR-1: Real-World Task Simulation
6
- - Simulates SQL query review β€” a task humans do daily in engineering teams
7
- - No games, no toys β€” purely professional/practical domain
8
-
9
- ### FR-2: OpenEnv Spec Compliance
10
- - Typed Pydantic models for Observation, Action, State
11
- - `step(action)` β†’ returns observation, reward, done, info
12
- - `reset()` β†’ returns initial observation
13
- - `state()` β†’ returns current internal state
14
- - Valid `openenv.yaml` with metadata
15
- - Passes `openenv validate`
16
-
17
- ### FR-3: Minimum 3 Tasks with Agent Graders
18
- - **Task 1 (Easy):** Syntax & basic logic errors β€” expected agent score 0.7-0.9
19
- - **Task 2 (Medium):** Performance anti-patterns β€” expected agent score 0.4-0.6
20
- - **Task 3 (Hard):** Security vulnerabilities + schema-aware optimization β€” expected agent score 0.2-0.4
21
- - Each grader: deterministic, returns float in [0.0, 1.0], reproducible
22
-
23
- ### FR-4: Meaningful Reward Function
24
- - Per-step rewards (not just end-of-episode binary)
25
- - Partial credit for partial issue identification
26
- - Penalties for false positives and missed critical issues
27
- - Smooth signal that guides learning
28
-
29
- ### FR-5: Baseline Inference Script
30
- - Named `inference.py` in project root
31
- - Uses OpenAI Client for LLM calls
32
- - Reads `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` from env vars
33
- - Emits `[START]`, `[STEP]`, `[END]` structured stdout logs
34
- - Produces reproducible baseline scores on all 3 tasks
35
-
36
- ## Non-Functional Requirements
37
-
38
- ### NFR-1: Deploys to Hugging Face Space
39
- - Containerized HF Space tagged with `openenv`
40
- - Returns 200 and responds to `/reset` POST
41
-
42
- ### NFR-2: Containerized Execution
43
- - Working Dockerfile
44
- - Builds with `docker build`, runs with `docker run`
45
- - Starts cleanly, responds to HTTP requests
46
-
47
- ### NFR-3: Infrastructure Constraints
48
- - Inference script runtime < 20 minutes
49
- - Runs on 2 vCPU, 8GB RAM machine
50
-
51
- ### NFR-4: Documentation
52
- - README with: environment description, motivation, action/observation space definitions, task descriptions with difficulty, setup instructions, baseline scores
53
-
54
- ## Disqualification Criteria (Must Avoid)
55
- - ❌ Environment does not deploy or respond
56
- - ❌ Plagiarized or trivially modified existing environments
57
- - ❌ Graders that always return the same score
58
- - ❌ No baseline inference script
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
files/03-information-architecture.md DELETED
@@ -1,66 +0,0 @@
1
- # 03 β€” Information Architecture
2
-
3
- ## Data Flow
4
-
5
- ```
6
- [Task JSON] β†’ reset() β†’ [Observation: query + schema + context]
7
- ↓
8
- Agent decides action
9
- ↓
10
- step(Action) β†’ [Observation + Reward + Done]
11
- ↓
12
- (repeat until done or max_steps)
13
- ↓
14
- close() β†’ Grader computes final score
15
- ```
16
-
17
- ## Task Data Structure
18
-
19
- Each task is a JSON object:
20
- ```json
21
- {
22
- "task_id": "easy_001",
23
- "difficulty": "easy",
24
- "query": "SELCT * FORM users WEHRE id = 1",
25
- "schema": {
26
- "users": {"id": "INT PRIMARY KEY", "name": "VARCHAR(255)", "email": "VARCHAR(255)"}
27
- },
28
- "context": "Fetch user by ID for profile page",
29
- "ground_truth_issues": [
30
- {"category": "syntax", "description": "SELCT should be SELECT", "severity": 0.3, "fix": "SELECT"},
31
- {"category": "syntax", "description": "FORM should be FROM", "severity": 0.3, "fix": "FROM"},
32
- {"category": "syntax", "description": "WEHRE should be WHERE", "severity": 0.3, "fix": "WHERE"},
33
- {"category": "performance", "description": "SELECT * fetches unnecessary columns", "severity": 0.1, "fix": "SELECT id, name, email"}
34
- ],
35
- "max_steps": 5
36
- }
37
- ```
38
-
39
- ## State Management
40
-
41
- | Field | Type | Description |
42
- |---|---|---|
43
- | `task_id` | str | Current task identifier |
44
- | `query` | str | The SQL query under review |
45
- | `issues_identified` | list | Issues the agent has found so far |
46
- | `fixes_suggested` | list | Fixes the agent has proposed |
47
- | `step_count` | int | Current step number |
48
- | `total_reward` | float | Accumulated reward |
49
- | `done` | bool | Whether episode is complete |
50
- | `approved` | bool | Whether agent approved the query |
51
-
52
- ## Observation Space
53
- - `query`: The full SQL query text
54
- - `schema_info`: Dict of table β†’ column definitions (empty for easy tasks)
55
- - `context`: Natural language description of query intent
56
- - `issues_found_so_far`: List of previously identified issues in this episode
57
- - `remaining_actions`: Max steps minus current step
58
- - `difficulty`: "easy" | "medium" | "hard"
59
- - `feedback`: Result of last action ("correct identification", "false positive", "already identified", etc.)
60
-
61
- ## Action Space
62
- - `action_type`: enum β€” "identify_issue" | "suggest_fix" | "approve" | "request_more_context"
63
- - `issue_category`: enum β€” "syntax" | "performance" | "security" | "logic" | "style"
64
- - `issue_description`: str β€” what the agent thinks is wrong
65
- - `suggested_fix`: str (optional) β€” corrected SQL fragment
66
- - `confidence`: float 0.0-1.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
files/04-system-architecture.md DELETED
@@ -1,54 +0,0 @@
1
- # 04 β€” System Architecture
2
-
3
- ## Components
4
-
5
- ```
6
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
7
- β”‚ HF Space β”‚
8
- β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
9
- β”‚ β”‚ FastAPI Server β”‚ β”‚
10
- β”‚ β”‚ (app.py β€” Uvicorn) β”‚ β”‚
11
- β”‚ β”‚ β”‚ β”‚
12
- β”‚ β”‚ POST /reset β†’ environment.reset() β”‚ β”‚
13
- β”‚ β”‚ POST /step β†’ environment.step() β”‚ β”‚
14
- β”‚ β”‚ GET /state β†’ environment.state() β”‚ β”‚
15
- β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
16
- β”‚ β”‚ β”‚
17
- β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
18
- β”‚ β”‚ SQLReviewEnvironment β”‚ β”‚
19
- β”‚ β”‚ - task_bank (easy/medium/hard JSON) β”‚ β”‚
20
- β”‚ β”‚ - grader (deterministic scoring) β”‚ β”‚
21
- β”‚ β”‚ - reward_fn (per-step signals) β”‚ β”‚
22
- β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
23
- β”‚ β”‚
24
- β”‚ Dockerfile (Python 3.10-slim + deps) β”‚
25
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
26
-
27
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
28
- β”‚ inference.py (Client) β”‚
29
- β”‚ - OpenAI Client β†’ LLM API β”‚
30
- β”‚ - SQLReviewEnvClient β†’ HF Space β”‚
31
- β”‚ - Structured stdout logging β”‚
32
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
33
- ```
34
-
35
- ## Technology Stack
36
- - **Runtime:** Python 3.10+
37
- - **Framework:** FastAPI + Uvicorn
38
- - **Models:** Pydantic v2
39
- - **Container:** Docker (python:3.10-slim base)
40
- - **Deployment:** Hugging Face Spaces (Docker SDK)
41
- - **LLM Client:** OpenAI Python SDK
42
- - **Environment SDK:** openenv-core
43
-
44
- ## Communication Protocol
45
- - WebSocket at `/ws` for persistent sessions (OpenEnv standard)
46
- - HTTP POST endpoints as fallback: `/reset`, `/step`
47
- - HTTP GET: `/state`
48
- - JSON request/response bodies matching typed Pydantic models
49
-
50
- ## Episode Lifecycle
51
- 1. Client calls `reset(task_id="easy_001")` β†’ server loads task, returns initial observation
52
- 2. Client calls `step(action)` β†’ server validates action, computes reward, returns observation
53
- 3. Repeat until `done=True` (all issues found, agent approves, or max_steps reached)
54
- 4. Client calls `close()` β†’ server runs grader, returns final score
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
files/05-database-schema.md DELETED
@@ -1,52 +0,0 @@
1
- # 05 β€” Task Bank Schema
2
-
3
- ## Overview
4
- Tasks are stored as JSON files, not a database. Each difficulty level has its own file with 3-5 queries.
5
-
6
- ## Easy Tasks (`tasks/easy_tasks.json`)
7
-
8
- Queries with obvious syntax errors, wrong keywords, basic logic mistakes. An LLM should score 0.7-0.9.
9
-
10
- Example queries:
11
- 1. Misspelled keywords (SELCT, FORM, WEHRE)
12
- 2. Missing FROM clause
13
- 3. Wrong column names that don't exist in schema
14
- 4. Missing semicolons / unclosed quotes
15
- 5. Using = NULL instead of IS NULL
16
-
17
- ## Medium Tasks (`tasks/medium_tasks.json`)
18
-
19
- Queries with performance anti-patterns. Requires understanding schema context. Target score: 0.4-0.6.
20
-
21
- Example queries:
22
- 1. SELECT * on a 50-column table when only 2 columns needed
23
- 2. Missing index hint on a JOIN with large table
24
- 3. Correlated subquery that could be a JOIN
25
- 4. Missing LIMIT on unbounded query
26
- 5. Redundant DISTINCT on a column with UNIQUE constraint
27
-
28
- ## Hard Tasks (`tasks/hard_tasks.json`)
29
-
30
- Security vulnerabilities + complex optimization. Target score: 0.2-0.4.
31
-
32
- Example queries:
33
- 1. String concatenation enabling SQL injection
34
- 2. Privilege escalation via UNION with system tables
35
- 3. Data leakage through unfiltered JOIN exposing PII
36
- 4. Query that could use window functions instead of self-join (10x perf gain)
37
- 5. Missing transaction isolation causing phantom reads
38
-
39
- ## Ground Truth Format
40
-
41
- Each issue in ground truth:
42
- ```json
43
- {
44
- "category": "security",
45
- "description": "String concatenation in WHERE clause enables SQL injection",
46
- "severity": 1.0,
47
- "fix": "Use parameterized query with ? placeholder",
48
- "keywords": ["injection", "concatenation", "user input", "unsanitized"]
49
- }
50
- ```
51
-
52
- The `keywords` field is used by the grader for fuzzy matching against agent responses.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
files/06-api-contracts.md DELETED
@@ -1,96 +0,0 @@
1
- # 06 β€” API Contracts
2
-
3
- ## OpenEnv Standard Endpoints
4
-
5
- ### POST /reset
6
- **Request:**
7
- ```json
8
- {"task_id": "easy_001"}
9
- ```
10
- **Response (StepResult):**
11
- ```json
12
- {
13
- "observation": {
14
- "query": "SELCT * FORM users WEHRE id = 1",
15
- "schema_info": {"users": {"id": "INT PK", "name": "VARCHAR(255)", "email": "VARCHAR(255)"}},
16
- "context": "Fetch user by ID for profile page",
17
- "issues_found_so_far": [],
18
- "remaining_actions": 5,
19
- "difficulty": "easy",
20
- "feedback": "Review this SQL query and identify any issues."
21
- },
22
- "reward": 0.0,
23
- "done": false,
24
- "info": {}
25
- }
26
- ```
27
-
28
- ### POST /step
29
- **Request (Action):**
30
- ```json
31
- {
32
- "action_type": "identify_issue",
33
- "issue_category": "syntax",
34
- "issue_description": "SELCT is misspelled, should be SELECT",
35
- "suggested_fix": "SELECT",
36
- "confidence": 0.95
37
- }
38
- ```
39
- **Response (StepResult):**
40
- ```json
41
- {
42
- "observation": {
43
- "query": "SELCT * FORM users WEHRE id = 1",
44
- "schema_info": {"users": {"id": "INT PK", "name": "VARCHAR(255)", "email": "VARCHAR(255)"}},
45
- "context": "Fetch user by ID for profile page",
46
- "issues_found_so_far": [{"category": "syntax", "description": "SELCT should be SELECT"}],
47
- "remaining_actions": 4,
48
- "difficulty": "easy",
49
- "feedback": "Correct! SELCT is indeed a syntax error. 3 issues remaining."
50
- },
51
- "reward": 0.25,
52
- "done": false,
53
- "info": {"match_type": "exact", "severity": 0.3}
54
- }
55
- ```
56
-
57
- ### GET /state
58
- **Response (State):**
59
- ```json
60
- {
61
- "task_id": "easy_001",
62
- "step_count": 1,
63
- "issues_identified": [{"category": "syntax", "description": "SELCT should be SELECT"}],
64
- "total_reward": 0.25,
65
- "done": false,
66
- "approved": false
67
- }
68
- ```
69
-
70
- ## Pydantic Models
71
-
72
- ```python
73
- class SQLReviewAction(Action):
74
- action_type: Literal["identify_issue", "suggest_fix", "approve", "request_more_context"]
75
- issue_category: Optional[Literal["syntax", "performance", "security", "logic", "style"]] = None
76
- issue_description: Optional[str] = None
77
- suggested_fix: Optional[str] = None
78
- confidence: float = 0.5
79
-
80
- class SQLReviewObservation(Observation):
81
- query: str
82
- schema_info: Dict[str, Dict[str, str]]
83
- context: str
84
- issues_found_so_far: List[Dict[str, str]]
85
- remaining_actions: int
86
- difficulty: str
87
- feedback: str
88
-
89
- class SQLReviewState(State):
90
- task_id: str
91
- step_count: int
92
- issues_identified: List[Dict[str, str]]
93
- total_reward: float
94
- done: bool
95
- approved: bool
96
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
files/07-monorepo-structure.md DELETED
@@ -1,65 +0,0 @@
1
- # 07 β€” Monorepo Structure
2
-
3
- ```
4
- sql-query-reviewer/
5
- β”‚
6
- β”œβ”€β”€ openenv.yaml # Environment metadata manifest
7
- β”œβ”€β”€ models.py # Pydantic: SQLReviewAction, SQLReviewObservation, SQLReviewState
8
- β”œβ”€β”€ client.py # EnvClient subclass for external consumers
9
- β”œβ”€β”€ inference.py # MANDATORY: Baseline inference script (root directory!)
10
- β”œβ”€β”€ README.md # Environment documentation
11
- β”œβ”€β”€ pyproject.toml # Package config
12
- β”‚
13
- β”œβ”€β”€ tasks/
14
- β”‚ β”œβ”€β”€ easy_tasks.json # 5 syntax/logic error queries
15
- β”‚ β”œβ”€β”€ medium_tasks.json # 5 performance anti-pattern queries
16
- β”‚ └── hard_tasks.json # 5 security + optimization queries
17
- β”‚
18
- └── server/
19
- β”œβ”€β”€ __init__.py
20
- β”œβ”€β”€ environment.py # SQLReviewEnvironment(Environment) β€” core logic
21
- β”œβ”€β”€ grader.py # Deterministic grading: fuzzy match agent output vs ground truth
22
- β”œβ”€β”€ reward.py # Per-step reward computation
23
- β”œβ”€β”€ app.py # FastAPI server (create_app with routes)
24
- β”œβ”€β”€ Dockerfile # Python 3.10-slim, install deps, expose port
25
- └── requirements.txt # openenv-core, fastapi, uvicorn, pydantic
26
- ```
27
-
28
- ## Key Files Explained
29
-
30
- | File | Purpose | Critical? |
31
- |---|---|---|
32
- | `openenv.yaml` | Metadata: name, description, author, tasks list | Yes β€” validated by `openenv validate` |
33
- | `models.py` | Typed Action/Observation/State contracts | Yes β€” spec compliance |
34
- | `inference.py` | Baseline agent using OpenAI Client | Yes β€” DQ if missing |
35
- | `server/environment.py` | `reset()`, `step()`, `state()` implementation | Yes β€” core logic |
36
- | `server/grader.py` | Score computation per task | Yes β€” must return 0.0-1.0 |
37
- | `server/Dockerfile` | Container definition | Yes β€” must build cleanly |
38
- | `README.md` | Human-readable documentation | Yes β€” judges read this first |
39
-
40
- ## openenv.yaml
41
-
42
- ```yaml
43
- name: sql-query-reviewer
44
- description: "AI agent reviews SQL queries for correctness, performance, and security"
45
- author: ravi
46
- version: "1.0.0"
47
- tags:
48
- - openenv
49
- - sql
50
- - code-review
51
- - security
52
- tasks:
53
- - id: easy_syntax
54
- name: "Syntax Error Detection"
55
- difficulty: easy
56
- description: "Find and fix obvious SQL syntax errors"
57
- - id: medium_performance
58
- name: "Performance Anti-Pattern Review"
59
- difficulty: medium
60
- description: "Identify performance issues requiring schema awareness"
61
- - id: hard_security
62
- name: "Security & Optimization Audit"
63
- difficulty: hard
64
- description: "Find SQL injection vectors and complex optimization opportunities"
65
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
files/08-computation-engine-spec.md DELETED
@@ -1,86 +0,0 @@
1
- # 08 β€” Reward & Grading Engine Spec
2
-
3
- ## Per-Step Reward Function
4
-
5
- ```python
6
- def compute_reward(action, ground_truth_issues, already_found):
7
- if action.action_type == "identify_issue":
8
- match = fuzzy_match(action.issue_description, ground_truth_issues, already_found)
9
- if match:
10
- base = match["severity"] # 0.1 - 1.0
11
- fix_bonus = 0.1 if action.suggested_fix and is_valid_fix(action.suggested_fix, match) else 0.0
12
- confidence_bonus = 0.05 * action.confidence if match else 0.0
13
- return min(base + fix_bonus + confidence_bonus, 0.4) # cap per-step
14
- else:
15
- return -0.1 # false positive penalty
16
-
17
- elif action.action_type == "approve":
18
- unfound = len(ground_truth_issues) - len(already_found)
19
- if unfound == 0:
20
- return 0.2 # correct approval
21
- else:
22
- return -0.15 * unfound # penalty per missed issue
23
-
24
- elif action.action_type == "suggest_fix":
25
- if not already_found:
26
- return -0.05 # fixing without identifying first
27
- last_issue = already_found[-1]
28
- if is_valid_fix(action.suggested_fix, last_issue):
29
- return 0.1
30
- return 0.0
31
-
32
- elif action.action_type == "request_more_context":
33
- return 0.0 # neutral β€” no reward, no penalty
34
-
35
- return 0.0
36
- ```
37
-
38
- ## Fuzzy Matching Algorithm
39
-
40
- ```python
41
- def fuzzy_match(agent_description, ground_truth_issues, already_found):
42
- """Match agent's issue description to a ground truth issue."""
43
- best_match = None
44
- best_score = 0.0
45
-
46
- for issue in ground_truth_issues:
47
- if issue in already_found:
48
- continue
49
- # Keyword overlap score
50
- agent_words = set(agent_description.lower().split())
51
- truth_words = set(issue["keywords"])
52
- overlap = len(agent_words & truth_words) / max(len(truth_words), 1)
53
- # Category match bonus
54
- category_bonus = 0.3 if action.issue_category == issue["category"] else 0.0
55
- score = overlap + category_bonus
56
- if score > best_score and score > 0.3: # threshold
57
- best_score = score
58
- best_match = issue
59
-
60
- return best_match
61
- ```
62
-
63
- ## End-of-Episode Grader
64
-
65
- ```python
66
- def grade_episode(issues_found, ground_truth_issues, total_steps, max_steps):
67
- """Deterministic grader returning float in [0.0, 1.0]."""
68
- if not ground_truth_issues:
69
- return 1.0 if not issues_found else 0.5
70
-
71
- total_severity = sum(i["severity"] for i in ground_truth_issues)
72
- found_severity = sum(i["severity"] for i in issues_found if i in matched_ground_truth)
73
-
74
- coverage_score = found_severity / total_severity # 0.0 - 1.0
75
- efficiency_bonus = max(0, 0.1 * (1 - total_steps / max_steps)) # reward fewer steps
76
- false_positive_penalty = 0.05 * count_false_positives(issues_found, ground_truth_issues)
77
-
78
- score = coverage_score + efficiency_bonus - false_positive_penalty
79
- return max(0.0, min(1.0, score))
80
- ```
81
-
82
- ## Score Variance Guarantee
83
- - Easy tasks: 5 different queries with 2-5 issues each β†’ scores range from 0.4 to 1.0
84
- - Medium tasks: different anti-patterns β†’ scores range from 0.2 to 0.8
85
- - Hard tasks: varied security issues β†’ scores range from 0.0 to 0.6
86
- - A grader that always returns the same score = instant DQ. Our design inherently prevents this because different queries have different ground truth issues.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
files/09-engineering-scope-definition.md DELETED
@@ -1,39 +0,0 @@
1
- # 09 β€” Engineering Scope Definition
2
-
3
- ## In Scope (Must Build)
4
- 1. **Environment server** β€” `environment.py` with `reset()`, `step()`, `state()`
5
- 2. **Pydantic models** β€” `models.py` with typed Action, Observation, State
6
- 3. **Client** β€” `client.py` with EnvClient subclass
7
- 4. **Task bank** β€” 15 SQL queries (5 easy, 5 medium, 5 hard) with ground truth
8
- 5. **Grader** β€” Deterministic scoring function per task
9
- 6. **Reward function** β€” Per-step partial credit with penalties
10
- 7. **Inference script** β€” `inference.py` using OpenAI Client
11
- 8. **Dockerfile** β€” Working container that builds and runs
12
- 9. **HF Space deployment** β€” Live, tagged with `openenv`
13
- 10. **README** β€” Complete documentation
14
- 11. **openenv.yaml** β€” Valid metadata manifest
15
-
16
- ## Out of Scope (Don't Build)
17
- - Real database execution (all analysis is pattern-matching based)
18
- - Custom LLM fine-tuning
19
- - Web UI beyond OpenEnv's built-in web interface
20
- - Multiple language SQL dialects (stick to standard SQL)
21
- - Integration tests against real databases
22
-
23
- ## Effort Estimates
24
-
25
- | Component | Hours | Priority |
26
- |---|---|---|
27
- | Prep course + bootcamp | 3.0 | P0 |
28
- | Task bank creation (15 queries + ground truth) | 2.5 | P0 |
29
- | Pydantic models | 0.5 | P0 |
30
- | Environment logic (reset/step/state) | 3.0 | P0 |
31
- | Grader + reward function | 2.0 | P0 |
32
- | Inference script | 1.5 | P0 |
33
- | Dockerfile + local testing | 1.0 | P0 |
34
- | HF Space deployment | 0.5 | P0 |
35
- | README | 1.0 | P0 |
36
- | Pre-validation + bug fixes | 2.0 | P0 |
37
- | **Total** | **~17 hours** | |
38
-
39
- Fits within the 2-day window with buffer for debugging.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
files/10-development-phases.md DELETED
@@ -1,48 +0,0 @@
1
- # 10 β€” Development Phases
2
-
3
- ## Phase 1: Learn (Apr 10, 9 AM – 12 PM)
4
- - [ ] Complete Module 1: Interface basics
5
- - [ ] Complete Module 2: Using existing environments
6
- - [ ] Complete Module 3: Deployment to HF Spaces
7
- - [ ] Complete Module 4: Building your own environment
8
- - [ ] Watch bootcamp recording, note judge preferences
9
- - [ ] Study sample inference script format
10
-
11
- ## Phase 2: Scaffold (Apr 10, 12 PM – 2 PM)
12
- - [ ] `pip install openenv-core huggingface_hub openai`
13
- - [ ] `openenv init sql-query-reviewer`
14
- - [ ] Clone and study echo env for reference
15
- - [ ] Set up project structure per 07-monorepo-structure.md
16
-
17
- ## Phase 3: Core Build (Apr 10, 2 PM – Apr 11, 12 PM)
18
- - [ ] Write `models.py` β€” Action, Observation, State
19
- - [ ] Create task bank β€” 5 easy, 5 medium, 5 hard queries with ground truth
20
- - [ ] Implement `environment.py` β€” reset(), step(), state()
21
- - [ ] Implement `grader.py` β€” deterministic scoring
22
- - [ ] Implement `reward.py` β€” per-step reward computation
23
- - [ ] Implement fuzzy matching for issue identification
24
- - [ ] Write `app.py` β€” FastAPI routes
25
- - [ ] Local testing: `uv run server` β†’ test all endpoints manually
26
-
27
- ## Phase 4: Inference (Apr 11, 12 PM – 3 PM)
28
- - [ ] Write `inference.py` following sample script format exactly
29
- - [ ] System prompt design for SQL review agent
30
- - [ ] Test with free HF Inference API
31
- - [ ] Verify `[START]`, `[STEP]`, `[END]` output format
32
- - [ ] Run 3x to verify reproducible scores
33
-
34
- ## Phase 5: Containerize & Deploy (Apr 11, 3 PM – 6 PM)
35
- - [ ] Write Dockerfile (python:3.10-slim base)
36
- - [ ] `docker build -t sql-query-reviewer ./server`
37
- - [ ] `docker run -p 8000:8000 sql-query-reviewer`
38
- - [ ] Test `/reset`, `/step`, `/state` against running container
39
- - [ ] `openenv push --repo-id ravi/sql-query-reviewer`
40
- - [ ] Verify HF Space returns 200 on `/reset`
41
-
42
- ## Phase 6: Polish & Submit (Apr 11, 6 PM – Apr 12, 11:59 PM)
43
- - [ ] Write compelling README
44
- - [ ] Run `openenv validate`
45
- - [ ] Run `validate-submission.sh`
46
- - [ ] Fix any issues
47
- - [ ] Submit early, iterate if time permits
48
- - [ ] Final verification: HF Space live and responding
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
files/11-environment-and-devops.md DELETED
@@ -1,77 +0,0 @@
1
- # 11 β€” Environment & DevOps
2
-
3
- ## Local Development Setup
4
-
5
- ```bash
6
- # Python environment
7
- python3.10 -m venv .venv
8
- source .venv/bin/activate
9
- pip install openenv-core fastapi uvicorn pydantic openai huggingface_hub
10
-
11
- # Run locally
12
- cd server && uvicorn app:app --reload --port 8000
13
-
14
- # Test endpoints
15
- curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{"task_id": "easy_001"}'
16
- ```
17
-
18
- ## Dockerfile
19
-
20
- ```dockerfile
21
- FROM python:3.10-slim
22
-
23
- WORKDIR /app
24
-
25
- COPY server/requirements.txt .
26
- RUN pip install --no-cache-dir -r requirements.txt
27
-
28
- COPY models.py .
29
- COPY tasks/ ./tasks/
30
- COPY server/ ./server/
31
- COPY openenv.yaml .
32
-
33
- EXPOSE 8000
34
-
35
- CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]
36
- ```
37
-
38
- ## server/requirements.txt
39
-
40
- ```
41
- openenv-core>=0.1.0
42
- fastapi>=0.100.0
43
- uvicorn>=0.23.0
44
- pydantic>=2.0.0
45
- ```
46
-
47
- ## HF Space Deployment
48
-
49
- ```bash
50
- # Login
51
- huggingface-cli login
52
-
53
- # Deploy
54
- openenv push --repo-id ravi/sql-query-reviewer
55
-
56
- # Verify
57
- curl -s -o /dev/null -w "%{http_code}" -X POST https://ravi-sql-query-reviewer.hf.space/reset -H "Content-Type: application/json" -d '{}'
58
- # Expected: 200
59
- ```
60
-
61
- ## Environment Variables for Inference
62
-
63
- ```bash
64
- export API_BASE_URL="https://router.huggingface.co/v1"
65
- export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
66
- export HF_TOKEN="hf_xxxxxxxxxxxxx"
67
- export IMAGE_NAME="sql-query-reviewer"
68
- ```
69
-
70
- ## Pre-Validation
71
-
72
- ```bash
73
- chmod +x validate-submission.sh
74
- ./validate-submission.sh https://ravi-sql-query-reviewer.hf.space .
75
- ```
76
-
77
- Expected output: All 3/3 checks passed.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
files/12-testing-strategy.md DELETED
@@ -1,52 +0,0 @@
1
- # 12 β€” Testing Strategy
2
-
3
- ## Level 1: Unit Tests (During Build)
4
- - **Models:** Validate Pydantic models accept/reject correct/incorrect data
5
- - **Grader:** Test with known inputs β†’ known scores. Verify determinism (run 10x, same result).
6
- - **Reward function:** Test each action type returns expected reward range
7
- - **Fuzzy matcher:** Test with exact match, partial match, no match, already-found cases
8
-
9
- ## Level 2: Integration Tests (Before Docker)
10
- - Run `uv run server` locally
11
- - POST `/reset` with each task ID β†’ verify valid observation returned
12
- - POST `/step` with valid action β†’ verify reward, done, observation
13
- - POST `/step` with invalid action β†’ verify graceful error handling
14
- - GET `/state` β†’ verify state matches expectations
15
- - Run full episode: reset β†’ steps β†’ done β†’ verify final grader score
16
-
17
- ## Level 3: Container Tests (Before Deploy)
18
- ```bash
19
- docker build -t sql-query-reviewer ./server
20
- docker run -d -p 8000:8000 sql-query-reviewer
21
- # Wait for startup
22
- sleep 5
23
- # Test reset
24
- curl -X POST http://localhost:8000/reset -d '{}' | python -m json.tool
25
- # Test step
26
- curl -X POST http://localhost:8000/step -d '{"action_type":"identify_issue","issue_category":"syntax","issue_description":"test"}' | python -m json.tool
27
- docker stop $(docker ps -q)
28
- ```
29
-
30
- ## Level 4: Validation Tests (Before Submit)
31
- - `openenv validate` β€” must pass
32
- - `validate-submission.sh <url> .` β€” all 3 checks must pass
33
- - Run `inference.py` 3 times β†’ verify scores are consistent
34
- - Verify stdout format matches `[START]`, `[STEP]`, `[END]` exactly
35
- - Check memory usage stays under 8GB
36
- - Check runtime stays under 20 minutes
37
-
38
- ## Level 5: Score Variance Check
39
- - Run inference on all 3 tasks β†’ verify different scores
40
- - Confirm no grader returns the same score for different inputs
41
- - Verify easy > medium > hard in terms of baseline agent performance
42
-
43
- ## DQ Prevention Checklist
44
- - [ ] HF Space returns 200 on POST /reset
45
- - [ ] openenv.yaml is valid
46
- - [ ] Typed models work
47
- - [ ] Dockerfile builds
48
- - [ ] 3+ tasks with graders returning 0.0-1.0
49
- - [ ] Graders DON'T always return the same score
50
- - [ ] inference.py exists in root
51
- - [ ] Baseline produces reproducible scores
52
- - [ ] Not plagiarized from existing environments
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
files/CHANGES.md DELETED
@@ -1,72 +0,0 @@
1
- # Changes to Apply β€” Priority Order
2
-
3
- ## 🚨 CRITICAL FIX (Do this first β€” DQ risk)
4
-
5
- ### 1. Replace `inference.py`
6
- **File:** `inference.py` (root directory)
7
- **Problem:** Current stdout format outputs JSON like `[START] {"difficulty": "easy", ...}` instead of the required `[START] task=easy_001 env=sql-query-reviewer model=Qwen/...` format.
8
- **Impact:** The hackathon dashboard explicitly states: "Any deviation in field names, ordering, or formatting will result in incorrect evaluation scoring."
9
- **Fix:** Replace with the provided `inference.py` that uses `log_start()`, `log_step()`, `log_end()` matching the exact spec format.
10
-
11
- **Key changes in the new inference.py:**
12
- - `[START] task=<task_name> env=<benchmark> model=<model_name>` β€” flat key=value, not JSON
13
- - `[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>` β€” reward formatted to 2 decimal places
14
- - `[END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>` β€” comma-separated rewards list
15
- - Uses `API_BASE_URL` defaulting to HF router (not openai.com)
16
- - Uses `HF_TOKEN` as primary API key env var
17
- - Accumulates rewards list and computes success boolean
18
- - try/finally ensures [END] is always emitted even on exception
19
-
20
- ---
21
-
22
- ## ⚠️ HIGH PRIORITY
23
-
24
- ### 2. Replace `openenv.yaml`
25
- **Problem:** Task IDs in yaml (`easy_syntax`, `medium_performance`, `hard_security`) don't match actual task IDs in JSON files (`easy_001`–`easy_005`, `medium_001`–`medium_005`, `hard_001`–`hard_005`).
26
- **Impact:** If `openenv validate` checks task ID alignment, validation fails.
27
- **Fix:** Replace with provided `openenv.yaml` listing all 15 actual task IDs.
28
-
29
- ### 3. Replace `Dockerfile`
30
- **Problem:** No HEALTHCHECK instruction and no `curl` installed.
31
- **Fix:** Added `apt-get install curl` and `HEALTHCHECK` directive.
32
-
33
- ### 4. Replace `README.md`
34
- **Problem:** Functional but not compelling for human reviewers (30% weight on real-world utility).
35
- **Fix:** Added "Why This Matters" narrative, baseline score table, cleaner structure.
36
-
37
- ---
38
-
39
- ## 🟑 MEDIUM PRIORITY (before deadline if time permits)
40
-
41
- ### 5. Merge PR #1 on GitHub
42
- The fix/package-server-and-inference-imports branch is already deployed to HF Spaces but still a draft PR on GitHub. Merge it so `main` branch CI passes.
43
-
44
- ### 6. Verify `openenv` tag on HF Space
45
- Go to Space settings on HuggingFace and confirm the `openenv` tag is applied. The README has it in YAML front matter tags, but double-check it appears in the Space metadata.
46
-
47
- ### 7. Run pre-validation
48
- ```bash
49
- ./validate-submission.sh https://hellinferno-sql-query-reviewer.hf.space .
50
- ```
51
-
52
- ---
53
-
54
- ## How to apply these changes
55
-
56
- ```bash
57
- # From your local repo directory:
58
- cp /path/to/fixes/inference.py ./inference.py
59
- cp /path/to/fixes/openenv.yaml ./openenv.yaml
60
- cp /path/to/fixes/Dockerfile ./Dockerfile
61
- cp /path/to/fixes/README.md ./README.md
62
-
63
- # Test locally
64
- uvicorn server.app:app --port 8000 &
65
- python inference.py # verify [START]/[STEP]/[END] format
66
-
67
- # Push to HF Spaces
68
- git add -A
69
- git commit -m "fix: correct inference stdout format and align openenv.yaml task IDs"
70
- git push origin main
71
- git push hf main
72
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
files/Dockerfile DELETED
@@ -1,24 +0,0 @@
1
- FROM python:3.11-slim
2
-
3
- ENV PYTHONDONTWRITEBYTECODE=1 \
4
- PYTHONUNBUFFERED=1 \
5
- PORT=8000
6
-
7
- RUN apt-get update && apt-get install -y --no-install-recommends curl && rm -rf /var/lib/apt/lists/*
8
-
9
- WORKDIR /app
10
-
11
- COPY pyproject.toml README.md models.py client.py openenv.yaml inference.py ./
12
- COPY sql_query_reviewer ./sql_query_reviewer
13
- COPY server ./server
14
- COPY tasks ./tasks
15
-
16
- RUN pip install --no-cache-dir --upgrade pip && \
17
- pip install --no-cache-dir .
18
-
19
- EXPOSE 8000
20
-
21
- HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
22
- CMD curl -f http://localhost:8000/health || exit 1
23
-
24
- CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
files/README.md DELETED
@@ -1,162 +0,0 @@
1
- ---
2
- title: SQL Query Reviewer
3
- colorFrom: blue
4
- colorTo: green
5
- sdk: docker
6
- app_port: 8000
7
- pinned: false
8
- tags:
9
- - openenv
10
- ---
11
-
12
- # SQL Query Reviewer
13
-
14
- An OpenEnv environment where an AI agent reviews SQL queries for correctness, performance, and security β€” the same task thousands of engineers perform every day in code reviews, migration scripts, and ETL audits.
15
-
16
- ## Why This Matters
17
-
18
- SQL bugs are among the most common and costly defects in production systems. A misplaced keyword breaks an API, a missing index degrades latency by 100x, and an unsanitized input opens a door to data exfiltration. Today these defects are caught by human reviewers who spend hours on repetitive pattern matching. This environment provides a standardized benchmark to train and evaluate AI agents that can automate this critical workflow β€” directly useful for developer tools, IDE integrations, and automated code review systems.
19
-
20
- ## What The Environment Does
21
-
22
- Each episode gives the agent:
23
-
24
- - a SQL query (with realistic bugs drawn from production patterns)
25
- - schema context when it matters (table definitions, column types, constraints)
26
- - a short explanation of the query's intended purpose
27
-
28
- The agent responds step by step with one of four actions:
29
-
30
- | Action | Description |
31
- |---|---|
32
- | `identify_issue` | Flag a correctness, performance, or security problem |
33
- | `suggest_fix` | Propose corrected SQL for a previously identified issue |
34
- | `approve` | Mark the query as acceptable (ends episode) |
35
- | `request_more_context` | Ask for additional schema information |
36
-
37
- ## Reward Design
38
-
39
- Rewards are deterministic and shaped for partial progress throughout the trajectory:
40
-
41
- - **Correct issue identification**: +0.10 to +0.35 scaled by issue severity
42
- - **Valid fix suggestion**: +0.08 to +0.10 bonus
43
- - **Confidence bonus**: up to +0.05 for high-confidence correct identifications
44
- - **False positive**: βˆ’0.10 penalty
45
- - **Duplicate identification**: βˆ’0.02 penalty
46
- - **Approving with missed issues**: βˆ’0.15 per missed issue
47
- - **Complete correct approval**: +0.20
48
-
49
- ## Task Bank
50
-
51
- The environment ships with **15 tasks** across three difficulty levels:
52
-
53
- | Difficulty | Count | Examples | Expected Baseline Score |
54
- |---|---|---|---|
55
- | Easy | 5 | Misspelled keywords, missing FROM, = NULL vs IS NULL | ~0.75–0.90 |
56
- | Medium | 5 | SELECT *, missing indexes, correlated subqueries, unbounded queries | ~0.40–0.60 |
57
- | Hard | 5 | SQL injection, privilege escalation, PII leakage, self-join optimization | ~0.20–0.40 |
58
-
59
- Task data: `tasks/easy_tasks.json`, `tasks/medium_tasks.json`, `tasks/hard_tasks.json`
60
-
61
- ## Action & Observation Spaces
62
-
63
- **Action** (`SQLReviewAction`):
64
- - `action_type`: identify_issue | suggest_fix | approve | request_more_context
65
- - `issue_category`: syntax | performance | security | logic | style
66
- - `issue_description`: concise statement of the problem
67
- - `suggested_fix`: corrected SQL fragment
68
- - `confidence`: float 0.0–1.0
69
-
70
- **Observation** (`SQLReviewObservation`):
71
- - `query`: the full SQL query text
72
- - `schema_info`: dict of table β†’ column definitions
73
- - `context`: natural language description of query intent
74
- - `issues_found_so_far`: previously identified issues this episode
75
- - `remaining_actions`: steps left before episode ends
76
- - `difficulty`: easy | medium | hard
77
- - `feedback`: result of last action
78
-
79
- ## Repository Layout
80
-
81
- ```
82
- .
83
- β”œβ”€β”€ openenv.yaml
84
- β”œβ”€β”€ models.py
85
- β”œβ”€β”€ client.py
86
- β”œβ”€β”€ inference.py ← baseline agent (root directory)
87
- β”œβ”€β”€ Dockerfile
88
- β”œβ”€β”€ sql_query_reviewer/ ← typed models and client package
89
- β”œβ”€β”€ server/ ← FastAPI environment server
90
- β”‚ β”œβ”€β”€ environment.py ← reset(), step(), state()
91
- β”‚ β”œβ”€β”€ grader.py ← deterministic scoring
92
- β”‚ β”œβ”€β”€ reward.py ← per-step reward computation
93
- β”‚ └── app.py ← HTTP routes
94
- β”œβ”€β”€ tasks/ ← 15 SQL query tasks (JSON)
95
- └── tests/ ← pytest suite
96
- ```
97
-
98
- ## Local Development
99
-
100
- ```bash
101
- python -m venv .venv
102
- source .venv/bin/activate # or .venv\Scripts\activate on Windows
103
- pip install -e .[dev]
104
- uvicorn server.app:app --reload --port 8000
105
- ```
106
-
107
- Test the API:
108
- ```bash
109
- curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{"task_id":"easy_001"}'
110
- curl http://localhost:8000/state
111
- pytest
112
- ```
113
-
114
- ## Docker
115
-
116
- ```bash
117
- docker build -t sql-query-reviewer .
118
- docker run -p 8000:8000 sql-query-reviewer
119
- ```
120
-
121
- ## Inference
122
-
123
- ```bash
124
- export ENV_BASE_URL=http://localhost:8000
125
- export API_BASE_URL=https://router.huggingface.co/v1
126
- export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
127
- export HF_TOKEN=hf_xxx
128
- python inference.py
129
- ```
130
-
131
- The script emits structured `[START]`, `[STEP]`, `[END]` logs per the OpenEnv spec.
132
-
133
- ## Hugging Face Spaces
134
-
135
- This repo is Space-ready: HF YAML front matter in README, root Dockerfile, API on port 8000. Deploy with:
136
-
137
- ```bash
138
- git remote add hf https://huggingface.co/spaces/<username>/sql-query-reviewer
139
- git push hf main
140
- ```
141
-
142
- ## Usage Example
143
-
144
- ```python
145
- from sql_query_reviewer import SQLReviewAction, SQLReviewEnv
146
-
147
- with SQLReviewEnv(base_url="https://hellinferno-sql-query-reviewer.hf.space").sync() as env:
148
- result = env.reset(task_id="easy_001")
149
- result = env.step(SQLReviewAction(
150
- action_type="identify_issue",
151
- issue_category="syntax",
152
- issue_description="SELCT is misspelled and should be SELECT",
153
- suggested_fix="SELECT * FROM users WHERE id = 1;",
154
- confidence=0.98,
155
- ))
156
- print(result.reward)
157
- print(result.observation.feedback)
158
- ```
159
-
160
- ## Author
161
-
162
- **Hellinferno** β€” Solo participant, Meta PyTorch OpenEnv Hackathon 2026
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
files/architecture-diagram.md DELETED
@@ -1,61 +0,0 @@
1
- # Architecture Diagram
2
-
3
- ## High-Level Flow
4
-
5
- ```
6
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
7
- β”‚ β”‚ β”‚ HF Space (Docker) β”‚
8
- β”‚ inference.pyβ”‚ β”‚ β”‚
9
- β”‚ (Agent) β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
10
- β”‚ β”‚ WS β”‚ β”‚ FastAPI Server β”‚ β”‚
11
- β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”œβ”€β”€β”€β”€β–Ίβ”‚ β”‚ (app.py) β”‚ β”‚
12
- β”‚ β”‚ OpenAI β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
13
- β”‚ β”‚ Client β”‚ β”‚ β”‚ β”‚ /reset β†’ load task β”‚ β”‚
14
- β”‚ β”‚ ↕ β”‚ │◄───── β”‚ /step β†’ grade action β”‚ β”‚
15
- β”‚ β”‚ LLM β”‚ β”‚ β”‚ β”‚ /state β†’ return state β”‚ β”‚
16
- β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
17
- β”‚ β”‚ β”‚ β”‚ β”‚
18
- β”‚ stdout: β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
19
- β”‚ [START] β”‚ β”‚ β”‚ SQLReviewEnvironment β”‚ β”‚
20
- β”‚ [STEP] β”‚ β”‚ β”‚ - task_bank (JSON) β”‚ β”‚
21
- β”‚ [END] β”‚ β”‚ β”‚ - fuzzy_matcher β”‚ β”‚
22
- β”‚ β”‚ β”‚ β”‚ - reward_fn β”‚ β”‚
23
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ - grader β”‚ β”‚
24
- β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
25
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
26
- ```
27
-
28
- ## Episode Sequence
29
-
30
- ```
31
- Agent Environment
32
- β”‚ β”‚
33
- │──── reset(task_id) ──────────►│ Load task from JSON
34
- │◄─── observation ──────────────│ Return query + schema + context
35
- β”‚ β”‚
36
- │──── step(identify_issue) ────►│ Fuzzy match vs ground truth
37
- │◄─── obs + reward + done ──────│ Return feedback + reward
38
- β”‚ β”‚
39
- │──── step(suggest_fix) ───────►│ Validate fix
40
- │◄─── obs + reward + done ──────│ Return feedback + reward
41
- β”‚ β”‚
42
- │──── step(approve) ───────────►│ Check remaining issues
43
- │◄─── obs + reward + done=true──│ Episode ends
44
- β”‚ β”‚
45
- │──── close() ─────────────────►│ Run grader β†’ final score
46
- │◄─── final_score ──────────────│
47
- β”‚ β”‚
48
- ```
49
-
50
- ## Evaluation Pipeline (Hackathon Judges)
51
-
52
- ```
53
- Phase 1: Automated Validation
54
- └─ HF Space responds? β†’ openenv validate? β†’ Docker builds? β†’ inference.py runs? β†’ 3+ tasks?
55
-
56
- Phase 2: Agentic Evaluation
57
- └─ Run Nemotron 3 Super against all envs β†’ check score variance
58
-
59
- Phase 3: Human Review
60
- └─ Meta + HF engineers review for utility, creativity, exploit checks
61
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
files/inference.py DELETED
@@ -1,227 +0,0 @@
1
- """
2
- Inference Script β€” SQL Query Reviewer
3
- ======================================
4
- MANDATORY environment variables:
5
- API_BASE_URL The API endpoint for the LLM.
6
- MODEL_NAME The model identifier to use for inference.
7
- HF_TOKEN Your Hugging Face / API key.
8
-
9
- STDOUT FORMAT:
10
- [START] task=<task_name> env=<benchmark> model=<model_name>
11
- [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
12
- [END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
13
- """
14
-
15
- from __future__ import annotations
16
-
17
- import json
18
- import os
19
- from typing import Any, List, Optional
20
-
21
- from openai import OpenAI
22
-
23
- from sql_query_reviewer.client import SyncSQLReviewEnv
24
- from sql_query_reviewer.models import SQLReviewAction, SQLReviewObservation
25
-
26
- # ---------------------------------------------------------------------------
27
- # Configuration
28
- # ---------------------------------------------------------------------------
29
-
30
- DEFAULT_TASK_IDS = ("easy_001", "medium_001", "hard_001")
31
- BENCHMARK = "sql-query-reviewer"
32
- SUCCESS_SCORE_THRESHOLD = 0.1
33
-
34
- ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:8000")
35
- API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
36
- MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
37
- API_KEY = os.getenv("HF_TOKEN") or os.getenv("OPENAI_API_KEY") or os.getenv("API_KEY")
38
-
39
- SYSTEM_PROMPT = """You are reviewing a SQL query for correctness, performance, and security.
40
- Return exactly one JSON object with these keys:
41
- - action_type: identify_issue, suggest_fix, approve, or request_more_context
42
- - issue_category: syntax, performance, security, logic, or style when relevant
43
- - issue_description: concise issue statement when relevant
44
- - suggested_fix: corrected SQL or corrected fragment when relevant
45
- - confidence: float between 0.0 and 1.0
46
-
47
- Guidelines:
48
- - Prefer identify_issue until you have high confidence all important issues are covered.
49
- - Use approve only when the query looks acceptable or all issues have already been identified.
50
- - Keep the JSON valid and do not wrap it in prose.
51
- """
52
-
53
- # ---------------------------------------------------------------------------
54
- # Structured stdout logging β€” MUST match the hackathon spec exactly
55
- # ---------------------------------------------------------------------------
56
-
57
-
58
- def log_start(task: str, env: str, model: str) -> None:
59
- print(f"[START] task={task} env={env} model={model}", flush=True)
60
-
61
-
62
- def log_step(
63
- step: int, action: str, reward: float, done: bool, error: Optional[str]
64
- ) -> None:
65
- done_str = str(done).lower()
66
- error_str = error if error else "null"
67
- print(
68
- f"[STEP] step={step} action={action} reward={reward:.2f} "
69
- f"done={done_str} error={error_str}",
70
- flush=True,
71
- )
72
-
73
-
74
- def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
75
- rewards_str = ",".join(f"{r:.2f}" for r in rewards)
76
- print(
77
- f"[END] success={str(success).lower()} steps={steps} "
78
- f"score={score:.2f} rewards={rewards_str}",
79
- flush=True,
80
- )
81
-
82
-
83
- # ---------------------------------------------------------------------------
84
- # LLM interaction
85
- # ---------------------------------------------------------------------------
86
-
87
-
88
- def build_user_prompt(observation: SQLReviewObservation) -> str:
89
- payload = {
90
- "query": observation.query,
91
- "schema_info": observation.schema_info,
92
- "context": observation.context,
93
- "issues_found_so_far": [
94
- issue.model_dump() for issue in observation.issues_found_so_far
95
- ],
96
- "remaining_actions": observation.remaining_actions,
97
- "difficulty": observation.difficulty,
98
- "feedback": observation.feedback,
99
- }
100
- return json.dumps(payload, indent=2)
101
-
102
-
103
- def extract_json(content: str) -> dict[str, Any]:
104
- stripped = content.strip()
105
- if stripped.startswith("```"):
106
- lines = [line for line in stripped.splitlines() if not line.startswith("```")]
107
- stripped = "\n".join(lines).strip()
108
- start = stripped.find("{")
109
- end = stripped.rfind("}")
110
- if start == -1 or end == -1 or end <= start:
111
- raise ValueError(f"Could not find JSON object in model response: {content!r}")
112
- return json.loads(stripped[start : end + 1])
113
-
114
-
115
- def choose_action(
116
- llm_client: OpenAI, model_name: str, observation: SQLReviewObservation
117
- ) -> SQLReviewAction:
118
- try:
119
- response = llm_client.chat.completions.create(
120
- model=model_name,
121
- temperature=0,
122
- max_tokens=300,
123
- messages=[
124
- {"role": "system", "content": SYSTEM_PROMPT},
125
- {"role": "user", "content": build_user_prompt(observation)},
126
- ],
127
- )
128
- content = response.choices[0].message.content or ""
129
- return SQLReviewAction.model_validate(extract_json(content))
130
- except Exception as exc:
131
- print(f"[DEBUG] Model request failed: {exc}", flush=True)
132
- # Fallback: approve to end the episode gracefully
133
- return SQLReviewAction(action_type="approve", confidence=0.1)
134
-
135
-
136
- # ---------------------------------------------------------------------------
137
- # Episode runner
138
- # ---------------------------------------------------------------------------
139
-
140
-
141
- def run_episode(
142
- env: SyncSQLReviewEnv, llm_client: OpenAI, model_name: str, task_id: str
143
- ) -> None:
144
- rewards: List[float] = []
145
- steps_taken = 0
146
- score = 0.0
147
- success = False
148
- last_error: Optional[str] = None
149
-
150
- log_start(task=task_id, env=BENCHMARK, model=model_name)
151
-
152
- try:
153
- result = env.reset(task_id=task_id)
154
-
155
- step = 0
156
- while not result.done:
157
- step += 1
158
- action = choose_action(
159
- llm_client=llm_client,
160
- model_name=model_name,
161
- observation=result.observation,
162
- )
163
-
164
- action_str = action.action_type
165
- if action.issue_description:
166
- # Keep action string short and readable
167
- action_str = f"{action.action_type}({action.issue_category})"
168
-
169
- result = env.step(action)
170
-
171
- reward = result.reward
172
- rewards.append(reward)
173
- steps_taken = step
174
- last_error = result.info.get("error") if result.info else None
175
-
176
- log_step(
177
- step=step,
178
- action=action_str,
179
- reward=reward,
180
- done=result.done,
181
- error=last_error,
182
- )
183
-
184
- # Get final score from state
185
- state = env.state()
186
- score = state.final_score if state.final_score is not None else 0.0
187
- success = score >= SUCCESS_SCORE_THRESHOLD
188
-
189
- except Exception as exc:
190
- print(f"[DEBUG] Episode error: {exc}", flush=True)
191
- last_error = str(exc)
192
-
193
- finally:
194
- log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
195
-
196
-
197
- # ---------------------------------------------------------------------------
198
- # Main
199
- # ---------------------------------------------------------------------------
200
-
201
-
202
- def main() -> int:
203
- if not API_KEY:
204
- raise SystemExit("Set HF_TOKEN or OPENAI_API_KEY before running inference.py")
205
-
206
- task_ids = tuple(
207
- tid.strip()
208
- for tid in os.getenv("TASK_IDS", ",".join(DEFAULT_TASK_IDS)).split(",")
209
- if tid.strip()
210
- )
211
-
212
- llm_client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
213
-
214
- with SyncSQLReviewEnv(base_url=ENV_BASE_URL) as env:
215
- for task_id in task_ids:
216
- run_episode(
217
- env=env,
218
- llm_client=llm_client,
219
- model_name=MODEL_NAME,
220
- task_id=task_id,
221
- )
222
-
223
- return 0
224
-
225
-
226
- if __name__ == "__main__":
227
- raise SystemExit(main())
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
files/openenv.yaml DELETED
@@ -1,70 +0,0 @@
1
- name: sql-query-reviewer
2
- description: "AI agent reviews SQL queries for correctness, performance, and security."
3
- author: Hellinferno
4
- version: "0.1.0"
5
- tags:
6
- - openenv
7
- - sql
8
- - code-review
9
- - security
10
- tasks:
11
- - id: easy_001
12
- name: Syntax Keyword Typos
13
- difficulty: easy
14
- description: "Detect misspelled SQL keywords (SELCT, FORM, WEHRE) and unnecessary SELECT *."
15
- - id: easy_002
16
- name: Missing FROM Clause
17
- difficulty: easy
18
- description: "Find missing FROM keyword before table name."
19
- - id: easy_003
20
- name: NULL Comparison Logic
21
- difficulty: easy
22
- description: "Detect = NULL instead of IS NULL."
23
- - id: easy_004
24
- name: Unclosed String Literal
25
- difficulty: easy
26
- description: "Find unterminated quote in WHERE clause."
27
- - id: easy_005
28
- name: Unknown Column Name
29
- difficulty: easy
30
- description: "Detect column name typo (statuz vs status)."
31
- - id: medium_001
32
- name: Performance Anti-Pattern Review
33
- difficulty: medium
34
- description: "Identify schema-aware performance problems like SELECT *, missing indexes, correlated subqueries."
35
- - id: medium_002
36
- name: Unbounded Query Detection
37
- difficulty: medium
38
- description: "Find queries missing LIMIT on large tables."
39
- - id: medium_003
40
- name: Redundant Operations
41
- difficulty: medium
42
- description: "Detect unnecessary DISTINCT on unique columns."
43
- - id: medium_004
44
- name: Correlated Subquery Optimization
45
- difficulty: medium
46
- description: "Find correlated subqueries that could be JOINs."
47
- - id: medium_005
48
- name: Join Performance Issues
49
- difficulty: medium
50
- description: "Identify missing index hints and inefficient joins."
51
- - id: hard_001
52
- name: SQL Injection Detection
53
- difficulty: hard
54
- description: "Find string concatenation enabling SQL injection vectors."
55
- - id: hard_002
56
- name: Privilege Escalation via UNION
57
- difficulty: hard
58
- description: "Detect UNION with system tables exposing sensitive data."
59
- - id: hard_003
60
- name: PII Data Leakage
61
- difficulty: hard
62
- description: "Find unfiltered JOINs exposing personally identifiable information."
63
- - id: hard_004
64
- name: Self-Join Optimization
65
- difficulty: hard
66
- description: "Detect self-joins replaceable with window functions for 10x improvement."
67
- - id: hard_005
68
- name: Transaction Isolation Issues
69
- difficulty: hard
70
- description: "Find missing transaction isolation causing phantom reads."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
files/project-design.md DELETED
@@ -1,40 +0,0 @@
1
- # Project Design
2
-
3
- ## Design Principles
4
-
5
- 1. **Spec compliance first, creativity second.** Most teams will fail on automated validation. Perfect adherence to the OpenEnv spec is the highest-ROI activity.
6
-
7
- 2. **Reward shaping is the differentiator.** Binary end-of-episode rewards are common. Per-step, severity-weighted, partial-credit rewards are what separate top submissions.
8
-
9
- 3. **Score variance is mandatory.** The environment must produce different scores for different agent capabilities. Our design inherently ensures this: different queries have different issues, so no two episodes produce identical scores.
10
-
11
- 4. **Domain authenticity wins the 30%.** Real-world utility is the highest-weighted criterion. SQL review is a task every Meta engineer knows and values. The task bank should contain queries that feel like real code review findings, not synthetic puzzles.
12
-
13
- ## Key Design Decisions
14
-
15
- | Decision | Choice | Rationale |
16
- |---|---|---|
17
- | Domain | SQL Query Review | Universal relevance, clear grading, natural difficulty progression |
18
- | Task count | 15 queries (5/5/5) | Well above minimum 3, shows depth |
19
- | Matching | Fuzzy keyword matching | Robust to LLM phrasing variation while staying deterministic |
20
- | Reward | Per-step partial credit | Provides learning signal throughout trajectory |
21
- | Episode length | 3-8 steps | Short enough for 20-min inference limit across all tasks |
22
- | Grader | Severity-weighted coverage | Rewards finding critical issues more than trivial ones |
23
-
24
- ## Risk Mitigation
25
-
26
- | Risk | Mitigation |
27
- |---|---|
28
- | Fuzzy matching too loose β†’ inflated scores | Require 30% keyword overlap threshold + category match |
29
- | Fuzzy matching too strict β†’ no agent can score | Include broad keywords list, test with actual LLM output |
30
- | Inference timeout | 15 queries Γ— 5-8 steps Γ— ~3s per LLM call = ~6 min. Well under 20 min. |
31
- | Docker build fails on HF | Use minimal dependencies, test Dockerfile locally first |
32
- | Grader returns same score | Impossible with varied queries β€” but verify during testing |
33
-
34
- ## What Judges Will See
35
-
36
- 1. **README** β€” Clear, compelling, explains why SQL review matters and how the env works
37
- 2. **HF Space** β€” Live, responds instantly to `/reset`
38
- 3. **Code** β€” Clean, well-structured, typed models, deterministic graders
39
- 4. **Scores** β€” Meaningful variance: easy ~0.8, medium ~0.5, hard ~0.3
40
- 5. **Novelty** β€” No existing SQL review env in OpenEnv ecosystem
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
files/project-readme.md DELETED
@@ -1,91 +0,0 @@
1
- # SQL Query Reviewer β€” OpenEnv Environment
2
-
3
- An AI agent environment for reviewing SQL queries for correctness, performance, and security issues.
4
-
5
- ## Why This Matters
6
-
7
- Every engineering team reviews SQL queries daily β€” in code reviews, migration scripts, ETL pipelines, and security audits. This environment lets you train and evaluate AI agents on a task that directly maps to real engineering workflows. Unlike toy benchmarks, the queries here reflect genuine patterns found in production codebases: misspelled keywords, N+1 anti-patterns, missing indexes, SQL injection vectors, and schema-aware optimization opportunities.
8
-
9
- ## Environment Overview
10
-
11
- The agent receives a SQL query (plus optional schema context) and must identify issues through a multi-step review process. It earns rewards for correctly flagging problems and suggesting fixes, while being penalized for false positives or approving buggy queries.
12
-
13
- ## Action Space
14
-
15
- | Action Type | Description |
16
- |---|---|
17
- | `identify_issue` | Flag a specific issue with category and description |
18
- | `suggest_fix` | Propose corrected SQL for a previously identified issue |
19
- | `approve` | Mark the query as acceptable (ends episode) |
20
- | `request_more_context` | Ask for additional schema information |
21
-
22
- **Fields:** `action_type`, `issue_category` (syntax/performance/security/logic/style), `issue_description`, `suggested_fix`, `confidence` (0.0-1.0)
23
-
24
- ## Observation Space
25
-
26
- | Field | Type | Description |
27
- |---|---|---|
28
- | `query` | str | The SQL query under review |
29
- | `schema_info` | dict | Table/column definitions (richer for harder tasks) |
30
- | `context` | str | What the query is supposed to do |
31
- | `issues_found_so_far` | list | Previously identified issues this episode |
32
- | `remaining_actions` | int | Steps left before episode ends |
33
- | `difficulty` | str | easy, medium, or hard |
34
- | `feedback` | str | Result of last action |
35
-
36
- ## Tasks
37
-
38
- ### Task 1: Syntax Error Detection (Easy)
39
- Queries with obvious typos, missing keywords, wrong column names. A baseline agent should score **0.7-0.9**.
40
-
41
- ### Task 2: Performance Anti-Pattern Review (Medium)
42
- Queries with SELECT *, missing indexes, correlated subqueries, unbounded queries. Requires schema awareness. Expected score: **0.4-0.6**.
43
-
44
- ### Task 3: Security & Optimization Audit (Hard)
45
- SQL injection vectors, privilege escalation, data leakage, complex optimization. Requires multi-step reasoning. Expected score: **0.2-0.4**.
46
-
47
- ## Reward Design
48
- - Per-step partial credit (not binary end-of-episode)
49
- - Correct issue identification: +0.1 to +0.4 (scaled by severity)
50
- - Valid fix suggestion: +0.1 bonus
51
- - False positive: -0.1 penalty
52
- - Approving a query with unfound issues: -0.15 per missed issue
53
- - Correct approval of clean query: +0.2
54
-
55
- ## Setup
56
-
57
- ```bash
58
- # Install
59
- pip install openenv-core
60
- pip install git+https://huggingface.co/spaces/ravi/sql-query-reviewer
61
-
62
- # Use
63
- from sql_query_reviewer import SQLReviewEnv, SQLReviewAction
64
-
65
- with SQLReviewEnv(base_url="https://ravi-sql-query-reviewer.hf.space").sync() as env:
66
- result = env.reset()
67
- result = env.step(SQLReviewAction(
68
- action_type="identify_issue",
69
- issue_category="syntax",
70
- issue_description="SELCT should be SELECT"
71
- ))
72
- print(result.observation.feedback)
73
- ```
74
-
75
- ## Docker
76
-
77
- ```bash
78
- docker build -t sql-query-reviewer ./server
79
- docker run -p 8000:8000 sql-query-reviewer
80
- ```
81
-
82
- ## Baseline Scores
83
-
84
- | Task | Difficulty | Baseline Score |
85
- |---|---|---|
86
- | Syntax Error Detection | Easy | ~0.82 |
87
- | Performance Anti-Pattern Review | Medium | ~0.51 |
88
- | Security & Optimization Audit | Hard | ~0.29 |
89
-
90
- ## Author
91
- **Ravi** β€” Solo participant, Meta PyTorch OpenEnv Hackathon 2026