Spaces:
Sleeping
Sleeping
Commit Β·
c98afe9
1
Parent(s): 35a203e
chore: remove planning/reference files directory to reduce Docker image size
Browse files- files/00-winning-plan.md +0 -200
- files/01-problem-statement.md +0 -32
- files/02-requirements.md +0 -58
- files/03-information-architecture.md +0 -66
- files/04-system-architecture.md +0 -54
- files/05-database-schema.md +0 -52
- files/06-api-contracts.md +0 -96
- files/07-monorepo-structure.md +0 -65
- files/08-computation-engine-spec.md +0 -86
- files/09-engineering-scope-definition.md +0 -39
- files/10-development-phases.md +0 -48
- files/11-environment-and-devops.md +0 -77
- files/12-testing-strategy.md +0 -52
- files/CHANGES.md +0 -72
- files/Dockerfile +0 -24
- files/README.md +0 -162
- files/architecture-diagram.md +0 -61
- files/inference.py +0 -227
- files/openenv.yaml +0 -70
- files/project-design.md +0 -40
- files/project-readme.md +0 -91
files/00-winning-plan.md
DELETED
|
@@ -1,200 +0,0 @@
|
|
| 1 |
-
# OpenEnv Hackathon β Winning Plan
|
| 2 |
-
|
| 3 |
-
**Participant:** Ravi (Solo)
|
| 4 |
-
**Deadline:** April 12, 2026, 11:59 PM IST
|
| 5 |
-
**Goal:** Top 3,000 out of 20,000 teams β Finale April 25β26
|
| 6 |
-
|
| 7 |
-
---
|
| 8 |
-
|
| 9 |
-
## Chosen Domain: **SQL Query Optimizer Review**
|
| 10 |
-
|
| 11 |
-
An environment where an AI agent reviews SQL queries for correctness, performance, and security issues β then suggests fixes. This scores high on real-world utility (30% weight), is novel in OpenEnv, has natural difficulty progression, and produces clear measurable rewards.
|
| 12 |
-
|
| 13 |
-
**Why this wins:**
|
| 14 |
-
- Every engineering team at Meta deals with SQL/data pipelines daily β maximum relevance
|
| 15 |
-
- Clear grading: each query has known issues, agent either finds them or doesn't β partial credit is natural
|
| 16 |
-
- Difficulty scales cleanly: syntax errors (easy) β performance anti-patterns (medium) β subtle injection vulnerabilities + schema-aware optimization (hard)
|
| 17 |
-
- Novel domain not seen in existing OpenEnv environments (creativity 10%)
|
| 18 |
-
- Deterministic grading with score variance (agents that find more issues score higher)
|
| 19 |
-
|
| 20 |
-
---
|
| 21 |
-
|
| 22 |
-
## Timeline
|
| 23 |
-
|
| 24 |
-
| When | What |
|
| 25 |
-
|---|---|
|
| 26 |
-
| **Apr 10, Morning** | Complete prep modules 1-4 on Colab, watch bootcamp recording |
|
| 27 |
-
| **Apr 10, Afternoon** | Install prerequisites, study sample inference script, study echo env code |
|
| 28 |
-
| **Apr 10, Evening** | Scaffold project with `openenv init`, define Pydantic models, implement core env logic |
|
| 29 |
-
| **Apr 11, Morning** | Implement 3 tasks (easy/medium/hard) with graders and reward functions |
|
| 30 |
-
| **Apr 11, Afternoon** | Write `inference.py`, test locally, iterate on reward shaping |
|
| 31 |
-
| **Apr 11, Evening** | Dockerize, deploy to HF Spaces, run pre-validation script |
|
| 32 |
-
| **Apr 12, Morning** | Write README, final testing, fix issues |
|
| 33 |
-
| **Apr 12, Afternoon** | Final pre-validation, submit |
|
| 34 |
-
| **Apr 12, Before 11:59 PM** | Verify HF Space is live and responding |
|
| 35 |
-
|
| 36 |
-
---
|
| 37 |
-
|
| 38 |
-
## Phase 0: Preparation (Today β First 3 Hours)
|
| 39 |
-
|
| 40 |
-
### Step 1: Complete Prep Course Modules
|
| 41 |
-
- Module 1: Interface basics (`reset()`, `step()`, `state()`)
|
| 42 |
-
- Module 2: Using existing environments, typed models
|
| 43 |
-
- Module 3: Deployment to HF Spaces with `openenv push`
|
| 44 |
-
- Module 4: **Building your own environment** β most critical, take detailed notes
|
| 45 |
-
|
| 46 |
-
### Step 2: Watch Bootcamp Recording
|
| 47 |
-
- Note tips from Ben Burtenshaw (HF) and Pulkit Aneja about what judges look for
|
| 48 |
-
|
| 49 |
-
### Step 3: Install Prerequisites
|
| 50 |
-
```bash
|
| 51 |
-
pip install openenv-core huggingface_hub openai pydantic
|
| 52 |
-
pip install docker # or ensure Docker Desktop is running
|
| 53 |
-
huggingface-cli login
|
| 54 |
-
```
|
| 55 |
-
|
| 56 |
-
### Step 4: Study the Sample Inference Script
|
| 57 |
-
- Memorize the `[START]`, `[STEP]`, `[END]` stdout format
|
| 58 |
-
- Any deviation in field names/ordering = incorrect evaluation scoring
|
| 59 |
-
|
| 60 |
-
### Step 5: Study Existing Environments
|
| 61 |
-
- Clone `https://github.com/meta-pytorch/OpenEnv`
|
| 62 |
-
- Study `envs/echo_env/` structure: models.py, client.py, server/environment.py, server/app.py, server/Dockerfile
|
| 63 |
-
|
| 64 |
-
---
|
| 65 |
-
|
| 66 |
-
## Phase 1: Build the Environment
|
| 67 |
-
|
| 68 |
-
### Project Structure
|
| 69 |
-
```
|
| 70 |
-
sql-query-reviewer/
|
| 71 |
-
βββ openenv.yaml
|
| 72 |
-
βββ models.py # Action, Observation, State Pydantic models
|
| 73 |
-
βββ client.py # EnvClient subclass
|
| 74 |
-
βββ inference.py # Baseline inference script (root!)
|
| 75 |
-
βββ README.md
|
| 76 |
-
βββ tasks/
|
| 77 |
-
β βββ easy_tasks.json # Syntax error queries
|
| 78 |
-
β βββ medium_tasks.json # Performance anti-pattern queries
|
| 79 |
-
β βββ hard_tasks.json # Security + schema-aware optimization queries
|
| 80 |
-
βββ server/
|
| 81 |
-
βββ environment.py # Core environment logic
|
| 82 |
-
βββ grader.py # Deterministic grading functions
|
| 83 |
-
βββ app.py # FastAPI server
|
| 84 |
-
βββ Dockerfile
|
| 85 |
-
βββ requirements.txt
|
| 86 |
-
```
|
| 87 |
-
|
| 88 |
-
### Pydantic Models Design
|
| 89 |
-
|
| 90 |
-
**Observation:**
|
| 91 |
-
- `query`: The SQL query to review
|
| 92 |
-
- `schema_info`: Table/column definitions (for medium/hard tasks)
|
| 93 |
-
- `context`: What the query is supposed to do
|
| 94 |
-
- `issues_found_so_far`: List of issues already identified
|
| 95 |
-
- `remaining_actions`: How many review steps remain
|
| 96 |
-
- `difficulty`: easy | medium | hard
|
| 97 |
-
|
| 98 |
-
**Action:**
|
| 99 |
-
- `action_type`: "identify_issue" | "suggest_fix" | "approve" | "request_more_context"
|
| 100 |
-
- `issue_category`: "syntax" | "performance" | "security" | "logic" | "style"
|
| 101 |
-
- `issue_description`: Free text description of the issue
|
| 102 |
-
- `suggested_fix`: The corrected SQL (optional)
|
| 103 |
-
- `confidence`: Float 0.0-1.0
|
| 104 |
-
|
| 105 |
-
**Reward:** Float 0.0-1.0 with partial credit
|
| 106 |
-
|
| 107 |
-
### Three Tasks with Progressive Difficulty
|
| 108 |
-
|
| 109 |
-
**Task 1 β Easy: Syntax & Basic Logic Errors**
|
| 110 |
-
- Queries with missing keywords, wrong joins, typos in column names
|
| 111 |
-
- Agent identifies each error β 0.2 reward per correct identification
|
| 112 |
-
- Suggesting a valid fix β bonus 0.1 per fix
|
| 113 |
-
- Expected baseline score: 0.7-0.9
|
| 114 |
-
|
| 115 |
-
**Task 2 β Medium: Performance Anti-Patterns**
|
| 116 |
-
- SELECT *, missing indexes, N+1 patterns, unnecessary subqueries, missing WHERE clauses on large tables
|
| 117 |
-
- Requires understanding schema context
|
| 118 |
-
- Agent identifies anti-pattern + suggests optimization β partial credit
|
| 119 |
-
- Expected baseline score: 0.4-0.6
|
| 120 |
-
|
| 121 |
-
**Task 3 β Hard: Security Vulnerabilities + Schema-Aware Optimization**
|
| 122 |
-
- SQL injection vectors, privilege escalation, data leakage, plus complex optimization (query plan awareness)
|
| 123 |
-
- Requires multi-step reasoning about schema relationships
|
| 124 |
-
- Expected baseline score: 0.2-0.4
|
| 125 |
-
|
| 126 |
-
### Reward Function Design
|
| 127 |
-
- Per-step rewards (not just end-of-episode)
|
| 128 |
-
- Correct issue identification: +0.2 (scaled by issue severity)
|
| 129 |
-
- Valid fix suggestion: +0.1
|
| 130 |
-
- False positive (flagging non-issue): -0.1
|
| 131 |
-
- Missing critical issue at episode end: -0.15
|
| 132 |
-
- Approving a query with unfound issues: -0.2
|
| 133 |
-
- Smooth, informative signal throughout the trajectory
|
| 134 |
-
|
| 135 |
-
### Grader Design
|
| 136 |
-
- Each task has a ground-truth list of issues with categories and severity
|
| 137 |
-
- Grader compares agent's identified issues against ground truth using fuzzy matching on descriptions
|
| 138 |
-
- Score = (correctly_identified Γ severity_weight) / total_possible_score
|
| 139 |
-
- Deterministic: same agent output β same score every time
|
| 140 |
-
- Returns float in [0.0, 1.0]
|
| 141 |
-
- Never returns the same score for all inputs (variety of queries ensures variance)
|
| 142 |
-
|
| 143 |
-
---
|
| 144 |
-
|
| 145 |
-
## Phase 2: Inference Script
|
| 146 |
-
|
| 147 |
-
Key requirements:
|
| 148 |
-
- Named `inference.py` in root directory
|
| 149 |
-
- Uses OpenAI Client for all LLM calls
|
| 150 |
-
- Reads `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` from env vars
|
| 151 |
-
- Emits `[START]`, `[STEP]`, `[END]` logs exactly per spec
|
| 152 |
-
- Completes in <20 minutes on 2 vCPU, 8GB RAM
|
| 153 |
-
- Reproducible scores
|
| 154 |
-
|
| 155 |
-
---
|
| 156 |
-
|
| 157 |
-
## Phase 3: Containerize & Deploy
|
| 158 |
-
|
| 159 |
-
```bash
|
| 160 |
-
# Build and test locally
|
| 161 |
-
docker build -t sql-query-reviewer ./server
|
| 162 |
-
docker run -p 8000:8000 sql-query-reviewer
|
| 163 |
-
|
| 164 |
-
# Verify endpoints
|
| 165 |
-
curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{}'
|
| 166 |
-
|
| 167 |
-
# Deploy to HF Spaces
|
| 168 |
-
openenv push --repo-id ravi/sql-query-reviewer
|
| 169 |
-
|
| 170 |
-
# Verify deployed version
|
| 171 |
-
curl -X POST https://ravi-sql-query-reviewer.hf.space/reset
|
| 172 |
-
```
|
| 173 |
-
|
| 174 |
-
---
|
| 175 |
-
|
| 176 |
-
## Phase 4: Pre-Submission QA
|
| 177 |
-
|
| 178 |
-
Run pre-validation script:
|
| 179 |
-
```bash
|
| 180 |
-
./validate-submission.sh https://ravi-sql-query-reviewer.hf.space .
|
| 181 |
-
```
|
| 182 |
-
|
| 183 |
-
Checklist:
|
| 184 |
-
- [ ] HF Space deploys and responds to `/reset` with 200
|
| 185 |
-
- [ ] `openenv validate` passes
|
| 186 |
-
- [ ] Dockerfile builds cleanly
|
| 187 |
-
- [ ] Inference script runs without errors, produces scores
|
| 188 |
-
- [ ] 3+ tasks, each grader returns scores in 0.0-1.0 range
|
| 189 |
-
- [ ] Scores are reproducible across runs
|
| 190 |
-
- [ ] README is compelling and complete
|
| 191 |
-
|
| 192 |
-
---
|
| 193 |
-
|
| 194 |
-
## Winning Differentiators
|
| 195 |
-
|
| 196 |
-
1. **Real-world utility (30%)**: SQL review is something every data team needs β immediate value for the RL/agent community
|
| 197 |
-
2. **Score variance**: Different agent capabilities produce meaningfully different scores β a basic agent catches syntax errors but misses security issues
|
| 198 |
-
3. **Reward shaping**: Per-step partial credit signals, not binary end-of-episode
|
| 199 |
-
4. **Novelty**: No SQL review environment exists in OpenEnv yet
|
| 200 |
-
5. **Spec compliance**: Bulletproof adherence to every technical requirement β this alone eliminates most competitors
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
files/01-problem-statement.md
DELETED
|
@@ -1,32 +0,0 @@
|
|
| 1 |
-
# 01 β Problem Statement & Domain Selection
|
| 2 |
-
|
| 3 |
-
## Domain: SQL Query Review Environment
|
| 4 |
-
|
| 5 |
-
### The Real-World Problem
|
| 6 |
-
Every software team reviews SQL queries β in code reviews, database migrations, ETL pipeline audits, and security assessments. This is a genuine, high-frequency task that requires:
|
| 7 |
-
- Pattern recognition (anti-patterns, vulnerabilities)
|
| 8 |
-
- Domain knowledge (schema relationships, indexing strategies)
|
| 9 |
-
- Multi-step reasoning (understanding query intent before evaluating correctness)
|
| 10 |
-
|
| 11 |
-
### Why This Domain Wins
|
| 12 |
-
|
| 13 |
-
| Evaluation Criteria | Weight | How We Score |
|
| 14 |
-
|---|---|---|
|
| 15 |
-
| Real-world utility | 30% | SQL review is universal β Meta runs millions of queries daily. Fills a real gap in agent evaluation. |
|
| 16 |
-
| Task & grader quality | 25% | Clear ground truth per query, deterministic grading, natural difficulty progression |
|
| 17 |
-
| Environment design | 20% | Clean state (per-query episode), rich observations, well-typed actions, per-step rewards |
|
| 18 |
-
| Code quality & spec compliance | 15% | Full OpenEnv spec, clean project structure, Docker, typed models |
|
| 19 |
-
| Creativity & novelty | 10% | No SQL review env exists in OpenEnv. Reward design uses severity-weighted partial credit. |
|
| 20 |
-
|
| 21 |
-
### What the Agent Does
|
| 22 |
-
1. Receives a SQL query + optional schema context
|
| 23 |
-
2. Reviews it step-by-step, identifying issues (syntax, performance, security, logic)
|
| 24 |
-
3. Suggests fixes for each identified issue
|
| 25 |
-
4. Decides when to approve or flag the query
|
| 26 |
-
5. Gets rewarded for correctly identified issues and penalized for false positives
|
| 27 |
-
|
| 28 |
-
### Scope Boundaries
|
| 29 |
-
- **In scope**: SELECT, INSERT, UPDATE, DELETE queries; joins; subqueries; CTEs; window functions
|
| 30 |
-
- **Out of scope**: Stored procedures, database-specific dialect features, real database execution
|
| 31 |
-
- **Episode length**: 3-8 steps depending on query complexity
|
| 32 |
-
- **No external dependencies**: All query analysis is rule-based and deterministic
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
files/02-requirements.md
DELETED
|
@@ -1,58 +0,0 @@
|
|
| 1 |
-
# 02 β Requirements Specification
|
| 2 |
-
|
| 3 |
-
## Functional Requirements
|
| 4 |
-
|
| 5 |
-
### FR-1: Real-World Task Simulation
|
| 6 |
-
- Simulates SQL query review β a task humans do daily in engineering teams
|
| 7 |
-
- No games, no toys β purely professional/practical domain
|
| 8 |
-
|
| 9 |
-
### FR-2: OpenEnv Spec Compliance
|
| 10 |
-
- Typed Pydantic models for Observation, Action, State
|
| 11 |
-
- `step(action)` β returns observation, reward, done, info
|
| 12 |
-
- `reset()` β returns initial observation
|
| 13 |
-
- `state()` β returns current internal state
|
| 14 |
-
- Valid `openenv.yaml` with metadata
|
| 15 |
-
- Passes `openenv validate`
|
| 16 |
-
|
| 17 |
-
### FR-3: Minimum 3 Tasks with Agent Graders
|
| 18 |
-
- **Task 1 (Easy):** Syntax & basic logic errors β expected agent score 0.7-0.9
|
| 19 |
-
- **Task 2 (Medium):** Performance anti-patterns β expected agent score 0.4-0.6
|
| 20 |
-
- **Task 3 (Hard):** Security vulnerabilities + schema-aware optimization β expected agent score 0.2-0.4
|
| 21 |
-
- Each grader: deterministic, returns float in [0.0, 1.0], reproducible
|
| 22 |
-
|
| 23 |
-
### FR-4: Meaningful Reward Function
|
| 24 |
-
- Per-step rewards (not just end-of-episode binary)
|
| 25 |
-
- Partial credit for partial issue identification
|
| 26 |
-
- Penalties for false positives and missed critical issues
|
| 27 |
-
- Smooth signal that guides learning
|
| 28 |
-
|
| 29 |
-
### FR-5: Baseline Inference Script
|
| 30 |
-
- Named `inference.py` in project root
|
| 31 |
-
- Uses OpenAI Client for LLM calls
|
| 32 |
-
- Reads `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` from env vars
|
| 33 |
-
- Emits `[START]`, `[STEP]`, `[END]` structured stdout logs
|
| 34 |
-
- Produces reproducible baseline scores on all 3 tasks
|
| 35 |
-
|
| 36 |
-
## Non-Functional Requirements
|
| 37 |
-
|
| 38 |
-
### NFR-1: Deploys to Hugging Face Space
|
| 39 |
-
- Containerized HF Space tagged with `openenv`
|
| 40 |
-
- Returns 200 and responds to `/reset` POST
|
| 41 |
-
|
| 42 |
-
### NFR-2: Containerized Execution
|
| 43 |
-
- Working Dockerfile
|
| 44 |
-
- Builds with `docker build`, runs with `docker run`
|
| 45 |
-
- Starts cleanly, responds to HTTP requests
|
| 46 |
-
|
| 47 |
-
### NFR-3: Infrastructure Constraints
|
| 48 |
-
- Inference script runtime < 20 minutes
|
| 49 |
-
- Runs on 2 vCPU, 8GB RAM machine
|
| 50 |
-
|
| 51 |
-
### NFR-4: Documentation
|
| 52 |
-
- README with: environment description, motivation, action/observation space definitions, task descriptions with difficulty, setup instructions, baseline scores
|
| 53 |
-
|
| 54 |
-
## Disqualification Criteria (Must Avoid)
|
| 55 |
-
- β Environment does not deploy or respond
|
| 56 |
-
- β Plagiarized or trivially modified existing environments
|
| 57 |
-
- β Graders that always return the same score
|
| 58 |
-
- β No baseline inference script
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
files/03-information-architecture.md
DELETED
|
@@ -1,66 +0,0 @@
|
|
| 1 |
-
# 03 β Information Architecture
|
| 2 |
-
|
| 3 |
-
## Data Flow
|
| 4 |
-
|
| 5 |
-
```
|
| 6 |
-
[Task JSON] β reset() β [Observation: query + schema + context]
|
| 7 |
-
β
|
| 8 |
-
Agent decides action
|
| 9 |
-
β
|
| 10 |
-
step(Action) β [Observation + Reward + Done]
|
| 11 |
-
β
|
| 12 |
-
(repeat until done or max_steps)
|
| 13 |
-
β
|
| 14 |
-
close() β Grader computes final score
|
| 15 |
-
```
|
| 16 |
-
|
| 17 |
-
## Task Data Structure
|
| 18 |
-
|
| 19 |
-
Each task is a JSON object:
|
| 20 |
-
```json
|
| 21 |
-
{
|
| 22 |
-
"task_id": "easy_001",
|
| 23 |
-
"difficulty": "easy",
|
| 24 |
-
"query": "SELCT * FORM users WEHRE id = 1",
|
| 25 |
-
"schema": {
|
| 26 |
-
"users": {"id": "INT PRIMARY KEY", "name": "VARCHAR(255)", "email": "VARCHAR(255)"}
|
| 27 |
-
},
|
| 28 |
-
"context": "Fetch user by ID for profile page",
|
| 29 |
-
"ground_truth_issues": [
|
| 30 |
-
{"category": "syntax", "description": "SELCT should be SELECT", "severity": 0.3, "fix": "SELECT"},
|
| 31 |
-
{"category": "syntax", "description": "FORM should be FROM", "severity": 0.3, "fix": "FROM"},
|
| 32 |
-
{"category": "syntax", "description": "WEHRE should be WHERE", "severity": 0.3, "fix": "WHERE"},
|
| 33 |
-
{"category": "performance", "description": "SELECT * fetches unnecessary columns", "severity": 0.1, "fix": "SELECT id, name, email"}
|
| 34 |
-
],
|
| 35 |
-
"max_steps": 5
|
| 36 |
-
}
|
| 37 |
-
```
|
| 38 |
-
|
| 39 |
-
## State Management
|
| 40 |
-
|
| 41 |
-
| Field | Type | Description |
|
| 42 |
-
|---|---|---|
|
| 43 |
-
| `task_id` | str | Current task identifier |
|
| 44 |
-
| `query` | str | The SQL query under review |
|
| 45 |
-
| `issues_identified` | list | Issues the agent has found so far |
|
| 46 |
-
| `fixes_suggested` | list | Fixes the agent has proposed |
|
| 47 |
-
| `step_count` | int | Current step number |
|
| 48 |
-
| `total_reward` | float | Accumulated reward |
|
| 49 |
-
| `done` | bool | Whether episode is complete |
|
| 50 |
-
| `approved` | bool | Whether agent approved the query |
|
| 51 |
-
|
| 52 |
-
## Observation Space
|
| 53 |
-
- `query`: The full SQL query text
|
| 54 |
-
- `schema_info`: Dict of table β column definitions (empty for easy tasks)
|
| 55 |
-
- `context`: Natural language description of query intent
|
| 56 |
-
- `issues_found_so_far`: List of previously identified issues in this episode
|
| 57 |
-
- `remaining_actions`: Max steps minus current step
|
| 58 |
-
- `difficulty`: "easy" | "medium" | "hard"
|
| 59 |
-
- `feedback`: Result of last action ("correct identification", "false positive", "already identified", etc.)
|
| 60 |
-
|
| 61 |
-
## Action Space
|
| 62 |
-
- `action_type`: enum β "identify_issue" | "suggest_fix" | "approve" | "request_more_context"
|
| 63 |
-
- `issue_category`: enum β "syntax" | "performance" | "security" | "logic" | "style"
|
| 64 |
-
- `issue_description`: str β what the agent thinks is wrong
|
| 65 |
-
- `suggested_fix`: str (optional) β corrected SQL fragment
|
| 66 |
-
- `confidence`: float 0.0-1.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
files/04-system-architecture.md
DELETED
|
@@ -1,54 +0,0 @@
|
|
| 1 |
-
# 04 β System Architecture
|
| 2 |
-
|
| 3 |
-
## Components
|
| 4 |
-
|
| 5 |
-
```
|
| 6 |
-
βββββββββββββββββββββββββββββββββββββββββββββββ
|
| 7 |
-
β HF Space β
|
| 8 |
-
β βββββββββββββββββββββββββββββββββββββββ β
|
| 9 |
-
β β FastAPI Server β β
|
| 10 |
-
β β (app.py β Uvicorn) β β
|
| 11 |
-
β β β β
|
| 12 |
-
β β POST /reset β environment.reset() β β
|
| 13 |
-
β β POST /step β environment.step() β β
|
| 14 |
-
β β GET /state β environment.state() β β
|
| 15 |
-
β ββββββββββββ¬βββββββββββββββββββββββββββ β
|
| 16 |
-
β β β
|
| 17 |
-
β ββββββββββββΌβββββββββββββββββββββββββββ β
|
| 18 |
-
β β SQLReviewEnvironment β β
|
| 19 |
-
β β - task_bank (easy/medium/hard JSON) β β
|
| 20 |
-
β β - grader (deterministic scoring) β β
|
| 21 |
-
β β - reward_fn (per-step signals) β β
|
| 22 |
-
β βββββββββββββββββββββββββββββββββββββββ β
|
| 23 |
-
β β
|
| 24 |
-
β Dockerfile (Python 3.10-slim + deps) β
|
| 25 |
-
βββββββββββββββββββββββββββββββββββββββββββββββ
|
| 26 |
-
|
| 27 |
-
βββββββββββββββββββββββββββββββββββββββββββββββ
|
| 28 |
-
β inference.py (Client) β
|
| 29 |
-
β - OpenAI Client β LLM API β
|
| 30 |
-
β - SQLReviewEnvClient β HF Space β
|
| 31 |
-
β - Structured stdout logging β
|
| 32 |
-
βββββββββββββββββββββββββββββββββββββββββββββββ
|
| 33 |
-
```
|
| 34 |
-
|
| 35 |
-
## Technology Stack
|
| 36 |
-
- **Runtime:** Python 3.10+
|
| 37 |
-
- **Framework:** FastAPI + Uvicorn
|
| 38 |
-
- **Models:** Pydantic v2
|
| 39 |
-
- **Container:** Docker (python:3.10-slim base)
|
| 40 |
-
- **Deployment:** Hugging Face Spaces (Docker SDK)
|
| 41 |
-
- **LLM Client:** OpenAI Python SDK
|
| 42 |
-
- **Environment SDK:** openenv-core
|
| 43 |
-
|
| 44 |
-
## Communication Protocol
|
| 45 |
-
- WebSocket at `/ws` for persistent sessions (OpenEnv standard)
|
| 46 |
-
- HTTP POST endpoints as fallback: `/reset`, `/step`
|
| 47 |
-
- HTTP GET: `/state`
|
| 48 |
-
- JSON request/response bodies matching typed Pydantic models
|
| 49 |
-
|
| 50 |
-
## Episode Lifecycle
|
| 51 |
-
1. Client calls `reset(task_id="easy_001")` β server loads task, returns initial observation
|
| 52 |
-
2. Client calls `step(action)` β server validates action, computes reward, returns observation
|
| 53 |
-
3. Repeat until `done=True` (all issues found, agent approves, or max_steps reached)
|
| 54 |
-
4. Client calls `close()` β server runs grader, returns final score
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
files/05-database-schema.md
DELETED
|
@@ -1,52 +0,0 @@
|
|
| 1 |
-
# 05 β Task Bank Schema
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
Tasks are stored as JSON files, not a database. Each difficulty level has its own file with 3-5 queries.
|
| 5 |
-
|
| 6 |
-
## Easy Tasks (`tasks/easy_tasks.json`)
|
| 7 |
-
|
| 8 |
-
Queries with obvious syntax errors, wrong keywords, basic logic mistakes. An LLM should score 0.7-0.9.
|
| 9 |
-
|
| 10 |
-
Example queries:
|
| 11 |
-
1. Misspelled keywords (SELCT, FORM, WEHRE)
|
| 12 |
-
2. Missing FROM clause
|
| 13 |
-
3. Wrong column names that don't exist in schema
|
| 14 |
-
4. Missing semicolons / unclosed quotes
|
| 15 |
-
5. Using = NULL instead of IS NULL
|
| 16 |
-
|
| 17 |
-
## Medium Tasks (`tasks/medium_tasks.json`)
|
| 18 |
-
|
| 19 |
-
Queries with performance anti-patterns. Requires understanding schema context. Target score: 0.4-0.6.
|
| 20 |
-
|
| 21 |
-
Example queries:
|
| 22 |
-
1. SELECT * on a 50-column table when only 2 columns needed
|
| 23 |
-
2. Missing index hint on a JOIN with large table
|
| 24 |
-
3. Correlated subquery that could be a JOIN
|
| 25 |
-
4. Missing LIMIT on unbounded query
|
| 26 |
-
5. Redundant DISTINCT on a column with UNIQUE constraint
|
| 27 |
-
|
| 28 |
-
## Hard Tasks (`tasks/hard_tasks.json`)
|
| 29 |
-
|
| 30 |
-
Security vulnerabilities + complex optimization. Target score: 0.2-0.4.
|
| 31 |
-
|
| 32 |
-
Example queries:
|
| 33 |
-
1. String concatenation enabling SQL injection
|
| 34 |
-
2. Privilege escalation via UNION with system tables
|
| 35 |
-
3. Data leakage through unfiltered JOIN exposing PII
|
| 36 |
-
4. Query that could use window functions instead of self-join (10x perf gain)
|
| 37 |
-
5. Missing transaction isolation causing phantom reads
|
| 38 |
-
|
| 39 |
-
## Ground Truth Format
|
| 40 |
-
|
| 41 |
-
Each issue in ground truth:
|
| 42 |
-
```json
|
| 43 |
-
{
|
| 44 |
-
"category": "security",
|
| 45 |
-
"description": "String concatenation in WHERE clause enables SQL injection",
|
| 46 |
-
"severity": 1.0,
|
| 47 |
-
"fix": "Use parameterized query with ? placeholder",
|
| 48 |
-
"keywords": ["injection", "concatenation", "user input", "unsanitized"]
|
| 49 |
-
}
|
| 50 |
-
```
|
| 51 |
-
|
| 52 |
-
The `keywords` field is used by the grader for fuzzy matching against agent responses.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
files/06-api-contracts.md
DELETED
|
@@ -1,96 +0,0 @@
|
|
| 1 |
-
# 06 β API Contracts
|
| 2 |
-
|
| 3 |
-
## OpenEnv Standard Endpoints
|
| 4 |
-
|
| 5 |
-
### POST /reset
|
| 6 |
-
**Request:**
|
| 7 |
-
```json
|
| 8 |
-
{"task_id": "easy_001"}
|
| 9 |
-
```
|
| 10 |
-
**Response (StepResult):**
|
| 11 |
-
```json
|
| 12 |
-
{
|
| 13 |
-
"observation": {
|
| 14 |
-
"query": "SELCT * FORM users WEHRE id = 1",
|
| 15 |
-
"schema_info": {"users": {"id": "INT PK", "name": "VARCHAR(255)", "email": "VARCHAR(255)"}},
|
| 16 |
-
"context": "Fetch user by ID for profile page",
|
| 17 |
-
"issues_found_so_far": [],
|
| 18 |
-
"remaining_actions": 5,
|
| 19 |
-
"difficulty": "easy",
|
| 20 |
-
"feedback": "Review this SQL query and identify any issues."
|
| 21 |
-
},
|
| 22 |
-
"reward": 0.0,
|
| 23 |
-
"done": false,
|
| 24 |
-
"info": {}
|
| 25 |
-
}
|
| 26 |
-
```
|
| 27 |
-
|
| 28 |
-
### POST /step
|
| 29 |
-
**Request (Action):**
|
| 30 |
-
```json
|
| 31 |
-
{
|
| 32 |
-
"action_type": "identify_issue",
|
| 33 |
-
"issue_category": "syntax",
|
| 34 |
-
"issue_description": "SELCT is misspelled, should be SELECT",
|
| 35 |
-
"suggested_fix": "SELECT",
|
| 36 |
-
"confidence": 0.95
|
| 37 |
-
}
|
| 38 |
-
```
|
| 39 |
-
**Response (StepResult):**
|
| 40 |
-
```json
|
| 41 |
-
{
|
| 42 |
-
"observation": {
|
| 43 |
-
"query": "SELCT * FORM users WEHRE id = 1",
|
| 44 |
-
"schema_info": {"users": {"id": "INT PK", "name": "VARCHAR(255)", "email": "VARCHAR(255)"}},
|
| 45 |
-
"context": "Fetch user by ID for profile page",
|
| 46 |
-
"issues_found_so_far": [{"category": "syntax", "description": "SELCT should be SELECT"}],
|
| 47 |
-
"remaining_actions": 4,
|
| 48 |
-
"difficulty": "easy",
|
| 49 |
-
"feedback": "Correct! SELCT is indeed a syntax error. 3 issues remaining."
|
| 50 |
-
},
|
| 51 |
-
"reward": 0.25,
|
| 52 |
-
"done": false,
|
| 53 |
-
"info": {"match_type": "exact", "severity": 0.3}
|
| 54 |
-
}
|
| 55 |
-
```
|
| 56 |
-
|
| 57 |
-
### GET /state
|
| 58 |
-
**Response (State):**
|
| 59 |
-
```json
|
| 60 |
-
{
|
| 61 |
-
"task_id": "easy_001",
|
| 62 |
-
"step_count": 1,
|
| 63 |
-
"issues_identified": [{"category": "syntax", "description": "SELCT should be SELECT"}],
|
| 64 |
-
"total_reward": 0.25,
|
| 65 |
-
"done": false,
|
| 66 |
-
"approved": false
|
| 67 |
-
}
|
| 68 |
-
```
|
| 69 |
-
|
| 70 |
-
## Pydantic Models
|
| 71 |
-
|
| 72 |
-
```python
|
| 73 |
-
class SQLReviewAction(Action):
|
| 74 |
-
action_type: Literal["identify_issue", "suggest_fix", "approve", "request_more_context"]
|
| 75 |
-
issue_category: Optional[Literal["syntax", "performance", "security", "logic", "style"]] = None
|
| 76 |
-
issue_description: Optional[str] = None
|
| 77 |
-
suggested_fix: Optional[str] = None
|
| 78 |
-
confidence: float = 0.5
|
| 79 |
-
|
| 80 |
-
class SQLReviewObservation(Observation):
|
| 81 |
-
query: str
|
| 82 |
-
schema_info: Dict[str, Dict[str, str]]
|
| 83 |
-
context: str
|
| 84 |
-
issues_found_so_far: List[Dict[str, str]]
|
| 85 |
-
remaining_actions: int
|
| 86 |
-
difficulty: str
|
| 87 |
-
feedback: str
|
| 88 |
-
|
| 89 |
-
class SQLReviewState(State):
|
| 90 |
-
task_id: str
|
| 91 |
-
step_count: int
|
| 92 |
-
issues_identified: List[Dict[str, str]]
|
| 93 |
-
total_reward: float
|
| 94 |
-
done: bool
|
| 95 |
-
approved: bool
|
| 96 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
files/07-monorepo-structure.md
DELETED
|
@@ -1,65 +0,0 @@
|
|
| 1 |
-
# 07 β Monorepo Structure
|
| 2 |
-
|
| 3 |
-
```
|
| 4 |
-
sql-query-reviewer/
|
| 5 |
-
β
|
| 6 |
-
βββ openenv.yaml # Environment metadata manifest
|
| 7 |
-
βββ models.py # Pydantic: SQLReviewAction, SQLReviewObservation, SQLReviewState
|
| 8 |
-
βββ client.py # EnvClient subclass for external consumers
|
| 9 |
-
βββ inference.py # MANDATORY: Baseline inference script (root directory!)
|
| 10 |
-
βββ README.md # Environment documentation
|
| 11 |
-
βββ pyproject.toml # Package config
|
| 12 |
-
β
|
| 13 |
-
βββ tasks/
|
| 14 |
-
β βββ easy_tasks.json # 5 syntax/logic error queries
|
| 15 |
-
β βββ medium_tasks.json # 5 performance anti-pattern queries
|
| 16 |
-
β βββ hard_tasks.json # 5 security + optimization queries
|
| 17 |
-
β
|
| 18 |
-
βββ server/
|
| 19 |
-
βββ __init__.py
|
| 20 |
-
βββ environment.py # SQLReviewEnvironment(Environment) β core logic
|
| 21 |
-
βββ grader.py # Deterministic grading: fuzzy match agent output vs ground truth
|
| 22 |
-
βββ reward.py # Per-step reward computation
|
| 23 |
-
βββ app.py # FastAPI server (create_app with routes)
|
| 24 |
-
βββ Dockerfile # Python 3.10-slim, install deps, expose port
|
| 25 |
-
βββ requirements.txt # openenv-core, fastapi, uvicorn, pydantic
|
| 26 |
-
```
|
| 27 |
-
|
| 28 |
-
## Key Files Explained
|
| 29 |
-
|
| 30 |
-
| File | Purpose | Critical? |
|
| 31 |
-
|---|---|---|
|
| 32 |
-
| `openenv.yaml` | Metadata: name, description, author, tasks list | Yes β validated by `openenv validate` |
|
| 33 |
-
| `models.py` | Typed Action/Observation/State contracts | Yes β spec compliance |
|
| 34 |
-
| `inference.py` | Baseline agent using OpenAI Client | Yes β DQ if missing |
|
| 35 |
-
| `server/environment.py` | `reset()`, `step()`, `state()` implementation | Yes β core logic |
|
| 36 |
-
| `server/grader.py` | Score computation per task | Yes β must return 0.0-1.0 |
|
| 37 |
-
| `server/Dockerfile` | Container definition | Yes β must build cleanly |
|
| 38 |
-
| `README.md` | Human-readable documentation | Yes β judges read this first |
|
| 39 |
-
|
| 40 |
-
## openenv.yaml
|
| 41 |
-
|
| 42 |
-
```yaml
|
| 43 |
-
name: sql-query-reviewer
|
| 44 |
-
description: "AI agent reviews SQL queries for correctness, performance, and security"
|
| 45 |
-
author: ravi
|
| 46 |
-
version: "1.0.0"
|
| 47 |
-
tags:
|
| 48 |
-
- openenv
|
| 49 |
-
- sql
|
| 50 |
-
- code-review
|
| 51 |
-
- security
|
| 52 |
-
tasks:
|
| 53 |
-
- id: easy_syntax
|
| 54 |
-
name: "Syntax Error Detection"
|
| 55 |
-
difficulty: easy
|
| 56 |
-
description: "Find and fix obvious SQL syntax errors"
|
| 57 |
-
- id: medium_performance
|
| 58 |
-
name: "Performance Anti-Pattern Review"
|
| 59 |
-
difficulty: medium
|
| 60 |
-
description: "Identify performance issues requiring schema awareness"
|
| 61 |
-
- id: hard_security
|
| 62 |
-
name: "Security & Optimization Audit"
|
| 63 |
-
difficulty: hard
|
| 64 |
-
description: "Find SQL injection vectors and complex optimization opportunities"
|
| 65 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
files/08-computation-engine-spec.md
DELETED
|
@@ -1,86 +0,0 @@
|
|
| 1 |
-
# 08 β Reward & Grading Engine Spec
|
| 2 |
-
|
| 3 |
-
## Per-Step Reward Function
|
| 4 |
-
|
| 5 |
-
```python
|
| 6 |
-
def compute_reward(action, ground_truth_issues, already_found):
|
| 7 |
-
if action.action_type == "identify_issue":
|
| 8 |
-
match = fuzzy_match(action.issue_description, ground_truth_issues, already_found)
|
| 9 |
-
if match:
|
| 10 |
-
base = match["severity"] # 0.1 - 1.0
|
| 11 |
-
fix_bonus = 0.1 if action.suggested_fix and is_valid_fix(action.suggested_fix, match) else 0.0
|
| 12 |
-
confidence_bonus = 0.05 * action.confidence if match else 0.0
|
| 13 |
-
return min(base + fix_bonus + confidence_bonus, 0.4) # cap per-step
|
| 14 |
-
else:
|
| 15 |
-
return -0.1 # false positive penalty
|
| 16 |
-
|
| 17 |
-
elif action.action_type == "approve":
|
| 18 |
-
unfound = len(ground_truth_issues) - len(already_found)
|
| 19 |
-
if unfound == 0:
|
| 20 |
-
return 0.2 # correct approval
|
| 21 |
-
else:
|
| 22 |
-
return -0.15 * unfound # penalty per missed issue
|
| 23 |
-
|
| 24 |
-
elif action.action_type == "suggest_fix":
|
| 25 |
-
if not already_found:
|
| 26 |
-
return -0.05 # fixing without identifying first
|
| 27 |
-
last_issue = already_found[-1]
|
| 28 |
-
if is_valid_fix(action.suggested_fix, last_issue):
|
| 29 |
-
return 0.1
|
| 30 |
-
return 0.0
|
| 31 |
-
|
| 32 |
-
elif action.action_type == "request_more_context":
|
| 33 |
-
return 0.0 # neutral β no reward, no penalty
|
| 34 |
-
|
| 35 |
-
return 0.0
|
| 36 |
-
```
|
| 37 |
-
|
| 38 |
-
## Fuzzy Matching Algorithm
|
| 39 |
-
|
| 40 |
-
```python
|
| 41 |
-
def fuzzy_match(agent_description, ground_truth_issues, already_found):
|
| 42 |
-
"""Match agent's issue description to a ground truth issue."""
|
| 43 |
-
best_match = None
|
| 44 |
-
best_score = 0.0
|
| 45 |
-
|
| 46 |
-
for issue in ground_truth_issues:
|
| 47 |
-
if issue in already_found:
|
| 48 |
-
continue
|
| 49 |
-
# Keyword overlap score
|
| 50 |
-
agent_words = set(agent_description.lower().split())
|
| 51 |
-
truth_words = set(issue["keywords"])
|
| 52 |
-
overlap = len(agent_words & truth_words) / max(len(truth_words), 1)
|
| 53 |
-
# Category match bonus
|
| 54 |
-
category_bonus = 0.3 if action.issue_category == issue["category"] else 0.0
|
| 55 |
-
score = overlap + category_bonus
|
| 56 |
-
if score > best_score and score > 0.3: # threshold
|
| 57 |
-
best_score = score
|
| 58 |
-
best_match = issue
|
| 59 |
-
|
| 60 |
-
return best_match
|
| 61 |
-
```
|
| 62 |
-
|
| 63 |
-
## End-of-Episode Grader
|
| 64 |
-
|
| 65 |
-
```python
|
| 66 |
-
def grade_episode(issues_found, ground_truth_issues, total_steps, max_steps):
|
| 67 |
-
"""Deterministic grader returning float in [0.0, 1.0]."""
|
| 68 |
-
if not ground_truth_issues:
|
| 69 |
-
return 1.0 if not issues_found else 0.5
|
| 70 |
-
|
| 71 |
-
total_severity = sum(i["severity"] for i in ground_truth_issues)
|
| 72 |
-
found_severity = sum(i["severity"] for i in issues_found if i in matched_ground_truth)
|
| 73 |
-
|
| 74 |
-
coverage_score = found_severity / total_severity # 0.0 - 1.0
|
| 75 |
-
efficiency_bonus = max(0, 0.1 * (1 - total_steps / max_steps)) # reward fewer steps
|
| 76 |
-
false_positive_penalty = 0.05 * count_false_positives(issues_found, ground_truth_issues)
|
| 77 |
-
|
| 78 |
-
score = coverage_score + efficiency_bonus - false_positive_penalty
|
| 79 |
-
return max(0.0, min(1.0, score))
|
| 80 |
-
```
|
| 81 |
-
|
| 82 |
-
## Score Variance Guarantee
|
| 83 |
-
- Easy tasks: 5 different queries with 2-5 issues each β scores range from 0.4 to 1.0
|
| 84 |
-
- Medium tasks: different anti-patterns β scores range from 0.2 to 0.8
|
| 85 |
-
- Hard tasks: varied security issues β scores range from 0.0 to 0.6
|
| 86 |
-
- A grader that always returns the same score = instant DQ. Our design inherently prevents this because different queries have different ground truth issues.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
files/09-engineering-scope-definition.md
DELETED
|
@@ -1,39 +0,0 @@
|
|
| 1 |
-
# 09 β Engineering Scope Definition
|
| 2 |
-
|
| 3 |
-
## In Scope (Must Build)
|
| 4 |
-
1. **Environment server** β `environment.py` with `reset()`, `step()`, `state()`
|
| 5 |
-
2. **Pydantic models** β `models.py` with typed Action, Observation, State
|
| 6 |
-
3. **Client** β `client.py` with EnvClient subclass
|
| 7 |
-
4. **Task bank** β 15 SQL queries (5 easy, 5 medium, 5 hard) with ground truth
|
| 8 |
-
5. **Grader** β Deterministic scoring function per task
|
| 9 |
-
6. **Reward function** β Per-step partial credit with penalties
|
| 10 |
-
7. **Inference script** β `inference.py` using OpenAI Client
|
| 11 |
-
8. **Dockerfile** β Working container that builds and runs
|
| 12 |
-
9. **HF Space deployment** β Live, tagged with `openenv`
|
| 13 |
-
10. **README** β Complete documentation
|
| 14 |
-
11. **openenv.yaml** β Valid metadata manifest
|
| 15 |
-
|
| 16 |
-
## Out of Scope (Don't Build)
|
| 17 |
-
- Real database execution (all analysis is pattern-matching based)
|
| 18 |
-
- Custom LLM fine-tuning
|
| 19 |
-
- Web UI beyond OpenEnv's built-in web interface
|
| 20 |
-
- Multiple language SQL dialects (stick to standard SQL)
|
| 21 |
-
- Integration tests against real databases
|
| 22 |
-
|
| 23 |
-
## Effort Estimates
|
| 24 |
-
|
| 25 |
-
| Component | Hours | Priority |
|
| 26 |
-
|---|---|---|
|
| 27 |
-
| Prep course + bootcamp | 3.0 | P0 |
|
| 28 |
-
| Task bank creation (15 queries + ground truth) | 2.5 | P0 |
|
| 29 |
-
| Pydantic models | 0.5 | P0 |
|
| 30 |
-
| Environment logic (reset/step/state) | 3.0 | P0 |
|
| 31 |
-
| Grader + reward function | 2.0 | P0 |
|
| 32 |
-
| Inference script | 1.5 | P0 |
|
| 33 |
-
| Dockerfile + local testing | 1.0 | P0 |
|
| 34 |
-
| HF Space deployment | 0.5 | P0 |
|
| 35 |
-
| README | 1.0 | P0 |
|
| 36 |
-
| Pre-validation + bug fixes | 2.0 | P0 |
|
| 37 |
-
| **Total** | **~17 hours** | |
|
| 38 |
-
|
| 39 |
-
Fits within the 2-day window with buffer for debugging.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
files/10-development-phases.md
DELETED
|
@@ -1,48 +0,0 @@
|
|
| 1 |
-
# 10 β Development Phases
|
| 2 |
-
|
| 3 |
-
## Phase 1: Learn (Apr 10, 9 AM β 12 PM)
|
| 4 |
-
- [ ] Complete Module 1: Interface basics
|
| 5 |
-
- [ ] Complete Module 2: Using existing environments
|
| 6 |
-
- [ ] Complete Module 3: Deployment to HF Spaces
|
| 7 |
-
- [ ] Complete Module 4: Building your own environment
|
| 8 |
-
- [ ] Watch bootcamp recording, note judge preferences
|
| 9 |
-
- [ ] Study sample inference script format
|
| 10 |
-
|
| 11 |
-
## Phase 2: Scaffold (Apr 10, 12 PM β 2 PM)
|
| 12 |
-
- [ ] `pip install openenv-core huggingface_hub openai`
|
| 13 |
-
- [ ] `openenv init sql-query-reviewer`
|
| 14 |
-
- [ ] Clone and study echo env for reference
|
| 15 |
-
- [ ] Set up project structure per 07-monorepo-structure.md
|
| 16 |
-
|
| 17 |
-
## Phase 3: Core Build (Apr 10, 2 PM β Apr 11, 12 PM)
|
| 18 |
-
- [ ] Write `models.py` β Action, Observation, State
|
| 19 |
-
- [ ] Create task bank β 5 easy, 5 medium, 5 hard queries with ground truth
|
| 20 |
-
- [ ] Implement `environment.py` β reset(), step(), state()
|
| 21 |
-
- [ ] Implement `grader.py` β deterministic scoring
|
| 22 |
-
- [ ] Implement `reward.py` β per-step reward computation
|
| 23 |
-
- [ ] Implement fuzzy matching for issue identification
|
| 24 |
-
- [ ] Write `app.py` β FastAPI routes
|
| 25 |
-
- [ ] Local testing: `uv run server` β test all endpoints manually
|
| 26 |
-
|
| 27 |
-
## Phase 4: Inference (Apr 11, 12 PM β 3 PM)
|
| 28 |
-
- [ ] Write `inference.py` following sample script format exactly
|
| 29 |
-
- [ ] System prompt design for SQL review agent
|
| 30 |
-
- [ ] Test with free HF Inference API
|
| 31 |
-
- [ ] Verify `[START]`, `[STEP]`, `[END]` output format
|
| 32 |
-
- [ ] Run 3x to verify reproducible scores
|
| 33 |
-
|
| 34 |
-
## Phase 5: Containerize & Deploy (Apr 11, 3 PM β 6 PM)
|
| 35 |
-
- [ ] Write Dockerfile (python:3.10-slim base)
|
| 36 |
-
- [ ] `docker build -t sql-query-reviewer ./server`
|
| 37 |
-
- [ ] `docker run -p 8000:8000 sql-query-reviewer`
|
| 38 |
-
- [ ] Test `/reset`, `/step`, `/state` against running container
|
| 39 |
-
- [ ] `openenv push --repo-id ravi/sql-query-reviewer`
|
| 40 |
-
- [ ] Verify HF Space returns 200 on `/reset`
|
| 41 |
-
|
| 42 |
-
## Phase 6: Polish & Submit (Apr 11, 6 PM β Apr 12, 11:59 PM)
|
| 43 |
-
- [ ] Write compelling README
|
| 44 |
-
- [ ] Run `openenv validate`
|
| 45 |
-
- [ ] Run `validate-submission.sh`
|
| 46 |
-
- [ ] Fix any issues
|
| 47 |
-
- [ ] Submit early, iterate if time permits
|
| 48 |
-
- [ ] Final verification: HF Space live and responding
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
files/11-environment-and-devops.md
DELETED
|
@@ -1,77 +0,0 @@
|
|
| 1 |
-
# 11 β Environment & DevOps
|
| 2 |
-
|
| 3 |
-
## Local Development Setup
|
| 4 |
-
|
| 5 |
-
```bash
|
| 6 |
-
# Python environment
|
| 7 |
-
python3.10 -m venv .venv
|
| 8 |
-
source .venv/bin/activate
|
| 9 |
-
pip install openenv-core fastapi uvicorn pydantic openai huggingface_hub
|
| 10 |
-
|
| 11 |
-
# Run locally
|
| 12 |
-
cd server && uvicorn app:app --reload --port 8000
|
| 13 |
-
|
| 14 |
-
# Test endpoints
|
| 15 |
-
curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{"task_id": "easy_001"}'
|
| 16 |
-
```
|
| 17 |
-
|
| 18 |
-
## Dockerfile
|
| 19 |
-
|
| 20 |
-
```dockerfile
|
| 21 |
-
FROM python:3.10-slim
|
| 22 |
-
|
| 23 |
-
WORKDIR /app
|
| 24 |
-
|
| 25 |
-
COPY server/requirements.txt .
|
| 26 |
-
RUN pip install --no-cache-dir -r requirements.txt
|
| 27 |
-
|
| 28 |
-
COPY models.py .
|
| 29 |
-
COPY tasks/ ./tasks/
|
| 30 |
-
COPY server/ ./server/
|
| 31 |
-
COPY openenv.yaml .
|
| 32 |
-
|
| 33 |
-
EXPOSE 8000
|
| 34 |
-
|
| 35 |
-
CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]
|
| 36 |
-
```
|
| 37 |
-
|
| 38 |
-
## server/requirements.txt
|
| 39 |
-
|
| 40 |
-
```
|
| 41 |
-
openenv-core>=0.1.0
|
| 42 |
-
fastapi>=0.100.0
|
| 43 |
-
uvicorn>=0.23.0
|
| 44 |
-
pydantic>=2.0.0
|
| 45 |
-
```
|
| 46 |
-
|
| 47 |
-
## HF Space Deployment
|
| 48 |
-
|
| 49 |
-
```bash
|
| 50 |
-
# Login
|
| 51 |
-
huggingface-cli login
|
| 52 |
-
|
| 53 |
-
# Deploy
|
| 54 |
-
openenv push --repo-id ravi/sql-query-reviewer
|
| 55 |
-
|
| 56 |
-
# Verify
|
| 57 |
-
curl -s -o /dev/null -w "%{http_code}" -X POST https://ravi-sql-query-reviewer.hf.space/reset -H "Content-Type: application/json" -d '{}'
|
| 58 |
-
# Expected: 200
|
| 59 |
-
```
|
| 60 |
-
|
| 61 |
-
## Environment Variables for Inference
|
| 62 |
-
|
| 63 |
-
```bash
|
| 64 |
-
export API_BASE_URL="https://router.huggingface.co/v1"
|
| 65 |
-
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
|
| 66 |
-
export HF_TOKEN="hf_xxxxxxxxxxxxx"
|
| 67 |
-
export IMAGE_NAME="sql-query-reviewer"
|
| 68 |
-
```
|
| 69 |
-
|
| 70 |
-
## Pre-Validation
|
| 71 |
-
|
| 72 |
-
```bash
|
| 73 |
-
chmod +x validate-submission.sh
|
| 74 |
-
./validate-submission.sh https://ravi-sql-query-reviewer.hf.space .
|
| 75 |
-
```
|
| 76 |
-
|
| 77 |
-
Expected output: All 3/3 checks passed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
files/12-testing-strategy.md
DELETED
|
@@ -1,52 +0,0 @@
|
|
| 1 |
-
# 12 β Testing Strategy
|
| 2 |
-
|
| 3 |
-
## Level 1: Unit Tests (During Build)
|
| 4 |
-
- **Models:** Validate Pydantic models accept/reject correct/incorrect data
|
| 5 |
-
- **Grader:** Test with known inputs β known scores. Verify determinism (run 10x, same result).
|
| 6 |
-
- **Reward function:** Test each action type returns expected reward range
|
| 7 |
-
- **Fuzzy matcher:** Test with exact match, partial match, no match, already-found cases
|
| 8 |
-
|
| 9 |
-
## Level 2: Integration Tests (Before Docker)
|
| 10 |
-
- Run `uv run server` locally
|
| 11 |
-
- POST `/reset` with each task ID β verify valid observation returned
|
| 12 |
-
- POST `/step` with valid action β verify reward, done, observation
|
| 13 |
-
- POST `/step` with invalid action β verify graceful error handling
|
| 14 |
-
- GET `/state` β verify state matches expectations
|
| 15 |
-
- Run full episode: reset β steps β done β verify final grader score
|
| 16 |
-
|
| 17 |
-
## Level 3: Container Tests (Before Deploy)
|
| 18 |
-
```bash
|
| 19 |
-
docker build -t sql-query-reviewer ./server
|
| 20 |
-
docker run -d -p 8000:8000 sql-query-reviewer
|
| 21 |
-
# Wait for startup
|
| 22 |
-
sleep 5
|
| 23 |
-
# Test reset
|
| 24 |
-
curl -X POST http://localhost:8000/reset -d '{}' | python -m json.tool
|
| 25 |
-
# Test step
|
| 26 |
-
curl -X POST http://localhost:8000/step -d '{"action_type":"identify_issue","issue_category":"syntax","issue_description":"test"}' | python -m json.tool
|
| 27 |
-
docker stop $(docker ps -q)
|
| 28 |
-
```
|
| 29 |
-
|
| 30 |
-
## Level 4: Validation Tests (Before Submit)
|
| 31 |
-
- `openenv validate` β must pass
|
| 32 |
-
- `validate-submission.sh <url> .` β all 3 checks must pass
|
| 33 |
-
- Run `inference.py` 3 times β verify scores are consistent
|
| 34 |
-
- Verify stdout format matches `[START]`, `[STEP]`, `[END]` exactly
|
| 35 |
-
- Check memory usage stays under 8GB
|
| 36 |
-
- Check runtime stays under 20 minutes
|
| 37 |
-
|
| 38 |
-
## Level 5: Score Variance Check
|
| 39 |
-
- Run inference on all 3 tasks β verify different scores
|
| 40 |
-
- Confirm no grader returns the same score for different inputs
|
| 41 |
-
- Verify easy > medium > hard in terms of baseline agent performance
|
| 42 |
-
|
| 43 |
-
## DQ Prevention Checklist
|
| 44 |
-
- [ ] HF Space returns 200 on POST /reset
|
| 45 |
-
- [ ] openenv.yaml is valid
|
| 46 |
-
- [ ] Typed models work
|
| 47 |
-
- [ ] Dockerfile builds
|
| 48 |
-
- [ ] 3+ tasks with graders returning 0.0-1.0
|
| 49 |
-
- [ ] Graders DON'T always return the same score
|
| 50 |
-
- [ ] inference.py exists in root
|
| 51 |
-
- [ ] Baseline produces reproducible scores
|
| 52 |
-
- [ ] Not plagiarized from existing environments
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
files/CHANGES.md
DELETED
|
@@ -1,72 +0,0 @@
|
|
| 1 |
-
# Changes to Apply β Priority Order
|
| 2 |
-
|
| 3 |
-
## π¨ CRITICAL FIX (Do this first β DQ risk)
|
| 4 |
-
|
| 5 |
-
### 1. Replace `inference.py`
|
| 6 |
-
**File:** `inference.py` (root directory)
|
| 7 |
-
**Problem:** Current stdout format outputs JSON like `[START] {"difficulty": "easy", ...}` instead of the required `[START] task=easy_001 env=sql-query-reviewer model=Qwen/...` format.
|
| 8 |
-
**Impact:** The hackathon dashboard explicitly states: "Any deviation in field names, ordering, or formatting will result in incorrect evaluation scoring."
|
| 9 |
-
**Fix:** Replace with the provided `inference.py` that uses `log_start()`, `log_step()`, `log_end()` matching the exact spec format.
|
| 10 |
-
|
| 11 |
-
**Key changes in the new inference.py:**
|
| 12 |
-
- `[START] task=<task_name> env=<benchmark> model=<model_name>` β flat key=value, not JSON
|
| 13 |
-
- `[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>` β reward formatted to 2 decimal places
|
| 14 |
-
- `[END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>` β comma-separated rewards list
|
| 15 |
-
- Uses `API_BASE_URL` defaulting to HF router (not openai.com)
|
| 16 |
-
- Uses `HF_TOKEN` as primary API key env var
|
| 17 |
-
- Accumulates rewards list and computes success boolean
|
| 18 |
-
- try/finally ensures [END] is always emitted even on exception
|
| 19 |
-
|
| 20 |
-
---
|
| 21 |
-
|
| 22 |
-
## β οΈ HIGH PRIORITY
|
| 23 |
-
|
| 24 |
-
### 2. Replace `openenv.yaml`
|
| 25 |
-
**Problem:** Task IDs in yaml (`easy_syntax`, `medium_performance`, `hard_security`) don't match actual task IDs in JSON files (`easy_001`β`easy_005`, `medium_001`β`medium_005`, `hard_001`β`hard_005`).
|
| 26 |
-
**Impact:** If `openenv validate` checks task ID alignment, validation fails.
|
| 27 |
-
**Fix:** Replace with provided `openenv.yaml` listing all 15 actual task IDs.
|
| 28 |
-
|
| 29 |
-
### 3. Replace `Dockerfile`
|
| 30 |
-
**Problem:** No HEALTHCHECK instruction and no `curl` installed.
|
| 31 |
-
**Fix:** Added `apt-get install curl` and `HEALTHCHECK` directive.
|
| 32 |
-
|
| 33 |
-
### 4. Replace `README.md`
|
| 34 |
-
**Problem:** Functional but not compelling for human reviewers (30% weight on real-world utility).
|
| 35 |
-
**Fix:** Added "Why This Matters" narrative, baseline score table, cleaner structure.
|
| 36 |
-
|
| 37 |
-
---
|
| 38 |
-
|
| 39 |
-
## π‘ MEDIUM PRIORITY (before deadline if time permits)
|
| 40 |
-
|
| 41 |
-
### 5. Merge PR #1 on GitHub
|
| 42 |
-
The fix/package-server-and-inference-imports branch is already deployed to HF Spaces but still a draft PR on GitHub. Merge it so `main` branch CI passes.
|
| 43 |
-
|
| 44 |
-
### 6. Verify `openenv` tag on HF Space
|
| 45 |
-
Go to Space settings on HuggingFace and confirm the `openenv` tag is applied. The README has it in YAML front matter tags, but double-check it appears in the Space metadata.
|
| 46 |
-
|
| 47 |
-
### 7. Run pre-validation
|
| 48 |
-
```bash
|
| 49 |
-
./validate-submission.sh https://hellinferno-sql-query-reviewer.hf.space .
|
| 50 |
-
```
|
| 51 |
-
|
| 52 |
-
---
|
| 53 |
-
|
| 54 |
-
## How to apply these changes
|
| 55 |
-
|
| 56 |
-
```bash
|
| 57 |
-
# From your local repo directory:
|
| 58 |
-
cp /path/to/fixes/inference.py ./inference.py
|
| 59 |
-
cp /path/to/fixes/openenv.yaml ./openenv.yaml
|
| 60 |
-
cp /path/to/fixes/Dockerfile ./Dockerfile
|
| 61 |
-
cp /path/to/fixes/README.md ./README.md
|
| 62 |
-
|
| 63 |
-
# Test locally
|
| 64 |
-
uvicorn server.app:app --port 8000 &
|
| 65 |
-
python inference.py # verify [START]/[STEP]/[END] format
|
| 66 |
-
|
| 67 |
-
# Push to HF Spaces
|
| 68 |
-
git add -A
|
| 69 |
-
git commit -m "fix: correct inference stdout format and align openenv.yaml task IDs"
|
| 70 |
-
git push origin main
|
| 71 |
-
git push hf main
|
| 72 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
files/Dockerfile
DELETED
|
@@ -1,24 +0,0 @@
|
|
| 1 |
-
FROM python:3.11-slim
|
| 2 |
-
|
| 3 |
-
ENV PYTHONDONTWRITEBYTECODE=1 \
|
| 4 |
-
PYTHONUNBUFFERED=1 \
|
| 5 |
-
PORT=8000
|
| 6 |
-
|
| 7 |
-
RUN apt-get update && apt-get install -y --no-install-recommends curl && rm -rf /var/lib/apt/lists/*
|
| 8 |
-
|
| 9 |
-
WORKDIR /app
|
| 10 |
-
|
| 11 |
-
COPY pyproject.toml README.md models.py client.py openenv.yaml inference.py ./
|
| 12 |
-
COPY sql_query_reviewer ./sql_query_reviewer
|
| 13 |
-
COPY server ./server
|
| 14 |
-
COPY tasks ./tasks
|
| 15 |
-
|
| 16 |
-
RUN pip install --no-cache-dir --upgrade pip && \
|
| 17 |
-
pip install --no-cache-dir .
|
| 18 |
-
|
| 19 |
-
EXPOSE 8000
|
| 20 |
-
|
| 21 |
-
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
|
| 22 |
-
CMD curl -f http://localhost:8000/health || exit 1
|
| 23 |
-
|
| 24 |
-
CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
files/README.md
DELETED
|
@@ -1,162 +0,0 @@
|
|
| 1 |
-
---
|
| 2 |
-
title: SQL Query Reviewer
|
| 3 |
-
colorFrom: blue
|
| 4 |
-
colorTo: green
|
| 5 |
-
sdk: docker
|
| 6 |
-
app_port: 8000
|
| 7 |
-
pinned: false
|
| 8 |
-
tags:
|
| 9 |
-
- openenv
|
| 10 |
-
---
|
| 11 |
-
|
| 12 |
-
# SQL Query Reviewer
|
| 13 |
-
|
| 14 |
-
An OpenEnv environment where an AI agent reviews SQL queries for correctness, performance, and security β the same task thousands of engineers perform every day in code reviews, migration scripts, and ETL audits.
|
| 15 |
-
|
| 16 |
-
## Why This Matters
|
| 17 |
-
|
| 18 |
-
SQL bugs are among the most common and costly defects in production systems. A misplaced keyword breaks an API, a missing index degrades latency by 100x, and an unsanitized input opens a door to data exfiltration. Today these defects are caught by human reviewers who spend hours on repetitive pattern matching. This environment provides a standardized benchmark to train and evaluate AI agents that can automate this critical workflow β directly useful for developer tools, IDE integrations, and automated code review systems.
|
| 19 |
-
|
| 20 |
-
## What The Environment Does
|
| 21 |
-
|
| 22 |
-
Each episode gives the agent:
|
| 23 |
-
|
| 24 |
-
- a SQL query (with realistic bugs drawn from production patterns)
|
| 25 |
-
- schema context when it matters (table definitions, column types, constraints)
|
| 26 |
-
- a short explanation of the query's intended purpose
|
| 27 |
-
|
| 28 |
-
The agent responds step by step with one of four actions:
|
| 29 |
-
|
| 30 |
-
| Action | Description |
|
| 31 |
-
|---|---|
|
| 32 |
-
| `identify_issue` | Flag a correctness, performance, or security problem |
|
| 33 |
-
| `suggest_fix` | Propose corrected SQL for a previously identified issue |
|
| 34 |
-
| `approve` | Mark the query as acceptable (ends episode) |
|
| 35 |
-
| `request_more_context` | Ask for additional schema information |
|
| 36 |
-
|
| 37 |
-
## Reward Design
|
| 38 |
-
|
| 39 |
-
Rewards are deterministic and shaped for partial progress throughout the trajectory:
|
| 40 |
-
|
| 41 |
-
- **Correct issue identification**: +0.10 to +0.35 scaled by issue severity
|
| 42 |
-
- **Valid fix suggestion**: +0.08 to +0.10 bonus
|
| 43 |
-
- **Confidence bonus**: up to +0.05 for high-confidence correct identifications
|
| 44 |
-
- **False positive**: β0.10 penalty
|
| 45 |
-
- **Duplicate identification**: β0.02 penalty
|
| 46 |
-
- **Approving with missed issues**: β0.15 per missed issue
|
| 47 |
-
- **Complete correct approval**: +0.20
|
| 48 |
-
|
| 49 |
-
## Task Bank
|
| 50 |
-
|
| 51 |
-
The environment ships with **15 tasks** across three difficulty levels:
|
| 52 |
-
|
| 53 |
-
| Difficulty | Count | Examples | Expected Baseline Score |
|
| 54 |
-
|---|---|---|---|
|
| 55 |
-
| Easy | 5 | Misspelled keywords, missing FROM, = NULL vs IS NULL | ~0.75β0.90 |
|
| 56 |
-
| Medium | 5 | SELECT *, missing indexes, correlated subqueries, unbounded queries | ~0.40β0.60 |
|
| 57 |
-
| Hard | 5 | SQL injection, privilege escalation, PII leakage, self-join optimization | ~0.20β0.40 |
|
| 58 |
-
|
| 59 |
-
Task data: `tasks/easy_tasks.json`, `tasks/medium_tasks.json`, `tasks/hard_tasks.json`
|
| 60 |
-
|
| 61 |
-
## Action & Observation Spaces
|
| 62 |
-
|
| 63 |
-
**Action** (`SQLReviewAction`):
|
| 64 |
-
- `action_type`: identify_issue | suggest_fix | approve | request_more_context
|
| 65 |
-
- `issue_category`: syntax | performance | security | logic | style
|
| 66 |
-
- `issue_description`: concise statement of the problem
|
| 67 |
-
- `suggested_fix`: corrected SQL fragment
|
| 68 |
-
- `confidence`: float 0.0β1.0
|
| 69 |
-
|
| 70 |
-
**Observation** (`SQLReviewObservation`):
|
| 71 |
-
- `query`: the full SQL query text
|
| 72 |
-
- `schema_info`: dict of table β column definitions
|
| 73 |
-
- `context`: natural language description of query intent
|
| 74 |
-
- `issues_found_so_far`: previously identified issues this episode
|
| 75 |
-
- `remaining_actions`: steps left before episode ends
|
| 76 |
-
- `difficulty`: easy | medium | hard
|
| 77 |
-
- `feedback`: result of last action
|
| 78 |
-
|
| 79 |
-
## Repository Layout
|
| 80 |
-
|
| 81 |
-
```
|
| 82 |
-
.
|
| 83 |
-
βββ openenv.yaml
|
| 84 |
-
βββ models.py
|
| 85 |
-
βββ client.py
|
| 86 |
-
βββ inference.py β baseline agent (root directory)
|
| 87 |
-
βββ Dockerfile
|
| 88 |
-
βββ sql_query_reviewer/ β typed models and client package
|
| 89 |
-
βββ server/ β FastAPI environment server
|
| 90 |
-
β βββ environment.py β reset(), step(), state()
|
| 91 |
-
β βββ grader.py β deterministic scoring
|
| 92 |
-
β βββ reward.py β per-step reward computation
|
| 93 |
-
β βββ app.py β HTTP routes
|
| 94 |
-
βββ tasks/ β 15 SQL query tasks (JSON)
|
| 95 |
-
βββ tests/ β pytest suite
|
| 96 |
-
```
|
| 97 |
-
|
| 98 |
-
## Local Development
|
| 99 |
-
|
| 100 |
-
```bash
|
| 101 |
-
python -m venv .venv
|
| 102 |
-
source .venv/bin/activate # or .venv\Scripts\activate on Windows
|
| 103 |
-
pip install -e .[dev]
|
| 104 |
-
uvicorn server.app:app --reload --port 8000
|
| 105 |
-
```
|
| 106 |
-
|
| 107 |
-
Test the API:
|
| 108 |
-
```bash
|
| 109 |
-
curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{"task_id":"easy_001"}'
|
| 110 |
-
curl http://localhost:8000/state
|
| 111 |
-
pytest
|
| 112 |
-
```
|
| 113 |
-
|
| 114 |
-
## Docker
|
| 115 |
-
|
| 116 |
-
```bash
|
| 117 |
-
docker build -t sql-query-reviewer .
|
| 118 |
-
docker run -p 8000:8000 sql-query-reviewer
|
| 119 |
-
```
|
| 120 |
-
|
| 121 |
-
## Inference
|
| 122 |
-
|
| 123 |
-
```bash
|
| 124 |
-
export ENV_BASE_URL=http://localhost:8000
|
| 125 |
-
export API_BASE_URL=https://router.huggingface.co/v1
|
| 126 |
-
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
|
| 127 |
-
export HF_TOKEN=hf_xxx
|
| 128 |
-
python inference.py
|
| 129 |
-
```
|
| 130 |
-
|
| 131 |
-
The script emits structured `[START]`, `[STEP]`, `[END]` logs per the OpenEnv spec.
|
| 132 |
-
|
| 133 |
-
## Hugging Face Spaces
|
| 134 |
-
|
| 135 |
-
This repo is Space-ready: HF YAML front matter in README, root Dockerfile, API on port 8000. Deploy with:
|
| 136 |
-
|
| 137 |
-
```bash
|
| 138 |
-
git remote add hf https://huggingface.co/spaces/<username>/sql-query-reviewer
|
| 139 |
-
git push hf main
|
| 140 |
-
```
|
| 141 |
-
|
| 142 |
-
## Usage Example
|
| 143 |
-
|
| 144 |
-
```python
|
| 145 |
-
from sql_query_reviewer import SQLReviewAction, SQLReviewEnv
|
| 146 |
-
|
| 147 |
-
with SQLReviewEnv(base_url="https://hellinferno-sql-query-reviewer.hf.space").sync() as env:
|
| 148 |
-
result = env.reset(task_id="easy_001")
|
| 149 |
-
result = env.step(SQLReviewAction(
|
| 150 |
-
action_type="identify_issue",
|
| 151 |
-
issue_category="syntax",
|
| 152 |
-
issue_description="SELCT is misspelled and should be SELECT",
|
| 153 |
-
suggested_fix="SELECT * FROM users WHERE id = 1;",
|
| 154 |
-
confidence=0.98,
|
| 155 |
-
))
|
| 156 |
-
print(result.reward)
|
| 157 |
-
print(result.observation.feedback)
|
| 158 |
-
```
|
| 159 |
-
|
| 160 |
-
## Author
|
| 161 |
-
|
| 162 |
-
**Hellinferno** β Solo participant, Meta PyTorch OpenEnv Hackathon 2026
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
files/architecture-diagram.md
DELETED
|
@@ -1,61 +0,0 @@
|
|
| 1 |
-
# Architecture Diagram
|
| 2 |
-
|
| 3 |
-
## High-Level Flow
|
| 4 |
-
|
| 5 |
-
```
|
| 6 |
-
ββββββββββββββββ βββββββββββββββββββββββββββββββββββββ
|
| 7 |
-
β β β HF Space (Docker) β
|
| 8 |
-
β inference.pyβ β β
|
| 9 |
-
β (Agent) β β ββββββββββββββββββββββββββββ β
|
| 10 |
-
β β WS β β FastAPI Server β β
|
| 11 |
-
β ββββββββββ ββββββΊβ β (app.py) β β
|
| 12 |
-
β β OpenAI β β β β β β
|
| 13 |
-
β β Client β β β β /reset β load task β β
|
| 14 |
-
β β β β βββββββ€ β /step β grade action β β
|
| 15 |
-
β β LLM β β β β /state β return state β β
|
| 16 |
-
β ββββββββββ β β ββββββββββββ¬ββββββββββββββββ β
|
| 17 |
-
β β β β β
|
| 18 |
-
β stdout: β β ββββββββββββΌββββββββββββββββ β
|
| 19 |
-
β [START] β β β SQLReviewEnvironment β β
|
| 20 |
-
β [STEP] β β β - task_bank (JSON) β β
|
| 21 |
-
β [END] β β β - fuzzy_matcher β β
|
| 22 |
-
β β β β - reward_fn β β
|
| 23 |
-
ββββββββββββββββ β β - grader β β
|
| 24 |
-
β ββββββββββββββββββββββββββββ β
|
| 25 |
-
βββββββββββββββββββββββββββββββββββββ
|
| 26 |
-
```
|
| 27 |
-
|
| 28 |
-
## Episode Sequence
|
| 29 |
-
|
| 30 |
-
```
|
| 31 |
-
Agent Environment
|
| 32 |
-
β β
|
| 33 |
-
βββββ reset(task_id) βββββββββββΊβ Load task from JSON
|
| 34 |
-
βββββ observation βββββββββββββββ Return query + schema + context
|
| 35 |
-
β β
|
| 36 |
-
βββββ step(identify_issue) βββββΊβ Fuzzy match vs ground truth
|
| 37 |
-
βββββ obs + reward + done βββββββ Return feedback + reward
|
| 38 |
-
β β
|
| 39 |
-
βββββ step(suggest_fix) ββββββββΊβ Validate fix
|
| 40 |
-
βββββ obs + reward + done βββββββ Return feedback + reward
|
| 41 |
-
β β
|
| 42 |
-
βββββ step(approve) ββββββββββββΊβ Check remaining issues
|
| 43 |
-
βββββ obs + reward + done=trueβββ Episode ends
|
| 44 |
-
β β
|
| 45 |
-
βββββ close() ββββββββββββββββββΊβ Run grader β final score
|
| 46 |
-
βββββ final_score βββββββββββββββ
|
| 47 |
-
β β
|
| 48 |
-
```
|
| 49 |
-
|
| 50 |
-
## Evaluation Pipeline (Hackathon Judges)
|
| 51 |
-
|
| 52 |
-
```
|
| 53 |
-
Phase 1: Automated Validation
|
| 54 |
-
ββ HF Space responds? β openenv validate? β Docker builds? β inference.py runs? β 3+ tasks?
|
| 55 |
-
|
| 56 |
-
Phase 2: Agentic Evaluation
|
| 57 |
-
ββ Run Nemotron 3 Super against all envs β check score variance
|
| 58 |
-
|
| 59 |
-
Phase 3: Human Review
|
| 60 |
-
ββ Meta + HF engineers review for utility, creativity, exploit checks
|
| 61 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
files/inference.py
DELETED
|
@@ -1,227 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
Inference Script β SQL Query Reviewer
|
| 3 |
-
======================================
|
| 4 |
-
MANDATORY environment variables:
|
| 5 |
-
API_BASE_URL The API endpoint for the LLM.
|
| 6 |
-
MODEL_NAME The model identifier to use for inference.
|
| 7 |
-
HF_TOKEN Your Hugging Face / API key.
|
| 8 |
-
|
| 9 |
-
STDOUT FORMAT:
|
| 10 |
-
[START] task=<task_name> env=<benchmark> model=<model_name>
|
| 11 |
-
[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
|
| 12 |
-
[END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
|
| 13 |
-
"""
|
| 14 |
-
|
| 15 |
-
from __future__ import annotations
|
| 16 |
-
|
| 17 |
-
import json
|
| 18 |
-
import os
|
| 19 |
-
from typing import Any, List, Optional
|
| 20 |
-
|
| 21 |
-
from openai import OpenAI
|
| 22 |
-
|
| 23 |
-
from sql_query_reviewer.client import SyncSQLReviewEnv
|
| 24 |
-
from sql_query_reviewer.models import SQLReviewAction, SQLReviewObservation
|
| 25 |
-
|
| 26 |
-
# ---------------------------------------------------------------------------
|
| 27 |
-
# Configuration
|
| 28 |
-
# ---------------------------------------------------------------------------
|
| 29 |
-
|
| 30 |
-
DEFAULT_TASK_IDS = ("easy_001", "medium_001", "hard_001")
|
| 31 |
-
BENCHMARK = "sql-query-reviewer"
|
| 32 |
-
SUCCESS_SCORE_THRESHOLD = 0.1
|
| 33 |
-
|
| 34 |
-
ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:8000")
|
| 35 |
-
API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
|
| 36 |
-
MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
|
| 37 |
-
API_KEY = os.getenv("HF_TOKEN") or os.getenv("OPENAI_API_KEY") or os.getenv("API_KEY")
|
| 38 |
-
|
| 39 |
-
SYSTEM_PROMPT = """You are reviewing a SQL query for correctness, performance, and security.
|
| 40 |
-
Return exactly one JSON object with these keys:
|
| 41 |
-
- action_type: identify_issue, suggest_fix, approve, or request_more_context
|
| 42 |
-
- issue_category: syntax, performance, security, logic, or style when relevant
|
| 43 |
-
- issue_description: concise issue statement when relevant
|
| 44 |
-
- suggested_fix: corrected SQL or corrected fragment when relevant
|
| 45 |
-
- confidence: float between 0.0 and 1.0
|
| 46 |
-
|
| 47 |
-
Guidelines:
|
| 48 |
-
- Prefer identify_issue until you have high confidence all important issues are covered.
|
| 49 |
-
- Use approve only when the query looks acceptable or all issues have already been identified.
|
| 50 |
-
- Keep the JSON valid and do not wrap it in prose.
|
| 51 |
-
"""
|
| 52 |
-
|
| 53 |
-
# ---------------------------------------------------------------------------
|
| 54 |
-
# Structured stdout logging β MUST match the hackathon spec exactly
|
| 55 |
-
# ---------------------------------------------------------------------------
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
def log_start(task: str, env: str, model: str) -> None:
|
| 59 |
-
print(f"[START] task={task} env={env} model={model}", flush=True)
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
def log_step(
|
| 63 |
-
step: int, action: str, reward: float, done: bool, error: Optional[str]
|
| 64 |
-
) -> None:
|
| 65 |
-
done_str = str(done).lower()
|
| 66 |
-
error_str = error if error else "null"
|
| 67 |
-
print(
|
| 68 |
-
f"[STEP] step={step} action={action} reward={reward:.2f} "
|
| 69 |
-
f"done={done_str} error={error_str}",
|
| 70 |
-
flush=True,
|
| 71 |
-
)
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
|
| 75 |
-
rewards_str = ",".join(f"{r:.2f}" for r in rewards)
|
| 76 |
-
print(
|
| 77 |
-
f"[END] success={str(success).lower()} steps={steps} "
|
| 78 |
-
f"score={score:.2f} rewards={rewards_str}",
|
| 79 |
-
flush=True,
|
| 80 |
-
)
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
# ---------------------------------------------------------------------------
|
| 84 |
-
# LLM interaction
|
| 85 |
-
# ---------------------------------------------------------------------------
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
def build_user_prompt(observation: SQLReviewObservation) -> str:
|
| 89 |
-
payload = {
|
| 90 |
-
"query": observation.query,
|
| 91 |
-
"schema_info": observation.schema_info,
|
| 92 |
-
"context": observation.context,
|
| 93 |
-
"issues_found_so_far": [
|
| 94 |
-
issue.model_dump() for issue in observation.issues_found_so_far
|
| 95 |
-
],
|
| 96 |
-
"remaining_actions": observation.remaining_actions,
|
| 97 |
-
"difficulty": observation.difficulty,
|
| 98 |
-
"feedback": observation.feedback,
|
| 99 |
-
}
|
| 100 |
-
return json.dumps(payload, indent=2)
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
def extract_json(content: str) -> dict[str, Any]:
|
| 104 |
-
stripped = content.strip()
|
| 105 |
-
if stripped.startswith("```"):
|
| 106 |
-
lines = [line for line in stripped.splitlines() if not line.startswith("```")]
|
| 107 |
-
stripped = "\n".join(lines).strip()
|
| 108 |
-
start = stripped.find("{")
|
| 109 |
-
end = stripped.rfind("}")
|
| 110 |
-
if start == -1 or end == -1 or end <= start:
|
| 111 |
-
raise ValueError(f"Could not find JSON object in model response: {content!r}")
|
| 112 |
-
return json.loads(stripped[start : end + 1])
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
def choose_action(
|
| 116 |
-
llm_client: OpenAI, model_name: str, observation: SQLReviewObservation
|
| 117 |
-
) -> SQLReviewAction:
|
| 118 |
-
try:
|
| 119 |
-
response = llm_client.chat.completions.create(
|
| 120 |
-
model=model_name,
|
| 121 |
-
temperature=0,
|
| 122 |
-
max_tokens=300,
|
| 123 |
-
messages=[
|
| 124 |
-
{"role": "system", "content": SYSTEM_PROMPT},
|
| 125 |
-
{"role": "user", "content": build_user_prompt(observation)},
|
| 126 |
-
],
|
| 127 |
-
)
|
| 128 |
-
content = response.choices[0].message.content or ""
|
| 129 |
-
return SQLReviewAction.model_validate(extract_json(content))
|
| 130 |
-
except Exception as exc:
|
| 131 |
-
print(f"[DEBUG] Model request failed: {exc}", flush=True)
|
| 132 |
-
# Fallback: approve to end the episode gracefully
|
| 133 |
-
return SQLReviewAction(action_type="approve", confidence=0.1)
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
# ---------------------------------------------------------------------------
|
| 137 |
-
# Episode runner
|
| 138 |
-
# ---------------------------------------------------------------------------
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
def run_episode(
|
| 142 |
-
env: SyncSQLReviewEnv, llm_client: OpenAI, model_name: str, task_id: str
|
| 143 |
-
) -> None:
|
| 144 |
-
rewards: List[float] = []
|
| 145 |
-
steps_taken = 0
|
| 146 |
-
score = 0.0
|
| 147 |
-
success = False
|
| 148 |
-
last_error: Optional[str] = None
|
| 149 |
-
|
| 150 |
-
log_start(task=task_id, env=BENCHMARK, model=model_name)
|
| 151 |
-
|
| 152 |
-
try:
|
| 153 |
-
result = env.reset(task_id=task_id)
|
| 154 |
-
|
| 155 |
-
step = 0
|
| 156 |
-
while not result.done:
|
| 157 |
-
step += 1
|
| 158 |
-
action = choose_action(
|
| 159 |
-
llm_client=llm_client,
|
| 160 |
-
model_name=model_name,
|
| 161 |
-
observation=result.observation,
|
| 162 |
-
)
|
| 163 |
-
|
| 164 |
-
action_str = action.action_type
|
| 165 |
-
if action.issue_description:
|
| 166 |
-
# Keep action string short and readable
|
| 167 |
-
action_str = f"{action.action_type}({action.issue_category})"
|
| 168 |
-
|
| 169 |
-
result = env.step(action)
|
| 170 |
-
|
| 171 |
-
reward = result.reward
|
| 172 |
-
rewards.append(reward)
|
| 173 |
-
steps_taken = step
|
| 174 |
-
last_error = result.info.get("error") if result.info else None
|
| 175 |
-
|
| 176 |
-
log_step(
|
| 177 |
-
step=step,
|
| 178 |
-
action=action_str,
|
| 179 |
-
reward=reward,
|
| 180 |
-
done=result.done,
|
| 181 |
-
error=last_error,
|
| 182 |
-
)
|
| 183 |
-
|
| 184 |
-
# Get final score from state
|
| 185 |
-
state = env.state()
|
| 186 |
-
score = state.final_score if state.final_score is not None else 0.0
|
| 187 |
-
success = score >= SUCCESS_SCORE_THRESHOLD
|
| 188 |
-
|
| 189 |
-
except Exception as exc:
|
| 190 |
-
print(f"[DEBUG] Episode error: {exc}", flush=True)
|
| 191 |
-
last_error = str(exc)
|
| 192 |
-
|
| 193 |
-
finally:
|
| 194 |
-
log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
# ---------------------------------------------------------------------------
|
| 198 |
-
# Main
|
| 199 |
-
# ---------------------------------------------------------------------------
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
def main() -> int:
|
| 203 |
-
if not API_KEY:
|
| 204 |
-
raise SystemExit("Set HF_TOKEN or OPENAI_API_KEY before running inference.py")
|
| 205 |
-
|
| 206 |
-
task_ids = tuple(
|
| 207 |
-
tid.strip()
|
| 208 |
-
for tid in os.getenv("TASK_IDS", ",".join(DEFAULT_TASK_IDS)).split(",")
|
| 209 |
-
if tid.strip()
|
| 210 |
-
)
|
| 211 |
-
|
| 212 |
-
llm_client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
|
| 213 |
-
|
| 214 |
-
with SyncSQLReviewEnv(base_url=ENV_BASE_URL) as env:
|
| 215 |
-
for task_id in task_ids:
|
| 216 |
-
run_episode(
|
| 217 |
-
env=env,
|
| 218 |
-
llm_client=llm_client,
|
| 219 |
-
model_name=MODEL_NAME,
|
| 220 |
-
task_id=task_id,
|
| 221 |
-
)
|
| 222 |
-
|
| 223 |
-
return 0
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
if __name__ == "__main__":
|
| 227 |
-
raise SystemExit(main())
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
files/openenv.yaml
DELETED
|
@@ -1,70 +0,0 @@
|
|
| 1 |
-
name: sql-query-reviewer
|
| 2 |
-
description: "AI agent reviews SQL queries for correctness, performance, and security."
|
| 3 |
-
author: Hellinferno
|
| 4 |
-
version: "0.1.0"
|
| 5 |
-
tags:
|
| 6 |
-
- openenv
|
| 7 |
-
- sql
|
| 8 |
-
- code-review
|
| 9 |
-
- security
|
| 10 |
-
tasks:
|
| 11 |
-
- id: easy_001
|
| 12 |
-
name: Syntax Keyword Typos
|
| 13 |
-
difficulty: easy
|
| 14 |
-
description: "Detect misspelled SQL keywords (SELCT, FORM, WEHRE) and unnecessary SELECT *."
|
| 15 |
-
- id: easy_002
|
| 16 |
-
name: Missing FROM Clause
|
| 17 |
-
difficulty: easy
|
| 18 |
-
description: "Find missing FROM keyword before table name."
|
| 19 |
-
- id: easy_003
|
| 20 |
-
name: NULL Comparison Logic
|
| 21 |
-
difficulty: easy
|
| 22 |
-
description: "Detect = NULL instead of IS NULL."
|
| 23 |
-
- id: easy_004
|
| 24 |
-
name: Unclosed String Literal
|
| 25 |
-
difficulty: easy
|
| 26 |
-
description: "Find unterminated quote in WHERE clause."
|
| 27 |
-
- id: easy_005
|
| 28 |
-
name: Unknown Column Name
|
| 29 |
-
difficulty: easy
|
| 30 |
-
description: "Detect column name typo (statuz vs status)."
|
| 31 |
-
- id: medium_001
|
| 32 |
-
name: Performance Anti-Pattern Review
|
| 33 |
-
difficulty: medium
|
| 34 |
-
description: "Identify schema-aware performance problems like SELECT *, missing indexes, correlated subqueries."
|
| 35 |
-
- id: medium_002
|
| 36 |
-
name: Unbounded Query Detection
|
| 37 |
-
difficulty: medium
|
| 38 |
-
description: "Find queries missing LIMIT on large tables."
|
| 39 |
-
- id: medium_003
|
| 40 |
-
name: Redundant Operations
|
| 41 |
-
difficulty: medium
|
| 42 |
-
description: "Detect unnecessary DISTINCT on unique columns."
|
| 43 |
-
- id: medium_004
|
| 44 |
-
name: Correlated Subquery Optimization
|
| 45 |
-
difficulty: medium
|
| 46 |
-
description: "Find correlated subqueries that could be JOINs."
|
| 47 |
-
- id: medium_005
|
| 48 |
-
name: Join Performance Issues
|
| 49 |
-
difficulty: medium
|
| 50 |
-
description: "Identify missing index hints and inefficient joins."
|
| 51 |
-
- id: hard_001
|
| 52 |
-
name: SQL Injection Detection
|
| 53 |
-
difficulty: hard
|
| 54 |
-
description: "Find string concatenation enabling SQL injection vectors."
|
| 55 |
-
- id: hard_002
|
| 56 |
-
name: Privilege Escalation via UNION
|
| 57 |
-
difficulty: hard
|
| 58 |
-
description: "Detect UNION with system tables exposing sensitive data."
|
| 59 |
-
- id: hard_003
|
| 60 |
-
name: PII Data Leakage
|
| 61 |
-
difficulty: hard
|
| 62 |
-
description: "Find unfiltered JOINs exposing personally identifiable information."
|
| 63 |
-
- id: hard_004
|
| 64 |
-
name: Self-Join Optimization
|
| 65 |
-
difficulty: hard
|
| 66 |
-
description: "Detect self-joins replaceable with window functions for 10x improvement."
|
| 67 |
-
- id: hard_005
|
| 68 |
-
name: Transaction Isolation Issues
|
| 69 |
-
difficulty: hard
|
| 70 |
-
description: "Find missing transaction isolation causing phantom reads."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
files/project-design.md
DELETED
|
@@ -1,40 +0,0 @@
|
|
| 1 |
-
# Project Design
|
| 2 |
-
|
| 3 |
-
## Design Principles
|
| 4 |
-
|
| 5 |
-
1. **Spec compliance first, creativity second.** Most teams will fail on automated validation. Perfect adherence to the OpenEnv spec is the highest-ROI activity.
|
| 6 |
-
|
| 7 |
-
2. **Reward shaping is the differentiator.** Binary end-of-episode rewards are common. Per-step, severity-weighted, partial-credit rewards are what separate top submissions.
|
| 8 |
-
|
| 9 |
-
3. **Score variance is mandatory.** The environment must produce different scores for different agent capabilities. Our design inherently ensures this: different queries have different issues, so no two episodes produce identical scores.
|
| 10 |
-
|
| 11 |
-
4. **Domain authenticity wins the 30%.** Real-world utility is the highest-weighted criterion. SQL review is a task every Meta engineer knows and values. The task bank should contain queries that feel like real code review findings, not synthetic puzzles.
|
| 12 |
-
|
| 13 |
-
## Key Design Decisions
|
| 14 |
-
|
| 15 |
-
| Decision | Choice | Rationale |
|
| 16 |
-
|---|---|---|
|
| 17 |
-
| Domain | SQL Query Review | Universal relevance, clear grading, natural difficulty progression |
|
| 18 |
-
| Task count | 15 queries (5/5/5) | Well above minimum 3, shows depth |
|
| 19 |
-
| Matching | Fuzzy keyword matching | Robust to LLM phrasing variation while staying deterministic |
|
| 20 |
-
| Reward | Per-step partial credit | Provides learning signal throughout trajectory |
|
| 21 |
-
| Episode length | 3-8 steps | Short enough for 20-min inference limit across all tasks |
|
| 22 |
-
| Grader | Severity-weighted coverage | Rewards finding critical issues more than trivial ones |
|
| 23 |
-
|
| 24 |
-
## Risk Mitigation
|
| 25 |
-
|
| 26 |
-
| Risk | Mitigation |
|
| 27 |
-
|---|---|
|
| 28 |
-
| Fuzzy matching too loose β inflated scores | Require 30% keyword overlap threshold + category match |
|
| 29 |
-
| Fuzzy matching too strict β no agent can score | Include broad keywords list, test with actual LLM output |
|
| 30 |
-
| Inference timeout | 15 queries Γ 5-8 steps Γ ~3s per LLM call = ~6 min. Well under 20 min. |
|
| 31 |
-
| Docker build fails on HF | Use minimal dependencies, test Dockerfile locally first |
|
| 32 |
-
| Grader returns same score | Impossible with varied queries β but verify during testing |
|
| 33 |
-
|
| 34 |
-
## What Judges Will See
|
| 35 |
-
|
| 36 |
-
1. **README** β Clear, compelling, explains why SQL review matters and how the env works
|
| 37 |
-
2. **HF Space** β Live, responds instantly to `/reset`
|
| 38 |
-
3. **Code** β Clean, well-structured, typed models, deterministic graders
|
| 39 |
-
4. **Scores** β Meaningful variance: easy ~0.8, medium ~0.5, hard ~0.3
|
| 40 |
-
5. **Novelty** β No existing SQL review env in OpenEnv ecosystem
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
files/project-readme.md
DELETED
|
@@ -1,91 +0,0 @@
|
|
| 1 |
-
# SQL Query Reviewer β OpenEnv Environment
|
| 2 |
-
|
| 3 |
-
An AI agent environment for reviewing SQL queries for correctness, performance, and security issues.
|
| 4 |
-
|
| 5 |
-
## Why This Matters
|
| 6 |
-
|
| 7 |
-
Every engineering team reviews SQL queries daily β in code reviews, migration scripts, ETL pipelines, and security audits. This environment lets you train and evaluate AI agents on a task that directly maps to real engineering workflows. Unlike toy benchmarks, the queries here reflect genuine patterns found in production codebases: misspelled keywords, N+1 anti-patterns, missing indexes, SQL injection vectors, and schema-aware optimization opportunities.
|
| 8 |
-
|
| 9 |
-
## Environment Overview
|
| 10 |
-
|
| 11 |
-
The agent receives a SQL query (plus optional schema context) and must identify issues through a multi-step review process. It earns rewards for correctly flagging problems and suggesting fixes, while being penalized for false positives or approving buggy queries.
|
| 12 |
-
|
| 13 |
-
## Action Space
|
| 14 |
-
|
| 15 |
-
| Action Type | Description |
|
| 16 |
-
|---|---|
|
| 17 |
-
| `identify_issue` | Flag a specific issue with category and description |
|
| 18 |
-
| `suggest_fix` | Propose corrected SQL for a previously identified issue |
|
| 19 |
-
| `approve` | Mark the query as acceptable (ends episode) |
|
| 20 |
-
| `request_more_context` | Ask for additional schema information |
|
| 21 |
-
|
| 22 |
-
**Fields:** `action_type`, `issue_category` (syntax/performance/security/logic/style), `issue_description`, `suggested_fix`, `confidence` (0.0-1.0)
|
| 23 |
-
|
| 24 |
-
## Observation Space
|
| 25 |
-
|
| 26 |
-
| Field | Type | Description |
|
| 27 |
-
|---|---|---|
|
| 28 |
-
| `query` | str | The SQL query under review |
|
| 29 |
-
| `schema_info` | dict | Table/column definitions (richer for harder tasks) |
|
| 30 |
-
| `context` | str | What the query is supposed to do |
|
| 31 |
-
| `issues_found_so_far` | list | Previously identified issues this episode |
|
| 32 |
-
| `remaining_actions` | int | Steps left before episode ends |
|
| 33 |
-
| `difficulty` | str | easy, medium, or hard |
|
| 34 |
-
| `feedback` | str | Result of last action |
|
| 35 |
-
|
| 36 |
-
## Tasks
|
| 37 |
-
|
| 38 |
-
### Task 1: Syntax Error Detection (Easy)
|
| 39 |
-
Queries with obvious typos, missing keywords, wrong column names. A baseline agent should score **0.7-0.9**.
|
| 40 |
-
|
| 41 |
-
### Task 2: Performance Anti-Pattern Review (Medium)
|
| 42 |
-
Queries with SELECT *, missing indexes, correlated subqueries, unbounded queries. Requires schema awareness. Expected score: **0.4-0.6**.
|
| 43 |
-
|
| 44 |
-
### Task 3: Security & Optimization Audit (Hard)
|
| 45 |
-
SQL injection vectors, privilege escalation, data leakage, complex optimization. Requires multi-step reasoning. Expected score: **0.2-0.4**.
|
| 46 |
-
|
| 47 |
-
## Reward Design
|
| 48 |
-
- Per-step partial credit (not binary end-of-episode)
|
| 49 |
-
- Correct issue identification: +0.1 to +0.4 (scaled by severity)
|
| 50 |
-
- Valid fix suggestion: +0.1 bonus
|
| 51 |
-
- False positive: -0.1 penalty
|
| 52 |
-
- Approving a query with unfound issues: -0.15 per missed issue
|
| 53 |
-
- Correct approval of clean query: +0.2
|
| 54 |
-
|
| 55 |
-
## Setup
|
| 56 |
-
|
| 57 |
-
```bash
|
| 58 |
-
# Install
|
| 59 |
-
pip install openenv-core
|
| 60 |
-
pip install git+https://huggingface.co/spaces/ravi/sql-query-reviewer
|
| 61 |
-
|
| 62 |
-
# Use
|
| 63 |
-
from sql_query_reviewer import SQLReviewEnv, SQLReviewAction
|
| 64 |
-
|
| 65 |
-
with SQLReviewEnv(base_url="https://ravi-sql-query-reviewer.hf.space").sync() as env:
|
| 66 |
-
result = env.reset()
|
| 67 |
-
result = env.step(SQLReviewAction(
|
| 68 |
-
action_type="identify_issue",
|
| 69 |
-
issue_category="syntax",
|
| 70 |
-
issue_description="SELCT should be SELECT"
|
| 71 |
-
))
|
| 72 |
-
print(result.observation.feedback)
|
| 73 |
-
```
|
| 74 |
-
|
| 75 |
-
## Docker
|
| 76 |
-
|
| 77 |
-
```bash
|
| 78 |
-
docker build -t sql-query-reviewer ./server
|
| 79 |
-
docker run -p 8000:8000 sql-query-reviewer
|
| 80 |
-
```
|
| 81 |
-
|
| 82 |
-
## Baseline Scores
|
| 83 |
-
|
| 84 |
-
| Task | Difficulty | Baseline Score |
|
| 85 |
-
|---|---|---|
|
| 86 |
-
| Syntax Error Detection | Easy | ~0.82 |
|
| 87 |
-
| Performance Anti-Pattern Review | Medium | ~0.51 |
|
| 88 |
-
| Security & Optimization Audit | Hard | ~0.29 |
|
| 89 |
-
|
| 90 |
-
## Author
|
| 91 |
-
**Ravi** β Solo participant, Meta PyTorch OpenEnv Hackathon 2026
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|