Spaces:

hellinferno
/

sql-query-reviewer

Sleeping

App Files Files Community

hellinferno commited on 13 days ago

Commit

c98afe9

1 Parent(s): 35a203e

chore: remove planning/reference files directory to reduce Docker image size

Browse files

Files changed (21) hide show

files/00-winning-plan.md +0 -200
files/01-problem-statement.md +0 -32
files/02-requirements.md +0 -58
files/03-information-architecture.md +0 -66
files/04-system-architecture.md +0 -54
files/05-database-schema.md +0 -52
files/06-api-contracts.md +0 -96
files/07-monorepo-structure.md +0 -65
files/08-computation-engine-spec.md +0 -86
files/09-engineering-scope-definition.md +0 -39
files/10-development-phases.md +0 -48
files/11-environment-and-devops.md +0 -77
files/12-testing-strategy.md +0 -52
files/CHANGES.md +0 -72
files/Dockerfile +0 -24
files/README.md +0 -162
files/architecture-diagram.md +0 -61
files/inference.py +0 -227
files/openenv.yaml +0 -70
files/project-design.md +0 -40
files/project-readme.md +0 -91

files/00-winning-plan.md DELETED Viewed

@@ -1,200 +0,0 @@
-# OpenEnv Hackathon — Winning Plan
-**Participant:** Ravi (Solo)
-**Deadline:** April 12, 2026, 11:59 PM IST
-**Goal:** Top 3,000 out of 20,000 teams → Finale April 25–26
----
-## Chosen Domain: **SQL Query Optimizer Review**
-An environment where an AI agent reviews SQL queries for correctness, performance, and security issues — then suggests fixes. This scores high on real-world utility (30% weight), is novel in OpenEnv, has natural difficulty progression, and produces clear measurable rewards.
-**Why this wins:**
-- Every engineering team at Meta deals with SQL/data pipelines daily — maximum relevance
-- Clear grading: each query has known issues, agent either finds them or doesn't → partial credit is natural
-- Difficulty scales cleanly: syntax errors (easy) → performance anti-patterns (medium) → subtle injection vulnerabilities + schema-aware optimization (hard)
-- Novel domain not seen in existing OpenEnv environments (creativity 10%)
-- Deterministic grading with score variance (agents that find more issues score higher)
----
-## Timeline
-| When | What |
-|---|---|
-| **Apr 10, Morning** | Complete prep modules 1-4 on Colab, watch bootcamp recording |
-| **Apr 10, Afternoon** | Install prerequisites, study sample inference script, study echo env code |
-| **Apr 10, Evening** | Scaffold project with `openenv init`, define Pydantic models, implement core env logic |
-| **Apr 11, Morning** | Implement 3 tasks (easy/medium/hard) with graders and reward functions |
-| **Apr 11, Afternoon** | Write `inference.py`, test locally, iterate on reward shaping |
-| **Apr 11, Evening** | Dockerize, deploy to HF Spaces, run pre-validation script |
-| **Apr 12, Morning** | Write README, final testing, fix issues |
-| **Apr 12, Afternoon** | Final pre-validation, submit |
-| **Apr 12, Before 11:59 PM** | Verify HF Space is live and responding |
----
-## Phase 0: Preparation (Today — First 3 Hours)
-### Step 1: Complete Prep Course Modules
-- Module 1: Interface basics (`reset()`, `step()`, `state()`)
-- Module 2: Using existing environments, typed models
-- Module 3: Deployment to HF Spaces with `openenv push`
-- Module 4: **Building your own environment** — most critical, take detailed notes
-### Step 2: Watch Bootcamp Recording
-- Note tips from Ben Burtenshaw (HF) and Pulkit Aneja about what judges look for
-### Step 3: Install Prerequisites
-```bash
-pip install openenv-core huggingface_hub openai pydantic
-pip install docker  # or ensure Docker Desktop is running
-huggingface-cli login
-```
-### Step 4: Study the Sample Inference Script
-- Memorize the `[START]`, `[STEP]`, `[END]` stdout format
-- Any deviation in field names/ordering = incorrect evaluation scoring
-### Step 5: Study Existing Environments
-- Clone `https://github.com/meta-pytorch/OpenEnv`
-- Study `envs/echo_env/` structure: models.py, client.py, server/environment.py, server/app.py, server/Dockerfile
----
-## Phase 1: Build the Environment
-### Project Structure
-```
-sql-query-reviewer/
-├── openenv.yaml
-├── models.py              # Action, Observation, State Pydantic models
-├── client.py              # EnvClient subclass
-├── inference.py           # Baseline inference script (root!)
-├── README.md
-├── tasks/
-│   ├── easy_tasks.json    # Syntax error queries
-│   ├── medium_tasks.json  # Performance anti-pattern queries
-│   └── hard_tasks.json    # Security + schema-aware optimization queries
-└── server/
-    ├── environment.py     # Core environment logic
-    ├── grader.py          # Deterministic grading functions
-    ├── app.py             # FastAPI server
-    ├── Dockerfile
-    └── requirements.txt
-```
-### Pydantic Models Design
-**Observation:**
-- `query`: The SQL query to review
-- `schema_info`: Table/column definitions (for medium/hard tasks)
-- `context`: What the query is supposed to do
-- `issues_found_so_far`: List of issues already identified
-- `remaining_actions`: How many review steps remain
-- `difficulty`: easy | medium | hard
-**Action:**
-- `action_type`: "identify_issue" | "suggest_fix" | "approve" | "request_more_context"
-- `issue_category`: "syntax" | "performance" | "security" | "logic" | "style"
-- `issue_description`: Free text description of the issue
-- `suggested_fix`: The corrected SQL (optional)
-- `confidence`: Float 0.0-1.0
-**Reward:** Float 0.0-1.0 with partial credit
-### Three Tasks with Progressive Difficulty
-**Task 1 — Easy: Syntax & Basic Logic Errors**
-- Queries with missing keywords, wrong joins, typos in column names
-- Agent identifies each error → 0.2 reward per correct identification
-- Suggesting a valid fix → bonus 0.1 per fix
-- Expected baseline score: 0.7-0.9
-**Task 2 — Medium: Performance Anti-Patterns**
-- SELECT *, missing indexes, N+1 patterns, unnecessary subqueries, missing WHERE clauses on large tables
-- Requires understanding schema context
-- Agent identifies anti-pattern + suggests optimization → partial credit
-- Expected baseline score: 0.4-0.6
-**Task 3 — Hard: Security Vulnerabilities + Schema-Aware Optimization**
-- SQL injection vectors, privilege escalation, data leakage, plus complex optimization (query plan awareness)
-- Requires multi-step reasoning about schema relationships
-- Expected baseline score: 0.2-0.4
-### Reward Function Design
-- Per-step rewards (not just end-of-episode)
-- Correct issue identification: +0.2 (scaled by issue severity)
-- Valid fix suggestion: +0.1
-- False positive (flagging non-issue): -0.1
-- Missing critical issue at episode end: -0.15
-- Approving a query with unfound issues: -0.2
-- Smooth, informative signal throughout the trajectory
-### Grader Design
-- Each task has a ground-truth list of issues with categories and severity
-- Grader compares agent's identified issues against ground truth using fuzzy matching on descriptions
-- Score = (correctly_identified × severity_weight) / total_possible_score
-- Deterministic: same agent output → same score every time
-- Returns float in [0.0, 1.0]
-- Never returns the same score for all inputs (variety of queries ensures variance)
----
-## Phase 2: Inference Script
-Key requirements:
-- Named `inference.py` in root directory
-- Uses OpenAI Client for all LLM calls
-- Reads `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` from env vars
-- Emits `[START]`, `[STEP]`, `[END]` logs exactly per spec
-- Completes in <20 minutes on 2 vCPU, 8GB RAM
-- Reproducible scores
----
-## Phase 3: Containerize & Deploy
-```bash
-# Build and test locally
-docker build -t sql-query-reviewer ./server
-docker run -p 8000:8000 sql-query-reviewer
-# Verify endpoints
-curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{}'
-# Deploy to HF Spaces
-openenv push --repo-id ravi/sql-query-reviewer
-# Verify deployed version
-curl -X POST https://ravi-sql-query-reviewer.hf.space/reset
-```
----
-## Phase 4: Pre-Submission QA
-Run pre-validation script:
-```bash
-./validate-submission.sh https://ravi-sql-query-reviewer.hf.space .
-```
-Checklist:
-- [ ] HF Space deploys and responds to `/reset` with 200
-- [ ] `openenv validate` passes
-- [ ] Dockerfile builds cleanly
-- [ ] Inference script runs without errors, produces scores
-- [ ] 3+ tasks, each grader returns scores in 0.0-1.0 range
-- [ ] Scores are reproducible across runs
-- [ ] README is compelling and complete
----
-## Winning Differentiators
-1. **Real-world utility (30%)**: SQL review is something every data team needs — immediate value for the RL/agent community
-2. **Score variance**: Different agent capabilities produce meaningfully different scores — a basic agent catches syntax errors but misses security issues
-3. **Reward shaping**: Per-step partial credit signals, not binary end-of-episode
-4. **Novelty**: No SQL review environment exists in OpenEnv yet
-5. **Spec compliance**: Bulletproof adherence to every technical requirement — this alone eliminates most competitors

files/01-problem-statement.md DELETED Viewed

@@ -1,32 +0,0 @@
-# 01 — Problem Statement & Domain Selection
-## Domain: SQL Query Review Environment
-### The Real-World Problem
-Every software team reviews SQL queries — in code reviews, database migrations, ETL pipeline audits, and security assessments. This is a genuine, high-frequency task that requires:
-- Pattern recognition (anti-patterns, vulnerabilities)
-- Domain knowledge (schema relationships, indexing strategies)
-- Multi-step reasoning (understanding query intent before evaluating correctness)
-### Why This Domain Wins
-| Evaluation Criteria | Weight | How We Score |
-|---|---|---|
-| Real-world utility | 30% | SQL review is universal — Meta runs millions of queries daily. Fills a real gap in agent evaluation. |
-| Task & grader quality | 25% | Clear ground truth per query, deterministic grading, natural difficulty progression |
-| Environment design | 20% | Clean state (per-query episode), rich observations, well-typed actions, per-step rewards |
-| Code quality & spec compliance | 15% | Full OpenEnv spec, clean project structure, Docker, typed models |
-| Creativity & novelty | 10% | No SQL review env exists in OpenEnv. Reward design uses severity-weighted partial credit. |
-### What the Agent Does
-1. Receives a SQL query + optional schema context
-2. Reviews it step-by-step, identifying issues (syntax, performance, security, logic)
-3. Suggests fixes for each identified issue
-4. Decides when to approve or flag the query
-5. Gets rewarded for correctly identified issues and penalized for false positives
-### Scope Boundaries
-- **In scope**: SELECT, INSERT, UPDATE, DELETE queries; joins; subqueries; CTEs; window functions
-- **Out of scope**: Stored procedures, database-specific dialect features, real database execution
-- **Episode length**: 3-8 steps depending on query complexity
-- **No external dependencies**: All query analysis is rule-based and deterministic

files/02-requirements.md DELETED Viewed

@@ -1,58 +0,0 @@
-# 02 — Requirements Specification
-## Functional Requirements
-### FR-1: Real-World Task Simulation
-- Simulates SQL query review — a task humans do daily in engineering teams
-- No games, no toys — purely professional/practical domain
-### FR-2: OpenEnv Spec Compliance
-- Typed Pydantic models for Observation, Action, State
-- `step(action)` → returns observation, reward, done, info
-- `reset()` → returns initial observation
-- `state()` → returns current internal state
-- Valid `openenv.yaml` with metadata
-- Passes `openenv validate`
-### FR-3: Minimum 3 Tasks with Agent Graders
-- **Task 1 (Easy):** Syntax & basic logic errors — expected agent score 0.7-0.9
-- **Task 2 (Medium):** Performance anti-patterns — expected agent score 0.4-0.6
-- **Task 3 (Hard):** Security vulnerabilities + schema-aware optimization — expected agent score 0.2-0.4
-- Each grader: deterministic, returns float in [0.0, 1.0], reproducible
-### FR-4: Meaningful Reward Function
-- Per-step rewards (not just end-of-episode binary)
-- Partial credit for partial issue identification
-- Penalties for false positives and missed critical issues
-- Smooth signal that guides learning
-### FR-5: Baseline Inference Script
-- Named `inference.py` in project root
-- Uses OpenAI Client for LLM calls
-- Reads `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` from env vars
-- Emits `[START]`, `[STEP]`, `[END]` structured stdout logs
-- Produces reproducible baseline scores on all 3 tasks
-## Non-Functional Requirements
-### NFR-1: Deploys to Hugging Face Space
-- Containerized HF Space tagged with `openenv`
-- Returns 200 and responds to `/reset` POST
-### NFR-2: Containerized Execution
-- Working Dockerfile
-- Builds with `docker build`, runs with `docker run`
-- Starts cleanly, responds to HTTP requests
-### NFR-3: Infrastructure Constraints
-- Inference script runtime < 20 minutes
-- Runs on 2 vCPU, 8GB RAM machine
-### NFR-4: Documentation
-- README with: environment description, motivation, action/observation space definitions, task descriptions with difficulty, setup instructions, baseline scores
-## Disqualification Criteria (Must Avoid)
-- ❌ Environment does not deploy or respond
-- ❌ Plagiarized or trivially modified existing environments
-- ❌ Graders that always return the same score
-- ❌ No baseline inference script

files/03-information-architecture.md DELETED Viewed

@@ -1,66 +0,0 @@
-# 03 — Information Architecture
-## Data Flow
-```
-[Task JSON] → reset() → [Observation: query + schema + context]
-                              ↓
-                    Agent decides action
-                              ↓
-                step(Action) → [Observation + Reward + Done]
-                              ↓
-                    (repeat until done or max_steps)
-                              ↓
-                close() → Grader computes final score
-```
-## Task Data Structure
-Each task is a JSON object:
-```json
-{
-  "task_id": "easy_001",
-  "difficulty": "easy",
-  "query": "SELCT * FORM users WEHRE id = 1",
-  "schema": {
-    "users": {"id": "INT PRIMARY KEY", "name": "VARCHAR(255)", "email": "VARCHAR(255)"}
-  },
-  "context": "Fetch user by ID for profile page",
-  "ground_truth_issues": [
-    {"category": "syntax", "description": "SELCT should be SELECT", "severity": 0.3, "fix": "SELECT"},
-    {"category": "syntax", "description": "FORM should be FROM", "severity": 0.3, "fix": "FROM"},
-    {"category": "syntax", "description": "WEHRE should be WHERE", "severity": 0.3, "fix": "WHERE"},
-    {"category": "performance", "description": "SELECT * fetches unnecessary columns", "severity": 0.1, "fix": "SELECT id, name, email"}
-  ],
-  "max_steps": 5
-}
-```
-## State Management
-| Field | Type | Description |
-|---|---|---|
-| `task_id` | str | Current task identifier |
-| `query` | str | The SQL query under review |
-| `issues_identified` | list | Issues the agent has found so far |
-| `fixes_suggested` | list | Fixes the agent has proposed |
-| `step_count` | int | Current step number |
-| `total_reward` | float | Accumulated reward |
-| `done` | bool | Whether episode is complete |
-| `approved` | bool | Whether agent approved the query |
-## Observation Space
-- `query`: The full SQL query text
-- `schema_info`: Dict of table → column definitions (empty for easy tasks)
-- `context`: Natural language description of query intent
-- `issues_found_so_far`: List of previously identified issues in this episode
-- `remaining_actions`: Max steps minus current step
-- `difficulty`: "easy" | "medium" | "hard"
-- `feedback`: Result of last action ("correct identification", "false positive", "already identified", etc.)
-## Action Space
-- `action_type`: enum — "identify_issue" | "suggest_fix" | "approve" | "request_more_context"
-- `issue_category`: enum — "syntax" | "performance" | "security" | "logic" | "style"
-- `issue_description`: str — what the agent thinks is wrong
-- `suggested_fix`: str (optional) — corrected SQL fragment
-- `confidence`: float 0.0-1.0

files/04-system-architecture.md DELETED Viewed

@@ -1,54 +0,0 @@
-# 04 — System Architecture
-## Components
-```
-┌─────────────────────────────────────────────┐
-│                 HF Space                     │
-│  ┌─────────────────────────────────────┐    │
-│  │           FastAPI Server             │    │
-│  │  (app.py — Uvicorn)                  │    │
-│  │                                      │    │
-│  │  POST /reset  → environment.reset()  │    │
-│  │  POST /step   → environment.step()   │    │
-│  │  GET  /state  → environment.state()  │    │
-│  └──────────┬──────────────────────────┘    │
-│             │                                │
-│  ┌──────────▼──────────────────────────┐    │
-│  │      SQLReviewEnvironment            │    │
-│  │  - task_bank (easy/medium/hard JSON) │    │
-│  │  - grader (deterministic scoring)    │    │
-│  │  - reward_fn (per-step signals)      │    │
-│  └─────────────────────────────────────┘    │
-│                                              │
-│  Dockerfile (Python 3.10-slim + deps)        │
-└─────────────────────────────────────────────┘
-┌─────────────────────────────────────────────┐
-│            inference.py (Client)             │
-│  - OpenAI Client → LLM API                  │
-│  - SQLReviewEnvClient → HF Space            │
-│  - Structured stdout logging                 │
-└─────────────────────────────────────────────┘
-```
-## Technology Stack
-- **Runtime:** Python 3.10+
-- **Framework:** FastAPI + Uvicorn
-- **Models:** Pydantic v2
-- **Container:** Docker (python:3.10-slim base)
-- **Deployment:** Hugging Face Spaces (Docker SDK)
-- **LLM Client:** OpenAI Python SDK
-- **Environment SDK:** openenv-core
-## Communication Protocol
-- WebSocket at `/ws` for persistent sessions (OpenEnv standard)
-- HTTP POST endpoints as fallback: `/reset`, `/step`
-- HTTP GET: `/state`
-- JSON request/response bodies matching typed Pydantic models
-## Episode Lifecycle
-1. Client calls `reset(task_id="easy_001")` → server loads task, returns initial observation
-2. Client calls `step(action)` → server validates action, computes reward, returns observation
-3. Repeat until `done=True` (all issues found, agent approves, or max_steps reached)
-4. Client calls `close()` → server runs grader, returns final score

files/05-database-schema.md DELETED Viewed

@@ -1,52 +0,0 @@
-# 05 — Task Bank Schema
-## Overview
-Tasks are stored as JSON files, not a database. Each difficulty level has its own file with 3-5 queries.
-## Easy Tasks (`tasks/easy_tasks.json`)
-Queries with obvious syntax errors, wrong keywords, basic logic mistakes. An LLM should score 0.7-0.9.
-Example queries:
-1. Misspelled keywords (SELCT, FORM, WEHRE)
-2. Missing FROM clause
-3. Wrong column names that don't exist in schema
-4. Missing semicolons / unclosed quotes
-5. Using = NULL instead of IS NULL
-## Medium Tasks (`tasks/medium_tasks.json`)
-Queries with performance anti-patterns. Requires understanding schema context. Target score: 0.4-0.6.
-Example queries:
-1. SELECT * on a 50-column table when only 2 columns needed
-2. Missing index hint on a JOIN with large table
-3. Correlated subquery that could be a JOIN
-4. Missing LIMIT on unbounded query
-5. Redundant DISTINCT on a column with UNIQUE constraint
-## Hard Tasks (`tasks/hard_tasks.json`)
-Security vulnerabilities + complex optimization. Target score: 0.2-0.4.
-Example queries:
-1. String concatenation enabling SQL injection
-2. Privilege escalation via UNION with system tables
-3. Data leakage through unfiltered JOIN exposing PII
-4. Query that could use window functions instead of self-join (10x perf gain)
-5. Missing transaction isolation causing phantom reads
-## Ground Truth Format
-Each issue in ground truth:
-```json
-{
-  "category": "security",
-  "description": "String concatenation in WHERE clause enables SQL injection",
-  "severity": 1.0,
-  "fix": "Use parameterized query with ? placeholder",
-  "keywords": ["injection", "concatenation", "user input", "unsanitized"]
-}
-```
-The `keywords` field is used by the grader for fuzzy matching against agent responses.

files/06-api-contracts.md DELETED Viewed

@@ -1,96 +0,0 @@
-# 06 — API Contracts
-## OpenEnv Standard Endpoints
-### POST /reset
-**Request:**
-```json
-{"task_id": "easy_001"}
-```
-**Response (StepResult):**
-```json
-{
-  "observation": {
-    "query": "SELCT * FORM users WEHRE id = 1",
-    "schema_info": {"users": {"id": "INT PK", "name": "VARCHAR(255)", "email": "VARCHAR(255)"}},
-    "context": "Fetch user by ID for profile page",
-    "issues_found_so_far": [],
-    "remaining_actions": 5,
-    "difficulty": "easy",
-    "feedback": "Review this SQL query and identify any issues."
-  },
-  "reward": 0.0,
-  "done": false,
-  "info": {}
-}
-```
-### POST /step
-**Request (Action):**
-```json
-{
-  "action_type": "identify_issue",
-  "issue_category": "syntax",
-  "issue_description": "SELCT is misspelled, should be SELECT",
-  "suggested_fix": "SELECT",
-  "confidence": 0.95
-}
-```
-**Response (StepResult):**
-```json
-{
-  "observation": {
-    "query": "SELCT * FORM users WEHRE id = 1",
-    "schema_info": {"users": {"id": "INT PK", "name": "VARCHAR(255)", "email": "VARCHAR(255)"}},
-    "context": "Fetch user by ID for profile page",
-    "issues_found_so_far": [{"category": "syntax", "description": "SELCT should be SELECT"}],
-    "remaining_actions": 4,
-    "difficulty": "easy",
-    "feedback": "Correct! SELCT is indeed a syntax error. 3 issues remaining."
-  },
-  "reward": 0.25,
-  "done": false,
-  "info": {"match_type": "exact", "severity": 0.3}
-}
-```
-### GET /state
-**Response (State):**
-```json
-{
-  "task_id": "easy_001",
-  "step_count": 1,
-  "issues_identified": [{"category": "syntax", "description": "SELCT should be SELECT"}],
-  "total_reward": 0.25,
-  "done": false,
-  "approved": false
-}
-```
-## Pydantic Models
-```python
-class SQLReviewAction(Action):
-    action_type: Literal["identify_issue", "suggest_fix", "approve", "request_more_context"]
-    issue_category: Optional[Literal["syntax", "performance", "security", "logic", "style"]] = None
-    issue_description: Optional[str] = None
-    suggested_fix: Optional[str] = None
-    confidence: float = 0.5
-class SQLReviewObservation(Observation):
-    query: str
-    schema_info: Dict[str, Dict[str, str]]
-    context: str
-    issues_found_so_far: List[Dict[str, str]]
-    remaining_actions: int
-    difficulty: str
-    feedback: str
-class SQLReviewState(State):
-    task_id: str
-    step_count: int
-    issues_identified: List[Dict[str, str]]
-    total_reward: float
-    done: bool
-    approved: bool
-```

files/07-monorepo-structure.md DELETED Viewed

@@ -1,65 +0,0 @@
-# 07 — Monorepo Structure
-```
-sql-query-reviewer/
-│
-├── openenv.yaml                 # Environment metadata manifest
-├── models.py                    # Pydantic: SQLReviewAction, SQLReviewObservation, SQLReviewState
-├── client.py                    # EnvClient subclass for external consumers
-├── inference.py                 # MANDATORY: Baseline inference script (root directory!)
-├── README.md                    # Environment documentation
-├── pyproject.toml               # Package config
-│
-├── tasks/
-│   ├── easy_tasks.json          # 5 syntax/logic error queries
-│   ├── medium_tasks.json        # 5 performance anti-pattern queries
-│   └── hard_tasks.json          # 5 security + optimization queries
-│
-└── server/
-    ├── __init__.py
-    ├── environment.py           # SQLReviewEnvironment(Environment) — core logic
-    ├── grader.py                # Deterministic grading: fuzzy match agent output vs ground truth
-    ├── reward.py                # Per-step reward computation
-    ├── app.py                   # FastAPI server (create_app with routes)
-    ├── Dockerfile               # Python 3.10-slim, install deps, expose port
-    └── requirements.txt         # openenv-core, fastapi, uvicorn, pydantic
-```
-## Key Files Explained
-| File | Purpose | Critical? |
-|---|---|---|
-| `openenv.yaml` | Metadata: name, description, author, tasks list | Yes — validated by `openenv validate` |
-| `models.py` | Typed Action/Observation/State contracts | Yes — spec compliance |
-| `inference.py` | Baseline agent using OpenAI Client | Yes — DQ if missing |
-| `server/environment.py` | `reset()`, `step()`, `state()` implementation | Yes — core logic |
-| `server/grader.py` | Score computation per task | Yes — must return 0.0-1.0 |
-| `server/Dockerfile` | Container definition | Yes — must build cleanly |
-| `README.md` | Human-readable documentation | Yes — judges read this first |
-## openenv.yaml
-```yaml
-name: sql-query-reviewer
-description: "AI agent reviews SQL queries for correctness, performance, and security"
-author: ravi
-version: "1.0.0"
-tags:
-  - openenv
-  - sql
-  - code-review
-  - security
-tasks:
-  - id: easy_syntax
-    name: "Syntax Error Detection"
-    difficulty: easy
-    description: "Find and fix obvious SQL syntax errors"
-  - id: medium_performance
-    name: "Performance Anti-Pattern Review"
-    difficulty: medium
-    description: "Identify performance issues requiring schema awareness"
-  - id: hard_security
-    name: "Security & Optimization Audit"
-    difficulty: hard
-    description: "Find SQL injection vectors and complex optimization opportunities"
-```

files/08-computation-engine-spec.md DELETED Viewed

@@ -1,86 +0,0 @@
-# 08 — Reward & Grading Engine Spec
-## Per-Step Reward Function
-```python
-def compute_reward(action, ground_truth_issues, already_found):
-    if action.action_type == "identify_issue":
-        match = fuzzy_match(action.issue_description, ground_truth_issues, already_found)
-        if match:
-            base = match["severity"]  # 0.1 - 1.0
-            fix_bonus = 0.1 if action.suggested_fix and is_valid_fix(action.suggested_fix, match) else 0.0
-            confidence_bonus = 0.05 * action.confidence if match else 0.0
-            return min(base + fix_bonus + confidence_bonus, 0.4)  # cap per-step
-        else:
-            return -0.1  # false positive penalty
-    elif action.action_type == "approve":
-        unfound = len(ground_truth_issues) - len(already_found)
-        if unfound == 0:
-            return 0.2  # correct approval
-        else:
-            return -0.15 * unfound  # penalty per missed issue
-    elif action.action_type == "suggest_fix":
-        if not already_found:
-            return -0.05  # fixing without identifying first
-        last_issue = already_found[-1]
-        if is_valid_fix(action.suggested_fix, last_issue):
-            return 0.1
-        return 0.0
-    elif action.action_type == "request_more_context":
-        return 0.0  # neutral — no reward, no penalty
-    return 0.0
-```
-## Fuzzy Matching Algorithm
-```python
-def fuzzy_match(agent_description, ground_truth_issues, already_found):
-    """Match agent's issue description to a ground truth issue."""
-    best_match = None
-    best_score = 0.0
-    for issue in ground_truth_issues:
-        if issue in already_found:
-            continue
-        # Keyword overlap score
-        agent_words = set(agent_description.lower().split())
-        truth_words = set(issue["keywords"])
-        overlap = len(agent_words & truth_words) / max(len(truth_words), 1)
-        # Category match bonus
-        category_bonus = 0.3 if action.issue_category == issue["category"] else 0.0
-        score = overlap + category_bonus
-        if score > best_score and score > 0.3:  # threshold
-            best_score = score
-            best_match = issue
-    return best_match
-```
-## End-of-Episode Grader
-```python
-def grade_episode(issues_found, ground_truth_issues, total_steps, max_steps):
-    """Deterministic grader returning float in [0.0, 1.0]."""
-    if not ground_truth_issues:
-        return 1.0 if not issues_found else 0.5
-    total_severity = sum(i["severity"] for i in ground_truth_issues)
-    found_severity = sum(i["severity"] for i in issues_found if i in matched_ground_truth)
-    coverage_score = found_severity / total_severity  # 0.0 - 1.0
-    efficiency_bonus = max(0, 0.1 * (1 - total_steps / max_steps))  # reward fewer steps
-    false_positive_penalty = 0.05 * count_false_positives(issues_found, ground_truth_issues)
-    score = coverage_score + efficiency_bonus - false_positive_penalty
-    return max(0.0, min(1.0, score))
-```
-## Score Variance Guarantee
-- Easy tasks: 5 different queries with 2-5 issues each → scores range from 0.4 to 1.0
-- Medium tasks: different anti-patterns → scores range from 0.2 to 0.8
-- Hard tasks: varied security issues → scores range from 0.0 to 0.6
-- A grader that always returns the same score = instant DQ. Our design inherently prevents this because different queries have different ground truth issues.

files/09-engineering-scope-definition.md DELETED Viewed

@@ -1,39 +0,0 @@
-# 09 — Engineering Scope Definition
-## In Scope (Must Build)
-1. **Environment server** — `environment.py` with `reset()`, `step()`, `state()`
-2. **Pydantic models** — `models.py` with typed Action, Observation, State
-3. **Client** — `client.py` with EnvClient subclass
-4. **Task bank** — 15 SQL queries (5 easy, 5 medium, 5 hard) with ground truth
-5. **Grader** — Deterministic scoring function per task
-6. **Reward function** — Per-step partial credit with penalties
-7. **Inference script** — `inference.py` using OpenAI Client
-8. **Dockerfile** — Working container that builds and runs
-9. **HF Space deployment** — Live, tagged with `openenv`
-10. **README** — Complete documentation
-11. **openenv.yaml** — Valid metadata manifest
-## Out of Scope (Don't Build)
-- Real database execution (all analysis is pattern-matching based)
-- Custom LLM fine-tuning
-- Web UI beyond OpenEnv's built-in web interface
-- Multiple language SQL dialects (stick to standard SQL)
-- Integration tests against real databases
-## Effort Estimates
-| Component | Hours | Priority |
-|---|---|---|
-| Prep course + bootcamp | 3.0 | P0 |
-| Task bank creation (15 queries + ground truth) | 2.5 | P0 |
-| Pydantic models | 0.5 | P0 |
-| Environment logic (reset/step/state) | 3.0 | P0 |
-| Grader + reward function | 2.0 | P0 |
-| Inference script | 1.5 | P0 |
-| Dockerfile + local testing | 1.0 | P0 |
-| HF Space deployment | 0.5 | P0 |
-| README | 1.0 | P0 |
-| Pre-validation + bug fixes | 2.0 | P0 |
-| **Total** | **~17 hours** | |
-Fits within the 2-day window with buffer for debugging.

files/10-development-phases.md DELETED Viewed

@@ -1,48 +0,0 @@
-# 10 — Development Phases
-## Phase 1: Learn (Apr 10, 9 AM – 12 PM)
-- [ ] Complete Module 1: Interface basics
-- [ ] Complete Module 2: Using existing environments
-- [ ] Complete Module 3: Deployment to HF Spaces
-- [ ] Complete Module 4: Building your own environment
-- [ ] Watch bootcamp recording, note judge preferences
-- [ ] Study sample inference script format
-## Phase 2: Scaffold (Apr 10, 12 PM – 2 PM)
-- [ ] `pip install openenv-core huggingface_hub openai`
-- [ ] `openenv init sql-query-reviewer`
-- [ ] Clone and study echo env for reference
-- [ ] Set up project structure per 07-monorepo-structure.md
-## Phase 3: Core Build (Apr 10, 2 PM – Apr 11, 12 PM)
-- [ ] Write `models.py` — Action, Observation, State
-- [ ] Create task bank — 5 easy, 5 medium, 5 hard queries with ground truth
-- [ ] Implement `environment.py` — reset(), step(), state()
-- [ ] Implement `grader.py` — deterministic scoring
-- [ ] Implement `reward.py` — per-step reward computation
-- [ ] Implement fuzzy matching for issue identification
-- [ ] Write `app.py` — FastAPI routes
-- [ ] Local testing: `uv run server` → test all endpoints manually
-## Phase 4: Inference (Apr 11, 12 PM – 3 PM)
-- [ ] Write `inference.py` following sample script format exactly
-- [ ] System prompt design for SQL review agent
-- [ ] Test with free HF Inference API
-- [ ] Verify `[START]`, `[STEP]`, `[END]` output format
-- [ ] Run 3x to verify reproducible scores
-## Phase 5: Containerize & Deploy (Apr 11, 3 PM – 6 PM)
-- [ ] Write Dockerfile (python:3.10-slim base)
-- [ ] `docker build -t sql-query-reviewer ./server`
-- [ ] `docker run -p 8000:8000 sql-query-reviewer`
-- [ ] Test `/reset`, `/step`, `/state` against running container
-- [ ] `openenv push --repo-id ravi/sql-query-reviewer`
-- [ ] Verify HF Space returns 200 on `/reset`
-## Phase 6: Polish & Submit (Apr 11, 6 PM – Apr 12, 11:59 PM)
-- [ ] Write compelling README
-- [ ] Run `openenv validate`
-- [ ] Run `validate-submission.sh`
-- [ ] Fix any issues
-- [ ] Submit early, iterate if time permits
-- [ ] Final verification: HF Space live and responding

files/11-environment-and-devops.md DELETED Viewed

@@ -1,77 +0,0 @@
-# 11 — Environment & DevOps
-## Local Development Setup
-```bash
-# Python environment
-python3.10 -m venv .venv
-source .venv/bin/activate
-pip install openenv-core fastapi uvicorn pydantic openai huggingface_hub
-# Run locally
-cd server && uvicorn app:app --reload --port 8000
-# Test endpoints
-curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{"task_id": "easy_001"}'
-```
-## Dockerfile
-```dockerfile
-FROM python:3.10-slim
-WORKDIR /app
-COPY server/requirements.txt .
-RUN pip install --no-cache-dir -r requirements.txt
-COPY models.py .
-COPY tasks/ ./tasks/
-COPY server/ ./server/
-COPY openenv.yaml .
-EXPOSE 8000
-CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]
-```
-## server/requirements.txt
-```
-openenv-core>=0.1.0
-fastapi>=0.100.0
-uvicorn>=0.23.0
-pydantic>=2.0.0
-```
-## HF Space Deployment
-```bash
-# Login
-huggingface-cli login
-# Deploy
-openenv push --repo-id ravi/sql-query-reviewer
-# Verify
-curl -s -o /dev/null -w "%{http_code}" -X POST https://ravi-sql-query-reviewer.hf.space/reset -H "Content-Type: application/json" -d '{}'
-# Expected: 200
-```
-## Environment Variables for Inference
-```bash
-export API_BASE_URL="https://router.huggingface.co/v1"
-export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
-export HF_TOKEN="hf_xxxxxxxxxxxxx"
-export IMAGE_NAME="sql-query-reviewer"
-```
-## Pre-Validation
-```bash
-chmod +x validate-submission.sh
-./validate-submission.sh https://ravi-sql-query-reviewer.hf.space .
-```
-Expected output: All 3/3 checks passed.

files/12-testing-strategy.md DELETED Viewed

@@ -1,52 +0,0 @@
-# 12 — Testing Strategy
-## Level 1: Unit Tests (During Build)
-- **Models:** Validate Pydantic models accept/reject correct/incorrect data
-- **Grader:** Test with known inputs → known scores. Verify determinism (run 10x, same result).
-- **Reward function:** Test each action type returns expected reward range
-- **Fuzzy matcher:** Test with exact match, partial match, no match, already-found cases
-## Level 2: Integration Tests (Before Docker)
-- Run `uv run server` locally
-- POST `/reset` with each task ID → verify valid observation returned
-- POST `/step` with valid action → verify reward, done, observation
-- POST `/step` with invalid action → verify graceful error handling
-- GET `/state` → verify state matches expectations
-- Run full episode: reset → steps → done → verify final grader score
-## Level 3: Container Tests (Before Deploy)
-```bash
-docker build -t sql-query-reviewer ./server
-docker run -d -p 8000:8000 sql-query-reviewer
-# Wait for startup
-sleep 5
-# Test reset
-curl -X POST http://localhost:8000/reset -d '{}' | python -m json.tool
-# Test step
-curl -X POST http://localhost:8000/step -d '{"action_type":"identify_issue","issue_category":"syntax","issue_description":"test"}' | python -m json.tool
-docker stop $(docker ps -q)
-```
-## Level 4: Validation Tests (Before Submit)
-- `openenv validate` — must pass
-- `validate-submission.sh <url> .` — all 3 checks must pass
-- Run `inference.py` 3 times → verify scores are consistent
-- Verify stdout format matches `[START]`, `[STEP]`, `[END]` exactly
-- Check memory usage stays under 8GB
-- Check runtime stays under 20 minutes
-## Level 5: Score Variance Check
-- Run inference on all 3 tasks → verify different scores
-- Confirm no grader returns the same score for different inputs
-- Verify easy > medium > hard in terms of baseline agent performance
-## DQ Prevention Checklist
-- [ ] HF Space returns 200 on POST /reset
-- [ ] openenv.yaml is valid
-- [ ] Typed models work
-- [ ] Dockerfile builds
-- [ ] 3+ tasks with graders returning 0.0-1.0
-- [ ] Graders DON'T always return the same score
-- [ ] inference.py exists in root
-- [ ] Baseline produces reproducible scores
-- [ ] Not plagiarized from existing environments

files/CHANGES.md DELETED Viewed

@@ -1,72 +0,0 @@
-# Changes to Apply — Priority Order
-## 🚨 CRITICAL FIX (Do this first — DQ risk)
-### 1. Replace `inference.py`
-**File:** `inference.py` (root directory)
-**Problem:** Current stdout format outputs JSON like `[START] {"difficulty": "easy", ...}` instead of the required `[START] task=easy_001 env=sql-query-reviewer model=Qwen/...` format.
-**Impact:** The hackathon dashboard explicitly states: "Any deviation in field names, ordering, or formatting will result in incorrect evaluation scoring."
-**Fix:** Replace with the provided `inference.py` that uses `log_start()`, `log_step()`, `log_end()` matching the exact spec format.
-**Key changes in the new inference.py:**
-- `[START] task=<task_name> env=<benchmark> model=<model_name>` — flat key=value, not JSON
-- `[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>` — reward formatted to 2 decimal places
-- `[END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>` — comma-separated rewards list
-- Uses `API_BASE_URL` defaulting to HF router (not openai.com)
-- Uses `HF_TOKEN` as primary API key env var
-- Accumulates rewards list and computes success boolean
-- try/finally ensures [END] is always emitted even on exception
----
-## ⚠️ HIGH PRIORITY
-### 2. Replace `openenv.yaml`
-**Problem:** Task IDs in yaml (`easy_syntax`, `medium_performance`, `hard_security`) don't match actual task IDs in JSON files (`easy_001`–`easy_005`, `medium_001`–`medium_005`, `hard_001`–`hard_005`).
-**Impact:** If `openenv validate` checks task ID alignment, validation fails.
-**Fix:** Replace with provided `openenv.yaml` listing all 15 actual task IDs.
-### 3. Replace `Dockerfile`
-**Problem:** No HEALTHCHECK instruction and no `curl` installed.
-**Fix:** Added `apt-get install curl` and `HEALTHCHECK` directive.
-### 4. Replace `README.md`
-**Problem:** Functional but not compelling for human reviewers (30% weight on real-world utility).
-**Fix:** Added "Why This Matters" narrative, baseline score table, cleaner structure.
----
-## 🟡 MEDIUM PRIORITY (before deadline if time permits)
-### 5. Merge PR #1 on GitHub
-The fix/package-server-and-inference-imports branch is already deployed to HF Spaces but still a draft PR on GitHub. Merge it so `main` branch CI passes.
-### 6. Verify `openenv` tag on HF Space
-Go to Space settings on HuggingFace and confirm the `openenv` tag is applied. The README has it in YAML front matter tags, but double-check it appears in the Space metadata.
-### 7. Run pre-validation
-```bash
-./validate-submission.sh https://hellinferno-sql-query-reviewer.hf.space .
-```
----
-## How to apply these changes
-```bash
-# From your local repo directory:
-cp /path/to/fixes/inference.py ./inference.py
-cp /path/to/fixes/openenv.yaml ./openenv.yaml
-cp /path/to/fixes/Dockerfile ./Dockerfile
-cp /path/to/fixes/README.md ./README.md
-# Test locally
-uvicorn server.app:app --port 8000 &
-python inference.py  # verify [START]/[STEP]/[END] format
-# Push to HF Spaces
-git add -A
-git commit -m "fix: correct inference stdout format and align openenv.yaml task IDs"
-git push origin main
-git push hf main
-```

files/Dockerfile DELETED Viewed

@@ -1,24 +0,0 @@
-FROM python:3.11-slim
-ENV PYTHONDONTWRITEBYTECODE=1 \
-    PYTHONUNBUFFERED=1 \
-    PORT=8000
-RUN apt-get update && apt-get install -y --no-install-recommends curl && rm -rf /var/lib/apt/lists/*
-WORKDIR /app
-COPY pyproject.toml README.md models.py client.py openenv.yaml inference.py ./
-COPY sql_query_reviewer ./sql_query_reviewer
-COPY server ./server
-COPY tasks ./tasks
-RUN pip install --no-cache-dir --upgrade pip && \
-    pip install --no-cache-dir .
-EXPOSE 8000
-HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
-    CMD curl -f http://localhost:8000/health || exit 1
-CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]

files/README.md DELETED Viewed

@@ -1,162 +0,0 @@
----
-title: SQL Query Reviewer
-colorFrom: blue
-colorTo: green
-sdk: docker
-app_port: 8000
-pinned: false
-tags:
-  - openenv
----
-# SQL Query Reviewer
-An OpenEnv environment where an AI agent reviews SQL queries for correctness, performance, and security — the same task thousands of engineers perform every day in code reviews, migration scripts, and ETL audits.
-## Why This Matters
-SQL bugs are among the most common and costly defects in production systems. A misplaced keyword breaks an API, a missing index degrades latency by 100x, and an unsanitized input opens a door to data exfiltration. Today these defects are caught by human reviewers who spend hours on repetitive pattern matching. This environment provides a standardized benchmark to train and evaluate AI agents that can automate this critical workflow — directly useful for developer tools, IDE integrations, and automated code review systems.
-## What The Environment Does
-Each episode gives the agent:
-- a SQL query (with realistic bugs drawn from production patterns)
-- schema context when it matters (table definitions, column types, constraints)
-- a short explanation of the query's intended purpose
-The agent responds step by step with one of four actions:
-| Action | Description |
-|---|---|
-| `identify_issue` | Flag a correctness, performance, or security problem |
-| `suggest_fix` | Propose corrected SQL for a previously identified issue |
-| `approve` | Mark the query as acceptable (ends episode) |
-| `request_more_context` | Ask for additional schema information |
-## Reward Design
-Rewards are deterministic and shaped for partial progress throughout the trajectory:
-- **Correct issue identification**: +0.10 to +0.35 scaled by issue severity
-- **Valid fix suggestion**: +0.08 to +0.10 bonus
-- **Confidence bonus**: up to +0.05 for high-confidence correct identifications
-- **False positive**: −0.10 penalty
-- **Duplicate identification**: −0.02 penalty
-- **Approving with missed issues**: −0.15 per missed issue
-- **Complete correct approval**: +0.20
-## Task Bank
-The environment ships with **15 tasks** across three difficulty levels:
-| Difficulty | Count | Examples | Expected Baseline Score |
-|---|---|---|---|
-| Easy | 5 | Misspelled keywords, missing FROM, = NULL vs IS NULL | ~0.75–0.90 |
-| Medium | 5 | SELECT *, missing indexes, correlated subqueries, unbounded queries | ~0.40–0.60 |
-| Hard | 5 | SQL injection, privilege escalation, PII leakage, self-join optimization | ~0.20–0.40 |
-Task data: `tasks/easy_tasks.json`, `tasks/medium_tasks.json`, `tasks/hard_tasks.json`
-## Action & Observation Spaces
-**Action** (`SQLReviewAction`):
-- `action_type`: identify_issue | suggest_fix | approve | request_more_context
-- `issue_category`: syntax | performance | security | logic | style
-- `issue_description`: concise statement of the problem
-- `suggested_fix`: corrected SQL fragment
-- `confidence`: float 0.0–1.0
-**Observation** (`SQLReviewObservation`):
-- `query`: the full SQL query text
-- `schema_info`: dict of table → column definitions
-- `context`: natural language description of query intent
-- `issues_found_so_far`: previously identified issues this episode
-- `remaining_actions`: steps left before episode ends
-- `difficulty`: easy | medium | hard
-- `feedback`: result of last action
-## Repository Layout
-```
-.
-├── openenv.yaml
-├── models.py
-├── client.py
-├── inference.py          ← baseline agent (root directory)
-├── Dockerfile
-├── sql_query_reviewer/   ← typed models and client package
-├── server/               ← FastAPI environment server
-│   ├── environment.py    ← reset(), step(), state()
-│   ├── grader.py         ← deterministic scoring
-│   ├── reward.py         ← per-step reward computation
-│   └── app.py            ← HTTP routes
-├── tasks/                ← 15 SQL query tasks (JSON)
-└── tests/                ← pytest suite
-```
-## Local Development
-```bash
-python -m venv .venv
-source .venv/bin/activate   # or .venv\Scripts\activate on Windows
-pip install -e .[dev]
-uvicorn server.app:app --reload --port 8000
-```
-Test the API:
-```bash
-curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{"task_id":"easy_001"}'
-curl http://localhost:8000/state
-pytest
-```
-## Docker
-```bash
-docker build -t sql-query-reviewer .
-docker run -p 8000:8000 sql-query-reviewer
-```
-## Inference
-```bash
-export ENV_BASE_URL=http://localhost:8000
-export API_BASE_URL=https://router.huggingface.co/v1
-export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
-export HF_TOKEN=hf_xxx
-python inference.py
-```
-The script emits structured `[START]`, `[STEP]`, `[END]` logs per the OpenEnv spec.
-## Hugging Face Spaces
-This repo is Space-ready: HF YAML front matter in README, root Dockerfile, API on port 8000. Deploy with:
-```bash
-git remote add hf https://huggingface.co/spaces/<username>/sql-query-reviewer
-git push hf main
-```
-## Usage Example
-```python
-from sql_query_reviewer import SQLReviewAction, SQLReviewEnv
-with SQLReviewEnv(base_url="https://hellinferno-sql-query-reviewer.hf.space").sync() as env:
-    result = env.reset(task_id="easy_001")
-    result = env.step(SQLReviewAction(
-        action_type="identify_issue",
-        issue_category="syntax",
-        issue_description="SELCT is misspelled and should be SELECT",
-        suggested_fix="SELECT * FROM users WHERE id = 1;",
-        confidence=0.98,
-    ))
-    print(result.reward)
-    print(result.observation.feedback)
-```
-## Author
-**Hellinferno** — Solo participant, Meta PyTorch OpenEnv Hackathon 2026

files/architecture-diagram.md DELETED Viewed

@@ -1,61 +0,0 @@
-# Architecture Diagram
-## High-Level Flow
-```
-┌──────────────┐     ┌───────────────────────────────────┐
-│              │     │        HF Space (Docker)           │
-│  inference.py│     │                                    │
-│  (Agent)     │     │  ┌──────────────────────────┐     │
-│              │ WS  │  │    FastAPI Server         │     │
-│  ┌────────┐  ├────►│  │    (app.py)               │     │
-│  │ OpenAI │  │     │  │                           │     │
-│  │ Client │  │     │  │  /reset → load task       │     │
-│  │   ↕    │  │◄────┤  │  /step  → grade action    │     │
-│  │  LLM   │  │     │  │  /state → return state    │     │
-│  └────────┘  │     │  └──────────┬───────────────┘     │
-│              │     │             │                      │
-│  stdout:     │     │  ┌──────────▼───────────────┐     │
-│  [START]     │     │  │  SQLReviewEnvironment     │     │
-│  [STEP]      │     │  │  - task_bank (JSON)       │     │
-│  [END]       │     │  │  - fuzzy_matcher          │     │
-│              │     │  │  - reward_fn              │     │
-└──────────────┘     │  │  - grader                 │     │
-                     │  └──────────────────────────┘     │
-                     └───────────────────────────────────┘
-```
-## Episode Sequence
-```
-Agent                          Environment
-  │                                │
-  │──── reset(task_id) ──────────►│  Load task from JSON
-  │◄─── observation ──────────────│  Return query + schema + context
-  │                                │
-  │──── step(identify_issue) ────►│  Fuzzy match vs ground truth
-  │◄─── obs + reward + done ──────│  Return feedback + reward
-  │                                │
-  │──── step(suggest_fix) ───────►│  Validate fix
-  │◄─── obs + reward + done ──────│  Return feedback + reward
-  │                                │
-  │──── step(approve) ───────────►│  Check remaining issues
-  │◄─── obs + reward + done=true──│  Episode ends
-  │                                │
-  │──── close() ─────────────────►│  Run grader → final score
-  │◄─── final_score ──────────────│
-  │                                │
-```
-## Evaluation Pipeline (Hackathon Judges)
-```
-Phase 1: Automated Validation
-  └─ HF Space responds? → openenv validate? → Docker builds? → inference.py runs? → 3+ tasks?
-Phase 2: Agentic Evaluation
-  └─ Run Nemotron 3 Super against all envs → check score variance
-Phase 3: Human Review
-  └─ Meta + HF engineers review for utility, creativity, exploit checks
-```

files/inference.py DELETED Viewed

@@ -1,227 +0,0 @@
-"""
-Inference Script — SQL Query Reviewer
-======================================
-MANDATORY environment variables:
-    API_BASE_URL   The API endpoint for the LLM.
-    MODEL_NAME     The model identifier to use for inference.
-    HF_TOKEN       Your Hugging Face / API key.
-STDOUT FORMAT:
-    [START] task=<task_name> env=<benchmark> model=<model_name>
-    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
-    [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
-"""
-from __future__ import annotations
-import json
-import os
-from typing import Any, List, Optional
-from openai import OpenAI
-from sql_query_reviewer.client import SyncSQLReviewEnv
-from sql_query_reviewer.models import SQLReviewAction, SQLReviewObservation
-# ---------------------------------------------------------------------------
-# Configuration
-# ---------------------------------------------------------------------------
-DEFAULT_TASK_IDS = ("easy_001", "medium_001", "hard_001")
-BENCHMARK = "sql-query-reviewer"
-SUCCESS_SCORE_THRESHOLD = 0.1
-ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:8000")
-API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
-MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
-API_KEY = os.getenv("HF_TOKEN") or os.getenv("OPENAI_API_KEY") or os.getenv("API_KEY")
-SYSTEM_PROMPT = """You are reviewing a SQL query for correctness, performance, and security.
-Return exactly one JSON object with these keys:
-- action_type: identify_issue, suggest_fix, approve, or request_more_context
-- issue_category: syntax, performance, security, logic, or style when relevant
-- issue_description: concise issue statement when relevant
-- suggested_fix: corrected SQL or corrected fragment when relevant
-- confidence: float between 0.0 and 1.0
-Guidelines:
-- Prefer identify_issue until you have high confidence all important issues are covered.
-- Use approve only when the query looks acceptable or all issues have already been identified.
-- Keep the JSON valid and do not wrap it in prose.
-"""
-# ---------------------------------------------------------------------------
-# Structured stdout logging — MUST match the hackathon spec exactly
-# ---------------------------------------------------------------------------
-def log_start(task: str, env: str, model: str) -> None:
-    print(f"[START] task={task} env={env} model={model}", flush=True)
-def log_step(
-    step: int, action: str, reward: float, done: bool, error: Optional[str]
-) -> None:
-    done_str = str(done).lower()
-    error_str = error if error else "null"
-    print(
-        f"[STEP] step={step} action={action} reward={reward:.2f} "
-        f"done={done_str} error={error_str}",
-        flush=True,
-    )
-def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
-    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
-    print(
-        f"[END] success={str(success).lower()} steps={steps} "
-        f"score={score:.2f} rewards={rewards_str}",
-        flush=True,
-    )
-# ---------------------------------------------------------------------------
-# LLM interaction
-# ---------------------------------------------------------------------------
-def build_user_prompt(observation: SQLReviewObservation) -> str:
-    payload = {
-        "query": observation.query,
-        "schema_info": observation.schema_info,
-        "context": observation.context,
-        "issues_found_so_far": [
-            issue.model_dump() for issue in observation.issues_found_so_far
-        ],
-        "remaining_actions": observation.remaining_actions,
-        "difficulty": observation.difficulty,
-        "feedback": observation.feedback,
-    }
-    return json.dumps(payload, indent=2)
-def extract_json(content: str) -> dict[str, Any]:
-    stripped = content.strip()
-    if stripped.startswith("```"):
-        lines = [line for line in stripped.splitlines() if not line.startswith("```")]
-        stripped = "\n".join(lines).strip()
-    start = stripped.find("{")
-    end = stripped.rfind("}")
-    if start == -1 or end == -1 or end <= start:
-        raise ValueError(f"Could not find JSON object in model response: {content!r}")
-    return json.loads(stripped[start : end + 1])
-def choose_action(
-    llm_client: OpenAI, model_name: str, observation: SQLReviewObservation
-) -> SQLReviewAction:
-    try:
-        response = llm_client.chat.completions.create(
-            model=model_name,
-            temperature=0,
-            max_tokens=300,
-            messages=[
-                {"role": "system", "content": SYSTEM_PROMPT},
-                {"role": "user", "content": build_user_prompt(observation)},
-            ],
-        )
-        content = response.choices[0].message.content or ""
-        return SQLReviewAction.model_validate(extract_json(content))
-    except Exception as exc:
-        print(f"[DEBUG] Model request failed: {exc}", flush=True)
-        # Fallback: approve to end the episode gracefully
-        return SQLReviewAction(action_type="approve", confidence=0.1)
-# ---------------------------------------------------------------------------
-# Episode runner
-# ---------------------------------------------------------------------------
-def run_episode(
-    env: SyncSQLReviewEnv, llm_client: OpenAI, model_name: str, task_id: str
-) -> None:
-    rewards: List[float] = []
-    steps_taken = 0
-    score = 0.0
-    success = False
-    last_error: Optional[str] = None
-    log_start(task=task_id, env=BENCHMARK, model=model_name)
-    try:
-        result = env.reset(task_id=task_id)
-        step = 0
-        while not result.done:
-            step += 1
-            action = choose_action(
-                llm_client=llm_client,
-                model_name=model_name,
-                observation=result.observation,
-            )
-            action_str = action.action_type
-            if action.issue_description:
-                # Keep action string short and readable
-                action_str = f"{action.action_type}({action.issue_category})"
-            result = env.step(action)
-            reward = result.reward
-            rewards.append(reward)
-            steps_taken = step
-            last_error = result.info.get("error") if result.info else None
-            log_step(
-                step=step,
-                action=action_str,
-                reward=reward,
-                done=result.done,
-                error=last_error,
-            )
-        # Get final score from state
-        state = env.state()
-        score = state.final_score if state.final_score is not None else 0.0
-        success = score >= SUCCESS_SCORE_THRESHOLD
-    except Exception as exc:
-        print(f"[DEBUG] Episode error: {exc}", flush=True)
-        last_error = str(exc)
-    finally:
-        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
-# ---------------------------------------------------------------------------
-# Main
-# ---------------------------------------------------------------------------
-def main() -> int:
-    if not API_KEY:
-        raise SystemExit("Set HF_TOKEN or OPENAI_API_KEY before running inference.py")
-    task_ids = tuple(
-        tid.strip()
-        for tid in os.getenv("TASK_IDS", ",".join(DEFAULT_TASK_IDS)).split(",")
-        if tid.strip()
-    )
-    llm_client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
-    with SyncSQLReviewEnv(base_url=ENV_BASE_URL) as env:
-        for task_id in task_ids:
-            run_episode(
-                env=env,
-                llm_client=llm_client,
-                model_name=MODEL_NAME,
-                task_id=task_id,
-            )
-    return 0
-if __name__ == "__main__":
-    raise SystemExit(main())

files/openenv.yaml DELETED Viewed

@@ -1,70 +0,0 @@
-name: sql-query-reviewer
-description: "AI agent reviews SQL queries for correctness, performance, and security."
-author: Hellinferno
-version: "0.1.0"
-tags:
-  - openenv
-  - sql
-  - code-review
-  - security
-tasks:
-  - id: easy_001
-    name: Syntax Keyword Typos
-    difficulty: easy
-    description: "Detect misspelled SQL keywords (SELCT, FORM, WEHRE) and unnecessary SELECT *."
-  - id: easy_002
-    name: Missing FROM Clause
-    difficulty: easy
-    description: "Find missing FROM keyword before table name."
-  - id: easy_003
-    name: NULL Comparison Logic
-    difficulty: easy
-    description: "Detect = NULL instead of IS NULL."
-  - id: easy_004
-    name: Unclosed String Literal
-    difficulty: easy
-    description: "Find unterminated quote in WHERE clause."
-  - id: easy_005
-    name: Unknown Column Name
-    difficulty: easy
-    description: "Detect column name typo (statuz vs status)."
-  - id: medium_001
-    name: Performance Anti-Pattern Review
-    difficulty: medium
-    description: "Identify schema-aware performance problems like SELECT *, missing indexes, correlated subqueries."
-  - id: medium_002
-    name: Unbounded Query Detection
-    difficulty: medium
-    description: "Find queries missing LIMIT on large tables."
-  - id: medium_003
-    name: Redundant Operations
-    difficulty: medium
-    description: "Detect unnecessary DISTINCT on unique columns."
-  - id: medium_004
-    name: Correlated Subquery Optimization
-    difficulty: medium
-    description: "Find correlated subqueries that could be JOINs."
-  - id: medium_005
-    name: Join Performance Issues
-    difficulty: medium
-    description: "Identify missing index hints and inefficient joins."
-  - id: hard_001
-    name: SQL Injection Detection
-    difficulty: hard
-    description: "Find string concatenation enabling SQL injection vectors."
-  - id: hard_002
-    name: Privilege Escalation via UNION
-    difficulty: hard
-    description: "Detect UNION with system tables exposing sensitive data."
-  - id: hard_003
-    name: PII Data Leakage
-    difficulty: hard
-    description: "Find unfiltered JOINs exposing personally identifiable information."
-  - id: hard_004
-    name: Self-Join Optimization
-    difficulty: hard
-    description: "Detect self-joins replaceable with window functions for 10x improvement."
-  - id: hard_005
-    name: Transaction Isolation Issues
-    difficulty: hard
-    description: "Find missing transaction isolation causing phantom reads."

files/project-design.md DELETED Viewed

@@ -1,40 +0,0 @@
-# Project Design
-## Design Principles
-1. **Spec compliance first, creativity second.** Most teams will fail on automated validation. Perfect adherence to the OpenEnv spec is the highest-ROI activity.
-2. **Reward shaping is the differentiator.** Binary end-of-episode rewards are common. Per-step, severity-weighted, partial-credit rewards are what separate top submissions.
-3. **Score variance is mandatory.** The environment must produce different scores for different agent capabilities. Our design inherently ensures this: different queries have different issues, so no two episodes produce identical scores.
-4. **Domain authenticity wins the 30%.** Real-world utility is the highest-weighted criterion. SQL review is a task every Meta engineer knows and values. The task bank should contain queries that feel like real code review findings, not synthetic puzzles.
-## Key Design Decisions
-| Decision | Choice | Rationale |
-|---|---|---|
-| Domain | SQL Query Review | Universal relevance, clear grading, natural difficulty progression |
-| Task count | 15 queries (5/5/5) | Well above minimum 3, shows depth |
-| Matching | Fuzzy keyword matching | Robust to LLM phrasing variation while staying deterministic |
-| Reward | Per-step partial credit | Provides learning signal throughout trajectory |
-| Episode length | 3-8 steps | Short enough for 20-min inference limit across all tasks |
-| Grader | Severity-weighted coverage | Rewards finding critical issues more than trivial ones |
-## Risk Mitigation
-| Risk | Mitigation |
-|---|---|
-| Fuzzy matching too loose → inflated scores | Require 30% keyword overlap threshold + category match |
-| Fuzzy matching too strict → no agent can score | Include broad keywords list, test with actual LLM output |
-| Inference timeout | 15 queries × 5-8 steps × ~3s per LLM call = ~6 min. Well under 20 min. |
-| Docker build fails on HF | Use minimal dependencies, test Dockerfile locally first |
-| Grader returns same score | Impossible with varied queries — but verify during testing |
-## What Judges Will See
-1. **README** — Clear, compelling, explains why SQL review matters and how the env works
-2. **HF Space** — Live, responds instantly to `/reset`
-3. **Code** — Clean, well-structured, typed models, deterministic graders
-4. **Scores** — Meaningful variance: easy ~0.8, medium ~0.5, hard ~0.3
-5. **Novelty** — No existing SQL review env in OpenEnv ecosystem

files/project-readme.md DELETED Viewed

@@ -1,91 +0,0 @@
-# SQL Query Reviewer — OpenEnv Environment
-An AI agent environment for reviewing SQL queries for correctness, performance, and security issues.
-## Why This Matters
-Every engineering team reviews SQL queries daily — in code reviews, migration scripts, ETL pipelines, and security audits. This environment lets you train and evaluate AI agents on a task that directly maps to real engineering workflows. Unlike toy benchmarks, the queries here reflect genuine patterns found in production codebases: misspelled keywords, N+1 anti-patterns, missing indexes, SQL injection vectors, and schema-aware optimization opportunities.
-## Environment Overview
-The agent receives a SQL query (plus optional schema context) and must identify issues through a multi-step review process. It earns rewards for correctly flagging problems and suggesting fixes, while being penalized for false positives or approving buggy queries.
-## Action Space
-| Action Type | Description |
-|---|---|
-| `identify_issue` | Flag a specific issue with category and description |
-| `suggest_fix` | Propose corrected SQL for a previously identified issue |
-| `approve` | Mark the query as acceptable (ends episode) |
-| `request_more_context` | Ask for additional schema information |
-**Fields:** `action_type`, `issue_category` (syntax/performance/security/logic/style), `issue_description`, `suggested_fix`, `confidence` (0.0-1.0)
-## Observation Space
-| Field | Type | Description |
-|---|---|---|
-| `query` | str | The SQL query under review |
-| `schema_info` | dict | Table/column definitions (richer for harder tasks) |
-| `context` | str | What the query is supposed to do |
-| `issues_found_so_far` | list | Previously identified issues this episode |
-| `remaining_actions` | int | Steps left before episode ends |
-| `difficulty` | str | easy, medium, or hard |
-| `feedback` | str | Result of last action |
-## Tasks
-### Task 1: Syntax Error Detection (Easy)
-Queries with obvious typos, missing keywords, wrong column names. A baseline agent should score **0.7-0.9**.
-### Task 2: Performance Anti-Pattern Review (Medium)
-Queries with SELECT *, missing indexes, correlated subqueries, unbounded queries. Requires schema awareness. Expected score: **0.4-0.6**.
-### Task 3: Security & Optimization Audit (Hard)
-SQL injection vectors, privilege escalation, data leakage, complex optimization. Requires multi-step reasoning. Expected score: **0.2-0.4**.
-## Reward Design
-- Per-step partial credit (not binary end-of-episode)
-- Correct issue identification: +0.1 to +0.4 (scaled by severity)
-- Valid fix suggestion: +0.1 bonus
-- False positive: -0.1 penalty
-- Approving a query with unfound issues: -0.15 per missed issue
-- Correct approval of clean query: +0.2
-## Setup
-```bash
-# Install
-pip install openenv-core
-pip install git+https://huggingface.co/spaces/ravi/sql-query-reviewer
-# Use
-from sql_query_reviewer import SQLReviewEnv, SQLReviewAction
-with SQLReviewEnv(base_url="https://ravi-sql-query-reviewer.hf.space").sync() as env:
-    result = env.reset()
-    result = env.step(SQLReviewAction(
-        action_type="identify_issue",
-        issue_category="syntax",
-        issue_description="SELCT should be SELECT"
-    ))
-    print(result.observation.feedback)
-```
-## Docker
-```bash
-docker build -t sql-query-reviewer ./server
-docker run -p 8000:8000 sql-query-reviewer
-```
-## Baseline Scores
-| Task | Difficulty | Baseline Score |
-|---|---|---|
-| Syntax Error Detection | Easy | ~0.82 |
-| Performance Anti-Pattern Review | Medium | ~0.51 |
-| Security & Optimization Audit | Hard | ~0.29 |
-## Author
-**Ravi** — Solo participant, Meta PyTorch OpenEnv Hackathon 2026