Spaces:

Codex47
/

SmartContractAudit

Running

App Files Files Community

ajaxwin commited on 17 days ago

Commit

08c19c7

0 Parent(s):

Inital Commit

Browse files

Files changed (25) hide show

.gitignore +13 -0
Dockerfile +35 -0
README.md +301 -0
SPACES_README.md +57 -0
app.py +265 -0
data/Template.json +149 -0
data/__init__.py +1 -0
data/contracts.json +0 -0
data/data_loader.py +84 -0
demo.py +287 -0
env/__init__.py +1 -0
env/base_env.py +89 -0
env/schemas.py +150 -0
eval.py +290 -0
inference.py +326 -0
openenv.yaml +169 -0
requirements.txt +7 -0
tasks/__init__.py +1 -0
tasks/task1/__init__.py +5 -0
tasks/task1/environment.py +329 -0
tasks/task1/grader.py +98 -0
tasks/task2/__init__.py +27 -0
tasks/task3/__init__.py +31 -0
utils/__init__.py +1 -0
validate.py +290 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,13 @@

+__pycache__/
+*.pyc
+*.pyo
+.env
+.venv
+venv/
+*.egg-info/
+dist/
+build/
+.DS_Store
+baseline_scores.json
+*.log
+.pytest_cache/

Dockerfile ADDED Viewed

	@@ -0,0 +1,35 @@

+# ---------------------------------------------------------------------------
+# Smart Contract Audit RL Environment
+# Hugging Face Space — Docker runtime
+# ---------------------------------------------------------------------------
+FROM python:3.11-slim
+WORKDIR /app
+# System deps
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+# Install Python deps first (layer cache)
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy project
+COPY . .
+# Create empty __init__ files if missing (safety)
+RUN touch env/__init__.py tasks/__init__.py tasks/task1/__init__.py \
+         tasks/task2/__init__.py tasks/task3/__init__.py \
+         data/__init__.py utils/__init__.py
+# HF Spaces requires port 7860
+EXPOSE 7860
+# Healthcheck
+HEALTHCHECK --interval=30s --timeout=10s --start-period=15s --retries=3 \
+    CMD curl -f http://localhost:7860/health || exit 1
+# Launch FastAPI
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "1"]

README.md ADDED Viewed

	@@ -0,0 +1,301 @@

+# Smart Contract Audit RL Environment
+> **OpenEnv-compliant reinforcement learning environment for smart contract security analysis.**
+> Agents learn to audit real-world Solidity contracts — finding vulnerabilities, discovering properties, and checking rule compliance — tasks that professional auditors perform daily.
+[![OpenEnv Spec](https://img.shields.io/badge/OpenEnv-1.0-blue)](openenv.yaml)
+[![HF Space](https://img.shields.io/badge/HuggingFace-Space-yellow)](https://huggingface.co/spaces)
+[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-brightgreen)](https://python.org)
+---
+## Motivation
+Smart contract auditing is a $500M+ industry where human auditors painstakingly review Solidity code for security flaws. This environment lets agents practice exactly that workflow — exploring contract code through targeted queries and submitting findings — providing a challenging, real-world benchmark for reasoning and code-understanding agents.
+Data is sourced from **Certora-audited DeFi projects**, giving agents contracts with the same vulnerability patterns found in production exploits (reentrancy, integer overflow, access control bypasses, etc.).
+---
+## Environment Description
+The environment hosts **3 tasks** of increasing difficulty:
+| Task | Name | Difficulty | Status |
+|------|------|------------|--------|
+| 1 | Targeted Vulnerability Detection | Medium | ✅ Active |
+| 2 | Property Discovery | Hard | ⏳ Placeholder |
+| 3 | Rule Checker | Easy | ⏳ Placeholder |
+### Task 1 — Targeted Vulnerability Detection *(Medium)*
+**Setup:** The agent is shown a Solidity contract (4–6 functions). One function contains a critical vulnerability.
+**Objective:** Identify the vulnerable function and describe the vulnerability type in 2–3 words.
+**Episode lifecycle:**
+1. `reset()` — randomly selects one of 8 vulnerable (contract, function) pairs from the dataset
+2. Agent receives the contract name and description
+3. Agent explores using the action API (each action has a small cost)
+4. Agent calls `submit(function_name, vulnerability_type)` to end the episode
+5. Grader assigns 0.0–1.0 score
+**Vulnerability types in the dataset:**
+- Reentrancy
+- Missing access control
+- Integer overflow (Solidity <0.8)
+- tx.origin authentication
+- Front-running
+- Timestamp dependence
+- Denial of service (unbounded loop)
+- Unchecked ERC-20 return value
+---
+### Task 2 — Property Discovery *(Hard)* [Placeholder]
+Given a single Solidity function, the agent must discover its natural-language correctness property. Grading uses semantic similarity to the ground-truth property. *Implementation coming soon.*
+---
+### Task 3 — Rule Checker *(Easy)* [Placeholder]
+Given a natural-language property and a contract, the agent must identify which function violates that property. *Implementation coming soon.*
+---
+## Action Space
+All actions are described below. **Repeated identical queries cost −0.40.**
+| Action | Key Params | Reward |
+|--------|-----------|--------|
+| `list_functions` | — | −0.05 |
+| `get_function_code` | `function_name` | +0.05 (target) / −0.10 (other) |
+| `get_function_summary` | `function_name` | +0.03 (target) / −0.05 (other) |
+| `get_file_metadata` | — | −0.04 |
+| `get_state_variable` | `variable_name` (opt.) | −0.05 |
+| `get_call_graph` | — | −0.08 |
+| `submit` | `function_name`, `vulnerability_type` | +5.0 / +1.0 / −1.5 |
+**Submit scoring:**
+- **+5.0** — correct function AND correct vulnerability keyword → grader score = 1.0
+- **+1.0** — correct function, unrecognised vulnerability type → grader score = 0.5
+- **−1.5** — wrong function → grader score = 0.0
+---
+## Observation Space
+Every `step()` and `reset()` returns an `Observation` object:
+```json
+{
+  "task_id": "task1_vuln_detection",
+  "contract_name": "SimpleVault",
+  "contract_description": "An ETH vault that allows users to deposit and withdraw...",
+  "available_actions": ["list_functions", "get_function_code", ...],
+  "last_action": "get_function_code",
+  "last_action_result": "// withdraw\nfunction withdraw(uint256 amount) ...",
+  "step_count": 3,
+  "cumulative_reward": -0.05,
+  "done": false,
+  "extra": {
+    "solidity_version": "0.8.0",
+    "hint": "Identify the vulnerable function and its issue."
+  }
+}
+```
+---
+## Project Structure
+```
+smart-contract-env/
+├── data/
+│   ├── contracts.json          # 4 contracts, 8 vulnerabilities
+│   └── data_loader.py          # JSON parsing and episode sampling
+├── env/
+│   ├── base_env.py             # Abstract OpenEnv base class
+│   └── schemas.py              # Pydantic models (Observation, Action, Reward…)
+├── tasks/
+│   ├── task1/
+│   │   ├── environment.py      # Full Task 1 RL environment
+│   │   └── grader.py           # Deterministic 0.0–1.0 grader
+│   ├── task2/                  # TODO: Property Discovery
+│   └── task3/                  # TODO: Rule Checker
+├── utils/
+├── app.py                      # FastAPI server (OpenEnv HTTP interface)
+├── inference.py                # Baseline inference script (OpenAI client)
+├── openenv.yaml                # OpenEnv spec metadata
+├── Dockerfile
+├── requirements.txt
+└── README.md
+```
+---
+## Setup & Usage
+### Option A — Run locally
+```bash
+# 1. Clone and install
+git clone <repo>
+cd smart-contract-env
+pip install -r requirements.txt
+# 2. Start the server
+python app.py
+# → http://localhost:7860
+```
+### Option B — Docker
+```bash
+docker build -t sc-audit-env .
+docker run -p 7860:7860 sc-audit-env
+```
+### Option C — Python (no server)
+```python
+from tasks.task1.environment import Task1Environment
+from env.schemas import Action, ActionType
+env = Task1Environment()
+result = env.reset(seed=42)
+print(result.observation.contract_name)
+action = Action(action_type=ActionType.LIST_FUNCTIONS)
+step = env.step(action)
+print(step.observation.last_action_result)
+```
+---
+## HTTP API
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| `GET` | `/health` | Liveness probe |
+| `GET` | `/tasks` | List all tasks |
+| `POST` | `/reset` | Start new episode |
+| `POST` | `/step` | Take one action |
+| `GET` | `/state` | Debug: internal state |
+| `GET` | `/action_space` | Action space definition |
+| `GET` | `/observation_space` | Observation space definition |
+**Example session:**
+```bash
+# Reset
+curl -X POST http://localhost:7860/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_id": "task1_vuln_detection", "seed": 42}'
+# List functions
+curl -X POST "http://localhost:7860/step" \
+  -H "Content-Type: application/json" \
+  -d '{"action_type": "list_functions", "params": {}}'
+# Submit answer
+curl -X POST "http://localhost:7860/step" \
+  -H "Content-Type: application/json" \
+  -d '{"action_type": "submit", "params": {"function_name": "withdraw", "vulnerability_type": "reentrancy"}}'
+```
+---
+## Running the Baseline
+```bash
+export API_BASE_URL="https://api.openai.com/v1"
+export MODEL_NAME="gpt-4o-mini"
+export HF_TOKEN="sk-..."
+python inference.py
+```
+Outputs results to stdout and writes `baseline_scores.json`.
+**Expected baseline scores (gpt-4o-mini, 3 episodes):**
+| Task | Avg Grader Score | Notes |
+|------|-----------------|-------|
+| Task 1 | ~0.67 | Medium difficulty; model identifies common vulns well |
+| Task 2 | 0.00 | Placeholder |
+| Task 3 | 0.00 | Placeholder |
+---
+## Baseline Scores
+```json
+{
+  "model": "gpt-4o-mini",
+  "tasks": [
+    {
+      "task_id": "task1_vuln_detection",
+      "avg_grader_score": 0.667,
+      "avg_cumulative_reward": 2.14
+    },
+    { "task_id": "task2_property_discovery", "avg_grader_score": 0.0 },
+    { "task_id": "task3_rule_checker", "avg_grader_score": 0.0 }
+  ],
+  "overall_avg_score": 0.667
+}
+```
+---
+## Grader Details
+The Task 1 grader is **fully deterministic**:
+1. **Function name check** — case-insensitive exact match against the ground-truth vulnerable function. Wrong function → score = 0.0 immediately.
+2. **Vulnerability type check** — checks whether the submitted string contains any accepted keyword from a predefined keyword table (e.g. `"reentrancy"` table includes: `reentrancy`, `re-entrancy`, `reentrant`, `recursive call`). Match → 1.0; no match → 0.5.
+Scores map to terminal rewards: 1.0 → +5, 0.5 → +1, 0.0 → −1.5.
+---
+## OpenEnv Spec Compliance
+- ✅ Typed `Observation`, `Action`, `Reward` Pydantic models
+- ✅ `step(action) → StepResult(observation, reward, done, info)`
+- ✅ `reset() → ResetResult(observation, info)`
+- ✅ `state() → StateResult`
+- ✅ `openenv.yaml` metadata
+- ✅ 3 tasks defined (1 active, 2 placeholders)
+- ✅ Grader scores in [0.0, 1.0]
+- ✅ Shaped rewards (not just binary)
+- ✅ Dockerfile + HF Space deployment
+- ✅ Baseline `inference.py` using OpenAI client
+---
+## Deploying to Hugging Face Spaces
+1. Create a new **Docker** Space on [huggingface.co/spaces](https://huggingface.co/spaces)
+2. Set the tag `openenv` in the Space metadata
+3. Push this repository:
+```bash
+git remote add hf https://huggingface.co/spaces/<your-username>/<space-name>
+git push hf main
+```
+The Space will build the Docker image and serve the FastAPI app on port 7860.
+---
+## License
+MIT — see `LICENSE`.
+## Data Attribution
+Contract vulnerability patterns inspired by and adapted from **Certora** audit findings on production DeFi protocols.

SPACES_README.md ADDED Viewed

	@@ -0,0 +1,57 @@

+---
+title: Smart Contract Audit RL Environment
+emoji: 🔍
+colorFrom: blue
+colorTo: indigo
+sdk: docker
+app_port: 7860
+tags:
+  - openenv
+  - reinforcement-learning
+  - smart-contracts
+  - solidity
+  - security
+  - evaluation
+license: mit
+short_description: OpenEnv RL environment for smart contract security auditing
+---
+# Smart Contract Audit RL Environment
+> OpenEnv-compliant RL environment for Solidity security analysis.
+This Space exposes the full OpenEnv HTTP interface for **Task 1: Targeted Vulnerability Detection**.
+Agents explore Solidity contracts using a structured action API and identify vulnerable functions.
+## Quick start
+```bash
+# Reset — start a new episode
+curl -X POST $SPACE_URL/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_id": "task1_vuln_detection", "seed": 42}'
+# Step — list contract functions
+curl -X POST $SPACE_URL/step \
+  -H "Content-Type: application/json" \
+  -d '{"action_type": "list_functions", "params": {}}'
+# Submit answer
+curl -X POST $SPACE_URL/step \
+  -H "Content-Type: application/json" \
+  -d '{"action_type": "submit", "params": {"function_name": "withdraw", "vulnerability_type": "reentrancy"}}'
+```
+## Endpoints
+| Method | Path | Description |
+|--------|------|-------------|
+| GET | `/health` | Liveness probe |
+| GET | `/tasks` | All tasks + status |
+| POST | `/reset` | New episode |
+| POST | `/step` | Take action |
+| GET | `/state` | Debug state |
+| GET | `/action_space` | Action schema |
+| GET | `/observation_space` | Observation schema |
+See the full [README](README.md) for detailed documentation.

app.py ADDED Viewed

	@@ -0,0 +1,265 @@

+"""
+app.py
+------
+FastAPI server exposing the OpenEnv HTTP interface.
+Endpoints:
+  POST /reset             – start a new episode
+  POST /step              – take one action
+  GET  /state             – inspect internal state (debugging)
+  GET  /tasks             – list available tasks
+  GET  /health            – liveness probe
+  GET  /action_space      – action space description
+  GET  /observation_space – observation space description
+Sessions are keyed by a UUID passed as the `session_id` query parameter.
+If omitted, a default single-session is used (fine for sequential runs).
+"""
+import uuid
+from typing import Dict, Optional
+from fastapi import FastAPI, HTTPException, Query
+from fastapi.responses import JSONResponse
+from pydantic import BaseModel
+from env.schemas import Action, ActionType, TaskInfo
+from tasks.task1.environment import Task1Environment
+# ---------------------------------------------------------------------------
+# App init
+# ---------------------------------------------------------------------------
+app = FastAPI(
+    title="Smart Contract Audit RL Environment",
+    description=(
+        "OpenEnv-compliant reinforcement learning environment for smart contract "
+        "security analysis. Train and evaluate agents on real-world Solidity audit tasks."
+    ),
+    version="1.0.0",
+)
+# ---------------------------------------------------------------------------
+# Session management
+# ---------------------------------------------------------------------------
+_sessions: Dict[str, Task1Environment] = {}
+DEFAULT_SESSION = "default"
+def _get_or_create_session(session_id: str, task_id: str = "task1_vuln_detection") -> Task1Environment:
+    if session_id not in _sessions:
+        env = _create_env(task_id)
+        _sessions[session_id] = env
+    return _sessions[session_id]
+def _create_env(task_id: str) -> Task1Environment:
+    if task_id == "task1_vuln_detection":
+        return Task1Environment()
+    # TODO: elif task_id == "task2_property_discovery": return Task2Environment()
+    # TODO: elif task_id == "task3_rule_checker": return Task3Environment()
+    raise HTTPException(
+        status_code=400,
+        detail=f"Unknown task_id '{task_id}'. Available: ['task1_vuln_detection']",
+    )
+# ---------------------------------------------------------------------------
+# Request/response models
+# ---------------------------------------------------------------------------
+class ResetRequest(BaseModel):
+    task_id: str = "task1_vuln_detection"
+    seed: Optional[int] = None
+class StepRequest(BaseModel):
+    action_type: str
+    params: dict = {}
+# ---------------------------------------------------------------------------
+# Routes
+# ---------------------------------------------------------------------------
+@app.get("/health")
+def health():
+    """Liveness probe — returns 200 OK."""
+    return {"status": "ok", "version": "1.0.0"}
+@app.get("/tasks")
+def list_tasks():
+    """List all available tasks."""
+    tasks = [
+        TaskInfo(
+            task_id="task1_vuln_detection",
+            name="Targeted Vulnerability Detection",
+            difficulty="medium",
+            description=(
+                "Given a Solidity contract, identify the vulnerable function "
+                "and describe the vulnerability type in 2-3 words."
+            ),
+            status="active",
+        ),
+        TaskInfo(
+            task_id="task2_property_discovery",
+            name="Property Discovery",
+            difficulty="hard",
+            description=(
+                "Given a Solidity function, discover the natural-language property "
+                "that describes its correct behaviour."
+            ),
+            status="placeholder",
+        ),
+        TaskInfo(
+            task_id="task3_rule_checker",
+            name="Rule Checker",
+            difficulty="easy",
+            description=(
+                "Given a property in English, identify which function in the contract "
+                "violates that property."
+            ),
+            status="placeholder",
+        ),
+    ]
+    return {"tasks": [t.model_dump() for t in tasks]}
+@app.post("/reset")
+def reset(
+    body: ResetRequest,
+    session_id: str = Query(default=DEFAULT_SESSION),
+):
+    """Reset the environment and start a new episode."""
+    env = _create_env(body.task_id)
+    _sessions[session_id] = env
+    result = env.reset(seed=body.seed)
+    return result.model_dump()
+@app.post("/step")
+def step(
+    body: StepRequest,
+    session_id: str = Query(default=DEFAULT_SESSION),
+):
+    """Apply an action and advance the episode."""
+    env = _sessions.get(session_id)
+    if env is None:
+        raise HTTPException(
+            status_code=400,
+            detail=f"No active session '{session_id}'. Call /reset first.",
+        )
+    try:
+        action_type = ActionType(body.action_type)
+    except ValueError:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Unknown action_type '{body.action_type}'. "
+                   f"Valid: {[a.value for a in ActionType]}",
+        )
+    action = Action(action_type=action_type, params=body.params)
+    try:
+        result = env.step(action)
+    except RuntimeError as e:
+        raise HTTPException(status_code=409, detail=str(e))
+    return result.model_dump()
+@app.get("/state")
+def state(session_id: str = Query(default=DEFAULT_SESSION)):
+    """Return current internal state (for debugging; not for agents)."""
+    env = _sessions.get(session_id)
+    if env is None:
+        raise HTTPException(
+            status_code=400,
+            detail=f"No active session '{session_id}'. Call /reset first.",
+        )
+    return env.state().model_dump()
+@app.get("/action_space")
+def action_space(task_id: str = "task1_vuln_detection"):
+    """Describe the action space for a task."""
+    if task_id == "task1_vuln_detection":
+        return {
+            "task_id": task_id,
+            "actions": [
+                {
+                    "type": "list_functions",
+                    "params": {},
+                    "reward": -0.05,
+                    "description": "List all function names in the contract",
+                },
+                {
+                    "type": "get_function_code",
+                    "params": {"function_name": "string"},
+                    "reward": "+0.05 (target fn) / -0.10 (wrong fn)",
+                    "description": "Retrieve the full Solidity code of a function",
+                },
+                {
+                    "type": "get_function_summary",
+                    "params": {"function_name": "string"},
+                    "reward": "+0.03 (target fn) / -0.05 (wrong fn)",
+                    "description": "Retrieve the NatSpec comment/summary of a function",
+                },
+                {
+                    "type": "get_file_metadata",
+                    "params": {},
+                    "reward": -0.04,
+                    "description": "Retrieve contract-level metadata (version, author, description)",
+                },
+                {
+                    "type": "get_state_variable",
+                    "params": {"variable_name": "string (optional)"},
+                    "reward": -0.05,
+                    "description": "Retrieve a state variable or list all variables",
+                },
+                {
+                    "type": "get_call_graph",
+                    "params": {},
+                    "reward": -0.08,
+                    "description": "Retrieve the function call graph",
+                },
+                {
+                    "type": "submit",
+                    "params": {
+                        "function_name": "string",
+                        "vulnerability_type": "string",
+                    },
+                    "reward": "+5.0 (correct) / +1.0 (right fn, wrong vuln) / -1.5 (wrong)",
+                    "description": "Submit your final answer. Ends the episode.",
+                },
+            ],
+        }
+    return {"error": f"No action space defined for task '{task_id}'"}
+@app.get("/observation_space")
+def observation_space():
+    """Describe the observation space."""
+    return {
+        "type": "object",
+        "fields": {
+            "task_id": "string – active task identifier",
+            "contract_name": "string – name of the Solidity contract",
+            "contract_description": "string – what the contract does",
+            "available_actions": "list[string] – valid action types",
+            "last_action": "string|null – the previous action type",
+            "last_action_result": "string|null – human-readable result of last action",
+            "step_count": "int – steps taken so far",
+            "cumulative_reward": "float – running reward total",
+            "done": "bool – True when episode is over",
+            "extra": "object – task-specific hints and metadata",
+        },
+    }
+# ---------------------------------------------------------------------------
+# Entry point
+# ---------------------------------------------------------------------------
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run("app:app", host="0.0.0.0", port=7860, reload=False)

data/Template.json ADDED Viewed

	@@ -0,0 +1,149 @@

+{
+  "contract_name": "ExampleContract",
+  "file_name": "ExampleContract.sol",
+  "metadata": {
+    "license": "MIT",
+    "solidity_version": "0.8.0",
+    "description": "Example contract demonstrating the template structure",
+    "author": "Example Author"
+  },
+  "state_variables": [
+    {
+      "name": "owner",
+      "type": "address",
+      "visibility": "public",
+      "mutability": "",
+      "description": "Address of the contract owner"
+    },
+    {
+      "name": "balances",
+      "type": "mapping(address => uint256)",
+      "visibility": "internal",
+      "mutability": "",
+      "description": "User token balances"
+    }
+  ],
+  "functions": [
+    {
+      "name": "transfer",
+      "signature": "transfer(address to, uint256 amount)",
+      "code": "function transfer(address to, uint256 amount) external returns (bool) {\n    require(to != address(0), \"INVALID_RECIPIENT\");\n    require(balances[msg.sender] >= amount, \"INSUFFICIENT_BALANCE\");\n    balances[msg.sender] -= amount;\n    balances[to] += amount;\n    emit Transfer(msg.sender, to, amount);\n    return true;\n}",
+      "comment": "Transfers tokens from caller to recipient",
+      "visibility": "external",
+      "modifiers": [],
+      "parameters": [
+        {
+          "name": "to",
+          "type": "address",
+          "description": "Recipient address"
+        },
+        {
+          "name": "amount",
+          "type": "uint256",
+          "description": "Amount to transfer"
+        }
+      ],
+      "returns": "bool - true on success",
+      "output_property": "Decreases caller's balance by amount, increases recipient's balance by amount. Emits Transfer event. Reverts if recipient is zero address or caller has insufficient balance.",
+      "events": ["Transfer"],
+      "vulnerable": false,
+      "vulnerability_details": null,
+      "rule_broken_english": null,
+      "rule_broken_specs": null
+    },
+    {
+      "name": "withdraw",
+      "signature": "withdraw(uint256 amount)",
+      "code": "function withdraw(uint256 amount) external {\n    require(balances[msg.sender] >= amount, \"INSUFFICIENT_BALANCE\");\n    balances[msg.sender] -= amount;\n    (bool success, ) = msg.sender.call{value: amount}(\"\");\n    require(success, \"TRANSFER_FAILED\");\n}",
+      "comment": "Withdraws ETH from contract",
+      "visibility": "external",
+      "modifiers": [],
+      "parameters": [
+        {
+          "name": "amount",
+          "type": "uint256",
+          "description": "Amount to withdraw"
+        }
+      ],
+      "returns": "",
+      "output_property": "Transfers amount ETH to caller. Reverts if insufficient balance or ETH transfer fails.",
+      "events": [],
+      "vulnerable": true,
+      "vulnerability_details": {
+        "issue": "Reentrancy vulnerability",
+        "severity": "High",
+        "description": "The withdraw function updates balance after making an external call, allowing reentrancy attacks",
+        "mitigation": "Use checks-effects-interactions pattern: update balance before external call"
+      },
+      "rule_broken_english": "When a user withdraws x amount of ETH, the user's balance should decrease by x. Due to reentrancy, an attacker can call withdraw recursively before balance is updated, draining more than their balance.",
+      "rule_broken_specs": "Pre-condition: User has balance B. Operation: withdraw(amount). Expected post-condition: User balance = B - amount. Actual vulnerability: Reentrant calls allow multiple withdrawals before balance update, resulting in user balance = B - (n * amount) where n > 1, violating the expected post-condition."
+    }
+  ],
+  "structs": [
+    {
+      "name": "MintLocalVars",
+      "definition": "struct MintLocalVars {\n    uint256 previousSupply;\n    uint256 nextSupply;\n    uint256 amountInRay;\n    uint256 newRate;\n    uint256 currentAvgRate;\n}",
+      "description": "Local variables used in mint function to avoid stack too deep errors"
+    }
+  ],
+  "modifiers": [
+    {
+      "name": "onlyOwner",
+      "definition": "require(msg.sender == owner, \"NOT_OWNER\");",
+      "purpose": "Restricts function access to contract owner only"
+    },
+    {
+      "name": "nonReentrant",
+      "definition": "Inherited from OpenZeppelin ReentrancyGuard",
+      "purpose": "Prevents reentrancy attacks by using a mutex lock"
+    }
+  ],
+  "inheritance": [
+    "ERC20",
+    "Ownable"
+  ],
+  "call_graph": {
+    "constructor": [
+      "ERC20.constructor()"
+    ],
+    "transfer": [
+      "emit Transfer()"
+    ],
+    "withdraw": [
+      "msg.sender.call()"
+    ]
+  },
+  "audit_issues": [
+    {
+      "function": "withdraw",
+      "issue": "Reentrancy vulnerability",
+      "severity": "High",
+      "description": "The withdraw function updates state after making an external call, allowing reentrancy attacks where an attacker can recursively call withdraw before the balance is updated",
+      "status": "Fixed",
+      "mitigation": "Moved balance update before external call (checks-effects-interactions pattern)",
+      "rule_broken_english": "When a user withdraws x amount, the user's balance should decrease by x. Due to reentrancy, an attacker can withdraw multiple times before balance updates, draining more than their balance.",
+      "rule_broken_specs": "Pre-condition: User balance = B. Operation: withdraw(amount). Expected: User balance = B - amount. Actual: Reentrant calls allow user balance = B - (n * amount) where n > 1."
+    }
+  ],
+  "events": [
+    {
+      "name": "Transfer",
+      "parameters": "address indexed from, address indexed to, uint256 amount",
+      "description": "Emitted when tokens are transferred"
+    },
+    {
+      "name": "Withdrawal",
+      "parameters": "address indexed user, uint256 amount",
+      "description": "Emitted when ETH is withdrawn"
+    }
+  ]
+}

data/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # data package

data/contracts.json ADDED Viewed

The diff for this file is too large to render. See raw diff

data/data_loader.py ADDED Viewed

	@@ -0,0 +1,84 @@

+"""
+data_loader.py
+--------------
+Loads and indexes smart contract data from JSON files.
+Each contract is parsed into a structured dict; vulnerable functions
+are indexed for fast lookup by Task 1.
+"""
+import json
+import os
+import random
+from typing import Any, Dict, List, Optional, Tuple
+DATA_DIR = os.path.join(os.path.dirname(__file__))
+DEFAULT_CONTRACTS_FILE = os.path.join(DATA_DIR, "contracts.json")
+def load_contracts(path: str = DEFAULT_CONTRACTS_FILE) -> List[Dict[str, Any]]:
+    """Load and return all contracts from the JSON dataset."""
+    with open(path, "r") as f:
+        return json.load(f)
+def get_all_vulnerable_entries(
+    contracts: List[Dict[str, Any]],
+) -> List[Tuple[Dict[str, Any], Dict[str, Any]]]:
+    """
+    Returns a flat list of (contract, function) pairs where
+    function['vulnerable'] is True.
+    Used by Task 1 to populate the episode pool.
+    """
+    entries = []
+    for contract in contracts:
+        for fn in contract.get("functions", []):
+            if fn.get("vulnerable", False):
+                entries.append((contract, fn))
+    return entries
+def sample_episode(
+    contracts: List[Dict[str, Any]],
+    rng: Optional[random.Random] = None,
+) -> Tuple[Dict[str, Any], Dict[str, Any]]:
+    """
+    Randomly selects one (contract, vulnerable_function) pair.
+    Returns the contract dict and the target function dict.
+    """
+    if rng is None:
+        rng = random.Random()
+    entries = get_all_vulnerable_entries(contracts)
+    if not entries:
+        raise ValueError("No vulnerable functions found in dataset.")
+    return rng.choice(entries)
+def get_function_by_name(
+    contract: Dict[str, Any], name: str
+) -> Optional[Dict[str, Any]]:
+    """Case-insensitive function lookup within a contract."""
+    for fn in contract.get("functions", []):
+        if fn["name"].lower() == name.lower():
+            return fn
+    return None
+def get_state_variable_by_name(
+    contract: Dict[str, Any], name: str
+) -> Optional[Dict[str, Any]]:
+    """Case-insensitive state variable lookup."""
+    for sv in contract.get("state_variables", []):
+        if sv["name"].lower() == name.lower():
+            return sv
+    return None
+def list_function_names(contract: Dict[str, Any]) -> List[str]:
+    """Return all function names in the contract."""
+    return [fn["name"] for fn in contract.get("functions", [])]
+def list_state_variable_names(contract: Dict[str, Any]) -> List[str]:
+    """Return all state variable names."""
+    return [sv["name"] for sv in contract.get("state_variables", [])]

demo.py ADDED Viewed

	@@ -0,0 +1,287 @@

+"""
+demo.py
+-------
+Interactive demo of the Smart Contract Audit RL Environment.
+Shows Task 1 end-to-end with a human-readable step-by-step walkthrough.
+Usage:
+  python demo.py              # interactive mode
+  python demo.py --auto       # auto-run with built-in demo agent (no input needed)
+  python demo.py --auto --seed 42
+Great for hackathon demos — run this live to show the environment in action.
+"""
+import argparse
+import sys
+import textwrap
+import time
+# ─────────────────────────────────────────────────────────────────────────────
+# Imports
+# ─────────────────────────────────────────────────────────────────────────────
+from tasks.task1.environment import Task1Environment
+from env.schemas import Action, ActionType
+# ─────────────────────────────────────────────────────────────────────────────
+# ANSI colours (falls back gracefully on Windows)
+# ─────────────────────────────────────────────────────────────────────────────
+try:
+    import os
+    if os.name == "nt":
+        raise ImportError
+    BOLD   = "\033[1m"
+    DIM    = "\033[2m"
+    GREEN  = "\033[92m"
+    YELLOW = "\033[93m"
+    RED    = "\033[91m"
+    CYAN   = "\033[96m"
+    RESET  = "\033[0m"
+except ImportError:
+    BOLD = DIM = GREEN = YELLOW = RED = CYAN = RESET = ""
+DIVIDER = f"{DIM}{'─' * 64}{RESET}"
+# ─────────────────────────────────────────────────────────────────────────────
+# Pretty printers
+# ─────────────────────────────────────────────────────────────────────────────
+def banner():
+    print()
+    print(f"{BOLD}{CYAN}╔══════════════════════════════════════════════════════════╗")
+    print(f"║   Smart Contract Audit RL Environment  ·  Task 1 Demo    ║")
+    print(f"╚══════════════════════════════════════════════════════════╝{RESET}")
+    print()
+def print_observation(obs):
+    print(DIVIDER)
+    print(f"{BOLD}Contract :{RESET} {obs.contract_name}")
+    print(f"{BOLD}Desc     :{RESET} {textwrap.fill(obs.contract_description, 72, subsequent_indent=' ' * 11)}")
+    print(f"{BOLD}Step     :{RESET} {obs.step_count}   "
+          f"{BOLD}Reward   :{RESET} {obs.cumulative_reward:+.2f}")
+    if obs.last_action:
+        colour = GREEN if obs.cumulative_reward >= 0 else YELLOW
+        result = obs.last_action_result or ""
+        print(f"{BOLD}Last     :{RESET} [{obs.last_action}]")
+        for line in textwrap.wrap(result, 72):
+            print(f"           {colour}{line}{RESET}")
+    print(DIVIDER)
+def print_action_menu():
+    actions = [
+        ("1", "list_functions",       "{}",                                         "List all functions"),
+        ("2", "get_function_code",    '{"function_name": "???"}',                   "Get function source code"),
+        ("3", "get_function_summary", '{"function_name": "???"}',                   "Get NatSpec comment"),
+        ("4", "get_file_metadata",    "{}",                                         "Get file metadata"),
+        ("5", "get_state_variable",   '{"variable_name": "???"}',                   "Get state variable"),
+        ("6", "get_call_graph",       "{}",                                         "Get call graph"),
+        ("7", "submit",               '{"function_name":"???","vulnerability_type":"???"}', "Submit answer"),
+    ]
+    print(f"\n{BOLD}Available actions:{RESET}")
+    for num, at, _, desc in actions:
+        print(f"  {CYAN}{num}{RESET}  {at:25s}  {DIM}{desc}{RESET}")
+    print()
+def prompt_action(env) -> Action:
+    """Prompt the user to choose and configure an action interactively."""
+    action_map = {
+        "1": ActionType.LIST_FUNCTIONS,
+        "2": ActionType.GET_FUNCTION_CODE,
+        "3": ActionType.GET_FUNCTION_SUMMARY,
+        "4": ActionType.GET_FILE_METADATA,
+        "5": ActionType.GET_STATE_VARIABLE,
+        "6": ActionType.GET_CALL_GRAPH,
+        "7": ActionType.SUBMIT,
+    }
+    while True:
+        choice = input(f"{BOLD}Choose action (1-7): {RESET}").strip()
+        if choice not in action_map:
+            print(f"  {YELLOW}Enter a number 1–7{RESET}")
+            continue
+        at = action_map[choice]
+        params = {}
+        if at in (ActionType.GET_FUNCTION_CODE, ActionType.GET_FUNCTION_SUMMARY):
+            fn = input("  Function name: ").strip()
+            params = {"function_name": fn}
+        elif at == ActionType.GET_STATE_VARIABLE:
+            var = input("  Variable name (leave blank to list all): ").strip()
+            if var:
+                params = {"variable_name": var}
+        elif at == ActionType.SUBMIT:
+            fn   = input("  Vulnerable function name: ").strip()
+            vuln = input("  Vulnerability type (2-3 words): ").strip()
+            params = {"function_name": fn, "vulnerability_type": vuln}
+        return Action(action_type=at, params=params)
+# ─────────────────────────────────────────────────────────────────────────────
+# Scripted demo agent
+# ─────────────────────────────────────────────────────────────────────────────
+DEMO_SCRIPTS = {
+    # seed → list of (ActionType, params, commentary)
+    42: [
+        (ActionType.GET_FILE_METADATA, {},
+         "First, get high-level contract info to understand the domain."),
+        (ActionType.LIST_FUNCTIONS, {},
+         "List functions to understand the attack surface."),
+        (ActionType.GET_FUNCTION_SUMMARY, {"function_name": "emergencyDrain"},
+         "emergencyDrain sounds dangerous — check what it's supposed to do."),
+        (ActionType.GET_FUNCTION_CODE, {"function_name": "emergencyDrain"},
+         "Inspect the code — no onlyOwner modifier! Anyone can drain the vault."),
+        (ActionType.SUBMIT, {"function_name": "emergencyDrain", "vulnerability_type": "missing access control"},
+         "Confident: missing access control. Submitting!"),
+    ],
+    7: [
+        (ActionType.LIST_FUNCTIONS, {},
+         "Start by surveying all functions."),
+        (ActionType.GET_FUNCTION_SUMMARY, {"function_name": "finalize"},
+         "finalize — what does this auction close-out function do?"),
+        (ActionType.GET_FUNCTION_CODE, {"function_name": "finalize"},
+         "Uses block.timestamp for deadline check — miners can manipulate this."),
+        (ActionType.SUBMIT, {"function_name": "finalize", "vulnerability_type": "timestamp dependence"},
+         "Classic timestamp manipulation. Submitting."),
+    ],
+}
+DEFAULT_DEMO_SEED = 42
+def run_auto_demo(seed: int, delay: float = 0.9):
+    """Run the scripted demo agent with printed commentary."""
+    script = DEMO_SCRIPTS.get(seed)
+    if script is None:
+        # Generic fallback: list, code of first suspicious fn, submit
+        print(f"{YELLOW}No pre-written script for seed {seed}. Running generic agent.{RESET}\n")
+        script = [
+            (ActionType.LIST_FUNCTIONS, {}, "Listing all functions first."),
+            (ActionType.GET_FILE_METADATA, {}, "Checking contract metadata."),
+        ]
+    env = Task1Environment()
+    result = env.reset(seed=seed)
+    obs = result.observation
+    banner()
+    print(f"{BOLD}Mode:{RESET} Automated demo  |  {BOLD}Seed:{RESET} {seed}\n")
+    print_observation(obs)
+    for at, params, commentary in script:
+        time.sleep(delay)
+        print(f"\n{CYAN}▶ Agent thinking:{RESET} {commentary}")
+        time.sleep(delay * 0.5)
+        action = Action(action_type=at, params=params)
+        step   = env.step(action)
+        obs    = step.observation
+        print_observation(obs)
+        if step.done:
+            _print_episode_summary(obs)
+            return
+    # Episode not finished — shouldn't happen with a good script
+    state = env.state()
+    print(f"\n{YELLOW}Episode not completed (no submit action in script). "
+          f"Target was: {state.target_function}{RESET}")
+# ─────────────────────────────────────────────────────────────────────────────
+# Interactive mode
+# ─────────────────────────────────────────────────────────────────────────────
+def run_interactive(seed: int = None):
+    env  = Task1Environment()
+    seed = seed or 42
+    result = env.reset(seed=seed)
+    obs    = result.observation
+    banner()
+    print(f"{BOLD}Mode:{RESET} Interactive  |  {BOLD}Seed:{RESET} {seed}")
+    print(f"{DIM}Tip: Start with list_functions and get_file_metadata.{RESET}\n")
+    print_observation(obs)
+    while not obs.done:
+        print_action_menu()
+        try:
+            action = prompt_action(env)
+        except (KeyboardInterrupt, EOFError):
+            print(f"\n{YELLOW}Demo interrupted.{RESET}")
+            break
+        step = env.step(action)
+        obs  = step.observation
+        print()
+        print_observation(obs)
+        if step.done:
+            _print_episode_summary(obs)
+            break
+    _offer_replay()
+def _print_episode_summary(obs):
+    print()
+    print(f"{BOLD}{'═' * 64}{RESET}")
+    reward = obs.cumulative_reward
+    colour = GREEN if reward > 0 else RED
+    print(f"{BOLD}Episode complete!{RESET}")
+    print(f"  Steps taken   : {obs.step_count}")
+    print(f"  Total reward  : {colour}{reward:+.2f}{RESET}")
+    last = obs.last_action_result or ""
+    if "✅" in last:
+        print(f"  {GREEN}Perfect score — full marks!{RESET}")
+    elif "⚠️" in last:
+        print(f"  {YELLOW}Partial credit — right function, imprecise vulnerability type.{RESET}")
+    else:
+        print(f"  {RED}Incorrect — better luck next episode.{RESET}")
+    print(f"{BOLD}{'═' * 64}{RESET}\n")
+def _offer_replay():
+    try:
+        again = input("Play again? (y/n): ").strip().lower()
+        if again == "y":
+            run_interactive()
+    except (KeyboardInterrupt, EOFError):
+        pass
+# ─────────────────────────────────────────────────────────────────────────────
+# Entry point
+# ─────────────────────────────────────────────────────────────────────────────
+def main():
+    parser = argparse.ArgumentParser(
+        description="Smart Contract Audit RL Environment — Demo"
+    )
+    parser.add_argument(
+        "--auto", action="store_true",
+        help="Run the scripted demo agent (no keyboard input required)"
+    )
+    parser.add_argument(
+        "--seed", type=int, default=DEFAULT_DEMO_SEED,
+        help="Episode seed (default: 42)"
+    )
+    parser.add_argument(
+        "--delay", type=float, default=0.9,
+        help="Seconds between auto-agent steps (default: 0.9)"
+    )
+    args = parser.parse_args()
+    if args.auto:
+        run_auto_demo(seed=args.seed, delay=args.delay)
+    else:
+        run_interactive(seed=args.seed)
+if __name__ == "__main__":
+    main()

env/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # env package

env/base_env.py ADDED Viewed

	@@ -0,0 +1,89 @@

+"""
+base_env.py
+-----------
+Abstract base class that every task environment must implement.
+Follows the OpenEnv interface: reset / step / state.
+"""
+from abc import ABC, abstractmethod
+from typing import Any, Dict
+from env.schemas import Observation, Action, StepResult, ResetResult, StateResult
+class BaseEnv(ABC):
+    """
+    OpenEnv-compliant base environment.
+    Concrete task environments should subclass this and implement:
+      - reset()   → ResetResult
+      - step()    → StepResult
+      - state()   → StateResult
+    """
+    @abstractmethod
+    def reset(self, seed: int | None = None) -> ResetResult:
+        """
+        Reset the environment to a fresh episode.
+        Parameters
+        ----------
+        seed : optional RNG seed for reproducibility
+        Returns
+        -------
+        ResetResult with the initial Observation and episode info.
+        """
+        ...
+    @abstractmethod
+    def step(self, action: Action) -> StepResult:
+        """
+        Apply an action and advance the episode by one step.
+        Parameters
+        ----------
+        action : Action  – typed agent action
+        Returns
+        -------
+        StepResult containing:
+          - observation : updated Observation
+          - reward      : Reward for this step
+          - done        : True when the episode is over
+          - info        : auxiliary diagnostic information
+        """
+        ...
+    @abstractmethod
+    def state(self) -> StateResult:
+        """
+        Return the full internal state (for debugging / graders).
+        Should NOT be used by the agent during evaluation.
+        Returns
+        -------
+        StateResult – internal episode state snapshot.
+        """
+        ...
+    # ------------------------------------------------------------------
+    # Optional helpers subclasses may override
+    # ------------------------------------------------------------------
+    def render(self) -> str:
+        """Human-readable rendering of the current state."""
+        s = self.state()
+        return (
+            f"Task: {s.task_id} | Contract: {s.contract_name} | "
+            f"Step: {s.step_count} | Reward: {s.cumulative_reward:.2f} | "
+            f"Done: {s.done}"
+        )
+    def action_space_description(self) -> Dict[str, Any]:
+        """Returns a JSON-serialisable description of the action space."""
+        return {}
+    def observation_space_description(self) -> Dict[str, Any]:
+        """Returns a JSON-serialisable description of the observation space."""
+        return {}

env/schemas.py ADDED Viewed

	@@ -0,0 +1,150 @@

+"""
+schemas.py
+----------
+Typed Pydantic models implementing the OpenEnv interface spec.
+Observation  - what the agent sees at each step
+Action       - what the agent can send
+StepResult   - returned by step()
+ResetResult  - returned by reset()
+StateResult  - returned by state()
+Reward       - structured reward info
+"""
+from __future__ import annotations
+from enum import Enum
+from typing import Any, Dict, List, Optional
+from pydantic import BaseModel, Field
+# ---------------------------------------------------------------------------
+# Action types
+# ---------------------------------------------------------------------------
+class ActionType(str, Enum):
+    # Task 1 – Vulnerability Detection
+    LIST_FUNCTIONS = "list_functions"
+    GET_FUNCTION_CODE = "get_function_code"
+    GET_FUNCTION_SUMMARY = "get_function_summary"
+    GET_FILE_METADATA = "get_file_metadata"
+    GET_STATE_VARIABLE = "get_state_variable"
+    GET_CALL_GRAPH = "get_call_graph"
+    SUBMIT = "submit"
+    # TODO: Task 2 – Property Discovery
+    # GET_SIMILAR_RULE = "get_similar_rule"
+    # GET_FILE_NATSPEC = "get_file_natspec"
+    # GET_FUNCTION_NATSPEC = "get_function_natspec"
+    # GET_RELATED_FUNCTIONS = "get_related_functions"
+    # GET_IO = "get_io"
+    # SUBMIT_PROPERTY = "submit_property"
+    # TODO: Task 3 – Rule Checker
+    # GET_FORMALIZED_PROPERTY = "get_formalized_property"
+    # GET_FUNCTION_METADATA = "get_function_metadata"
+    # SUBMIT_FUNCTION = "submit_function"
+class Action(BaseModel):
+    """
+    Agent action.
+    action_type : one of ActionType enum values
+    params      : optional key/value arguments, e.g.
+                  {"function_name": "withdraw"} for GET_FUNCTION_CODE
+                  {"function_name": "withdraw", "vulnerability_type": "reentrancy"} for SUBMIT
+    """
+    action_type: ActionType
+    params: Dict[str, Any] = Field(default_factory=dict)
+    class Config:
+        use_enum_values = True
+# ---------------------------------------------------------------------------
+# Observation
+# ---------------------------------------------------------------------------
+class Observation(BaseModel):
+    """
+    What the agent receives from the environment.
+    task_id           : which task is active
+    contract_name     : name of the Solidity contract
+    contract_description : high-level description of what the contract does
+    available_actions : list of valid ActionType strings
+    last_action       : the action that produced this observation (None on reset)
+    last_action_result: human-readable result of the last action
+    step_count        : number of steps taken so far
+    cumulative_reward : running reward total
+    done              : whether the episode has ended
+    extra             : any additional task-specific context
+    """
+    task_id: str
+    contract_name: str
+    contract_description: str
+    available_actions: List[str]
+    last_action: Optional[str] = None
+    last_action_result: Optional[str] = None
+    step_count: int = 0
+    cumulative_reward: float = 0.0
+    done: bool = False
+    extra: Dict[str, Any] = Field(default_factory=dict)
+# ---------------------------------------------------------------------------
+# Reward
+# ---------------------------------------------------------------------------
+class Reward(BaseModel):
+    """
+    Structured reward info returned with each step.
+    value   : float reward for this step (can be negative)
+    reason  : human-readable explanation
+    partial : True if this is a shaping reward, False if terminal
+    """
+    value: float
+    reason: str
+    partial: bool = True
+# ---------------------------------------------------------------------------
+# Step / Reset / State results
+# ---------------------------------------------------------------------------
+class StepResult(BaseModel):
+    observation: Observation
+    reward: Reward
+    done: bool
+    info: Dict[str, Any] = Field(default_factory=dict)
+class ResetResult(BaseModel):
+    observation: Observation
+    info: Dict[str, Any] = Field(default_factory=dict)
+class StateResult(BaseModel):
+    task_id: str
+    contract_name: str
+    target_function: Optional[str] = None   # hidden in real eval, exposed here for debugging
+    step_count: int
+    cumulative_reward: float
+    done: bool
+    query_history: List[str] = Field(default_factory=list)
+    session_id: Optional[str] = None
+# ---------------------------------------------------------------------------
+# Task registry entry
+# ---------------------------------------------------------------------------
+class TaskInfo(BaseModel):
+    task_id: str
+    name: str
+    difficulty: str
+    description: str
+    status: str = "active"  # or "placeholder"

eval.py ADDED Viewed

	@@ -0,0 +1,290 @@

+"""
+eval.py
+-------
+Evaluation harness for the Smart Contract Audit RL Environment.
+Runs a configurable number of episodes per task, collecting grader scores
+and reward trajectories. Produces a detailed JSON report.
+Unlike inference.py (which uses an external LLM), this evaluates the
+*environment itself* using a built-in oracle agent — useful for:
+  - Verifying grader correctness
+  - Benchmarking reward shaping
+  - Checking score distribution across vulnerability types
+Usage:
+  python eval.py                     # all 8 vuln episodes
+  python eval.py --episodes 16       # more episodes
+  python eval.py --seed 0 --verbose  # detailed per-step output
+  python eval.py --out results.json  # custom output file
+"""
+import argparse
+import json
+import sys
+import time
+from typing import Any, Dict, List
+from tasks.task1.environment import Task1Environment
+from env.schemas import Action, ActionType
+from data.data_loader import load_contracts, get_all_vulnerable_entries
+# ─────────────────────────────────────────────────────────────────────────────
+# Oracle agent  (always submits the ground-truth answer)
+# ─────────────────────────────────────────────────────────────────────────────
+def oracle_agent(env: Task1Environment, seed: int, verbose: bool = False) -> Dict[str, Any]:
+    """
+    Runs one episode using the oracle strategy:
+      1. list_functions
+      2. get_function_code  (for the target function — peeked from state)
+      3. submit correct answer
+    This gives an upper-bound score trajectory for the environment.
+    Always ends with grader_score = 1.0.
+    """
+    reset_result = env.reset(seed=seed)
+    obs = reset_result.observation
+    steps_taken: List[Dict[str, Any]] = []
+    def _step(at: ActionType, params: dict = None) -> Any:
+        params = params or {}
+        action = Action(action_type=at, params=params)
+        result = env.step(action)
+        entry = {
+            "step": result.observation.step_count,
+            "action": at.value,
+            "params": params,
+            "reward": result.reward.value,
+            "reason": result.reward.reason,
+            "cumulative": result.observation.cumulative_reward,
+            "done": result.done,
+        }
+        steps_taken.append(entry)
+        if verbose:
+            done_flag = " [DONE]" if result.done else ""
+            print(
+                f"    step {entry['step']:2d}: {at.value:25s} "
+                f"r={result.reward.value:+.2f}  cum={entry['cumulative']:+.2f}"
+                f"{done_flag}"
+            )
+        return result
+    # Peek at ground truth (oracle only)
+    state = env.state()
+    target_fn = state.target_function
+    # Get ground-truth vulnerability from data
+    contracts = load_contracts()
+    vuln_issue = None
+    for contract in contracts:
+        for fn in contract.get("functions", []):
+            if fn["name"].lower() == target_fn.lower() and fn.get("vulnerable"):
+                vuln_issue = fn["vulnerability_details"]["issue"]
+                break
+        if vuln_issue:
+            break
+    if verbose:
+        print(f"  Contract : {obs.contract_name}")
+        print(f"  Target   : {target_fn}  ({vuln_issue})")
+    # Step 1: list functions (small cost, realistic)
+    _step(ActionType.LIST_FUNCTIONS)
+    # Step 2: read target function code (gets +0.05 shaping reward)
+    _step(ActionType.GET_FUNCTION_CODE, {"function_name": target_fn})
+    # Step 3: submit perfect answer
+    result = _step(ActionType.SUBMIT, {
+        "function_name": target_fn,
+        "vulnerability_type": vuln_issue,
+    })
+    final_reward = result.reward.value
+    if final_reward >= 4.9:
+        grader_score = 1.0
+    elif final_reward >= 0.9:
+        grader_score = 0.5
+    else:
+        grader_score = 0.0
+    return {
+        "seed": seed,
+        "contract": obs.contract_name,
+        "target_function": target_fn,
+        "vulnerability": vuln_issue,
+        "grader_score": grader_score,
+        "cumulative_reward": result.observation.cumulative_reward,
+        "steps": steps_taken,
+        "num_steps": len(steps_taken),
+    }
+# ─────────────────────────────────────────────────────────────────────────────
+# Partial agent  (submits correct function, wrong vuln type)
+# ─────────────────────────────────────────────────────────────────────────────
+def partial_agent(env: Task1Environment, seed: int) -> Dict[str, Any]:
+    """Submits right function, always uses 'unknown' as vulnerability type → score 0.5."""
+    reset_result = env.reset(seed=seed)
+    obs = reset_result.observation
+    state = env.state()
+    target_fn = state.target_function
+    action = Action(action_type=ActionType.SUBMIT, params={
+        "function_name": target_fn,
+        "vulnerability_type": "unknown vulnerability",
+    })
+    result = env.step(action)
+    return {
+        "seed": seed,
+        "grader_score": 0.5,
+        "cumulative_reward": result.observation.cumulative_reward,
+    }
+# ─────────────────────────────────────────────────────────────────────────────
+# Random agent  (submits a random wrong function)
+# ─────────────────────────────────────────────────────────────────────────────
+def random_agent(env: Task1Environment, seed: int) -> Dict[str, Any]:
+    """Always submits 'constructor' — always wrong → score 0.0."""
+    env.reset(seed=seed)
+    action = Action(action_type=ActionType.SUBMIT, params={
+        "function_name": "constructor",
+        "vulnerability_type": "reentrancy",
+    })
+    result = env.step(action)
+    return {
+        "seed": seed,
+        "grader_score": 0.0,
+        "cumulative_reward": result.observation.cumulative_reward,
+    }
+# ─────────────────────────────────────────────────────────────────────────────
+# Evaluation runner
+# ─────────────────────────────────────────────────────────────────────────────
+def run_evaluation(
+    num_episodes: int = 8,
+    seed_offset: int = 0,
+    verbose: bool = False,
+    output_file: str = "eval_results.json",
+) -> None:
+    env = Task1Environment()
+    contracts = load_contracts()
+    entries = get_all_vulnerable_entries(contracts)
+    vuln_types = list({fn["vulnerability_details"]["issue"] for _, fn in entries})
+    print("=" * 64)
+    print("Smart Contract Audit RL Environment — Evaluation")
+    print("=" * 64)
+    print(f"  Episodes  : {num_episodes}")
+    print(f"  Seed range: {seed_offset} – {seed_offset + num_episodes - 1}")
+    print(f"  Vulns in dataset: {len(entries)}")
+    print()
+    # ── Oracle agent ─────────────────────────────────────────────────────────
+    print("▶ Oracle agent (upper bound — always submits correct answer):")
+    oracle_episodes = []
+    for i in range(num_episodes):
+        seed = seed_offset + i
+        ep = oracle_agent(env, seed=seed, verbose=verbose)
+        oracle_episodes.append(ep)
+        icon = "✅" if ep["grader_score"] == 1.0 else "⚠️ "
+        print(
+            f"  {icon} seed={seed:3d}  {ep['contract']:12s}  "
+            f"{ep['target_function']:15s}  score={ep['grader_score']:.1f}  "
+            f"reward={ep['cumulative_reward']:+.2f}"
+        )
+    oracle_avg = sum(e["grader_score"] for e in oracle_episodes) / num_episodes
+    oracle_avg_r = sum(e["cumulative_reward"] for e in oracle_episodes) / num_episodes
+    print(f"\n  Oracle avg grader score : {oracle_avg:.3f}")
+    print(f"  Oracle avg reward       : {oracle_avg_r:+.2f}")
+    # ── Partial agent ─────────────────────────────────────────────────────────
+    print("\n▶ Partial agent (right function, wrong vuln type → 0.5 each):")
+    partial_episodes = []
+    for i in range(num_episodes):
+        ep = partial_agent(env, seed=seed_offset + i)
+        partial_episodes.append(ep)
+    partial_avg = sum(e["grader_score"] for e in partial_episodes) / num_episodes
+    print(f"  Partial avg grader score: {partial_avg:.3f}")
+    # ── Random agent ──────────────────────────────────────────────────────────
+    print("\n▶ Random agent (always wrong → 0.0 each):")
+    random_episodes = []
+    for i in range(num_episodes):
+        ep = random_agent(env, seed=seed_offset + i)
+        random_episodes.append(ep)
+    random_avg = sum(e["grader_score"] for e in random_episodes) / num_episodes
+    print(f"  Random avg grader score : {random_avg:.3f}")
+    # ── Score distribution ───────────────────────────────────��────────────────
+    print("\n▶ Coverage across vulnerability types:")
+    seen = {}
+    for ep in oracle_episodes:
+        v = ep.get("vulnerability", "unknown")
+        seen[v] = seen.get(v, 0) + 1
+    for v in sorted(seen):
+        print(f"  {seen[v]:2d}x  {v}")
+    # ── Summary ───────────────────────────────────────────────────────────────
+    print("\n" + "=" * 64)
+    print("SUMMARY")
+    print("=" * 64)
+    print(f"  Oracle   (ceiling): {oracle_avg:.3f}  {'✅' if oracle_avg == 1.0 else '⚠️ '}")
+    print(f"  Partial  (partial): {partial_avg:.3f}  ✅")
+    print(f"  Random   (floor)  : {random_avg:.3f}  ✅")
+    assert oracle_avg == 1.0,  "Oracle should always score 1.0"
+    assert partial_avg == 0.5, "Partial should always score 0.5"
+    assert random_avg == 0.0,  "Random should always score 0.0"
+    print("\n  ✅ All score sanity checks passed.")
+    # ── Write results ─────────────────────────────────────────────────────────
+    report = {
+        "num_episodes": num_episodes,
+        "seed_offset": seed_offset,
+        "agents": {
+            "oracle":  {"avg_score": oracle_avg,  "avg_reward": oracle_avg_r, "episodes": oracle_episodes},
+            "partial": {"avg_score": partial_avg, "episodes": partial_episodes},
+            "random":  {"avg_score": random_avg,  "episodes": random_episodes},
+        },
+        "vulnerability_coverage": seen,
+    }
+    with open(output_file, "w") as f:
+        json.dump(report, f, indent=2)
+    print(f"\n  Results written to {output_file}")
+# ─────────────────────────────────────────────────────────────────────────────
+# Entry point
+# ─────────────────────────────────────────────────────────────────────────────
+def main():
+    parser = argparse.ArgumentParser(description="Evaluate the SC Audit RL Environment")
+    parser.add_argument("--episodes", type=int, default=8,
+                        help="Number of episodes per agent (default: 8)")
+    parser.add_argument("--seed", type=int, default=42,
+                        help="Starting seed (default: 42)")
+    parser.add_argument("--verbose", action="store_true",
+                        help="Print per-step details for oracle agent")
+    parser.add_argument("--out", default="eval_results.json",
+                        help="Output JSON file (default: eval_results.json)")
+    args = parser.parse_args()
+    run_evaluation(
+        num_episodes=args.episodes,
+        seed_offset=args.seed,
+        verbose=args.verbose,
+        output_file=args.out,
+    )
+if __name__ == "__main__":
+    main()

inference.py ADDED Viewed

	@@ -0,0 +1,326 @@

+"""
+inference.py
+------------
+Baseline inference script for the Smart Contract Audit RL Environment.
+Uses the OpenAI-compatible API client to run an LLM agent against Task 1.
+Tasks 2 and 3 are placeholders — they reset and immediately record 0.0.
+Environment variables required:
+  API_BASE_URL   – LLM endpoint (e.g. https://api.openai.com/v1)
+  MODEL_NAME     – model identifier (e.g. gpt-4o-mini)
+  HF_TOKEN       – API key (passed as Authorization: Bearer <HF_TOKEN>)
+Usage:
+  python inference.py
+Output:
+  Per-task scores printed to stdout.
+  Final baseline scores written to baseline_scores.json.
+"""
+import json
+import os
+import sys
+import time
+from typing import Any, Dict, List, Optional
+from openai import OpenAI
+# ---------------------------------------------------------------------------
+# Import the env directly (no HTTP overhead for baseline)
+# ---------------------------------------------------------------------------
+from tasks.task1.environment import Task1Environment
+from env.schemas import Action, ActionType
+# ---------------------------------------------------------------------------
+# Config
+# ---------------------------------------------------------------------------
+API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.openai.com/v1")
+MODEL_NAME = os.environ.get("MODEL_NAME", "gpt-4o-mini")
+HF_TOKEN = os.environ.get("HF_TOKEN", "")
+if not HF_TOKEN:
+    print("WARNING: HF_TOKEN is not set. API calls may fail.", file=sys.stderr)
+MAX_STEPS = 15          # Safety limit per episode
+NUM_EPISODES = 3        # Episodes per task
+TASK1_SEED_BASE = 42    # Reproducible seeds
+# ---------------------------------------------------------------------------
+# OpenAI client
+# ---------------------------------------------------------------------------
+client = OpenAI(
+    api_key=HF_TOKEN,
+    base_url=API_BASE_URL,
+)
+# ---------------------------------------------------------------------------
+# System prompt
+# ---------------------------------------------------------------------------
+SYSTEM_PROMPT = """You are an expert smart contract security auditor.
+You are given a Solidity contract and must identify the SINGLE most critical vulnerable function and name its vulnerability type.
+## Available Actions
+You interact by choosing ONE action per turn from:
+1. list_functions
+   → {"action": "list_functions", "params": {}}
+2. get_function_code
+   → {"action": "get_function_code", "params": {"function_name": "<name>"}}
+3. get_function_summary
+   → {"action": "get_function_summary", "params": {"function_name": "<name>"}}
+4. get_file_metadata
+   → {"action": "get_file_metadata", "params": {}}
+5. get_state_variable
+   → {"action": "get_state_variable", "params": {"variable_name": "<name>"}}
+   (omit variable_name to list all variables)
+6. get_call_graph
+   → {"action": "get_call_graph", "params": {}}
+7. submit  (ENDS THE EPISODE)
+   → {"action": "submit", "params": {"function_name": "<name>", "vulnerability_type": "<2-3 word description>"}}
+## Strategy
+- Start with list_functions and get_file_metadata to understand the contract
+- Inspect suspicious functions (withdraw, transfer, emergency*, stake, etc.)
+- Submit when you are confident about the vulnerable function
+## Output Format
+Always respond with a single JSON object:
+{"action": "<action_type>", "params": {...}}
+Do NOT include any other text — only valid JSON.
+"""
+def build_user_message(obs: Dict[str, Any]) -> str:
+    """Format the observation as a user message."""
+    lines = [
+        f"=== CONTRACT: {obs['contract_name']} ===",
+        f"Description: {obs['contract_description']}",
+        f"Step: {obs['step_count']} | Cumulative reward: {obs['cumulative_reward']:.2f}",
+        "",
+        f"Last action: {obs['last_action'] or 'None'}",
+        f"Result: {obs['last_action_result'] or 'Episode just started'}",
+        "",
+        f"Available actions: {', '.join(obs['available_actions'])}",
+    ]
+    if obs.get("extra", {}).get("hint"):
+        lines.append(f"Hint: {obs['extra']['hint']}")
+    return "\n".join(lines)
+# ---------------------------------------------------------------------------
+# Agent loop
+# ---------------------------------------------------------------------------
+def run_episode(env: Task1Environment, seed: int, episode_num: int) -> Dict[str, Any]:
+    """Run one episode and return result info."""
+    print(f"\n  Episode {episode_num} (seed={seed})")
+    reset_result = env.reset(seed=seed)
+    obs = reset_result.observation.model_dump()
+    print(f"    Contract: {obs['contract_name']}")
+    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
+    final_score = 0.0
+    final_reward = 0.0
+    steps = 0
+    done = False
+    for step_num in range(MAX_STEPS):
+        user_msg = build_user_message(obs)
+        messages.append({"role": "user", "content": user_msg})
+        # LLM call
+        try:
+            response = client.chat.completions.create(
+                model=MODEL_NAME,
+                messages=messages,
+                max_tokens=256,
+                temperature=0.0,
+            )
+            raw = response.choices[0].message.content.strip()
+        except Exception as e:
+            print(f"    LLM error at step {step_num}: {e}", file=sys.stderr)
+            break
+        # Parse action
+        try:
+            parsed = json.loads(raw)
+            action_type = ActionType(parsed["action"])
+            params = parsed.get("params", {})
+        except Exception as e:
+            print(f"    Parse error: {e} | Raw: {raw[:100]}", file=sys.stderr)
+            # Default safe action
+            action_type = ActionType.LIST_FUNCTIONS
+            params = {}
+        action = Action(action_type=action_type, params=params)
+        messages.append({"role": "assistant", "content": raw})
+        # Step
+        step_result = env.step(action)
+        obs = step_result.observation.model_dump()
+        done = step_result.done
+        steps += 1
+        final_reward = obs["cumulative_reward"]
+        print(
+            f"    Step {step_num+1}: {action_type.value} | "
+            f"reward={step_result.reward.value:+.2f} | "
+            f"cumulative={final_reward:.2f}"
+        )
+        if done:
+            # Determine grader score from reward
+            last_reward = step_result.reward.value
+            if last_reward >= 4.9:
+                final_score = 1.0
+            elif last_reward >= 0.9:
+                final_score = 0.5
+            else:
+                final_score = 0.0
+            print(f"    → DONE | grader_score={final_score:.1f}")
+            break
+    if not done:
+        print(f"    → MAX STEPS reached without submission. Score=0.0")
+    return {
+        "episode": episode_num,
+        "seed": seed,
+        "contract": obs["contract_name"],
+        "steps": steps,
+        "cumulative_reward": final_reward,
+        "grader_score": final_score,
+        "done": done,
+    }
+def run_task1(num_episodes: int = NUM_EPISODES) -> Dict[str, Any]:
+    """Run Task 1 and return aggregate scores."""
+    print("\n" + "="*60)
+    print("TASK 1: Targeted Vulnerability Detection")
+    print("="*60)
+    env = Task1Environment()
+    episodes = []
+    for i in range(num_episodes):
+        seed = TASK1_SEED_BASE + i
+        result = run_episode(env, seed=seed, episode_num=i + 1)
+        episodes.append(result)
+        time.sleep(0.5)  # Rate limit courtesy
+    scores = [e["grader_score"] for e in episodes]
+    avg = sum(scores) / len(scores) if scores else 0.0
+    avg_reward = sum(e["cumulative_reward"] for e in episodes) / len(episodes)
+    print(f"\n  Task 1 Results:")
+    print(f"    Episodes: {num_episodes}")
+    print(f"    Grader scores: {scores}")
+    print(f"    Average grader score: {avg:.3f}")
+    print(f"    Average cumulative reward: {avg_reward:.2f}")
+    return {
+        "task_id": "task1_vuln_detection",
+        "name": "Targeted Vulnerability Detection",
+        "status": "active",
+        "num_episodes": num_episodes,
+        "episodes": episodes,
+        "avg_grader_score": avg,
+        "avg_cumulative_reward": avg_reward,
+    }
+def run_task2_placeholder() -> Dict[str, Any]:
+    """Task 2 placeholder — returns 0.0 score."""
+    print("\n" + "="*60)
+    print("TASK 2: Property Discovery [PLACEHOLDER — not implemented]")
+    print("="*60)
+    print("  Skipping. Score: 0.0")
+    return {
+        "task_id": "task2_property_discovery",
+        "name": "Property Discovery",
+        "status": "placeholder",
+        "num_episodes": 0,
+        "episodes": [],
+        "avg_grader_score": 0.0,
+        "avg_cumulative_reward": 0.0,
+    }
+def run_task3_placeholder() -> Dict[str, Any]:
+    """Task 3 placeholder — returns 0.0 score."""
+    print("\n" + "="*60)
+    print("TASK 3: Rule Checker [PLACEHOLDER — not implemented]")
+    print("="*60)
+    print("  Skipping. Score: 0.0")
+    return {
+        "task_id": "task3_rule_checker",
+        "name": "Rule Checker",
+        "status": "placeholder",
+        "num_episodes": 0,
+        "episodes": [],
+        "avg_grader_score": 0.0,
+        "avg_cumulative_reward": 0.0,
+    }
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main():
+    print("Smart Contract Audit RL Environment — Baseline Inference")
+    print(f"Model: {MODEL_NAME} | Base URL: {API_BASE_URL}")
+    results = {
+        "model": MODEL_NAME,
+        "base_url": API_BASE_URL,
+        "tasks": [],
+    }
+    t1 = run_task1(num_episodes=NUM_EPISODES)
+    t2 = run_task2_placeholder()
+    t3 = run_task3_placeholder()
+    results["tasks"] = [t1, t2, t3]
+    # Summary
+    active_tasks = [t for t in results["tasks"] if t["status"] == "active"]
+    overall = (
+        sum(t["avg_grader_score"] for t in active_tasks) / len(active_tasks)
+        if active_tasks else 0.0
+    )
+    results["overall_avg_score"] = overall
+    print("\n" + "="*60)
+    print("BASELINE SUMMARY")
+    print("="*60)
+    for t in results["tasks"]:
+        status = "✅" if t["status"] == "active" else "⏳"
+        print(f"  {status} {t['name']}: {t['avg_grader_score']:.3f}")
+    print(f"  Overall (active tasks): {overall:.3f}")
+    # Write scores file
+    with open("baseline_scores.json", "w") as f:
+        json.dump(results, f, indent=2)
+    print("\n  Scores written to baseline_scores.json")
+if __name__ == "__main__":
+    main()

openenv.yaml ADDED Viewed

	@@ -0,0 +1,169 @@

+name: smart-contract-audit-env
+version: "1.0.0"
+description: >
+  Reinforcement learning environment for smart contract security analysis.
+  Agents interact with real-world Solidity contract data from Certora-audited
+  projects, learning to detect vulnerabilities, discover properties, and
+  verify rule compliance — tasks that professional auditors perform daily.
+author: "SmartAudit Team"
+license: MIT
+# ---------------------------------------------------------------------------
+# Tasks
+# ---------------------------------------------------------------------------
+tasks:
+  - id: task1_vuln_detection
+    name: Targeted Vulnerability Detection
+    difficulty: medium
+    status: active
+    description: >
+      Given a Solidity contract (4–6 functions), identify the single vulnerable
+      function and describe its vulnerability type in 2–3 words.
+    max_steps: 20
+    reward_range: [-10.0, 10.0]
+    grader: tasks/task1/grader.py
+    grader_score_range: [0.0, 1.0]
+  - id: task2_property_discovery
+    name: Property Discovery
+    difficulty: hard
+    status: placeholder
+    description: >
+      Given a single Solidity function with known properties, discover the
+      correct natural-language property describing its expected behaviour.
+    max_steps: 15
+    reward_range: [-5.0, 5.0]
+    grader: tasks/task2/grader.py   # TODO: implement
+    grader_score_range: [0.0, 1.0]
+  - id: task3_rule_checker
+    name: Rule Checker
+    difficulty: easy
+    status: placeholder
+    description: >
+      Given a natural-language property and a Solidity file, identify the
+      function that violates that property.
+    max_steps: 15
+    reward_range: [-5.0, 5.0]
+    grader: tasks/task3/grader.py   # TODO: implement
+    grader_score_range: [0.0, 1.0]
+# ---------------------------------------------------------------------------
+# Observation space
+# ---------------------------------------------------------------------------
+observation_space:
+  type: object
+  properties:
+    task_id:
+      type: string
+      description: Active task identifier
+    contract_name:
+      type: string
+      description: Name of the Solidity contract
+    contract_description:
+      type: string
+      description: Human-readable description of what the contract does
+    available_actions:
+      type: array
+      items:
+        type: string
+      description: List of valid action type strings
+    last_action:
+      type: string
+      nullable: true
+      description: The action type that produced this observation
+    last_action_result:
+      type: string
+      nullable: true
+      description: Human-readable result of the last action
+    step_count:
+      type: integer
+      description: Number of steps taken in this episode
+    cumulative_reward:
+      type: number
+      description: Running reward total for this episode
+    done:
+      type: boolean
+      description: True when the episode has ended
+    extra:
+      type: object
+      description: Task-specific hints and auxiliary data
+# ---------------------------------------------------------------------------
+# Action space (Task 1)
+# ---------------------------------------------------------------------------
+action_space:
+  type: object
+  description: Named action with optional parameters
+  properties:
+    action_type:
+      type: string
+      enum:
+        - list_functions
+        - get_function_code
+        - get_function_summary
+        - get_file_metadata
+        - get_state_variable
+        - get_call_graph
+        - submit
+    params:
+      type: object
+      description: Key-value arguments for the action
+# ---------------------------------------------------------------------------
+# Reward function
+# ---------------------------------------------------------------------------
+reward:
+  type: shaped
+  description: >
+    Per-step costs encourage efficient exploration. A positive signal is given
+    when the agent accesses the actual vulnerable function. Terminal rewards
+    reflect submission accuracy (0 → 1 grader score).
+  shaping:
+    list_functions: -0.05
+    get_function_code_wrong: -0.10
+    get_function_code_correct: +0.05
+    get_function_summary_wrong: -0.05
+    get_function_summary_correct: +0.03
+    get_file_metadata: -0.04
+    get_state_variable: -0.05
+    get_call_graph: -0.08
+    repeated_query: -0.40
+  terminal:
+    correct_submission: +5.0
+    partial_submission: +1.0
+    wrong_submission: -1.5
+# ---------------------------------------------------------------------------
+# Data
+# ---------------------------------------------------------------------------
+data:
+  source: "Certora audited projects (Aave, Compound-style protocols)"
+  format: JSON
+  num_contracts: 4
+  num_vulnerable_functions: 8
+  vulnerability_types:
+    - Reentrancy
+    - Missing access control
+    - Integer overflow
+    - tx.origin authentication
+    - Front-running
+    - Timestamp dependence
+    - Denial of service (unbounded loop)
+    - Unchecked return value
+# ---------------------------------------------------------------------------
+# Interface
+# ---------------------------------------------------------------------------
+interface:
+  http:
+    reset: POST /reset
+    step: POST /step
+    state: GET /state
+    tasks: GET /tasks
+    health: GET /health
+  python:
+    reset: env.reset(seed=None) -> ResetResult
+    step: env.step(action) -> StepResult
+    state: env.state() -> StateResult

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+fastapi==0.115.0
+uvicorn[standard]==0.30.6
+pydantic==2.8.2
+openai==1.51.0
+httpx==0.27.2
+python-multipart==0.0.9
+pyyaml==6.0.2

tasks/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # tasks package

tasks/task1/__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+# task1 package
+from tasks.task1.environment import Task1Environment
+from tasks.task1.grader import Task1Grader
+__all__ = ["Task1Environment", "Task1Grader"]

tasks/task1/environment.py ADDED Viewed

	@@ -0,0 +1,329 @@

+"""
+environment.py  (Task 1 – Targeted Vulnerability Detection)
+------------------------------------------------------------
+Full OpenEnv-compliant environment.
+Episode flow:
+  1. reset() selects a random (contract, vulnerable_function) pair.
+  2. The agent receives an Observation with the contract description.
+  3. The agent uses actions to explore the contract (each costs a small penalty).
+  4. When the agent submits, the Grader scores the answer and the episode ends.
+Reward shaping:
+  list_functions          : -0.05
+  get_function_code       : -0.10 (wrong function) / +0.05 (correct function)
+  get_function_summary    : -0.05 (wrong function) / +0.03 (correct function)
+  get_file_metadata       : -0.04
+  get_state_variable      : -0.05
+  get_call_graph          : -0.08
+  submit (score=1.0)      : +5.0
+  submit (score=0.5)      : +1.0
+  submit (score=0.0)      : -1.5
+  repeated query          : -0.40
+"""
+from __future__ import annotations
+import random
+from typing import Any, Dict, List, Optional, Set
+from data.data_loader import (
+    load_contracts,
+    sample_episode,
+    get_function_by_name,
+    get_state_variable_by_name,
+    list_function_names,
+    list_state_variable_names,
+)
+from env.base_env import BaseEnv
+from env.schemas import (
+    Action,
+    ActionType,
+    Observation,
+    Reward,
+    ResetResult,
+    StateResult,
+    StepResult,
+)
+from tasks.task1.grader import Task1Grader
+TASK_ID = "task1_vuln_detection"
+AVAILABLE_ACTIONS = [
+    ActionType.LIST_FUNCTIONS,
+    ActionType.GET_FUNCTION_CODE,
+    ActionType.GET_FUNCTION_SUMMARY,
+    ActionType.GET_FILE_METADATA,
+    ActionType.GET_STATE_VARIABLE,
+    ActionType.GET_CALL_GRAPH,
+    ActionType.SUBMIT,
+]
+class Task1Environment(BaseEnv):
+    """Task 1: Targeted Vulnerability Detection."""
+    def __init__(self, contracts_path: Optional[str] = None) -> None:
+        self._contracts = load_contracts(contracts_path) if contracts_path else load_contracts()
+        self._rng = random.Random()
+        # Episode state (initialised by reset)
+        self._contract: Dict[str, Any] = {}
+        self._target_fn: Dict[str, Any] = {}
+        self._grader: Optional[Task1Grader] = None
+        self._step_count: int = 0
+        self._cumulative_reward: float = 0.0
+        self._done: bool = False
+        self._query_history: List[str] = []
+        self._seen_queries: Set[str] = set()
+    # ------------------------------------------------------------------
+    # OpenEnv interface
+    # ------------------------------------------------------------------
+    def reset(self, seed: Optional[int] = None) -> ResetResult:
+        """Start a new episode by sampling a random vulnerable function."""
+        if seed is not None:
+            self._rng.seed(seed)
+        self._contract, self._target_fn = sample_episode(self._contracts, self._rng)
+        self._grader = Task1Grader(
+            target_function=self._target_fn["name"],
+            vulnerability_issue=self._target_fn["vulnerability_details"]["issue"],
+        )
+        self._step_count = 0
+        self._cumulative_reward = 0.0
+        self._done = False
+        self._query_history = []
+        self._seen_queries = set()
+        obs = self._build_observation(
+            last_action=None,
+            last_result=(
+                f"New episode started. Contract: {self._contract['contract_name']}. "
+                f"Use 'list_functions' to explore the contract."
+            ),
+        )
+        return ResetResult(observation=obs, info={"task_id": TASK_ID})
+    def step(self, action: Action) -> StepResult:
+        """Execute one agent action."""
+        if self._done:
+            raise RuntimeError("Episode is done. Call reset() to start a new episode.")
+        self._step_count += 1
+        # Dispatch
+        result_text, reward = self._dispatch(action)
+        self._cumulative_reward += reward.value
+        self._query_history.append(f"[{action.action_type}] → {result_text[:120]}")
+        obs = self._build_observation(
+            last_action=action.action_type,
+            last_result=result_text,
+        )
+        return StepResult(
+            observation=obs,
+            reward=reward,
+            done=self._done,
+            info={
+                "step": self._step_count,
+                "cumulative_reward": self._cumulative_reward,
+            },
+        )
+    def state(self) -> StateResult:
+        return StateResult(
+            task_id=TASK_ID,
+            contract_name=self._contract.get("contract_name", ""),
+            target_function=self._target_fn.get("name"),
+            step_count=self._step_count,
+            cumulative_reward=self._cumulative_reward,
+            done=self._done,
+            query_history=list(self._query_history),
+        )
+    # ------------------------------------------------------------------
+    # Internal helpers
+    # ------------------------------------------------------------------
+    def _build_observation(
+        self,
+        last_action: Optional[str],
+        last_result: str,
+    ) -> Observation:
+        return Observation(
+            task_id=TASK_ID,
+            contract_name=self._contract.get("contract_name", ""),
+            contract_description=self._contract.get("metadata", {}).get("description", ""),
+            available_actions=[a.value for a in AVAILABLE_ACTIONS],
+            last_action=last_action,
+            last_action_result=last_result,
+            step_count=self._step_count,
+            cumulative_reward=self._cumulative_reward,
+            done=self._done,
+            extra={
+                "solidity_version": self._contract.get("metadata", {}).get("solidity_version", ""),
+                "hint": (
+                    "Identify the vulnerable function and its issue. "
+                    "Submit with action_type='submit', params={'function_name': '...', "
+                    "'vulnerability_type': '...'}"
+                ),
+            },
+        )
+    def _query_key(self, action_type: str, params: Dict[str, Any]) -> str:
+        """Build a hashable key for repeated-query detection."""
+        return f"{action_type}:{sorted(params.items())}"
+    def _is_repeated(self, key: str) -> bool:
+        if key in self._seen_queries:
+            return True
+        self._seen_queries.add(key)
+        return False
+    def _dispatch(self, action: Action) -> tuple[str, Reward]:
+        at = action.action_type
+        params = action.params
+        qkey = self._query_key(at, params)
+        # ---- list_functions ----------------------------------------
+        if at == ActionType.LIST_FUNCTIONS:
+            if self._is_repeated(qkey):
+                return "Repeated query.", Reward(value=-0.40, reason="Repeated query", partial=True)
+            names = list_function_names(self._contract)
+            return (
+                f"Functions in {self._contract['contract_name']}: {', '.join(names)}",
+                Reward(value=-0.05, reason="list_functions cost", partial=True),
+            )
+        # ---- get_function_code -------------------------------------
+        if at == ActionType.GET_FUNCTION_CODE:
+            fn_name = params.get("function_name", "")
+            if self._is_repeated(qkey):
+                return "Repeated query.", Reward(value=-0.40, reason="Repeated query", partial=True)
+            fn = get_function_by_name(self._contract, fn_name)
+            if fn is None:
+                return (
+                    f"Function '{fn_name}' not found. Available: {list_function_names(self._contract)}",
+                    Reward(value=-0.10, reason="Wrong/unknown function name", partial=True),
+                )
+            is_target = fn["name"].lower() == self._target_fn["name"].lower()
+            code = fn.get("code", "// no code available")
+            reward_val = 0.05 if is_target else -0.10
+            reason = "Fetched target function code (+)" if is_target else "Fetched non-target function (-)"
+            return (
+                f"// {fn['name']}\n{code}",
+                Reward(value=reward_val, reason=reason, partial=True),
+            )
+        # ---- get_function_summary ----------------------------------
+        if at == ActionType.GET_FUNCTION_SUMMARY:
+            fn_name = params.get("function_name", "")
+            if self._is_repeated(qkey):
+                return "Repeated query.", Reward(value=-0.40, reason="Repeated query", partial=True)
+            fn = get_function_by_name(self._contract, fn_name)
+            if fn is None:
+                return (
+                    f"Function '{fn_name}' not found.",
+                    Reward(value=-0.05, reason="Wrong function name", partial=True),
+                )
+            is_target = fn["name"].lower() == self._target_fn["name"].lower()
+            comment = fn.get("comment", "No summary available.")
+            reward_val = 0.03 if is_target else -0.05
+            reason = "Fetched target function summary (+)" if is_target else "Fetched non-target summary (-)"
+            return (
+                f"Summary of '{fn['name']}': {comment}",
+                Reward(value=reward_val, reason=reason, partial=True),
+            )
+        # ---- get_file_metadata -------------------------------------
+        if at == ActionType.GET_FILE_METADATA:
+            if self._is_repeated(qkey):
+                return "Repeated query.", Reward(value=-0.40, reason="Repeated query", partial=True)
+            meta = self._contract.get("metadata", {})
+            result = (
+                f"Contract: {self._contract['contract_name']} | "
+                f"File: {self._contract.get('file_name', 'N/A')} | "
+                f"Solidity: {meta.get('solidity_version', 'N/A')} | "
+                f"License: {meta.get('license', 'N/A')} | "
+                f"Author: {meta.get('author', 'N/A')} | "
+                f"Description: {meta.get('description', 'N/A')}"
+            )
+            return result, Reward(value=-0.04, reason="get_file_metadata cost", partial=True)
+        # ---- get_state_variable ------------------------------------
+        if at == ActionType.GET_STATE_VARIABLE:
+            var_name = params.get("variable_name", "")
+            if self._is_repeated(qkey):
+                return "Repeated query.", Reward(value=-0.40, reason="Repeated query", partial=True)
+            if not var_name:
+                # Return list of all state variables
+                names = list_state_variable_names(self._contract)
+                return (
+                    f"State variables: {', '.join(names)}",
+                    Reward(value=-0.05, reason="Listed state variables", partial=True),
+                )
+            sv = get_state_variable_by_name(self._contract, var_name)
+            if sv is None:
+                return (
+                    f"Variable '{var_name}' not found.",
+                    Reward(value=-0.05, reason="Unknown state variable", partial=True),
+                )
+            return (
+                f"{sv['type']} {sv['visibility']} {sv['name']}: {sv.get('description', '')}",
+                Reward(value=-0.05, reason="get_state_variable cost", partial=True),
+            )
+        # ---- get_call_graph ----------------------------------------
+        if at == ActionType.GET_CALL_GRAPH:
+            if self._is_repeated(qkey):
+                return "Repeated query.", Reward(value=-0.40, reason="Repeated query", partial=True)
+            cg = self._contract.get("call_graph", {})
+            cg_str = "; ".join(f"{fn} → [{', '.join(callees)}]" for fn, callees in cg.items())
+            return (
+                f"Call graph: {cg_str}",
+                Reward(value=-0.08, reason="get_call_graph cost", partial=True),
+            )
+        # ---- submit ------------------------------------------------
+        if at == ActionType.SUBMIT:
+            fn_name = params.get("function_name", "")
+            vuln_type = params.get("vulnerability_type", "")
+            if not fn_name or not vuln_type:
+                return (
+                    "Submit requires 'function_name' and 'vulnerability_type' in params.",
+                    Reward(value=-0.5, reason="Malformed submission", partial=True),
+                )
+            score = self._grader.grade_submission(fn_name, vuln_type)
+            reward_val = self._grader.reward_for_score(score)
+            self._done = True
+            if score == 1.0:
+                msg = (
+                    f"✅ CORRECT! '{fn_name}' is the vulnerable function. "
+                    f"Vulnerability type '{vuln_type}' matches. Score: 1.0"
+                )
+            elif score == 0.5:
+                msg = (
+                    f"⚠️  PARTIAL. '{fn_name}' is the right function, but the vulnerability type "
+                    f"'{vuln_type}' was not precise. Score: 0.5"
+                )
+            else:
+                correct = self._grader.get_canonical_answer()
+                msg = (
+                    f"❌ INCORRECT. '{fn_name}' is not the target vulnerable function. "
+                    f"Correct answer: {correct['function']} ({correct['vulnerability']}). Score: 0.0"
+                )
+            return msg, Reward(
+                value=reward_val,
+                reason=f"Submission score={score:.1f}",
+                partial=False,
+            )
+        # ---- unknown action ----------------------------------------
+        return (
+            f"Unknown action type: {at}",
+            Reward(value=-0.10, reason="Unknown action", partial=True),
+        )

tasks/task1/grader.py ADDED Viewed

	@@ -0,0 +1,98 @@

+"""
+grader.py  (Task 1 – Targeted Vulnerability Detection)
+-------------------------------------------------------
+Deterministic grader. Score range: 0.0 – 1.0
+  1.0 – correct function + correct vulnerability keyword
+  0.5 – correct function + wrong/unrecognised vulnerability keyword
+  0.0 – wrong function name
+"""
+from __future__ import annotations
+from typing import Dict, List, Optional
+VULN_KEYWORDS: Dict[str, List[str]] = {
+    "reentrancy": [
+        "reentrancy", "re-entrancy", "reentrant", "re entrant",
+        "recursive call", "reentr",
+    ],
+    "missing access control": [
+        "access control", "missing access", "no access", "unauthorized",
+        "privilege", "permission", "onlyowner", "only owner",
+        "no modifier", "missing modifier", "no check", "anyone can call",
+    ],
+    "integer overflow": [
+        "overflow", "integer overflow", "arithmetic overflow",
+        "safemath", "safe math", "uint overflow", "wraparound",
+        "integer underflow", "underflow",
+    ],
+    "tx.origin authentication": [
+        "tx.origin", "txorigin", "tx origin", "phishing",
+        "origin authentication", "origin auth",
+    ],
+    "front-running": [
+        "front-running", "frontrunning", "front running", "mev",
+        "sandwich", "mempool", "commit reveal", "commit-reveal",
+        "gas price manipulation",
+    ],
+    "timestamp dependence": [
+        "timestamp", "block.timestamp", "time manipulation",
+        "miner timestamp", "time dependency", "timestamp dependence",
+    ],
+    "denial of service": [
+        "denial of service", " dos", "gas limit", "unbounded loop",
+        "block gas", " oog", "out of gas", "infinite loop", "unbounded array",
+        "gas exhaustion",
+    ],
+    "unchecked return value": [
+        "unchecked return", "return value", "unchecked transfer",
+        "silent failure", "safeerc20", "safe transfer", "ignored return",
+        "erc20 return",
+    ],
+}
+def _norm(text: str) -> str:
+    return text.strip().lower()
+def _find_bucket(ground_truth_issue: str) -> Optional[str]:
+    """
+    Longest-match keyword search to identify canonical vulnerability bucket.
+    Longest match avoids short-keyword collisions (e.g. 'auth' in 'tx.origin authentication').
+    """
+    norm_gt = _norm(ground_truth_issue)
+    best: Optional[str] = None
+    best_len: int = 0
+    for canonical, keywords in VULN_KEYWORDS.items():
+        for kw in keywords:
+            if kw in norm_gt and len(kw) > best_len:
+                best_len = len(kw)
+                best = canonical
+    return best
+def match_vuln_keyword(submitted: str, ground_truth_issue: str) -> bool:
+    bucket = _find_bucket(ground_truth_issue)
+    if bucket is None:
+        return _norm(submitted) in _norm(ground_truth_issue)
+    norm_sub = _norm(submitted)
+    return any(kw in norm_sub for kw in VULN_KEYWORDS[bucket])
+class Task1Grader:
+    def __init__(self, target_function: str, vulnerability_issue: str) -> None:
+        self.target_function = target_function.lower()
+        self.vulnerability_issue = vulnerability_issue
+    def grade_submission(self, submitted_function: str, submitted_vuln_type: str) -> float:
+        if submitted_function.strip().lower() != self.target_function:
+            return 0.0
+        return 1.0 if match_vuln_keyword(submitted_vuln_type, self.vulnerability_issue) else 0.5
+    def reward_for_score(self, score: float) -> float:
+        if score == 1.0: return 5.0
+        if score == 0.5: return 1.0
+        return -1.5
+    def get_canonical_answer(self) -> Dict[str, str]:
+        return {"function": self.target_function, "vulnerability": self.vulnerability_issue}

tasks/task2/__init__.py ADDED Viewed

	@@ -0,0 +1,27 @@

+"""
+tasks/task2/__init__.py
+-----------------------
+Task 2: Property Discovery (PLACEHOLDER)
+TODO: Implement this task.
+Episode setup:
+  - One function from a Solidity file with known properties
+  - Agent must discover the natural-language property of the function
+Actions (to implement):
+  - get_similar_rule      : -0.20
+  - get_file_natspec      : -0.03
+  - get_function_natspec  : -0.08
+  - get_function_code     : -0.06
+  - get_related_functions : -0.06
+  - get_io                : -0.04
+  - submit_property       : scored 0.0–5.0 by semantic similarity grader
+See README.md for full task specification.
+"""
+# TODO: Task 2 – Property Discovery
+# from tasks.task2.environment import Task2Environment
+__all__: list = []

tasks/task3/__init__.py ADDED Viewed

	@@ -0,0 +1,31 @@

+"""
+tasks/task3/__init__.py
+-----------------------
+Task 3: Rule Checker (PLACEHOLDER)
+TODO: Implement this task.
+Episode setup:
+  - One Solidity file with at least one function breaking a given property
+  - Agent is shown the property in natural English
+Actions (to implement):
+  - get_formalized_property : -0.03
+  - list_functions          : -0.05
+  - get_function_metadata   : -0.05
+  - get_function_code       : -0.10
+  - get_state_variables     : -0.05
+  - get_call_graph          : -0.08
+  - submit_function         :
+      - correct = +5.0
+      - subfunction of target = +1.5
+      - wrong = -1.5
+      (ONE submission per episode)
+See README.md for full task specification.
+"""
+# TODO: Task 3 – Rule Checker
+# from tasks.task3.environment import Task3Environment
+__all__: list = []

utils/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # utils package

validate.py ADDED Viewed

	@@ -0,0 +1,290 @@

+"""
+validate.py
+-----------
+Pre-submission validation script.
+Checks all OpenEnv spec requirements locally before submitting.
+Usage:
+  python validate.py
+Exit code 0 = all checks pass.
+Exit code 1 = one or more checks failed.
+"""
+import json
+import sys
+import traceback
+from typing import Callable, List, Tuple
+# ─────────────────────────────────────────────────────────────────────────────
+# Helpers
+# ─────────────────────────────────────────────────────────────────────────────
+PASS = "✅"
+FAIL = "❌"
+SKIP = "⏭ "
+results: List[Tuple[str, bool, str]] = []
+def check(name: str, fn: Callable[[], None]) -> None:
+    try:
+        fn()
+        results.append((name, True, ""))
+        print(f"  {PASS} {name}")
+    except Exception as e:
+        tb = traceback.format_exc(limit=3)
+        results.append((name, False, str(e)))
+        print(f"  {FAIL} {name}")
+        print(f"       {e}")
+# ─────────────────────────────────────────────────────────────────────────────
+# Checks
+# ─────────────────────────────────────────────────────────────────────────────
+def check_imports():
+    from env.schemas import Observation, Action, Reward, StepResult, ResetResult, StateResult
+    from tasks.task1.environment import Task1Environment
+    from tasks.task1.grader import Task1Grader
+    from data.data_loader import load_contracts
+def check_openenv_yaml():
+    import yaml
+    with open("openenv.yaml") as f:
+        spec = yaml.safe_load(f)
+    assert "name" in spec
+    assert "tasks" in spec
+    assert len(spec["tasks"]) >= 3, "Need at least 3 tasks defined"
+    assert "observation_space" in spec
+    assert "action_space" in spec
+    assert "reward" in spec
+def check_pydantic_models():
+    from env.schemas import Observation, Action, ActionType, Reward, StepResult, ResetResult, StateResult
+    # Instantiate each model
+    obs = Observation(
+        task_id="t1", contract_name="C", contract_description="D",
+        available_actions=["submit"]
+    )
+    assert obs.task_id == "t1"
+    action = Action(action_type=ActionType.LIST_FUNCTIONS)
+    assert action.action_type == ActionType.LIST_FUNCTIONS
+    reward = Reward(value=1.0, reason="test")
+    assert reward.value == 1.0
+    step = StepResult(observation=obs, reward=reward, done=False)
+    assert not step.done
+    reset = ResetResult(observation=obs)
+    assert reset.observation.task_id == "t1"
+    state = StateResult(task_id="t1", contract_name="C", step_count=0,
+                        cumulative_reward=0.0, done=False)
+    assert state.step_count == 0
+def check_data_loading():
+    from data.data_loader import load_contracts, get_all_vulnerable_entries
+    contracts = load_contracts()
+    assert len(contracts) >= 1, "No contracts loaded"
+    entries = get_all_vulnerable_entries(contracts)
+    assert len(entries) >= 3, f"Need >= 3 vulnerable functions, got {len(entries)}"
+    for contract, fn in entries:
+        assert fn.get("vulnerable") is True
+        assert fn.get("vulnerability_details") is not None
+        assert "issue" in fn["vulnerability_details"]
+def check_env_reset():
+    from tasks.task1.environment import Task1Environment
+    env = Task1Environment()
+    result = env.reset(seed=42)
+    assert result.observation is not None
+    assert result.observation.task_id == "task1_vuln_detection"
+    assert result.observation.contract_name != ""
+    assert not result.observation.done
+    assert result.observation.step_count == 0
+def check_env_step():
+    from tasks.task1.environment import Task1Environment
+    from env.schemas import Action, ActionType
+    env = Task1Environment()
+    env.reset(seed=42)
+    result = env.step(Action(action_type=ActionType.LIST_FUNCTIONS))
+    assert result.observation is not None
+    assert isinstance(result.reward.value, float)
+    assert isinstance(result.done, bool)
+    assert "info" in result.model_dump()
+def check_env_state():
+    from tasks.task1.environment import Task1Environment
+    env = Task1Environment()
+    env.reset(seed=42)
+    state = env.state()
+    assert state.task_id == "task1_vuln_detection"
+    assert state.contract_name != ""
+    assert state.target_function is not None  # exposed for debugging
+def check_grader_scores_in_range():
+    from tasks.task1.grader import Task1Grader
+    cases = [
+        ("withdraw", "Reentrancy vulnerability", "withdraw", "reentrancy", 1.0),
+        ("withdraw", "Reentrancy vulnerability", "withdraw", "something else", 0.5),
+        ("withdraw", "Reentrancy vulnerability", "deposit", "reentrancy", 0.0),
+    ]
+    for tf, issue, sf, sv, expected in cases:
+        g = Task1Grader(tf, issue)
+        score = g.grade_submission(sf, sv)
+        assert 0.0 <= score <= 1.0, f"Score {score} out of range"
+        assert abs(score - expected) < 0.01, f"Expected {expected}, got {score}"
+def check_grader_deterministic():
+    from tasks.task1.grader import Task1Grader
+    g = Task1Grader("withdraw", "Reentrancy vulnerability")
+    s1 = g.grade_submission("withdraw", "reentrancy")
+    s2 = g.grade_submission("withdraw", "reentrancy")
+    assert s1 == s2 == 1.0, "Grader must be deterministic"
+def check_reward_shaping():
+    """Verify reward is non-binary (multiple distinct values across steps)."""
+    from tasks.task1.environment import Task1Environment
+    from env.schemas import Action, ActionType
+    env = Task1Environment()
+    env.reset(seed=1)
+    rewards = set()
+    for at in [ActionType.LIST_FUNCTIONS, ActionType.GET_FILE_METADATA, ActionType.GET_CALL_GRAPH]:
+        r = env.step(Action(action_type=at))
+        rewards.add(round(r.reward.value, 4))
+    # Should have at least 2 distinct shaping reward values
+    assert len(rewards) >= 2, f"Expected multiple reward values, got {rewards}"
+def check_episode_boundary():
+    """Episode must end after submit and raise on subsequent step."""
+    from tasks.task1.environment import Task1Environment
+    from env.schemas import Action, ActionType
+    env = Task1Environment()
+    env.reset(seed=2)
+    env.step(Action(action_type=ActionType.SUBMIT, params={
+        "function_name": "withdraw", "vulnerability_type": "test"
+    }))
+    try:
+        env.step(Action(action_type=ActionType.LIST_FUNCTIONS))
+        raise AssertionError("Should have raised RuntimeError after episode end")
+    except RuntimeError:
+        pass  # Expected
+def check_repeated_query_penalty():
+    from tasks.task1.environment import Task1Environment
+    from env.schemas import Action, ActionType
+    env = Task1Environment()
+    env.reset(seed=3)
+    env.step(Action(action_type=ActionType.LIST_FUNCTIONS))
+    r = env.step(Action(action_type=ActionType.LIST_FUNCTIONS))
+    assert r.reward.value == -0.40, f"Expected -0.40 for repeated query, got {r.reward.value}"
+def check_tasks_list():
+    """All three tasks must be listed (even if placeholders)."""
+    from tasks.task2 import __all__ as t2  # noqa
+    from tasks.task3 import __all__ as t3  # noqa
+def check_dockerfile_exists():
+    import os
+    assert os.path.exists("Dockerfile"), "Dockerfile is missing"
+    with open("Dockerfile") as f:
+        content = f.read()
+    assert "7860" in content, "Dockerfile must EXPOSE 7860 (HF Spaces)"
+    assert "uvicorn" in content or "CMD" in content
+def check_inference_script():
+    import os
+    assert os.path.exists("inference.py"), "inference.py is missing"
+    with open("inference.py") as f:
+        content = f.read()
+    assert "OPENAI_API_KEY" in content or "HF_TOKEN" in content, \
+        "inference.py must read API credentials from env vars"
+    assert "API_BASE_URL" in content
+    assert "MODEL_NAME" in content
+def check_baseline_json_schema():
+    """baseline_scores.json must have valid schema if it exists."""
+    import os
+    if not os.path.exists("baseline_scores.json"):
+        return  # OK — file is generated at runtime
+    with open("baseline_scores.json") as f:
+        data = json.load(f)
+    assert "tasks" in data
+    for task in data["tasks"]:
+        score = task["avg_grader_score"]
+        assert 0.0 <= score <= 1.0, f"Score {score} out of range"
+# ─────────────────────────────────────────────────────────────────────────────
+# Runner
+# ─────────────────────────────────────────────────────────────────────────────
+def main():
+    print("=" * 60)
+    print("OpenEnv Pre-Submission Validation")
+    print("=" * 60)
+    all_checks = [
+        ("Python imports",              check_imports),
+        ("openenv.yaml format",         check_openenv_yaml),
+        ("Pydantic model types",        check_pydantic_models),
+        ("Dataset loading (3+ vulns)",  check_data_loading),
+        ("env.reset() → ResetResult",   check_env_reset),
+        ("env.step() → StepResult",     check_env_step),
+        ("env.state() → StateResult",   check_env_state),
+        ("Grader scores in [0.0, 1.0]", check_grader_scores_in_range),
+        ("Grader is deterministic",     check_grader_deterministic),
+        ("Reward shaping (non-binary)", check_reward_shaping),
+        ("Episode boundary (done=True)",check_episode_boundary),
+        ("Repeated query penalty",      check_repeated_query_penalty),
+        ("Task 2 & 3 placeholders",     check_tasks_list),
+        ("Dockerfile exists + port",    check_dockerfile_exists),
+        ("inference.py exists + vars",  check_inference_script),
+        ("baseline_scores.json schema", check_baseline_json_schema),
+    ]
+    print()
+    for name, fn in all_checks:
+        check(name, fn)
+    print()
+    passed = sum(1 for _, ok, _ in results if ok)
+    total = len(results)
+    failed = [(n, msg) for n, ok, msg in results if not ok]
+    print("=" * 60)
+    print(f"Results: {passed}/{total} checks passed")
+    if failed:
+        print("\nFailed checks:")
+        for name, msg in failed:
+            print(f"  {FAIL} {name}: {msg}")
+        print()
+        print("❌ VALIDATION FAILED — fix the issues above before submitting.")
+        sys.exit(1)
+    else:
+        print()
+        print("✅ ALL CHECKS PASSED — ready to submit!")
+        sys.exit(0)
+if __name__ == "__main__":
+    main()