Spaces:

Codex47
/

SmartContractAudit

Running

App Files Files Community

ajaxwin commited on 20 days ago

Commit

9c888b7

1 Parent(s): 8fccda7

Task 2 added

Browse files

Files changed (12) hide show

README.md +177 -169
app.py +78 -116
data/data_loader.py +135 -30
demo.py +74 -4
env/schemas.py +37 -36
eval.py +259 -220
inference.py +217 -234
openenv.yaml +67 -91
tasks/task2/__init__.py +4 -26
tasks/task2/environment.py +340 -0
tasks/task2/grader.py +171 -0
validate.py +189 -199

README.md CHANGED Viewed

@@ -1,108 +1,120 @@
 # Smart Contract Audit RL Environment
 > **OpenEnv-compliant reinforcement learning environment for smart contract security analysis.**
-> Agents learn to audit real-world Solidity contracts — finding vulnerabilities, discovering properties, and checking rule compliance — tasks that professional auditors perform daily.
-[![OpenEnv Spec](https://img.shields.io/badge/OpenEnv-1.0-blue)](openenv.yaml)
-[![HF Space](https://img.shields.io/badge/HuggingFace-Space-yellow)](https://huggingface.co/spaces)
 [![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-brightgreen)](https://python.org)
 ---
 ## Motivation
-Smart contract auditing is a $500M+ industry where human auditors painstakingly review Solidity code for security flaws. This environment lets agents practice exactly that workflow — exploring contract code through targeted queries and submitting findings — providing a challenging, real-world benchmark for reasoning and code-understanding agents.
-Data is sourced from **Certora-audited DeFi projects**, giving agents contracts with the same vulnerability patterns found in production exploits (reentrancy, integer overflow, access control bypasses, etc.).
 ---
-## Environment Description
-The environment hosts **3 tasks** of increasing difficulty:
-| Task | Name | Difficulty | Status |
-|------|------|------------|--------|
-| 1 | Targeted Vulnerability Detection | Medium | ✅ Active |
-| 2 | Property Discovery | Hard | ⏳ Placeholder |
-| 3 | Rule Checker | Easy | ⏳ Placeholder |
-### Task 1 — Targeted Vulnerability Detection *(Medium)*
-**Setup:** The agent is shown a Solidity contract (4–6 functions). One function contains a critical vulnerability.
-**Objective:** Identify the vulnerable function and describe the vulnerability type in 2–3 words.
-**Episode lifecycle:**
-1. `reset()` — randomly selects one of 8 vulnerable (contract, function) pairs from the dataset
-2. Agent receives the contract name and description
-3. Agent explores using the action API (each action has a small cost)
-4. Agent calls `submit(function_name, vulnerability_type)` to end the episode
-5. Grader assigns 0.0–1.0 score
-**Vulnerability types in the dataset:**
-- Reentrancy
-- Missing access control
-- Integer overflow (Solidity <0.8)
-- tx.origin authentication
-- Front-running
-- Timestamp dependence
-- Denial of service (unbounded loop)
-- Unchecked ERC-20 return value
----
-### Task 2 — Property Discovery *(Hard)* [Placeholder]
-Given a single Solidity function, the agent must discover its natural-language correctness property. Grading uses semantic similarity to the ground-truth property. *Implementation coming soon.*
 ---
-### Task 3 — Rule Checker *(Easy)* [Placeholder]
-Given a natural-language property and a contract, the agent must identify which function violates that property. *Implementation coming soon.*
----
-## Action Space
-All actions are described below. **Repeated identical queries cost −0.40.**
-| Action | Key Params | Reward |
-|--------|-----------|--------|
-| `list_functions` | — | −0.05 |
-| `get_function_code` | `function_name` | +0.05 (target) / −0.10 (other) |
-| `get_function_summary` | `function_name` | +0.03 (target) / −0.05 (other) |
-| `get_file_metadata` | — | −0.04 |
-| `get_state_variable` | `variable_name` (opt.) | −0.05 |
-| `get_call_graph` | — | −0.08 |
-| `submit` | `function_name`, `vulnerability_type` | +5.0 / +1.0 / −1.5 |
-**Submit scoring:**
-- **+5.0** — correct function AND correct vulnerability keyword → grader score = 1.0
-- **+1.0** — correct function, unrecognised vulnerability type → grader score = 0.5
-- **−1.5** — wrong function → grader score = 0.0
 ---
 ## Observation Space
-Every `step()` and `reset()` returns an `Observation` object:
 ```json
 {
-  "task_id": "task1_vuln_detection",
-  "contract_name": "SimpleVault",
-  "contract_description": "An ETH vault that allows users to deposit and withdraw...",
-  "available_actions": ["list_functions", "get_function_code", ...],
-  "last_action": "get_function_code",
-  "last_action_result": "// withdraw\nfunction withdraw(uint256 amount) ...",
-  "step_count": 3,
-  "cumulative_reward": -0.05,
   "done": false,
   "extra": {
-    "solidity_version": "0.8.0",
-    "hint": "Identify the vulnerable function and its issue."
   }
 }
 ```
@@ -114,63 +126,88 @@ Every `step()` and `reset()` returns an `Observation` object:
 ```
 smart-contract-env/
 ├── data/
-│   ├── contracts.json          # 4 contracts, 8 vulnerabilities
-│   └── data_loader.py          # JSON parsing and episode sampling
 ├── env/
 │   ├── base_env.py             # Abstract OpenEnv base class
-│   └── schemas.py              # Pydantic models (Observation, Action, Reward…)
 ├── tasks/
 │   ├── task1/
 │   │   ├── environment.py      # Full Task 1 RL environment
-│   │   └── grader.py           # Deterministic 0.0–1.0 grader
-│   ├── task2/                  # TODO: Property Discovery
-│   └── task3/                  # TODO: Rule Checker
-├── utils/
-├── app.py                      # FastAPI server (OpenEnv HTTP interface)
-├── inference.py                # Baseline inference script (OpenAI client)
-├── openenv.yaml                # OpenEnv spec metadata
-├── Dockerfile
-├── requirements.txt
-└── README.md
 ```
 ---
 ## Setup & Usage
-### Option A — Run locally
 ```bash
-# 1. Clone and install
-git clone <repo>
-cd smart-contract-env
 pip install -r requirements.txt
-# 2. Start the server
-python app.py
-# → http://localhost:7860
 ```
-### Option B — Docker
 ```bash
 docker build -t sc-audit-env .
 docker run -p 7860:7860 sc-audit-env
 ```
-### Option C — Python (no server)
 ```python
 from tasks.task1.environment import Task1Environment
 from env.schemas import Action, ActionType
 env = Task1Environment()
-result = env.reset(seed=42)
-print(result.observation.contract_name)
-action = Action(action_type=ActionType.LIST_FUNCTIONS)
-step = env.step(action)
-print(step.observation.last_action_result)
 ```
 ---
@@ -180,35 +217,28 @@ print(step.observation.last_action_result)
 | Method | Endpoint | Description |
 |--------|----------|-------------|
 | `GET` | `/health` | Liveness probe |
-| `GET` | `/tasks` | List all tasks |
-| `POST` | `/reset` | Start new episode |
-| `POST` | `/step` | Take one action |
-| `GET` | `/state` | Debug: internal state |
-| `GET` | `/action_space` | Action space definition |
-| `GET` | `/observation_space` | Observation space definition |
-**Example session:**
 ```bash
-# Reset
-curl -X POST http://localhost:7860/reset \
-  -H "Content-Type: application/json" \
-  -d '{"task_id": "task1_vuln_detection", "seed": 42}'
-# List functions
-curl -X POST "http://localhost:7860/step" \
-  -H "Content-Type: application/json" \
-  -d '{"action_type": "list_functions", "params": {}}'
-# Submit answer
-curl -X POST "http://localhost:7860/step" \
-  -H "Content-Type: application/json" \
-  -d '{"action_type": "submit", "params": {"function_name": "withdraw", "vulnerability_type": "reentrancy"}}'
 ```
 ---
-## Running the Baseline
 ```bash
 export API_BASE_URL="https://api.openai.com/v1"
@@ -216,86 +246,64 @@ export MODEL_NAME="gpt-4o-mini"
 export HF_TOKEN="sk-..."
 python inference.py
 ```
-Outputs results to stdout and writes `baseline_scores.json`.
-**Expected baseline scores (gpt-4o-mini, 3 episodes):**
 | Task | Avg Grader Score | Notes |
 |------|-----------------|-------|
-| Task 1 | ~0.67 | Medium difficulty; model identifies common vulns well |
-| Task 2 | 0.00 | Placeholder |
 | Task 3 | 0.00 | Placeholder |
 ---
-## Baseline Scores
-```json
-{
-  "model": "gpt-4o-mini",
-  "tasks": [
-    {
-      "task_id": "task1_vuln_detection",
-      "avg_grader_score": 0.667,
-      "avg_cumulative_reward": 2.14
-    },
-    { "task_id": "task2_property_discovery", "avg_grader_score": 0.0 },
-    { "task_id": "task3_rule_checker", "avg_grader_score": 0.0 }
-  ],
-  "overall_avg_score": 0.667
-}
-```
----
-## Grader Details
-The Task 1 grader is **fully deterministic**:
-1. **Function name check** — case-insensitive exact match against the ground-truth vulnerable function. Wrong function → score = 0.0 immediately.
-2. **Vulnerability type check** — checks whether the submitted string contains any accepted keyword from a predefined keyword table (e.g. `"reentrancy"` table includes: `reentrancy`, `re-entrancy`, `reentrant`, `recursive call`). Match → 1.0; no match → 0.5.
-Scores map to terminal rewards: 1.0 → +5, 0.5 → +1, 0.0 → −1.5.
----
-## OpenEnv Spec Compliance
-- ✅ Typed `Observation`, `Action`, `Reward` Pydantic models
-- ✅ `step(action) → StepResult(observation, reward, done, info)`
-- ✅ `reset() → ResetResult(observation, info)`
-- ✅ `state() → StateResult`
-- ✅ `openenv.yaml` metadata
-- ✅ 3 tasks defined (1 active, 2 placeholders)
-- ✅ Grader scores in [0.0, 1.0]
-- ✅ Shaped rewards (not just binary)
-- ✅ Dockerfile + HF Space deployment
-- ✅ Baseline `inference.py` using OpenAI client
 ---
 ## Deploying to Hugging Face Spaces
-1. Create a new **Docker** Space on [huggingface.co/spaces](https://huggingface.co/spaces)
-2. Set the tag `openenv` in the Space metadata
-3. Push this repository:
 ```bash
-git remote add hf https://huggingface.co/spaces/<your-username>/<space-name>
 git push hf main
 ```
-The Space will build the Docker image and serve the FastAPI app on port 7860.
 ---
-## License
-MIT — see `LICENSE`.
-## Data Attribution
-Contract vulnerability patterns inspired by and adapted from **Certora** audit findings on production DeFi protocols.

 # Smart Contract Audit RL Environment
 > **OpenEnv-compliant reinforcement learning environment for smart contract security analysis.**
+> Train and evaluate agents on real-world Solidity audit tasks — the same work professional auditors do every day.
+[![OpenEnv Spec](https://img.shields.io/badge/OpenEnv-1.1-blue)](openenv.yaml)
 [![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-brightgreen)](https://python.org)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow)](LICENSE)
 ---
 ## Motivation
+Smart contract auditing is a $500M+ industry where human auditors painstakingly review Solidity code for security flaws and formally specify function properties. This environment lets agents practice exactly that workflow — exploring contract code through targeted queries and submitting findings — providing a rigorous, real-world benchmark for code-reasoning agents.
+Data is sourced from **Certora-audited DeFi projects**, giving agents contracts with the same vulnerability patterns found in production exploits.
 ---
+## Tasks
+| # | Name | Difficulty | Status | Description |
+|---|------|------------|--------|-------------|
+| 1 | Targeted Vulnerability Detection | Medium | ✅ Active | Find the vulnerable function and name the vulnerability type |
+| 2 | Property Discovery | Hard | ✅ Active | Write the natural-language postcondition for a given function |
+| 3 | Rule Checker | Easy | ⏳ Placeholder | Identify which function violates a given property |
+---
+## Task 1 — Targeted Vulnerability Detection *(Medium)*
+**Setup:** Agent is shown a Solidity contract (4–6 functions). One function contains a critical vulnerability.
+**Objective:** Identify the vulnerable function and describe its vulnerability type in 2–3 words.
+### Actions
+| Action | Params | Reward |
+|--------|--------|--------|
+| `list_functions` | — | −0.05 |
+| `get_function_code` | `function_name` | +0.05 (target) / −0.10 (other) |
+| `get_function_summary` | `function_name` | +0.03 (target) / −0.05 (other) |
+| `get_file_metadata` | — | −0.04 |
+| `get_state_variable` | `variable_name` (opt.) | −0.05 |
+| `get_call_graph` | — | −0.08 |
+| `submit` | `function_name`, `vulnerability_type` | **+5.0** / +1.0 / −1.5 |
+Repeated identical queries: **−0.40**
+### Submit scoring (deterministic)
+- **1.0** → correct function **+** correct vulnerability keyword → reward +5.0
+- **0.5** → correct function, wrong/vague vulnerability type → reward +1.0
+- **0.0** → wrong function → reward −1.5
+### Vulnerability types in dataset
+Reentrancy · Missing access control · Integer overflow · tx.origin authentication ·
+Front-running · Timestamp dependence · Denial of service (unbounded loop) · Unchecked return value
 ---
+## Task 2 — Property Discovery *(Hard)*
+**Setup:** Agent is shown a single Solidity function and must write its natural-language correctness property (postcondition / invariant).
+**Objective:** Write a precise 2–4 sentence property describing what the function guarantees when it succeeds.
+### Actions
+| Action | Params | Reward |
+|--------|--------|--------|
+| `get_function_code` | — | −0.06 |
+| `get_function_natspec` | — | −0.08 |
+| `get_file_natspec` | — | −0.03 |
+| `get_related_functions` | — | −0.06 |
+| `get_io` | — | −0.04 |
+| `get_similar_rule` | — | −0.20 |
+| `submit_property` | `property` (string) | **0.0–5.0** (scored, ONE attempt) |
+Repeated identical queries: **−0.40**
+### Submit scoring (keyword-weighted)
+```
+score = 0.70 × (key_phrases_matched / total_key_phrases)
+      + 0.30 × (bonus_phrases_matched / total_bonus_phrases)
+reward = score × 5.0    →  range: 0.0 – 5.0
+```
+Matching uses **word-set containment** with synonym expansion (e.g. "caller" matches "msg.sender", "sender", "user"). Phrases don't need to be adjacent — all constituent words just need to appear somewhere in the submitted text.
+**One submission per episode** — choose carefully.
+### Property coverage
+11 functions across 4 contracts with ground-truth properties: SimpleVault (deposit, withdraw, emergencyDrain), TokenSale (buyTokens, setPrice, withdrawETH), DutchAuction (getPrice, bid, finalize), YieldFarm (stake, claimRewards).
 ---
 ## Observation Space
+Every `step()` and `reset()` returns the same `Observation` structure:
 ```json
 {
+  "task_id": "task2_property_discovery",
+  "contract_name": "YieldFarm",
+  "contract_description": "A simple yield farming contract...",
+  "available_actions": ["get_function_code", "get_function_natspec", ...],
+  "last_action": "get_function_natspec",
+  "last_action_result": "NatSpec for 'claimRewards':\n@notice Claim all accrued...",
+  "step_count": 2,
+  "cumulative_reward": -0.14,
   "done": false,
   "extra": {
+    "target_function": "claimRewards",
+    "target_signature": "claimRewards()",
+    "solidity_version": "0.8.10",
+    "hint": "Discover the property of the target function..."
   }
 }
 ```
 ```
 smart-contract-env/
 ├── data/
+│   ├── contracts.json          # 4 contracts · 8 vulnerabilities · 11 properties
+│   └── data_loader.py          # JSON parser, episode samplers, T1 + T2 helpers
 ├── env/
 │   ├── base_env.py             # Abstract OpenEnv base class
+│   └── schemas.py              # Pydantic: Observation, Action, Reward, StepResult…
 ├── tasks/
 │   ├── task1/
 │   │   ├── environment.py      # Full Task 1 RL environment
+│   │   └── grader.py           # Deterministic 0/0.5/1.0 rubric + longest-match keywords
+│   ├── task2/
+│   │   ├── environment.py      # Full Task 2 RL environment (one submit per episode)
+│   │   └── grader.py           # Keyword-weighted 0.0–1.0 grader + synonym expansion
+│   └── task3/                  # TODO: Rule Checker (placeholder)
+├── app.py                      # FastAPI server — all OpenEnv HTTP endpoints
+├── inference.py                # Baseline LLM agent (Task 1 + Task 2)
+├── eval.py                     # Oracle/partial/random evaluation harness
+├── demo.py                     # Colourised interactive + scripted demo
+├── validate.py                 # 19-check pre-submission validator
+├── openenv.yaml                # Full OpenEnv spec metadata
+├── Dockerfile                  # Port 7860, uvicorn, healthcheck
+└── requirements.txt
 ```
 ---
 ## Setup & Usage
+### Local Python
 ```bash
+git clone <repo> && cd smart-contract-env
 pip install -r requirements.txt
+# Run the server
+python app.py                   # → http://localhost:7860
+# Run interactive demo
+python demo.py                  # Task 1 interactive
+python demo.py --auto           # Task 1 scripted
+python demo.py --auto --task 2  # Task 2 scripted (add --task flag)
+# Run evaluation harness (no LLM needed)
+python eval.py                  # Both tasks, 8 episodes each
+python eval.py --task 2         # Task 2 only
+python eval.py --episodes 16 --verbose
+# Pre-submission validation
+python validate.py              # 19/19 checks
 ```
+### Docker
 ```bash
 docker build -t sc-audit-env .
 docker run -p 7860:7860 sc-audit-env
 ```
+### Direct Python API
 ```python
 from tasks.task1.environment import Task1Environment
+from tasks.task2.environment import Task2Environment
 from env.schemas import Action, ActionType
+# Task 1
 env = Task1Environment()
+r = env.reset(seed=42)
+print(r.observation.contract_name)          # SimpleVault
+s = env.step(Action(action_type=ActionType.LIST_FUNCTIONS))
+s = env.step(Action(action_type=ActionType.SUBMIT,
+             params={"function_name": "emergencyDrain",
+                     "vulnerability_type": "missing access control"}))
+print(s.reward.value)                       # +5.0
+# Task 2
+env2 = Task2Environment()
+r2 = env2.reset(seed=42)
+print(r2.observation.extra["target_function"])  # claimRewards
+s2 = env2.step(Action(action_type=ActionType.GET_FUNCTION_NATSPEC))
+s2 = env2.step(Action(action_type=ActionType.SUBMIT_PROPERTY,
+               params={"property": "After a successful claimRewards call, all accrued reward tokens are transferred to the caller and their rewards balance is zeroed. Reverts if no rewards."}))
+print(s2.reward.value)                      # ~4.0
 ```
 ---
 | Method | Endpoint | Description |
 |--------|----------|-------------|
 | `GET` | `/health` | Liveness probe |
+| `GET` | `/tasks` | All tasks + status |
+| `POST` | `/reset` | Start episode (`task_id`, `seed`) |
+| `POST` | `/step` | Take action (`action_type`, `params`) |
+| `GET` | `/state` | Debug: internal episode state |
+| `GET` | `/action_space?task_id=...` | Action schema for a task |
+| `GET` | `/observation_space` | Observation schema |
 ```bash
+# Task 2 full episode
+curl -X POST localhost:7860/reset \
+  -d '{"task_id":"task2_property_discovery","seed":42}'
+curl -X POST localhost:7860/step \
+  -d '{"action_type":"get_function_natspec","params":{}}'
+curl -X POST localhost:7860/step \
+  -d '{"action_type":"submit_property","params":{"property":"..."}}'
 ```
 ---
+## Baseline Inference
 ```bash
 export API_BASE_URL="https://api.openai.com/v1"
 export HF_TOKEN="sk-..."
 python inference.py
+# → baseline_scores.json
 ```
+### Expected baseline scores (gpt-4o-mini, 3 episodes per task)
 | Task | Avg Grader Score | Notes |
 |------|-----------------|-------|
+| Task 1 | ~0.67 | Good at common vulns; misses subtle ones |
+| Task 2 | ~0.55 | Reasonable properties but often misses specific variable names |
 | Task 3 | 0.00 | Placeholder |
 ---
+## Evaluation Scores
+Deterministic oracle / partial / baseline tiers verified on 8 episodes (seeds 42–49):
+| Task | Oracle | Partial | Floor |
+|------|--------|---------|-------|
+| Task 1 | **1.000** | 0.500 | 0.000 |
+| Task 2 | **0.775** | 0.034 | 0.000 |
+The clear separation confirms the grader provides **meaningful gradient signal** for RL training.
 ---
 ## Deploying to Hugging Face Spaces
+1. Create a new **Docker** Space at [huggingface.co/spaces](https://huggingface.co/spaces)
+2. Add tag `openenv` in the Space settings
+3. Copy the `SPACES_README.md` frontmatter into `README.md`
+4. Push:
 ```bash
+git remote add hf https://huggingface.co/spaces/<user>/<space>
 git push hf main
 ```
 ---
+## OpenEnv Spec Compliance
+| Requirement | Status |
+|-------------|--------|
+| Typed `Observation`, `Action`, `Reward` Pydantic models | ✅ |
+| `step(action) → StepResult(obs, reward, done, info)` | ✅ |
+| `reset() → ResetResult` | ✅ |
+| `state() → StateResult` | ✅ |
+| `openenv.yaml` metadata | ✅ |
+| 3+ tasks defined | ✅ (2 active, 1 placeholder) |
+| Grader scores in [0.0, 1.0] | ✅ |
+| Shaped rewards (non-binary) | ✅ |
+| Dockerfile + port 7860 | ✅ |
+| `inference.py` with OpenAI client | ✅ |
+| `validate.py` — all 19 checks pass | ✅ |
+---
+## License
+MIT. Contract vulnerability data adapted from Certora audits on production DeFi protocols.

app.py CHANGED Viewed

@@ -4,31 +4,30 @@ app.py
 FastAPI server exposing the OpenEnv HTTP interface.
 Endpoints:
-  POST /reset             – start a new episode
-  POST /step              – take one action
-  GET  /state             – inspect internal state (debugging)
-  GET  /tasks             – list available tasks
-  GET  /health            – liveness probe
-  GET  /action_space      – action space description
-  GET  /observation_space – observation space description
-Sessions are keyed by a UUID passed as the `session_id` query parameter.
-If omitted, a default single-session is used (fine for sequential runs).
 """
-import uuid
 from typing import Dict, Optional
 from fastapi import FastAPI, HTTPException, Query
-from fastapi.responses import JSONResponse
 from pydantic import BaseModel
 from env.schemas import Action, ActionType, TaskInfo
 from tasks.task1.environment import Task1Environment
-# ---------------------------------------------------------------------------
-# App init
-# ---------------------------------------------------------------------------
 app = FastAPI(
     title="Smart Contract Audit RL Environment",
@@ -36,38 +35,36 @@ app = FastAPI(
         "OpenEnv-compliant reinforcement learning environment for smart contract "
         "security analysis. Train and evaluate agents on real-world Solidity audit tasks."
     ),
-    version="1.0.0",
 )
-# ---------------------------------------------------------------------------
 # Session management
-# ---------------------------------------------------------------------------
-_sessions: Dict[str, Task1Environment] = {}
 DEFAULT_SESSION = "default"
-def _get_or_create_session(session_id: str, task_id: str = "task1_vuln_detection") -> Task1Environment:
-    if session_id not in _sessions:
-        env = _create_env(task_id)
-        _sessions[session_id] = env
-    return _sessions[session_id]
-def _create_env(task_id: str) -> Task1Environment:
-    if task_id == "task1_vuln_detection":
-        return Task1Environment()
-    # TODO: elif task_id == "task2_property_discovery": return Task2Environment()
-    # TODO: elif task_id == "task3_rule_checker": return Task3Environment()
-    raise HTTPException(
-        status_code=400,
-        detail=f"Unknown task_id '{task_id}'. Available: ['task1_vuln_detection']",
-    )
-# ---------------------------------------------------------------------------
-# Request/response models
-# ---------------------------------------------------------------------------
 class ResetRequest(BaseModel):
     task_id: str = "task1_vuln_detection"
@@ -79,48 +76,39 @@ class StepRequest(BaseModel):
     params: dict = {}
-# ---------------------------------------------------------------------------
 # Routes
-# ---------------------------------------------------------------------------
 @app.get("/health")
 def health():
-    """Liveness probe — returns 200 OK."""
-    return {"status": "ok", "version": "1.0.0"}
 @app.get("/tasks")
 def list_tasks():
-    """List all available tasks."""
     tasks = [
         TaskInfo(
             task_id="task1_vuln_detection",
             name="Targeted Vulnerability Detection",
             difficulty="medium",
-            description=(
-                "Given a Solidity contract, identify the vulnerable function "
-                "and describe the vulnerability type in 2-3 words."
-            ),
             status="active",
         ),
         TaskInfo(
             task_id="task2_property_discovery",
             name="Property Discovery",
             difficulty="hard",
-            description=(
-                "Given a Solidity function, discover the natural-language property "
-                "that describes its correct behaviour."
-            ),
-            status="placeholder",
         ),
         TaskInfo(
             task_id="task3_rule_checker",
             name="Rule Checker",
             difficulty="easy",
-            description=(
-                "Given a property in English, identify which function in the contract "
-                "violates that property."
-            ),
             status="placeholder",
         ),
     ]
@@ -144,7 +132,7 @@ def step(
     body: StepRequest,
     session_id: str = Query(default=DEFAULT_SESSION),
 ):
-    """Apply an action and advance the episode."""
     env = _sessions.get(session_id)
     if env is None:
         raise HTTPException(
@@ -156,8 +144,7 @@ def step(
     except ValueError:
         raise HTTPException(
             status_code=400,
-            detail=f"Unknown action_type '{body.action_type}'. "
-                   f"Valid: {[a.value for a in ActionType]}",
         )
     action = Action(action_type=action_type, params=body.params)
     try:
@@ -169,7 +156,7 @@ def step(
 @app.get("/state")
 def state(session_id: str = Query(default=DEFAULT_SESSION)):
-    """Return current internal state (for debugging; not for agents)."""
     env = _sessions.get(session_id)
     if env is None:
         raise HTTPException(
@@ -186,51 +173,26 @@ def action_space(task_id: str = "task1_vuln_detection"):
         return {
             "task_id": task_id,
             "actions": [
-                {
-                    "type": "list_functions",
-                    "params": {},
-                    "reward": -0.05,
-                    "description": "List all function names in the contract",
-                },
-                {
-                    "type": "get_function_code",
-                    "params": {"function_name": "string"},
-                    "reward": "+0.05 (target fn) / -0.10 (wrong fn)",
-                    "description": "Retrieve the full Solidity code of a function",
-                },
-                {
-                    "type": "get_function_summary",
-                    "params": {"function_name": "string"},
-                    "reward": "+0.03 (target fn) / -0.05 (wrong fn)",
-                    "description": "Retrieve the NatSpec comment/summary of a function",
-                },
-                {
-                    "type": "get_file_metadata",
-                    "params": {},
-                    "reward": -0.04,
-                    "description": "Retrieve contract-level metadata (version, author, description)",
-                },
-                {
-                    "type": "get_state_variable",
-                    "params": {"variable_name": "string (optional)"},
-                    "reward": -0.05,
-                    "description": "Retrieve a state variable or list all variables",
-                },
-                {
-                    "type": "get_call_graph",
-                    "params": {},
-                    "reward": -0.08,
-                    "description": "Retrieve the function call graph",
-                },
-                {
-                    "type": "submit",
-                    "params": {
-                        "function_name": "string",
-                        "vulnerability_type": "string",
-                    },
-                    "reward": "+5.0 (correct) / +1.0 (right fn, wrong vuln) / -1.5 (wrong)",
-                    "description": "Submit your final answer. Ends the episode.",
-                },
             ],
         }
     return {"error": f"No action space defined for task '{task_id}'"}
@@ -238,27 +200,27 @@ def action_space(task_id: str = "task1_vuln_detection"):
 @app.get("/observation_space")
 def observation_space():
-    """Describe the observation space."""
     return {
         "type": "object",
         "fields": {
-            "task_id": "string – active task identifier",
-            "contract_name": "string – name of the Solidity contract",
             "contract_description": "string – what the contract does",
-            "available_actions": "list[string] – valid action types",
-            "last_action": "string|null – the previous action type",
-            "last_action_result": "string|null – human-readable result of last action",
-            "step_count": "int – steps taken so far",
-            "cumulative_reward": "float – running reward total",
-            "done": "bool – True when episode is over",
-            "extra": "object – task-specific hints and metadata",
         },
     }
-# ---------------------------------------------------------------------------
 # Entry point
-# ---------------------------------------------------------------------------
 if __name__ == "__main__":
     import uvicorn

 FastAPI server exposing the OpenEnv HTTP interface.
 Endpoints:
+  POST /reset              – start a new episode
+  POST /step               – take one action
+  GET  /state              – inspect internal state (debugging)
+  GET  /tasks              – list available tasks
+  GET  /health             – liveness probe
+  GET  /action_space       – action space description for a task
+  GET  /observation_space  – observation space description
+Sessions are keyed by a UUID in the `session_id` query parameter.
+If omitted, "default" is used (fine for sequential single-agent runs).
 """
 from typing import Dict, Optional
 from fastapi import FastAPI, HTTPException, Query
 from pydantic import BaseModel
 from env.schemas import Action, ActionType, TaskInfo
 from tasks.task1.environment import Task1Environment
+from tasks.task2.environment import Task2Environment
+# ─────────────────────────────────────────────────────────────────────────────
+# App
+# ─────────────────────────────────────────────────────────────────────────────
 app = FastAPI(
     title="Smart Contract Audit RL Environment",
         "OpenEnv-compliant reinforcement learning environment for smart contract "
         "security analysis. Train and evaluate agents on real-world Solidity audit tasks."
     ),
+    version="1.1.0",
 )
+# ─────────────────────────────────────────────────────────────────────────────
 # Session management
+# ─────────────────────────────────────────────────────────────────────────────
+_sessions: Dict[str, object] = {}
 DEFAULT_SESSION = "default"
+TASK_ENV_MAP = {
+    "task1_vuln_detection":     Task1Environment,
+    "task2_property_discovery": Task2Environment,
+    # TODO: "task3_rule_checker": Task3Environment,
+}
+def _create_env(task_id: str):
+    cls = TASK_ENV_MAP.get(task_id)
+    if cls is None:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Unknown task_id '{task_id}'. Available: {list(TASK_ENV_MAP)}",
+        )
+    return cls()
+# ─────────────────────────────────────────────────────────────────────────────
+# Request bodies
+# ─────────────────────────────────────────────────────────────────────────────
 class ResetRequest(BaseModel):
     task_id: str = "task1_vuln_detection"
     params: dict = {}
+# ─────────────────────────────────────────────────────────────────────────────
 # Routes
+# ─────────────────────────────────────────────────────────────────────────────
 @app.get("/health")
 def health():
+    """Liveness probe."""
+    return {"status": "ok", "version": "1.1.0"}
 @app.get("/tasks")
 def list_tasks():
+    """List all tasks with their status."""
     tasks = [
         TaskInfo(
             task_id="task1_vuln_detection",
             name="Targeted Vulnerability Detection",
             difficulty="medium",
+            description="Given a Solidity contract, identify the vulnerable function and describe the vulnerability type in 2-3 words.",
             status="active",
         ),
         TaskInfo(
             task_id="task2_property_discovery",
             name="Property Discovery",
             difficulty="hard",
+            description="Given a Solidity function, write the natural-language property that describes its correct behaviour.",
+            status="active",
         ),
         TaskInfo(
             task_id="task3_rule_checker",
             name="Rule Checker",
             difficulty="easy",
+            description="Given a property in English and a Solidity contract, identify which function violates that property.",
             status="placeholder",
         ),
     ]
     body: StepRequest,
     session_id: str = Query(default=DEFAULT_SESSION),
 ):
+    """Apply one action and advance the episode."""
     env = _sessions.get(session_id)
     if env is None:
         raise HTTPException(
     except ValueError:
         raise HTTPException(
             status_code=400,
+            detail=f"Unknown action_type '{body.action_type}'. Valid: {[a.value for a in ActionType]}",
         )
     action = Action(action_type=action_type, params=body.params)
     try:
 @app.get("/state")
 def state(session_id: str = Query(default=DEFAULT_SESSION)):
+    """Return internal state for debugging (not for agents)."""
     env = _sessions.get(session_id)
     if env is None:
         raise HTTPException(
         return {
             "task_id": task_id,
             "actions": [
+                {"type": "list_functions",       "params": {},                                                    "reward": -0.05,                              "description": "List all function names"},
+                {"type": "get_function_code",    "params": {"function_name": "string"},                          "reward": "+0.05 (target) / -0.10 (other)",   "description": "Get full Solidity source of a function"},
+                {"type": "get_function_summary", "params": {"function_name": "string"},                          "reward": "+0.03 (target) / -0.05 (other)",   "description": "Get NatSpec comment of a function"},
+                {"type": "get_file_metadata",    "params": {},                                                    "reward": -0.04,                              "description": "Get contract-level metadata"},
+                {"type": "get_state_variable",   "params": {"variable_name": "string (optional)"},               "reward": -0.05,                              "description": "Get a state variable or list all"},
+                {"type": "get_call_graph",        "params": {},                                                   "reward": -0.08,                              "description": "Get function call graph"},
+                {"type": "submit",               "params": {"function_name": "str", "vulnerability_type": "str"},"reward": "+5.0 / +1.0 / -1.5",              "description": "Submit answer. Ends episode."},
+            ],
+        }
+    if task_id == "task2_property_discovery":
+        return {
+            "task_id": task_id,
+            "actions": [
+                {"type": "get_function_code",     "params": {},                         "reward": -0.06, "description": "Read full source of the target function"},
+                {"type": "get_function_natspec",  "params": {},                         "reward": -0.08, "description": "Read NatSpec + expected behaviour"},
+                {"type": "get_file_natspec",      "params": {},                         "reward": -0.03, "description": "Read contract-level NatSpec"},
+                {"type": "get_related_functions", "params": {},                         "reward": -0.06, "description": "List caller/callee functions with summaries"},
+                {"type": "get_io",                "params": {},                         "reward": -0.04, "description": "Get structured I/O + expected behaviour"},
+                {"type": "get_similar_rule",      "params": {},                         "reward": -0.20, "description": "Get a similar property from another contract"},
+                {"type": "submit_property",       "params": {"property": "string"},     "reward": "0.0–5.0 (scored)", "description": "Submit property. ONE attempt. Ends episode."},
             ],
         }
     return {"error": f"No action space defined for task '{task_id}'"}
 @app.get("/observation_space")
 def observation_space():
+    """Describe the observation space (same for all tasks)."""
     return {
         "type": "object",
         "fields": {
+            "task_id":              "string – active task identifier",
+            "contract_name":        "string – Solidity contract name",
             "contract_description": "string – what the contract does",
+            "available_actions":    "list[string] – valid action types for this task",
+            "last_action":          "string|null – previous action type",
+            "last_action_result":   "string|null – human-readable result of last action",
+            "step_count":           "int – steps taken in this episode",
+            "cumulative_reward":    "float – running reward total",
+            "done":                 "bool – True when episode is over",
+            "extra":                "object – task-specific hints (target_function, hint, etc.)",
         },
     }
+# ─────────────────────────────────────────────────────────────────────────────
 # Entry point
+# ─────────────────────────────────────────────────────────────────────────────
 if __name__ == "__main__":
     import uvicorn

data/data_loader.py CHANGED Viewed

@@ -2,8 +2,9 @@
 data_loader.py
 --------------
 Loads and indexes smart contract data from JSON files.
-Each contract is parsed into a structured dict; vulnerable functions
-are indexed for fast lookup by Task 1.
 """
 import json
@@ -16,25 +17,62 @@ DATA_DIR = os.path.join(os.path.dirname(__file__))
 DEFAULT_CONTRACTS_FILE = os.path.join(DATA_DIR, "contracts.json")
 DEFAULT_VUNERABILITIES_FILE = os.path.join(DATA_DIR, "vulnerabilities.json")
 def load_contracts(path: str = DEFAULT_CONTRACTS_FILE) -> List[Dict[str, Any]]:
     """Load and return all contracts from the JSON dataset."""
     with open(path, "r") as f:
         return json.load(f)
 def load_vulnerabilities(path: str = DEFAULT_VUNERABILITIES_FILE) -> List[Dict[str, Any]]:
     """Load and return all vulnerability entries from the JSON dataset."""
     with open(path, "r") as f:
         return json.load(f)
 def get_all_vulnerable_entries(
     contracts: List[Dict[str, Any]],
 ) -> List[Tuple[Dict[str, Any], Dict[str, Any]]]:
     """
     Returns a flat list of (contract, function) pairs where
     function['vulnerable'] is True.
-    Used by Task 1 to populate the episode pool.
     """
     entries = []
     for contract in contracts:
@@ -48,10 +86,7 @@ def sample_episode(
     contracts: List[Dict[str, Any]],
     rng: Optional[random.Random] = None,
 ) -> Tuple[Dict[str, Any], Dict[str, Any]]:
-    """
-    Randomly selects one (contract, vulnerable_function) pair.
-    Returns the contract dict and the target function dict.
-    """
     if rng is None:
         rng = random.Random()
     entries = get_all_vulnerable_entries(contracts)
@@ -60,31 +95,101 @@ def sample_episode(
     return rng.choice(entries)
-def get_function_by_name(
-    contract: Dict[str, Any], name: str
-) -> Optional[Dict[str, Any]]:
-    """Case-insensitive function lookup within a contract."""
-    for fn in contract.get("functions", []):
-        if fn["name"].lower() == name.lower():
-            return fn
-    return None
-def get_state_variable_by_name(
-    contract: Dict[str, Any], name: str
-) -> Optional[Dict[str, Any]]:
-    """Case-insensitive state variable lookup."""
-    for sv in contract.get("state_variables", []):
-        if sv["name"].lower() == name.lower():
-            return sv
-    return None
-def list_function_names(contract: Dict[str, Any]) -> List[str]:
-    """Return all function names in the contract."""
-    return [fn["name"] for fn in contract.get("functions", [])]
-def list_state_variable_names(contract: Dict[str, Any]) -> List[str]:
-    """Return all state variable names."""
-    return [sv["name"] for sv in contract.get("state_variables", [])]

 data_loader.py
 --------------
 Loads and indexes smart contract data from JSON files.
+Task 1 helpers  – vulnerable function sampling
+Task 2 helpers  – property function sampling, natspec, similar-rule lookup
 """
 import json
 DEFAULT_CONTRACTS_FILE = os.path.join(DATA_DIR, "contracts.json")
 DEFAULT_VUNERABILITIES_FILE = os.path.join(DATA_DIR, "vulnerabilities.json")
+# ────────────────────────────────────────────────────────────────
+# Core loaders
+# ────────────────────────────────────────────────────────────────
 def load_contracts(path: str = DEFAULT_CONTRACTS_FILE) -> List[Dict[str, Any]]:
     """Load and return all contracts from the JSON dataset."""
     with open(path, "r") as f:
         return json.load(f)
+def get_function_by_name(
+    contract: Dict[str, Any], name: str
+) -> Optional[Dict[str, Any]]:
+    """Case-insensitive function lookup within a contract."""
+    for fn in contract.get("functions", []):
+        if fn["name"].lower() == name.lower():
+            return fn
+    return None
+def get_state_variable_by_name(
+    contract: Dict[str, Any], name: str
+) -> Optional[Dict[str, Any]]:
+    """Case-insensitive state variable lookup."""
+    for sv in contract.get("state_variables", []):
+        if sv["name"].lower() == name.lower():
+            return sv
+    return None
+def list_function_names(contract: Dict[str, Any]) -> List[str]:
+    """Return all function names in the contract."""
+    return [fn["name"] for fn in contract.get("functions", [])]
+def list_state_variable_names(contract: Dict[str, Any]) -> List[str]:
+    """Return all state variable names."""
+    return [sv["name"] for sv in contract.get("state_variables", [])]
+# ────────────────────────────────────────────────────────────────
+# Task 1 helpers
+# ────────────────────────────────────────────────────────────────
 def load_vulnerabilities(path: str = DEFAULT_VUNERABILITIES_FILE) -> List[Dict[str, Any]]:
     """Load and return all vulnerability entries from the JSON dataset."""
     with open(path, "r") as f:
         return json.load(f)
 def get_all_vulnerable_entries(
     contracts: List[Dict[str, Any]],
 ) -> List[Tuple[Dict[str, Any], Dict[str, Any]]]:
     """
     Returns a flat list of (contract, function) pairs where
     function['vulnerable'] is True.
     """
     entries = []
     for contract in contracts:
     contracts: List[Dict[str, Any]],
     rng: Optional[random.Random] = None,
 ) -> Tuple[Dict[str, Any], Dict[str, Any]]:
+    """Randomly selects one (contract, vulnerable_function) pair for Task 1."""
     if rng is None:
         rng = random.Random()
     entries = get_all_vulnerable_entries(contracts)
     return rng.choice(entries)
+# ────────────────────────────────────────────────────────────────
+# Task 2 helpers
+# ────────────────────────────────────────────────────────────────
+def get_all_property_entries(
+    contracts: List[Dict[str, Any]],
+) -> List[Tuple[Dict[str, Any], Dict[str, Any]]]:
+    """
+    Returns a flat list of (contract, function) pairs where
+    function['property'] is not None.
+    Used by Task 2 to populate the episode pool.
+    """
+    entries = []
+    for contract in contracts:
+        for fn in contract.get("functions", []):
+            if fn.get("property") is not None:
+                entries.append((contract, fn))
+    return entries
+def sample_property_episode(
+    contracts: List[Dict[str, Any]],
+    rng: Optional[random.Random] = None,
+) -> Tuple[Dict[str, Any], Dict[str, Any]]:
+    """Randomly selects one (contract, function-with-property) pair for Task 2."""
+    if rng is None:
+        rng = random.Random()
+    entries = get_all_property_entries(contracts)
+    if not entries:
+        raise ValueError("No functions with properties found in dataset.")
+    return rng.choice(entries)
+def get_related_functions(
+    contract: Dict[str, Any],
+    function_name: str,
+) -> List[str]:
+    """
+    Returns function names that are related to the given function:
+    - Functions that it calls (from call_graph)
+    - Functions that call it (reverse call_graph lookup)
+    """
+    name_lower = function_name.lower()
+    cg: Dict[str, List[str]] = contract.get("call_graph", {})
+    related = set()
+    # Direct callees (functions called by this function)
+    for callee in cg.get(function_name, []):
+        # Only include callees that are also functions in this contract
+        if get_function_by_name(contract, callee) is not None:
+            related.add(callee)
+    # Reverse: functions that call this function
+    for caller_name, callees in cg.items():
+        if any(c.lower() == name_lower for c in callees):
+            if get_function_by_name(contract, caller_name) is not None:
+                related.add(caller_name)
+    return sorted(related)
+def get_similar_rule(
+    contracts: List[Dict[str, Any]],
+    current_contract_name: str,
+    current_function_name: str,
+) -> Optional[Dict[str, Any]]:
+    """
+    Returns the similar_rule hint stored in the target function's property field,
+    enriched with the referenced function's natspec if available.
+    Returns a dict with keys: contract_name, function_name, property_hint, natspec.
+    Returns None if no similar_rule is defined.
+    """
+    # Find target function
+    for contract in contracts:
+        if contract["contract_name"] == current_contract_name:
+            fn = get_function_by_name(contract, current_function_name)
+            if fn and fn.get("property") and fn["property"].get("similar_rule"):
+                sr = fn["property"]["similar_rule"]
+                # Look up the referenced function's natspec
+                for c2 in contracts:
+                    if c2["contract_name"] == sr["contract_name"]:
+                        ref_fn = get_function_by_name(c2, sr["function_name"])
+                        if ref_fn:
+                            return {
+                                "contract_name": sr["contract_name"],
+                                "function_name": sr["function_name"],
+                                "property_hint": sr["property_hint"],
+                                "natspec": ref_fn.get("natspec", ""),
+                            }
+                # Referenced function not found — return hint only
+                return {
+                    "contract_name": sr["contract_name"],
+                    "function_name": sr["function_name"],
+                    "property_hint": sr["property_hint"],
+                    "natspec": "",
+                }
+    return None

demo.py CHANGED Viewed

@@ -237,12 +237,18 @@ def _print_episode_summary(obs):
     print(f"  Steps taken   : {obs.step_count}")
     print(f"  Total reward  : {colour}{reward:+.2f}{RESET}")
     last = obs.last_action_result or ""
-    if "✅" in last:
         print(f"  {GREEN}Perfect score — full marks!{RESET}")
-    elif "⚠️" in last:
-        print(f"  {YELLOW}Partial credit — right function, imprecise vulnerability type.{RESET}")
-    else:
         print(f"  {RED}Incorrect — better luck next episode.{RESET}")
     print(f"{BOLD}{'═' * 64}{RESET}\n")
@@ -285,3 +291,67 @@ def main():
 if __name__ == "__main__":
     main()

     print(f"  Steps taken   : {obs.step_count}")
     print(f"  Total reward  : {colour}{reward:+.2f}{RESET}")
     last = obs.last_action_result or ""
+    if "✅ CORRECT" in last or "EXCELLENT" in last:
         print(f"  {GREEN}Perfect score — full marks!{RESET}")
+    elif "⚠️" in last or "PARTIAL" in last:
+        print(f"  {YELLOW}Partial credit.{RESET}")
+    elif "🟡 GOOD" in last:
+        print(f"  {YELLOW}Good — most key concepts matched!{RESET}")
+    elif "🟠" in last:
+        print(f"  {YELLOW}Partial — some key concepts matched.{RESET}")
+    elif "❌" in last:
         print(f"  {RED}Incorrect — better luck next episode.{RESET}")
+    else:
+        print(f"  {'Good effort!' if reward > 0 else 'Keep exploring next time.'}")
     print(f"{BOLD}{'═' * 64}{RESET}\n")
 if __name__ == "__main__":
     main()
+# ─────────────────────────────────────────────────────────────────────────────
+# Task 2 demo
+# ─────────────────────────────────────────────────────────────────────────────
+DEMO_SCRIPTS_T2 = {
+    42: [
+        (ActionType.GET_FUNCTION_NATSPEC,  {}, "First, read the NatSpec to understand intent and expected outputs."),
+        (ActionType.GET_IO,                {}, "Check parameters, return type and expected behaviour."),
+        (ActionType.GET_FUNCTION_CODE,     {}, "Read the actual Solidity code to confirm the behaviour."),
+        (ActionType.SUBMIT_PROPERTY,
+         {"property": "After a successful claimRewards call, all of the caller's accrued reward tokens are transferred to the caller and their rewards balance is set to zero. Reverts if the caller has no accrued rewards."},
+         "Confident about the property. Submitting!"),
+    ],
+}
+def run_auto_demo_t2(seed: int = 42, delay: float = 0.9):
+    """Run the scripted Task 2 demo."""
+    from tasks.task2.environment import Task2Environment
+    script = DEMO_SCRIPTS_T2.get(seed)
+    env = Task2Environment()
+    result = env.reset(seed=seed)
+    obs = result.observation
+    print()
+    print(f"{BOLD}{CYAN}╔══════════════════════════════════════════════════════════╗")
+    print(f"║   Smart Contract Audit RL Env  ·  Task 2 Demo            ║")
+    print(f"╚══════════════════════════════════════════════════════════╝{RESET}")
+    print()
+    print(f"{BOLD}Mode:{RESET} Automated demo  |  {BOLD}Seed:{RESET} {seed}")
+    print(f"{BOLD}Task:{RESET} Property Discovery")
+    print()
+    fn_name = obs.extra.get("target_function", "?")
+    sig     = obs.extra.get("target_signature", "")
+    print(f"{BOLD}Contract :{RESET} {obs.contract_name}")
+    print(f"{BOLD}Function :{RESET} {fn_name}  ({sig})")
+    print(f"{BOLD}Goal     :{RESET} Write the natural-language property for '{fn_name}'")
+    print(DIVIDER)
+    if not script:
+        print(f"{YELLOW}No pre-written script for seed {seed}.{RESET}")
+        return
+    for at, params, commentary in script:
+        time.sleep(delay)
+        print(f"\n{CYAN}▶ Agent thinking:{RESET} {commentary}")
+        time.sleep(delay * 0.5)
+        step_result = env.step(Action(action_type=at, params=params))
+        sobs = step_result.observation
+        print(DIVIDER)
+        print(f"{BOLD}Step {sobs.step_count:2d}{RESET}  [{at.value}]  r={step_result.reward.value:+.2f}  cum={sobs.cumulative_reward:+.2f}")
+        result_text = sobs.last_action_result or ""
+        colour = GREEN if step_result.reward.value > 0 else (YELLOW if step_result.reward.value > -0.15 else YELLOW)
+        for line in result_text.split("\n")[:8]:
+            print(f"  {colour}{line[:90]}{RESET}")
+        print(DIVIDER)
+        if step_result.done:
+            _print_episode_summary(sobs)
+            return

env/schemas.py CHANGED Viewed

@@ -3,12 +3,12 @@ schemas.py
 ----------
 Typed Pydantic models implementing the OpenEnv interface spec.
-Observation  - what the agent sees at each step
-Action       - what the agent can send
-StepResult   - returned by step()
-ResetResult  - returned by reset()
-StateResult  - returned by state()
-Reward       - structured reward info
 """
 from __future__ import annotations
@@ -24,27 +24,28 @@ from pydantic import BaseModel, Field
 # ---------------------------------------------------------------------------
 class ActionType(str, Enum):
-    # Task 1 – Vulnerability Detection
-    LIST_FUNCTIONS = "list_functions"
-    GET_FUNCTION_CODE = "get_function_code"
     GET_FUNCTION_SUMMARY = "get_function_summary"
-    GET_FILE_METADATA = "get_file_metadata"
-    GET_STATE_VARIABLE = "get_state_variable"
-    GET_CALL_GRAPH = "get_call_graph"
-    SUBMIT = "submit"
-    # TODO: Task 2 – Property Discovery
-    # GET_SIMILAR_RULE = "get_similar_rule"
-    # GET_FILE_NATSPEC = "get_file_natspec"
-    # GET_FUNCTION_NATSPEC = "get_function_natspec"
-    # GET_RELATED_FUNCTIONS = "get_related_functions"
-    # GET_IO = "get_io"
-    # SUBMIT_PROPERTY = "submit_property"
-    # TODO: Task 3 – Rule Checker
     # GET_FORMALIZED_PROPERTY = "get_formalized_property"
-    # GET_FUNCTION_METADATA = "get_function_metadata"
-    # SUBMIT_FUNCTION = "submit_function"
 class Action(BaseModel):
@@ -54,7 +55,7 @@ class Action(BaseModel):
     action_type : one of ActionType enum values
     params      : optional key/value arguments, e.g.
                   {"function_name": "withdraw"} for GET_FUNCTION_CODE
-                  {"function_name": "withdraw", "vulnerability_type": "reentrancy"} for SUBMIT
     """
     action_type: ActionType
     params: Dict[str, Any] = Field(default_factory=dict)
@@ -71,16 +72,16 @@ class Observation(BaseModel):
     """
     What the agent receives from the environment.
-    task_id           : which task is active
-    contract_name     : name of the Solidity contract
     contract_description : high-level description of what the contract does
-    available_actions : list of valid ActionType strings
-    last_action       : the action that produced this observation (None on reset)
-    last_action_result: human-readable result of the last action
-    step_count        : number of steps taken so far
-    cumulative_reward : running reward total
-    done              : whether the episode has ended
-    extra             : any additional task-specific context
     """
     task_id: str
     contract_name: str
@@ -147,4 +148,4 @@ class TaskInfo(BaseModel):
     name: str
     difficulty: str
     description: str
-    status: str = "active"  # or "placeholder"

 ----------
 Typed Pydantic models implementing the OpenEnv interface spec.
+Observation  – what the agent sees at each step
+Action       – what the agent can send
+StepResult   – returned by step()
+ResetResult  – returned by reset()
+StateResult  – returned by state()
+Reward       – structured reward info
 """
 from __future__ import annotations
 # ---------------------------------------------------------------------------
 class ActionType(str, Enum):
+    # ── Task 1 – Vulnerability Detection ───────────────────────────────────
+    LIST_FUNCTIONS       = "list_functions"
+    GET_FUNCTION_CODE    = "get_function_code"
     GET_FUNCTION_SUMMARY = "get_function_summary"
+    GET_FILE_METADATA    = "get_file_metadata"
+    GET_STATE_VARIABLE   = "get_state_variable"
+    GET_CALL_GRAPH       = "get_call_graph"
+    SUBMIT               = "submit"
+    # ── Task 2 – Property Discovery ─────────────────────────────────────────
+    GET_SIMILAR_RULE      = "get_similar_rule"       # -0.20
+    GET_FILE_NATSPEC      = "get_file_natspec"        # -0.03
+    GET_FUNCTION_NATSPEC  = "get_function_natspec"    # -0.08
+    GET_RELATED_FUNCTIONS = "get_related_functions"   # -0.06
+    GET_IO                = "get_io"                  # -0.04
+    SUBMIT_PROPERTY       = "submit_property"         # scored 0–5, one attempt
+    # ── Task 3 – Rule Checker ────────────────────────────────────────────────
+    # TODO: Task 3
     # GET_FORMALIZED_PROPERTY = "get_formalized_property"
+    # GET_FUNCTION_METADATA   = "get_function_metadata"
+    # SUBMIT_FUNCTION         = "submit_function"
 class Action(BaseModel):
     action_type : one of ActionType enum values
     params      : optional key/value arguments, e.g.
                   {"function_name": "withdraw"} for GET_FUNCTION_CODE
+                  {"property": "..."} for SUBMIT_PROPERTY
     """
     action_type: ActionType
     params: Dict[str, Any] = Field(default_factory=dict)
     """
     What the agent receives from the environment.
+    task_id              : which task is active
+    contract_name        : name of the Solidity contract
     contract_description : high-level description of what the contract does
+    available_actions    : list of valid ActionType strings
+    last_action          : the action that produced this observation (None on reset)
+    last_action_result   : human-readable result of the last action
+    step_count           : number of steps taken so far
+    cumulative_reward    : running reward total
+    done                 : whether the episode has ended
+    extra                : any additional task-specific context
     """
     task_id: str
     contract_name: str
     name: str
     difficulty: str
     description: str
+    status: str = "active"  # "active" | "placeholder"

eval.py CHANGED Viewed

@@ -3,264 +3,276 @@ eval.py
 -------
 Evaluation harness for the Smart Contract Audit RL Environment.
-Runs a configurable number of episodes per task, collecting grader scores
-and reward trajectories. Produces a detailed JSON report.
-Unlike inference.py (which uses an external LLM), this evaluates the
-*environment itself* using a built-in oracle agent — useful for:
-  - Verifying grader correctness
-  - Benchmarking reward shaping
-  - Checking score distribution across vulnerability types
 Usage:
-  python eval.py                     # all 8 vuln episodes
-  python eval.py --episodes 16       # more episodes
-  python eval.py --seed 0 --verbose  # detailed per-step output
-  python eval.py --out results.json  # custom output file
 """
 import argparse
 import json
 import sys
-import time
 from typing import Any, Dict, List
 from tasks.task1.environment import Task1Environment
 from env.schemas import Action, ActionType
-from data.data_loader import load_contracts, get_all_vulnerable_entries
 # ─────────────────────────────────────────────────────────────────────────────
-# Oracle agent  (always submits the ground-truth answer)
 # ─────────────────────────────────────────────────────────────────────────────
-def oracle_agent(env: Task1Environment, seed: int, verbose: bool = False) -> Dict[str, Any]:
-    """
-    Runs one episode using the oracle strategy:
-      1. list_functions
-      2. get_function_code  (for the target function — peeked from state)
-      3. submit correct answer
-    This gives an upper-bound score trajectory for the environment.
-    Always ends with grader_score = 1.0.
-    """
-    reset_result = env.reset(seed=seed)
-    obs = reset_result.observation
-    steps_taken: List[Dict[str, Any]] = []
-    def _step(at: ActionType, params: dict = None) -> Any:
-        params = params or {}
-        action = Action(action_type=at, params=params)
-        result = env.step(action)
-        entry = {
-            "step": result.observation.step_count,
-            "action": at.value,
-            "params": params,
-            "reward": result.reward.value,
-            "reason": result.reward.reason,
-            "cumulative": result.observation.cumulative_reward,
-            "done": result.done,
-        }
-        steps_taken.append(entry)
-        if verbose:
-            done_flag = " [DONE]" if result.done else ""
-            print(
-                f"    step {entry['step']:2d}: {at.value:25s} "
-                f"r={result.reward.value:+.2f}  cum={entry['cumulative']:+.2f}"
-                f"{done_flag}"
-            )
-        return result
-    # Peek at ground truth (oracle only)
-    state = env.state()
-    target_fn = state.target_function
-    # Get ground-truth vulnerability from data
     contracts = load_contracts()
-    vuln_issue = None
-    for contract in contracts:
-        for fn in contract.get("functions", []):
-            if fn["name"].lower() == target_fn.lower() and fn.get("vulnerable"):
-                # ! SINCE OUR MATCHER IS BASED ON FACT THAT EXPECTED STRING IS 2-3 WORDS, THIS DOESN'T MATCH WELL
-                vuln_issue = fn["vulnerability_details"]["issue"]
-                break
-        if vuln_issue:
             break
     if verbose:
-        print(f"  Contract : {obs.contract_name}")
-        print(f"  Target   : {target_fn}  ({vuln_issue})")
-    # Step 1: list functions (small cost, realistic)
-    _step(ActionType.LIST_FUNCTIONS)
-    # Step 2: read target function code (gets +0.05 shaping reward)
-    _step(ActionType.GET_FUNCTION_CODE, {"function_name": target_fn})
-    # Step 3: submit perfect answer
-    result = _step(ActionType.SUBMIT, {
-        "function_name": target_fn,
-        "vulnerability_type": vuln_issue,
-    })
-    final_reward = result.reward.value
-    if final_reward >= 4.9:
-        grader_score = 1.0
-    elif final_reward >= 0.9:
-        grader_score = 0.5
-    else:
-        grader_score = 0.0
     return {
         "seed": seed,
         "contract": obs.contract_name,
-        "target_function": target_fn,
         "vulnerability": vuln_issue,
-        "grader_score": grader_score,
         "cumulative_reward": result.observation.cumulative_reward,
-        "steps": steps_taken,
-        "num_steps": len(steps_taken),
     }
 # ─────────────────────────────────────────────────────────────────────────────
-# Partial agent  (submits correct function, wrong vuln type)
 # ─────────────────────────────────────────────────────────────────────────────
-def partial_agent(env: Task1Environment, seed: int) -> Dict[str, Any]:
-    """Submits right function, always uses 'unknown' as vulnerability type → score 0.5."""
-    reset_result = env.reset(seed=seed)
-    obs = reset_result.observation
-    state = env.state()
-    target_fn = state.target_function
-    action = Action(action_type=ActionType.SUBMIT, params={
-        "function_name": target_fn,
-        "vulnerability_type": "unknown vulnerability",
-    })
-    result = env.step(action)
     return {
         "seed": seed,
-        "grader_score": 0.5,
         "cumulative_reward": result.observation.cumulative_reward,
     }
-# ─────────────────────────────────────────────────────────────────────────────
-# Random agent  (submits a random wrong function)
-# ──────────────────────────────────────────────��──────────────────────────────
-def random_agent(env: Task1Environment, seed: int) -> Dict[str, Any]:
-    """Always submits 'constructor' — always wrong → score 0.0."""
     env.reset(seed=seed)
-    action = Action(action_type=ActionType.SUBMIT, params={
-        "function_name": "constructor",
-        "vulnerability_type": "reentrancy",
-    })
-    result = env.step(action)
-    return {
-        "seed": seed,
-        "grader_score": 0.0,
-        "cumulative_reward": result.observation.cumulative_reward,
-    }
 # ─────────────────────────────────────────────────────────────────────────────
-# Evaluation runner
 # ─────────────────────────────────────────────────────────────────────────────
-def run_evaluation(
-    num_episodes: int = 8,
-    seed_offset: int = 0,
-    verbose: bool = False,
-    output_file: str = "eval_results.json",
-) -> None:
-    env = Task1Environment()
     contracts = load_contracts()
-    entries = get_all_vulnerable_entries(contracts)
-    vuln_types = list({fn["vulnerability_details"]["issue"] for _, fn in entries})
-    print("=" * 64)
-    print("Smart Contract Audit RL Environment — Evaluation")
-    print("=" * 64)
-    print(f"  Episodes  : {num_episodes}")
-    print(f"  Seed range: {seed_offset} – {seed_offset + num_episodes - 1}")
-    print(f"  Vulns in dataset: {len(entries)}")
-    print()
-    # ── Oracle agent ─────────────────────────────────────────────────────────
-    print("▶ Oracle agent (upper bound — always submits correct answer):")
-    oracle_episodes = []
-    for i in range(num_episodes):
-        seed = seed_offset + i
-        ep = oracle_agent(env, seed=seed, verbose=verbose)
-        oracle_episodes.append(ep)
-        icon = "✅" if ep["grader_score"] == 1.0 else "⚠️ "
-        print(
-            f"  {icon} seed={seed:3d}  {ep['contract']:12s}  "
-            f"{ep['target_function']:15s}  score={ep['grader_score']:.1f}  "
-            f"reward={ep['cumulative_reward']:+.2f}"
-        )
-    oracle_avg = sum(e["grader_score"] for e in oracle_episodes) / num_episodes
-    oracle_avg_r = sum(e["cumulative_reward"] for e in oracle_episodes) / num_episodes
-    print(f"\n  Oracle avg grader score : {oracle_avg:.3f}")
-    print(f"  Oracle avg reward       : {oracle_avg_r:+.2f}")
-    # ── Partial agent ─────────────────────────────────────────────────────────
-    print("\n▶ Partial agent (right function, wrong vuln type → 0.5 each):")
-    partial_episodes = []
-    for i in range(num_episodes):
-        ep = partial_agent(env, seed=seed_offset + i)
-        partial_episodes.append(ep)
-    partial_avg = sum(e["grader_score"] for e in partial_episodes) / num_episodes
-    print(f"  Partial avg grader score: {partial_avg:.3f}")
-    # ── Random agent ────────────────────────────────────────────────��─────────
-    print("\n▶ Random agent (always wrong → 0.0 each):")
-    random_episodes = []
     for i in range(num_episodes):
-        ep = random_agent(env, seed=seed_offset + i)
-        random_episodes.append(ep)
-    random_avg = sum(e["grader_score"] for e in random_episodes) / num_episodes
-    print(f"  Random avg grader score : {random_avg:.3f}")
-    # ── Score distribution ────────────────────────────────────────────────────
-    print("\n▶ Coverage across vulnerability types:")
-    seen = {}
-    for ep in oracle_episodes:
         v = ep.get("vulnerability", "unknown")
-        seen[v] = seen.get(v, 0) + 1
-    for v in sorted(seen):
-        print(f"  {seen[v]:2d}x  {v}")
-    # ── Summary ───────────────────────────────────────────────────────────────
     print("\n" + "=" * 64)
-    print("SUMMARY")
     print("=" * 64)
-    print(f"  Oracle   (ceiling): {oracle_avg:.3f}  {'✅' if oracle_avg == 1.0 else '⚠️ '}")
-    print(f"  Partial  (partial): {partial_avg:.3f}  ✅")
-    print(f"  Random   (floor)  : {random_avg:.3f}  ✅")
-    assert oracle_avg == 1.0,  "Oracle should always score 1.0"
-    assert partial_avg == 0.5, "Partial should always score 0.5"
-    assert random_avg == 0.0,  "Random should always score 0.0"
-    print("\n  ✅ All score sanity checks passed.")
-    # ── Write results ─────────────────────────────────────────────────────────
-    report = {
-        "num_episodes": num_episodes,
-        "seed_offset": seed_offset,
-        "agents": {
-            "oracle":  {"avg_score": oracle_avg,  "avg_reward": oracle_avg_r, "episodes": oracle_episodes},
-            "partial": {"avg_score": partial_avg, "episodes": partial_episodes},
-            "random":  {"avg_score": random_avg,  "episodes": random_episodes},
-        },
-        "vulnerability_coverage": seen,
     }
-    with open(output_file, "w") as f:
-        json.dump(report, f, indent=2)
-    print(f"\n  Results written to {output_file}")
 # ─────────────────────────────────────────────────────────────────────────────
@@ -268,23 +280,50 @@ def run_evaluation(
 # ─────────────────────────────────────────────────────────────────────────────
 def main():
-    parser = argparse.ArgumentParser(description="Evaluate the SC Audit RL Environment")
-    parser.add_argument("--episodes", type=int, default=8,
-                        help="Number of episodes per agent (default: 8)")
-    parser.add_argument("--seed", type=int, default=42,
-                        help="Starting seed (default: 42)")
-    parser.add_argument("--verbose", action="store_true",
-                        help="Print per-step details for oracle agent")
-    parser.add_argument("--out", default="eval_results.json",
-                        help="Output JSON file (default: eval_results.json)")
     args = parser.parse_args()
-    run_evaluation(
-        num_episodes=args.episodes,
-        seed_offset=args.seed,
-        verbose=args.verbose,
-        output_file=args.out,
-    )
 if __name__ == "__main__":

 -------
 Evaluation harness for the Smart Contract Audit RL Environment.
+Runs oracle / partial / baseline agents against Task 1 and Task 2,
+verifying that grader scores form a clear ordering and that reward
+shaping is meaningful.
 Usage:
+  python eval.py                      # Task 1 + Task 2, 8 episodes each
+  python eval.py --task 1             # Task 1 only
+  python eval.py --task 2             # Task 2 only
+  python eval.py --episodes 16        # more episodes
+  python eval.py --seed 0 --verbose   # detailed per-step trace
+  python eval.py --out results.json   # custom output file
 """
 import argparse
 import json
 import sys
 from typing import Any, Dict, List
 from tasks.task1.environment import Task1Environment
+from tasks.task2.environment import Task2Environment
 from env.schemas import Action, ActionType
+from data.data_loader import (
+    load_contracts,
+    get_function_by_name,
+    get_all_vulnerable_entries,
+)
 # ─────────────────────────────────────────────────────────────────────────────
+# Task 1 agents
 # ─────────────────────────────────────────────────────────────────────────────
+def oracle_t1(env: Task1Environment, seed: int, verbose: bool = False) -> Dict[str, Any]:
+    """Always submits the exact ground-truth answer → score = 1.0."""
+    r   = env.reset(seed=seed)
+    obs = r.observation
+    st  = env.state()
+    fn_name = st.target_function
     contracts = load_contracts()
+    vuln_issue = ""
+    for c in contracts:
+        fn = get_function_by_name(c, fn_name)
+        if fn and fn.get("vulnerable"):
+            vuln_issue = fn["vulnerability_details"]["issue"]
             break
     if verbose:
+        print(f"    {obs.contract_name}.{fn_name}()  [{vuln_issue}]")
+    env.step(Action(action_type=ActionType.LIST_FUNCTIONS))
+    env.step(Action(action_type=ActionType.GET_FUNCTION_CODE,
+                    params={"function_name": fn_name}))
+    result = env.step(Action(action_type=ActionType.SUBMIT,
+                              params={"function_name": fn_name,
+                                      "vulnerability_type": vuln_issue}))
+    v = result.reward.value
+    score = 1.0 if v >= 4.9 else (0.5 if v >= 0.9 else 0.0)
     return {
         "seed": seed,
         "contract": obs.contract_name,
+        "target_function": fn_name,
         "vulnerability": vuln_issue,
+        "grader_score": score,
         "cumulative_reward": result.observation.cumulative_reward,
     }
+def partial_t1(env: Task1Environment, seed: int) -> Dict[str, Any]:
+    """Right function, wrong vuln type → score = 0.5."""
+    env.reset(seed=seed)
+    fn_name = env.state().target_function
+    result = env.step(Action(action_type=ActionType.SUBMIT,
+                              params={"function_name": fn_name,
+                                      "vulnerability_type": "unknown"}))
+    v = result.reward.value
+    return {"seed": seed, "grader_score": 0.5 if v >= 0.9 else 0.0,
+            "cumulative_reward": result.observation.cumulative_reward}
+def random_t1(env: Task1Environment, seed: int) -> Dict[str, Any]:
+    """Always submits 'constructor' → score = 0.0."""
+    env.reset(seed=seed)
+    result = env.step(Action(action_type=ActionType.SUBMIT,
+                              params={"function_name": "constructor",
+                                      "vulnerability_type": "reentrancy"}))
+    return {"seed": seed, "grader_score": 0.0,
+            "cumulative_reward": result.observation.cumulative_reward}
 # ─────────────────────────────────────────────────────────────────────────────
+# Task 2 agents
 # ─────────────────────────────────────────────────────────────────────────────
+def oracle_t2(env: Task2Environment, seed: int, verbose: bool = False) -> Dict[str, Any]:
+    """Submits the exact ground-truth natural_language → score ≥ 0.70."""
+    r   = env.reset(seed=seed)
+    obs = r.observation
+    fn_name  = obs.extra["target_function"]
+    contract = obs.contract_name
+    contracts = load_contracts()
+    gt_text = ""
+    for c in contracts:
+        if c["contract_name"] == contract:
+            fn = get_function_by_name(c, fn_name)
+            if fn and fn.get("property"):
+                gt_text = fn["property"]["natural_language"]
+            break
+    if verbose:
+        print(f"    {contract}.{fn_name}()")
+    # read code first (realistic browsing step)
+    env.step(Action(action_type=ActionType.GET_FUNCTION_CODE))
+    result = env.step(Action(action_type=ActionType.SUBMIT_PROPERTY,
+                              params={"property": gt_text}))
+    r_val = result.reward.value
+    score = round(r_val / 5.0, 4) if r_val > 0 else 0.0
     return {
         "seed": seed,
+        "contract": contract,
+        "function": fn_name,
+        "grader_score": score,
         "cumulative_reward": result.observation.cumulative_reward,
     }
+def partial_t2(env: Task2Environment, seed: int) -> Dict[str, Any]:
+    """Submits the function's NatSpec comment — partial credit."""
+    r   = env.reset(seed=seed)
+    obs = r.observation
+    contracts = load_contracts()
+    comment = ""
+    for c in contracts:
+        if c["contract_name"] == obs.contract_name:
+            fn = get_function_by_name(c, obs.extra["target_function"])
+            if fn:
+                comment = fn.get("comment", "")
+            break
+    result = env.step(Action(action_type=ActionType.SUBMIT_PROPERTY,
+                              params={"property": comment}))
+    r_val = result.reward.value
+    score = round(r_val / 5.0, 4) if r_val > 0 else 0.0
+    return {"seed": seed, "grader_score": score,
+            "cumulative_reward": result.observation.cumulative_reward}
+def empty_t2(env: Task2Environment, seed: int) -> Dict[str, Any]:
+    """Submits empty string → score = 0.0."""
     env.reset(seed=seed)
+    result = env.step(Action(action_type=ActionType.SUBMIT_PROPERTY,
+                              params={"property": ""}))
+    return {"seed": seed, "grader_score": 0.0,
+            "cumulative_reward": result.observation.cumulative_reward}
 # ─────────────────────────────────────────────────────────────────────────────
+# Evaluation runners
 # ─────────────────────────────────────────────────────────────────────────────
+def run_task1_eval(num_episodes: int, seed_offset: int, verbose: bool) -> Dict[str, Any]:
+    print("\n" + "=" * 64)
+    print("TASK 1 — Targeted Vulnerability Detection")
+    print("=" * 64)
     contracts = load_contracts()
+    entries   = get_all_vulnerable_entries(contracts)
+    print(f"  Dataset: {len(contracts)} contracts, {len(entries)} vulnerable functions\n")
+    env = Task1Environment()
+    print("▶ Oracle agent (always submits correct answer):")
+    oracle_eps = []
     for i in range(num_episodes):
+        ep = oracle_t1(env, seed_offset + i, verbose=verbose)
+        oracle_eps.append(ep)
+        print(f"  seed={ep['seed']:3d}  {ep['contract']:12s}.{ep['target_function']:18s}"
+              f"  score={ep['grader_score']:.1f}  reward={ep['cumulative_reward']:+.2f}")
+    oracle_avg   = sum(e["grader_score"] for e in oracle_eps) / num_episodes
+    oracle_avg_r = sum(e["cumulative_reward"] for e in oracle_eps) / num_episodes
+    print(f"\n  Oracle avg score : {oracle_avg:.3f}  avg reward: {oracle_avg_r:+.2f}")
+    print("\n▶ Partial agent (right function, wrong vuln type → 0.5):")
+    partial_eps = [partial_t1(env, seed_offset + i) for i in range(num_episodes)]
+    partial_avg = sum(e["grader_score"] for e in partial_eps) / num_episodes
+    print(f"  Partial avg score: {partial_avg:.3f}")
+    print("\n▶ Random agent (always wrong → 0.0):")
+    random_eps = [random_t1(env, seed_offset + i) for i in range(num_episodes)]
+    random_avg = sum(e["grader_score"] for e in random_eps) / num_episodes
+    print(f"  Random avg score : {random_avg:.3f}")
+    vuln_seen: Dict[str, int] = {}
+    for ep in oracle_eps:
         v = ep.get("vulnerability", "unknown")
+        vuln_seen[v] = vuln_seen.get(v, 0) + 1
+    print("\n▶ Vulnerability type coverage:")
+    for v in sorted(vuln_seen):
+        print(f"  {vuln_seen[v]:2d}×  {v}")
+    assert oracle_avg == 1.0,  f"Oracle should be 1.0, got {oracle_avg}"
+    assert partial_avg == 0.5, f"Partial should be 0.5, got {partial_avg}"
+    assert random_avg == 0.0,  f"Random should be 0.0, got {random_avg}"
+    print("\n  ✅ Task 1 score ordering: oracle(1.0) > partial(0.5) > random(0.0)")
+    return {
+        "task_id": "task1_vuln_detection",
+        "oracle":  {"avg_score": oracle_avg,  "avg_reward": oracle_avg_r, "episodes": oracle_eps},
+        "partial": {"avg_score": partial_avg, "episodes": partial_eps},
+        "random":  {"avg_score": random_avg,  "episodes": random_eps},
+        "vuln_coverage": vuln_seen,
+    }
+def run_task2_eval(num_episodes: int, seed_offset: int, verbose: bool) -> Dict[str, Any]:
     print("\n" + "=" * 64)
+    print("TASK 2 — Property Discovery")
     print("=" * 64)
+    from data.data_loader import get_all_property_entries
+    contracts = load_contracts()
+    entries   = get_all_property_entries(contracts)
+    print(f"  Dataset: {len(entries)} functions with properties\n")
+    env = Task2Environment()
+    print("▶ Oracle agent (submits ground-truth natural language):")
+    oracle_eps = []
+    for i in range(num_episodes):
+        ep = oracle_t2(env, seed_offset + i, verbose=verbose)
+        oracle_eps.append(ep)
+        icon = "✅" if ep["grader_score"] >= 0.65 else "⚠️ "
+        print(f"  {icon} seed={ep['seed']:3d}  {ep['contract']:12s}.{ep['function']:18s}"
+              f"  score={ep['grader_score']:.3f}  reward={ep['cumulative_reward']:+.2f}")
+    oracle_avg   = sum(e["grader_score"] for e in oracle_eps) / num_episodes
+    oracle_avg_r = sum(e["cumulative_reward"] for e in oracle_eps) / num_episodes
+    print(f"\n  Oracle avg score : {oracle_avg:.3f}  avg reward: {oracle_avg_r:+.2f}")
+    print("\n▶ Partial agent (submits NatSpec comment — partial signal):")
+    partial_eps  = [partial_t2(env, seed_offset + i) for i in range(num_episodes)]
+    partial_avg  = sum(e["grader_score"] for e in partial_eps) / num_episodes
+    partial_avg_r = sum(e["cumulative_reward"] for e in partial_eps) / num_episodes
+    print(f"  Partial avg score: {partial_avg:.3f}  avg reward: {partial_avg_r:+.2f}")
+    print("\n▶ Empty agent (submits nothing → 0.0):")
+    empty_eps  = [empty_t2(env, seed_offset + i) for i in range(num_episodes)]
+    empty_avg  = sum(e["grader_score"] for e in empty_eps) / num_episodes
+    print(f"  Empty avg score  : {empty_avg:.3f}")
+    fn_seen: Dict[str, int] = {}
+    for ep in oracle_eps:
+        fn_seen[ep["function"]] = fn_seen.get(ep["function"], 0) + 1
+    print("\n▶ Function coverage:")
+    for fn in sorted(fn_seen):
+        print(f"  {fn_seen[fn]:2d}×  {fn}")
+    assert oracle_avg > 0.60, f"Oracle avg {oracle_avg:.3f} should be > 0.60"
+    assert oracle_avg > partial_avg, "Oracle should beat partial"
+    assert partial_avg >= empty_avg,  "Partial should be >= empty"
+    assert empty_avg == 0.0, f"Empty should be 0.0, got {empty_avg}"
+    print(f"\n  ✅ Task 2 score ordering: oracle({oracle_avg:.3f}) > partial({partial_avg:.3f}) > empty(0.0)")
+    return {
+        "task_id": "task2_property_discovery",
+        "oracle":  {"avg_score": oracle_avg,  "avg_reward": oracle_avg_r, "episodes": oracle_eps},
+        "partial": {"avg_score": partial_avg, "avg_reward": partial_avg_r, "episodes": partial_eps},
+        "empty":   {"avg_score": empty_avg,   "episodes": empty_eps},
+        "fn_coverage": fn_seen,
     }
 # ─────────────────────────────────────────────────────────────────────────────
 # ─────────────────────────────────────────────────────────────────────────────
 def main():
+    parser = argparse.ArgumentParser(
+        description="Evaluate Task 1 and/or Task 2 of the SC Audit RL Environment"
+    )
+    parser.add_argument("--episodes", type=int,   default=8,
+                        help="Episodes per agent tier (default: 8)")
+    parser.add_argument("--seed",     type=int,   default=42,
+                        help="Starting RNG seed (default: 42)")
+    parser.add_argument("--task",     choices=["1", "2", "all"], default="all",
+                        help="Which task(s) to evaluate (default: all)")
+    parser.add_argument("--verbose",  action="store_true",
+                        help="Print per-episode target details")
+    parser.add_argument("--out",      default="eval_results.json",
+                        help="Output file (default: eval_results.json)")
     args = parser.parse_args()
+    report: Dict[str, Any] = {
+        "num_episodes": args.episodes,
+        "seed_offset":  args.seed,
+    }
+    if args.task in ("1", "all"):
+        report["task1"] = run_task1_eval(args.episodes, args.seed, args.verbose)
+    if args.task in ("2", "all"):
+        report["task2"] = run_task2_eval(args.episodes, args.seed, args.verbose)
+    # ── Summary ──────────────────────────────────────────────────────────────
+    print("\n" + "=" * 64)
+    print("EVALUATION COMPLETE")
+    print("=" * 64)
+    if "task1" in report:
+        t1 = report["task1"]
+        print(f"  Task 1  oracle={t1['oracle']['avg_score']:.3f}  "
+              f"partial={t1['partial']['avg_score']:.3f}  "
+              f"random={t1['random']['avg_score']:.3f}")
+    if "task2" in report:
+        t2 = report["task2"]
+        print(f"  Task 2  oracle={t2['oracle']['avg_score']:.3f}  "
+              f"partial={t2['partial']['avg_score']:.3f}  "
+              f"empty={t2['empty']['avg_score']:.3f}")
+    with open(args.out, "w") as f:
+        json.dump(report, f, indent=2)
+    print(f"\n  Results written to {args.out}")
 if __name__ == "__main__":

inference.py CHANGED Viewed

@@ -2,14 +2,13 @@
 inference.py
 ------------
 Baseline inference script for the Smart Contract Audit RL Environment.
-Uses the OpenAI-compatible API client to run an LLM agent against Task 1.
-Tasks 2 and 3 are placeholders — they reset and immediately record 0.0.
-Environment variables required:
-  API_BASE_URL   – LLM endpoint (e.g. https://api.openai.com/v1)
-  MODEL_NAME     – model identifier (e.g. gpt-4o-mini)
-  HF_TOKEN       – API key (passed as Authorization: Bearer <HF_TOKEN>)
 Usage:
   python inference.py
@@ -17,306 +16,290 @@ Usage:
 Output:
   Per-task scores printed to stdout.
   Final baseline scores written to baseline_scores.json.
 """
 import json
 import os
 import sys
 import time
-from typing import Any, Dict, List, Optional
 from openai import OpenAI
-# ---------------------------------------------------------------------------
-# Import the env directly (no HTTP overhead for baseline)
-# ---------------------------------------------------------------------------
 from tasks.task1.environment import Task1Environment
 from env.schemas import Action, ActionType
-# ---------------------------------------------------------------------------
-# Config
-# ---------------------------------------------------------------------------
 API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.openai.com/v1")
-MODEL_NAME = os.environ.get("MODEL_NAME", "gpt-4o-mini")
-HF_TOKEN = os.environ.get("HF_TOKEN", "")
 if not HF_TOKEN:
-    print("WARNING: HF_TOKEN is not set. API calls may fail.", file=sys.stderr)
-MAX_STEPS = 15          # Safety limit per episode
-NUM_EPISODES = 3        # Episodes per task
-TASK1_SEED_BASE = 42    # Reproducible seeds
-# ---------------------------------------------------------------------------
-# OpenAI client
-# ---------------------------------------------------------------------------
-client = OpenAI(
-    api_key=HF_TOKEN,
-    base_url=API_BASE_URL,
-)
-# ---------------------------------------------------------------------------
-# System prompt
-# ---------------------------------------------------------------------------
-SYSTEM_PROMPT = """You are an expert smart contract security auditor.
-You are given a Solidity contract and must identify the SINGLE most critical vulnerable function and name its vulnerability type.
-## Available Actions
-You interact by choosing ONE action per turn from:
-1. list_functions
-   → {"action": "list_functions", "params": {}}
-2. get_function_code
-   → {"action": "get_function_code", "params": {"function_name": "<name>"}}
-3. get_function_summary
-   → {"action": "get_function_summary", "params": {"function_name": "<name>"}}
-4. get_file_metadata
-   → {"action": "get_file_metadata", "params": {}}
-5. get_state_variable
-   → {"action": "get_state_variable", "params": {"variable_name": "<name>"}}
-   (omit variable_name to list all variables)
-6. get_call_graph
-   → {"action": "get_call_graph", "params": {}}
-7. submit  (ENDS THE EPISODE)
-   → {"action": "submit", "params": {"function_name": "<name>", "vulnerability_type": "<2-3 word description>"}}
-## Strategy
-- Start with list_functions and get_file_metadata to understand the contract
-- Inspect suspicious functions (withdraw, transfer, emergency*, stake, etc.)
-- Submit when you are confident about the vulnerable function
-## Output Format
-Always respond with a single JSON object:
-{"action": "<action_type>", "params": {...}}
-Do NOT include any other text — only valid JSON.
-"""
-def build_user_message(obs: Dict[str, Any]) -> str:
-    """Format the observation as a user message."""
-    lines = [
-        f"=== CONTRACT: {obs['contract_name']} ===",
-        f"Description: {obs['contract_description']}",
-        f"Step: {obs['step_count']} | Cumulative reward: {obs['cumulative_reward']:.2f}",
-        "",
-        f"Last action: {obs['last_action'] or 'None'}",
-        f"Result: {obs['last_action_result'] or 'Episode just started'}",
-        "",
-        f"Available actions: {', '.join(obs['available_actions'])}",
-    ]
-    if obs.get("extra", {}).get("hint"):
-        lines.append(f"Hint: {obs['extra']['hint']}")
-    return "\n".join(lines)
-# ---------------------------------------------------------------------------
-# Agent loop
-# ---------------------------------------------------------------------------
-def run_episode(env: Task1Environment, seed: int, episode_num: int) -> Dict[str, Any]:
-    """Run one episode and return result info."""
-    print(f"\n  Episode {episode_num} (seed={seed})")
-    reset_result = env.reset(seed=seed)
-    obs = reset_result.observation.model_dump()
-    print(f"    Contract: {obs['contract_name']}")
-    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
-    final_score = 0.0
-    final_reward = 0.0
-    steps = 0
-    done = False
-    for step_num in range(MAX_STEPS):
-        user_msg = build_user_message(obs)
-        messages.append({"role": "user", "content": user_msg})
-        # LLM call
         try:
-            response = client.chat.completions.create(
-                model=MODEL_NAME,
-                messages=messages,
-                max_tokens=256,
-                temperature=0.0,
             )
-            raw = response.choices[0].message.content.strip()
         except Exception as e:
-            print(f"    LLM error at step {step_num}: {e}", file=sys.stderr)
             break
-        # Parse action
         try:
             parsed = json.loads(raw)
-            action_type = ActionType(parsed["action"])
             params = parsed.get("params", {})
-        except Exception as e:
-            print(f"    Parse error: {e} | Raw: {raw[:100]}", file=sys.stderr)
-            # Default safe action
-            action_type = ActionType.LIST_FUNCTIONS
-            params = {}
-        action = Action(action_type=action_type, params=params)
         messages.append({"role": "assistant", "content": raw})
-        # Step
-        step_result = env.step(action)
-        obs = step_result.observation.model_dump()
-        done = step_result.done
-        steps += 1
-        final_reward = obs["cumulative_reward"]
-        print(
-            f"    Step {step_num+1}: {action_type.value} | "
-            f"reward={step_result.reward.value:+.2f} | "
-            f"cumulative={final_reward:.2f}"
-        )
-        if done:
-            # Determine grader score from reward
-            last_reward = step_result.reward.value
-            if last_reward >= 4.9:
-                final_score = 1.0
-            elif last_reward >= 0.9:
-                final_score = 0.5
-            else:
-                final_score = 0.0
-            print(f"    → DONE | grader_score={final_score:.1f}")
             break
-    if not done:
-        print(f"    → MAX STEPS reached without submission. Score=0.0")
-    return {
-        "episode": episode_num,
-        "seed": seed,
-        "contract": obs["contract_name"],
-        "steps": steps,
-        "cumulative_reward": final_reward,
-        "grader_score": final_score,
-        "done": done,
-    }
-def run_task1(num_episodes: int = NUM_EPISODES) -> Dict[str, Any]:
-    """Run Task 1 and return aggregate scores."""
     print("\n" + "="*60)
     print("TASK 1: Targeted Vulnerability Detection")
     print("="*60)
     env = Task1Environment()
-    episodes = []
-    for i in range(num_episodes):
-        seed = TASK1_SEED_BASE + i
-        result = run_episode(env, seed=seed, episode_num=i + 1)
-        episodes.append(result)
-        time.sleep(0.5)  # Rate limit courtesy
-    scores = [e["grader_score"] for e in episodes]
-    avg = sum(scores) / len(scores) if scores else 0.0
-    avg_reward = sum(e["cumulative_reward"] for e in episodes) / len(episodes)
-    print(f"\n  Task 1 Results:")
-    print(f"    Episodes: {num_episodes}")
-    print(f"    Grader scores: {scores}")
-    print(f"    Average grader score: {avg:.3f}")
-    print(f"    Average cumulative reward: {avg_reward:.2f}")
-    return {
-        "task_id": "task1_vuln_detection",
-        "name": "Targeted Vulnerability Detection",
-        "status": "active",
-        "num_episodes": num_episodes,
-        "episodes": episodes,
-        "avg_grader_score": avg,
-        "avg_cumulative_reward": avg_reward,
-    }
-def run_task2_placeholder() -> Dict[str, Any]:
-    """Task 2 placeholder — returns 0.0 score."""
     print("\n" + "="*60)
-    print("TASK 2: Property Discovery [PLACEHOLDER — not implemented]")
     print("="*60)
-    print("  Skipping. Score: 0.0")
-    return {
-        "task_id": "task2_property_discovery",
-        "name": "Property Discovery",
-        "status": "placeholder",
-        "num_episodes": 0,
-        "episodes": [],
-        "avg_grader_score": 0.0,
-        "avg_cumulative_reward": 0.0,
-    }
 def run_task3_placeholder() -> Dict[str, Any]:
-    """Task 3 placeholder — returns 0.0 score."""
     print("\n" + "="*60)
     print("TASK 3: Rule Checker [PLACEHOLDER — not implemented]")
     print("="*60)
     print("  Skipping. Score: 0.0")
-    return {
-        "task_id": "task3_rule_checker",
-        "name": "Rule Checker",
-        "status": "placeholder",
-        "num_episodes": 0,
-        "episodes": [],
-        "avg_grader_score": 0.0,
-        "avg_cumulative_reward": 0.0,
-    }
-# ---------------------------------------------------------------------------
 # Main
-# ---------------------------------------------------------------------------
 def main():
     print("Smart Contract Audit RL Environment — Baseline Inference")
     print(f"Model: {MODEL_NAME} | Base URL: {API_BASE_URL}")
-    results = {
-        "model": MODEL_NAME,
-        "base_url": API_BASE_URL,
-        "tasks": [],
-    }
-    t1 = run_task1(num_episodes=NUM_EPISODES)
-    t2 = run_task2_placeholder()
     t3 = run_task3_placeholder()
-    results["tasks"] = [t1, t2, t3]
-    # Summary
-    active_tasks = [t for t in results["tasks"] if t["status"] == "active"]
-    overall = (
-        sum(t["avg_grader_score"] for t in active_tasks) / len(active_tasks)
-        if active_tasks else 0.0
-    )
     results["overall_avg_score"] = overall
     print("\n" + "="*60)
     print("BASELINE SUMMARY")
     print("="*60)
     for t in results["tasks"]:
-        status = "✅" if t["status"] == "active" else "⏳"
-        print(f"  {status} {t['name']}: {t['avg_grader_score']:.3f}")
-    print(f"  Overall (active tasks): {overall:.3f}")
-    # Write scores file
     with open("baseline_scores.json", "w") as f:
         json.dump(results, f, indent=2)
     print("\n  Scores written to baseline_scores.json")

 inference.py
 ------------
 Baseline inference script for the Smart Contract Audit RL Environment.
+Implements Task 1 (Vulnerability Detection) and Task 2 (Property Discovery).
+Task 3 is a placeholder that returns 0.0.
+Environment variables:
+  API_BASE_URL   – LLM API endpoint   (e.g. https://api.openai.com/v1)
+  MODEL_NAME     – model identifier   (e.g. gpt-4o-mini)
+  HF_TOKEN       – API key
 Usage:
   python inference.py
 Output:
   Per-task scores printed to stdout.
   Final baseline scores written to baseline_scores.json.
+Runtime: < 5 minutes on 3 episodes per task with gpt-4o-mini.
 """
 import json
 import os
 import sys
 import time
+from typing import Any, Dict, List
 from openai import OpenAI
 from tasks.task1.environment import Task1Environment
+from tasks.task2.environment import Task2Environment
 from env.schemas import Action, ActionType
+# ─────────────────────────────────────────────────────────────────────────────
+# Configuration
+# ─────────────────────────────────────────────────────────────────────────────
 API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.openai.com/v1")
+MODEL_NAME   = os.environ.get("MODEL_NAME",   "gpt-4o-mini")
+HF_TOKEN     = os.environ.get("HF_TOKEN",     "")
 if not HF_TOKEN:
+    print("WARNING: HF_TOKEN not set. API calls may fail.", file=sys.stderr)
+MAX_STEPS_T1  = 15
+MAX_STEPS_T2  = 10
+NUM_EPISODES  = 3
+SEED_BASE_T1  = 42
+SEED_BASE_T2  = 10
+client = OpenAI(api_key=HF_TOKEN, base_url=API_BASE_URL)
+# ─────────────────────────────────────────────────────────────────────────────
+# Task 1 agent
+# ─────────────────────────────────────────────────────────────────────────────
+T1_SYSTEM = """You are an expert Solidity smart contract security auditor.
+Given a contract, identify the ONE vulnerable function and its vulnerability type.
+## Actions (choose ONE per turn, respond with JSON only):
+{"action": "list_functions",       "params": {}}
+{"action": "get_function_code",    "params": {"function_name": "<name>"}}
+{"action": "get_function_summary", "params": {"function_name": "<name>"}}
+{"action": "get_file_metadata",    "params": {}}
+{"action": "get_state_variable",   "params": {"variable_name": "<name>"}}
+{"action": "get_call_graph",       "params": {}}
+{"action": "submit",               "params": {"function_name": "<name>", "vulnerability_type": "<2-3 words>"}}
+## Strategy:
+1. list_functions first to see the attack surface
+2. Inspect suspicious functions (withdraw, drain, buy, stake, claim, setPrice, bid, finalize)
+3. Look for: reentrancy, missing access control, integer overflow, tx.origin, front-running,
+   timestamp dependence, denial of service, unchecked return value
+4. Submit when confident
+Respond ONLY with valid JSON. No explanation, no markdown."""
+def _t1_user_msg(obs: Dict[str, Any]) -> str:
+    return (
+        f"Contract: {obs['contract_name']}\n"
+        f"Description: {obs['contract_description']}\n"
+        f"Step: {obs['step_count']} | Reward: {obs['cumulative_reward']:.2f}\n\n"
+        f"Last action: {obs['last_action'] or 'None'}\n"
+        f"Result: {obs['last_action_result'] or 'Episode started.'}"
+    )
+def run_t1_episode(env: Task1Environment, seed: int, ep: int) -> Dict[str, Any]:
+    r = env.reset(seed=seed)
+    obs = r.observation.model_dump()
+    print(f"    ep={ep} seed={seed} contract={obs['contract_name']}")
+    messages = [{"role": "system", "content": T1_SYSTEM}]
+    grader_score = 0.0
+    cum_reward   = 0.0
+    for step in range(MAX_STEPS_T1):
+        messages.append({"role": "user", "content": _t1_user_msg(obs)})
+        try:
+            resp = client.chat.completions.create(
+                model=MODEL_NAME, messages=messages,
+                max_tokens=200, temperature=0.0,
+            )
+            raw = resp.choices[0].message.content.strip()
+        except Exception as e:
+            print(f"      LLM error: {e}", file=sys.stderr)
+            break
+        try:
+            parsed = json.loads(raw)
+            at     = ActionType(parsed["action"])
+            params = parsed.get("params", {})
+        except Exception:
+            at, params = ActionType.LIST_FUNCTIONS, {}
+        messages.append({"role": "assistant", "content": raw})
+        result = env.step(Action(action_type=at, params=params))
+        obs    = result.observation.model_dump()
+        print(f"      step {step+1:2d}: {at.value:25s} r={result.reward.value:+.2f}")
+        if result.done:
+            v = result.reward.value
+            grader_score = 1.0 if v >= 4.9 else (0.5 if v >= 0.9 else 0.0)
+            cum_reward   = obs["cumulative_reward"]
+            break
+        time.sleep(0.3)
+    print(f"      → grader_score={grader_score:.1f}  cum_reward={cum_reward:.2f}")
+    return {"episode": ep, "seed": seed, "contract": obs["contract_name"],
+            "grader_score": grader_score, "cumulative_reward": cum_reward}
+# ─────────────────────────────────────────────────────────────────────────────
+# Task 2 agent
+# ─────────────────────────────────────────────────────────────────────────────
+T2_SYSTEM = """You are a formal methods engineer specialising in Solidity smart contracts.
+You will be shown a specific Solidity function. Your task is to write a precise
+natural-language property (invariant / postcondition) that describes what the
+function guarantees when it succeeds.
+A good property covers:
+  - What state changes (balances, counters, flags)
+  - What assets are transferred (ETH, tokens, NFTs)
+  - What return value is produced (for view functions)
+  - Under what conditions it reverts
+## Actions (respond with JSON only, ONE action per turn):
+{"action": "get_function_code",     "params": {}}
+{"action": "get_function_natspec",  "params": {}}
+{"action": "get_file_natspec",      "params": {}}
+{"action": "get_related_functions", "params": {}}
+{"action": "get_io",                "params": {}}
+{"action": "get_similar_rule",      "params": {}}
+{"action": "submit_property",       "params": {"property": "<your full property text>"}}
+## Rules:
+- You have ONE submit_property attempt. Make it count.
+- Use get_function_natspec and get_io first — they give the most signal.
+- get_similar_rule costs more (-0.20) but shows a parallel property from another contract.
+- Write 2–4 sentences. Be specific about variable names and amounts.
+- Do NOT guess — read the code first.
+Respond ONLY with valid JSON. No markdown, no explanation."""
+def _t2_user_msg(obs: Dict[str, Any]) -> str:
+    extra = obs.get("extra", {})
+    return (
+        f"Contract : {obs['contract_name']}\n"
+        f"Function : {extra.get('target_function', '?')}  "
+        f"({extra.get('target_signature', '')})\n"
+        f"Step: {obs['step_count']} | Reward: {obs['cumulative_reward']:.2f}\n\n"
+        f"Last action: {obs['last_action'] or 'None'}\n"
+        f"Result:\n{obs['last_action_result'] or 'Episode started — begin exploring.'}"
+    )
+def run_t2_episode(env: Task2Environment, seed: int, ep: int) -> Dict[str, Any]:
+    r = env.reset(seed=seed)
+    obs = r.observation.model_dump()
+    fn  = obs["extra"].get("target_function", "?")
+    print(f"    ep={ep} seed={seed}  {obs['contract_name']}.{fn}()")
+    messages = [{"role": "system", "content": T2_SYSTEM}]
+    grader_score = 0.0
+    cum_reward   = 0.0
+    for step in range(MAX_STEPS_T2):
+        messages.append({"role": "user", "content": _t2_user_msg(obs)})
         try:
+            resp = client.chat.completions.create(
+                model=MODEL_NAME, messages=messages,
+                max_tokens=400, temperature=0.0,
             )
+            raw = resp.choices[0].message.content.strip()
         except Exception as e:
+            print(f"      LLM error: {e}", file=sys.stderr)
             break
         try:
             parsed = json.loads(raw)
+            at     = ActionType(parsed["action"])
             params = parsed.get("params", {})
+        except Exception:
+            at, params = ActionType.GET_FUNCTION_CODE, {}
         messages.append({"role": "assistant", "content": raw})
+        result = env.step(Action(action_type=at, params=params))
+        obs    = result.observation.model_dump()
+        r_val  = result.reward.value
+        print(f"      step {step+1:2d}: {at.value:25s} r={r_val:+.2f}")
+        if result.done:
+            grader_score = round(r_val / 5.0, 3) if r_val > 0 else 0.0
+            cum_reward   = obs["cumulative_reward"]
             break
+        time.sleep(0.3)
+    print(f"      → grader_score={grader_score:.3f}  cum_reward={cum_reward:.2f}")
+    return {"episode": ep, "seed": seed,
+            "contract": obs["contract_name"], "function": fn,
+            "grader_score": grader_score, "cumulative_reward": cum_reward}
+# ─────────────────────────────────────────────────────────────────────────────
+# Task runners
+# ─────────────────────────────────────────────────────────────────────────────
+def run_task1(n: int = NUM_EPISODES) -> Dict[str, Any]:
     print("\n" + "="*60)
     print("TASK 1: Targeted Vulnerability Detection")
     print("="*60)
     env = Task1Environment()
+    episodes = [run_t1_episode(env, SEED_BASE_T1 + i, i+1) for i in range(n)]
+    avg_s  = sum(e["grader_score"] for e in episodes) / n
+    avg_r  = sum(e["cumulative_reward"] for e in episodes) / n
+    print(f"\n  Avg grader score  : {avg_s:.3f}")
+    print(f"  Avg cum reward    : {avg_r:.2f}")
+    return {"task_id": "task1_vuln_detection", "name": "Targeted Vulnerability Detection",
+            "status": "active", "num_episodes": n, "episodes": episodes,
+            "avg_grader_score": avg_s, "avg_cumulative_reward": avg_r}
+def run_task2(n: int = NUM_EPISODES) -> Dict[str, Any]:
     print("\n" + "="*60)
+    print("TASK 2: Property Discovery")
     print("="*60)
+    env = Task2Environment()
+    episodes = [run_t2_episode(env, SEED_BASE_T2 + i, i+1) for i in range(n)]
+    avg_s  = sum(e["grader_score"] for e in episodes) / n
+    avg_r  = sum(e["cumulative_reward"] for e in episodes) / n
+    print(f"\n  Avg grader score  : {avg_s:.3f}")
+    print(f"  Avg cum reward    : {avg_r:.2f}")
+    return {"task_id": "task2_property_discovery", "name": "Property Discovery",
+            "status": "active", "num_episodes": n, "episodes": episodes,
+            "avg_grader_score": avg_s, "avg_cumulative_reward": avg_r}
 def run_task3_placeholder() -> Dict[str, Any]:
     print("\n" + "="*60)
     print("TASK 3: Rule Checker [PLACEHOLDER — not implemented]")
     print("="*60)
     print("  Skipping. Score: 0.0")
+    return {"task_id": "task3_rule_checker", "name": "Rule Checker",
+            "status": "placeholder", "num_episodes": 0, "episodes": [],
+            "avg_grader_score": 0.0, "avg_cumulative_reward": 0.0}
+# ─────────────────────────────────────────────────────────────────────────────
 # Main
+# ─────────────────────────────────────────────────────────────────────────────
 def main():
     print("Smart Contract Audit RL Environment — Baseline Inference")
     print(f"Model: {MODEL_NAME} | Base URL: {API_BASE_URL}")
+    t1 = run_task1(NUM_EPISODES)
+    t2 = run_task2(NUM_EPISODES)
     t3 = run_task3_placeholder()
+    results = {
+        "model": MODEL_NAME, "base_url": API_BASE_URL,
+        "tasks": [t1, t2, t3],
+    }
+    active  = [t for t in results["tasks"] if t["status"] == "active"]
+    overall = sum(t["avg_grader_score"] for t in active) / len(active) if active else 0.0
     results["overall_avg_score"] = overall
     print("\n" + "="*60)
     print("BASELINE SUMMARY")
     print("="*60)
     for t in results["tasks"]:
+        icon = "✅" if t["status"] == "active" else "⏳"
+        print(f"  {icon} {t['name']:40s}: {t['avg_grader_score']:.3f}")
+    print(f"\n  Overall (active tasks): {overall:.3f}")
     with open("baseline_scores.json", "w") as f:
         json.dump(results, f, indent=2)
     print("\n  Scores written to baseline_scores.json")

openenv.yaml CHANGED Viewed

@@ -1,25 +1,22 @@
 name: smart-contract-audit-env
-version: "1.0.0"
 description: >
   Reinforcement learning environment for smart contract security analysis.
   Agents interact with real-world Solidity contract data from Certora-audited
-  projects, learning to detect vulnerabilities, discover properties, and
-  verify rule compliance — tasks that professional auditors perform daily.
 author: "SmartAudit Team"
 license: MIT
-# ---------------------------------------------------------------------------
-# Tasks
-# ---------------------------------------------------------------------------
 tasks:
   - id: task1_vuln_detection
     name: Targeted Vulnerability Detection
     difficulty: medium
     status: active
     description: >
-      Given a Solidity contract (4–6 functions), identify the single vulnerable
-      function and describe its vulnerability type in 2–3 words.
     max_steps: 20
     reward_range: [-10.0, 10.0]
     grader: tasks/task1/grader.py
@@ -28,13 +25,13 @@ tasks:
   - id: task2_property_discovery
     name: Property Discovery
     difficulty: hard
-    status: placeholder
     description: >
       Given a single Solidity function with known properties, discover the
-      correct natural-language property describing its expected behaviour.
     max_steps: 15
     reward_range: [-5.0, 5.0]
-    grader: tasks/task2/grader.py   # TODO: implement
     grader_score_range: [0.0, 1.0]
   - id: task3_rule_checker
@@ -46,81 +43,52 @@ tasks:
       function that violates that property.
     max_steps: 15
     reward_range: [-5.0, 5.0]
-    grader: tasks/task3/grader.py   # TODO: implement
     grader_score_range: [0.0, 1.0]
-# ---------------------------------------------------------------------------
-# Observation space
-# ---------------------------------------------------------------------------
 observation_space:
   type: object
   properties:
-    task_id:
-      type: string
-      description: Active task identifier
-    contract_name:
-      type: string
-      description: Name of the Solidity contract
-    contract_description:
-      type: string
-      description: Human-readable description of what the contract does
-    available_actions:
-      type: array
-      items:
-        type: string
-      description: List of valid action type strings
-    last_action:
-      type: string
-      nullable: true
-      description: The action type that produced this observation
-    last_action_result:
-      type: string
-      nullable: true
-      description: Human-readable result of the last action
-    step_count:
-      type: integer
-      description: Number of steps taken in this episode
-    cumulative_reward:
-      type: number
-      description: Running reward total for this episode
-    done:
-      type: boolean
-      description: True when the episode has ended
-    extra:
-      type: object
-      description: Task-specific hints and auxiliary data
-# ---------------------------------------------------------------------------
-# Action space (Task 1)
-# ---------------------------------------------------------------------------
 action_space:
-  type: object
-  description: Named action with optional parameters
-  properties:
-    action_type:
-      type: string
-      enum:
-        - list_functions
-        - get_function_code
-        - get_function_summary
-        - get_file_metadata
-        - get_state_variable
-        - get_call_graph
-        - submit
-    params:
-      type: object
-      description: Key-value arguments for the action
-# ---------------------------------------------------------------------------
-# Reward function
-# ---------------------------------------------------------------------------
 reward:
   type: shaped
   description: >
-    Per-step costs encourage efficient exploration. A positive signal is given
-    when the agent accesses the actual vulnerable function. Terminal rewards
-    reflect submission accuracy (0 → 1 grader score).
-  shaping:
     list_functions: -0.05
     get_function_code_wrong: -0.10
     get_function_code_correct: +0.05
@@ -130,19 +98,28 @@ reward:
     get_state_variable: -0.05
     get_call_graph: -0.08
     repeated_query: -0.40
-  terminal:
     correct_submission: +5.0
     partial_submission: +1.0
     wrong_submission: -1.5
-# ---------------------------------------------------------------------------
-# Data
-# ---------------------------------------------------------------------------
 data:
-  source: "Certora audited projects (Aave, Compound-style protocols)"
   format: JSON
   num_contracts: 4
   num_vulnerable_functions: 8
   vulnerability_types:
     - Reentrancy
     - Missing access control
@@ -153,17 +130,16 @@ data:
     - Denial of service (unbounded loop)
     - Unchecked return value
-# ---------------------------------------------------------------------------
-# Interface
-# ---------------------------------------------------------------------------
 interface:
   http:
-    reset: POST /reset
-    step: POST /step
-    state: GET /state
-    tasks: GET /tasks
-    health: GET /health
   python:
-    reset: env.reset(seed=None) -> ResetResult
-    step: env.step(action) -> StepResult
-    state: env.state() -> StateResult

 name: smart-contract-audit-env
+version: "1.1.0"
 description: >
   Reinforcement learning environment for smart contract security analysis.
   Agents interact with real-world Solidity contract data from Certora-audited
+  projects, learning to detect vulnerabilities and discover correctness
+  properties — tasks that professional auditors perform daily.
 author: "SmartAudit Team"
 license: MIT
 tasks:
   - id: task1_vuln_detection
     name: Targeted Vulnerability Detection
     difficulty: medium
     status: active
     description: >
+      Given a Solidity contract (4-6 functions), identify the single vulnerable
+      function and describe its vulnerability type in 2-3 words.
     max_steps: 20
     reward_range: [-10.0, 10.0]
     grader: tasks/task1/grader.py
   - id: task2_property_discovery
     name: Property Discovery
     difficulty: hard
+    status: active
     description: >
       Given a single Solidity function with known properties, discover the
+      correct natural-language postcondition describing its correct behaviour.
     max_steps: 15
     reward_range: [-5.0, 5.0]
+    grader: tasks/task2/grader.py
     grader_score_range: [0.0, 1.0]
   - id: task3_rule_checker
       function that violates that property.
     max_steps: 15
     reward_range: [-5.0, 5.0]
+    grader: tasks/task3/grader.py
     grader_score_range: [0.0, 1.0]
 observation_space:
   type: object
   properties:
+    task_id:              {type: string, description: Active task identifier}
+    contract_name:        {type: string, description: Solidity contract name}
+    contract_description: {type: string, description: Human-readable contract description}
+    available_actions:    {type: array, items: {type: string}, description: Valid action types}
+    last_action:          {type: string, nullable: true}
+    last_action_result:   {type: string, nullable: true}
+    step_count:           {type: integer}
+    cumulative_reward:    {type: number}
+    done:                 {type: boolean}
+    extra:                {type: object, description: Task-specific hints}
 action_space:
+  task1:
+    type: object
+    actions:
+      list_functions:       {params: {},                                  reward: -0.05}
+      get_function_code:    {params: {function_name: string},             reward: "+0.05 / -0.10"}
+      get_function_summary: {params: {function_name: string},             reward: "+0.03 / -0.05"}
+      get_file_metadata:    {params: {},                                  reward: -0.04}
+      get_state_variable:   {params: {variable_name: "string (opt)"},    reward: -0.05}
+      get_call_graph:       {params: {},                                  reward: -0.08}
+      submit:               {params: {function_name: str, vulnerability_type: str}, reward: "+5.0 / +1.0 / -1.5"}
+  task2:
+    type: object
+    actions:
+      get_function_code:     {params: {}, reward: -0.06}
+      get_function_natspec:  {params: {}, reward: -0.08}
+      get_file_natspec:      {params: {}, reward: -0.03}
+      get_related_functions: {params: {}, reward: -0.06}
+      get_io:                {params: {}, reward: -0.04}
+      get_similar_rule:      {params: {}, reward: -0.20}
+      submit_property:       {params: {property: string}, reward: "0.0–5.0 (keyword-weighted)"}
 reward:
   type: shaped
   description: >
+    Per-step costs encourage efficient exploration. Positive shaping rewards
+    fire when the agent inspects the actual target. Terminal rewards reflect
+    grader score accuracy.
+  task1_shaping:
     list_functions: -0.05
     get_function_code_wrong: -0.10
     get_function_code_correct: +0.05
     get_state_variable: -0.05
     get_call_graph: -0.08
     repeated_query: -0.40
+  task1_terminal:
     correct_submission: +5.0
     partial_submission: +1.0
     wrong_submission: -1.5
+  task2_shaping:
+    get_function_code: -0.06
+    get_function_natspec: -0.08
+    get_file_natspec: -0.03
+    get_related_functions: -0.06
+    get_io: -0.04
+    get_similar_rule: -0.20
+    repeated_query: -0.40
+  task2_terminal:
+    score_range: [0.0, 5.0]
+    formula: "score * 5.0 where score = 0.70*(key_matches/total_key) + 0.30*(bonus_matches/total_bonus)"
 data:
+  source: "Certora audited DeFi projects"
   format: JSON
   num_contracts: 4
   num_vulnerable_functions: 8
+  num_property_functions: 11
   vulnerability_types:
     - Reentrancy
     - Missing access control
     - Denial of service (unbounded loop)
     - Unchecked return value
 interface:
   http:
+    reset:             POST /reset
+    step:              POST /step
+    state:             GET /state
+    tasks:             GET /tasks
+    health:            GET /health
+    action_space:      GET /action_space?task_id=<id>
+    observation_space: GET /observation_space
   python:
+    reset: env.reset(seed=None)  -> ResetResult
+    step:  env.step(action)      -> StepResult
+    state: env.state()           -> StateResult

tasks/task2/__init__.py CHANGED Viewed

@@ -1,27 +1,5 @@
-"""
-tasks/task2/__init__.py
------------------------
-Task 2: Property Discovery (PLACEHOLDER)
-TODO: Implement this task.
-Episode setup:
-  - One function from a Solidity file with known properties
-  - Agent must discover the natural-language property of the function
-Actions (to implement):
-  - get_similar_rule      : -0.20
-  - get_file_natspec      : -0.03
-  - get_function_natspec  : -0.08
-  - get_function_code     : -0.06
-  - get_related_functions : -0.06
-  - get_io                : -0.04
-  - submit_property       : scored 0.0–5.0 by semantic similarity grader
-See README.md for full task specification.
-"""
-# TODO: Task 2 – Property Discovery
-# from tasks.task2.environment import Task2Environment
-__all__: list = []

+# Task 2: Property Discovery
+from tasks.task2.environment import Task2Environment
+from tasks.task2.grader import Task2Grader
+__all__ = ["Task2Environment", "Task2Grader"]

tasks/task2/environment.py ADDED Viewed

	@@ -0,0 +1,340 @@

+"""
+environment.py  (Task 2 – Property Discovery)
+----------------------------------------------
+OpenEnv-compliant RL environment.
+Episode setup:
+  - One function from a Solidity contract that has a known property.
+  - The agent sees: contract description + function name + function signature.
+  - The agent must discover the natural-language property of the function.
+Actions & rewards:
+  get_function_code     -0.06  (always positive topic context)
+  get_function_natspec  -0.08  (strongest hint — natspec has param/return docs)
+  get_file_natspec      -0.03  (broad contract-level context)
+  get_related_functions -0.06  (shows callers/callees)
+  get_io                -0.04  (structured input/output description)
+  get_similar_rule      -0.20  (shows a similar property from another contract)
+  submit_property       scored 0–5 (ONE attempt, ends episode)
+  repeated_query        -0.40
+Episode ends when:
+  - submit_property is called (scored), OR
+  - max_steps is reached without submission (reward = -1.0)
+"""
+from __future__ import annotations
+import random
+from typing import Any, Dict, List, Optional, Set
+from data.data_loader import (
+    load_contracts,
+    sample_property_episode,
+    get_function_by_name,
+    get_related_functions,
+    get_similar_rule,
+)
+from env.base_env import BaseEnv
+from env.schemas import (
+    Action,
+    ActionType,
+    Observation,
+    Reward,
+    ResetResult,
+    StateResult,
+    StepResult,
+)
+from tasks.task2.grader import Task2Grader
+TASK_ID    = "task2_property_discovery"
+MAX_STEPS  = 15
+AVAILABLE_ACTIONS = [
+    ActionType.GET_FUNCTION_CODE,
+    ActionType.GET_FUNCTION_NATSPEC,
+    ActionType.GET_FILE_NATSPEC,
+    ActionType.GET_RELATED_FUNCTIONS,
+    ActionType.GET_IO,
+    ActionType.GET_SIMILAR_RULE,
+    ActionType.SUBMIT_PROPERTY,
+]
+class Task2Environment(BaseEnv):
+    """Task 2: Property Discovery."""
+    def __init__(self, contracts_path: Optional[str] = None) -> None:
+        self._contracts = load_contracts(contracts_path) if contracts_path else load_contracts()
+        self._rng = random.Random()
+        # Episode state – initialised by reset()
+        self._contract:    Dict[str, Any] = {}
+        self._target_fn:   Dict[str, Any] = {}
+        self._grader:      Optional[Task2Grader] = None
+        self._step_count:  int = 0
+        self._cum_reward:  float = 0.0
+        self._done:        bool = False
+        self._submitted:   bool = False          # only one submit_property allowed
+        self._query_hist:  List[str] = []
+        self._seen:        Set[str] = set()
+    # ── OpenEnv interface ────────────────────────────────────────────────────
+    def reset(self, seed: Optional[int] = None) -> ResetResult:
+        if seed is not None:
+            self._rng.seed(seed)
+        self._contract, self._target_fn = sample_property_episode(
+            self._contracts, self._rng
+        )
+        self._grader = Task2Grader(
+            function_name=self._target_fn["name"],
+            property_data=self._target_fn["property"],
+        )
+        self._step_count = 0
+        self._cum_reward = 0.0
+        self._done       = False
+        self._submitted  = False
+        self._query_hist = []
+        self._seen       = set()
+        obs = self._build_obs(
+            last_action=None,
+            last_result=(
+                f"New episode started.\n"
+                f"Contract  : {self._contract['contract_name']}\n"
+                f"Function  : {self._target_fn['name']}  "
+                f"({self._target_fn.get('signature', '')})\n"
+                f"Your task : Discover the natural-language property of "
+                f"'{self._target_fn['name']}' and submit it with submit_property."
+            ),
+        )
+        return ResetResult(observation=obs, info={"task_id": TASK_ID})
+    def step(self, action: Action) -> StepResult:
+        if self._done:
+            raise RuntimeError("Episode is done. Call reset() to start a new episode.")
+        self._step_count += 1
+        result_text, reward = self._dispatch(action)
+        self._cum_reward += reward.value
+        self._query_hist.append(f"[{action.action_type}] → {result_text[:100]}")
+        obs = self._build_obs(
+            last_action=action.action_type,
+            last_result=result_text,
+        )
+        return StepResult(
+            observation=obs,
+            reward=reward,
+            done=self._done,
+            info={
+                "step": self._step_count,
+                "cumulative_reward": self._cum_reward,
+            },
+        )
+    def state(self) -> StateResult:
+        return StateResult(
+            task_id=TASK_ID,
+            contract_name=self._contract.get("contract_name", ""),
+            target_function=self._target_fn.get("name"),
+            step_count=self._step_count,
+            cumulative_reward=self._cum_reward,
+            done=self._done,
+            query_history=list(self._query_hist),
+        )
+    # ── Internal helpers ─────────────────────────────────────────────────────
+    def _build_obs(self, last_action: Optional[str], last_result: str) -> Observation:
+        return Observation(
+            task_id=TASK_ID,
+            contract_name=self._contract.get("contract_name", ""),
+            contract_description=self._contract.get("metadata", {}).get("description", ""),
+            available_actions=[a.value for a in AVAILABLE_ACTIONS],
+            last_action=last_action,
+            last_action_result=last_result,
+            step_count=self._step_count,
+            cumulative_reward=self._cum_reward,
+            done=self._done,
+            extra={
+                "target_function": self._target_fn.get("name", ""),
+                "target_signature": self._target_fn.get("signature", ""),
+                "solidity_version": self._contract.get("metadata", {}).get("solidity_version", ""),
+                "hint": (
+                    "Discover the property of the target function. "
+                    "Use get_function_code, get_function_natspec, or get_similar_rule for hints. "
+                    "Submit with submit_property, params={'property': '<your property text>'}. "
+                    "ONE submission attempt only."
+                ),
+            },
+        )
+    def _qkey(self, at: str, params: Dict[str, Any]) -> str:
+        return f"{at}:{sorted(params.items())}"
+    def _is_repeated(self, key: str) -> bool:
+        if key in self._seen:
+            return True
+        self._seen.add(key)
+        return False
+    def _dispatch(self, action: Action) -> tuple[str, Reward]:
+        at     = action.action_type
+        params = action.params
+        qkey   = self._qkey(at, params)
+        fn     = self._target_fn
+        name   = fn["name"]
+        # ── get_function_code ────────────────────────────────────────────────
+        if at == ActionType.GET_FUNCTION_CODE:
+            if self._is_repeated(qkey):
+                return "Repeated query.", Reward(value=-0.40, reason="Repeated query")
+            code = fn.get("code", "// no code available")
+            return (
+                f"// {name}\n{code}",
+                Reward(value=-0.06, reason="get_function_code cost"),
+            )
+        # ── get_function_natspec ─────────────────────────────────────────────
+        if at == ActionType.GET_FUNCTION_NATSPEC:
+            if self._is_repeated(qkey):
+                return "Repeated query.", Reward(value=-0.40, reason="Repeated query")
+            natspec = fn.get("natspec") or fn.get("comment") or "No NatSpec available."
+            # Also include output_property if present
+            out_prop = fn.get("output_property", "")
+            result = f"NatSpec for '{name}':\n{natspec}"
+            if out_prop:
+                result += f"\n\nExpected output: {out_prop}"
+            return result, Reward(value=-0.08, reason="get_function_natspec cost")
+        # ── get_file_natspec ─────────────────────────────────────────────────
+        if at == ActionType.GET_FILE_NATSPEC:
+            if self._is_repeated(qkey):
+                return "Repeated query.", Reward(value=-0.40, reason="Repeated query")
+            meta = self._contract.get("metadata", {})
+            natspec = meta.get("natspec") or meta.get("description", "No file NatSpec available.")
+            return (
+                f"File NatSpec for {self._contract['contract_name']}:\n{natspec}",
+                Reward(value=-0.03, reason="get_file_natspec cost"),
+            )
+        # ── get_related_functions ────────────────────────────────────────────
+        if at == ActionType.GET_RELATED_FUNCTIONS:
+            if self._is_repeated(qkey):
+                return "Repeated query.", Reward(value=-0.40, reason="Repeated query")
+            related = get_related_functions(self._contract, name)
+            if not related:
+                text = f"No related functions found for '{name}'."
+            else:
+                summaries = []
+                for rn in related:
+                    rfn = get_function_by_name(self._contract, rn)
+                    if rfn:
+                        sig = rfn.get("signature", rn)
+                        comment = rfn.get("comment", "")
+                        summaries.append(f"  • {sig} — {comment}")
+                text = f"Related functions for '{name}':\n" + "\n".join(summaries)
+            return text, Reward(value=-0.06, reason="get_related_functions cost")
+        # ── get_io ───────────────────────────────────────────────────────────
+        if at == ActionType.GET_IO:
+            if self._is_repeated(qkey):
+                return "Repeated query.", Reward(value=-0.40, reason="Repeated query")
+            params_list = fn.get("parameters", [])
+            returns     = fn.get("returns", "") or "void"
+            out_prop    = fn.get("output_property", "")
+            visibility  = fn.get("visibility", "")
+            modifiers   = fn.get("modifiers", [])
+            lines = [f"Function: {fn.get('signature', name)}"]
+            lines.append(f"Visibility: {visibility}" + (f"  Modifiers: {', '.join(modifiers)}" if modifiers else ""))
+            if params_list:
+                lines.append("Parameters:")
+                for p in params_list:
+                    lines.append(f"  • {p['type']} {p['name']}: {p.get('description','')}")
+            else:
+                lines.append("Parameters: none (payable)" if "payable" in fn.get("code","") else "Parameters: none")
+            lines.append(f"Returns: {returns}")
+            if out_prop:
+                lines.append(f"Expected behaviour: {out_prop}")
+            return "\n".join(lines), Reward(value=-0.04, reason="get_io cost")
+        # ── get_similar_rule ─────────────────────────────────────────────────
+        if at == ActionType.GET_SIMILAR_RULE:
+            if self._is_repeated(qkey):
+                return "Repeated query.", Reward(value=-0.40, reason="Repeated query")
+            sr = get_similar_rule(
+                self._contracts,
+                self._contract["contract_name"],
+                name,
+            )
+            if sr is None:
+                return (
+                    "No similar rule available for this function.",
+                    Reward(value=-0.20, reason="get_similar_rule cost (not found)"),
+                )
+            lines = [
+                f"Similar property from {sr['contract_name']}.{sr['function_name']}():",
+                f"  {sr['property_hint']}",
+            ]
+            if sr.get("natspec"):
+                lines.append(f"\nFunction NatSpec:\n  {sr['natspec']}")
+            return "\n".join(lines), Reward(value=-0.20, reason="get_similar_rule cost")
+        # ── submit_property ──────────────────────────────────────────────────
+        if at == ActionType.SUBMIT_PROPERTY:
+            if self._submitted:
+                return (
+                    "❌ You have already submitted a property for this episode. "
+                    "Only one submission is allowed.",
+                    Reward(value=-1.0, reason="Second submit_property attempt", partial=False),
+                )
+            submitted_text = params.get("property", "").strip()
+            if not submitted_text:
+                return (
+                    "Submit requires 'property' key in params with a non-empty string.",
+                    Reward(value=-0.5, reason="Empty property submission"),
+                )
+            self._submitted = True
+            self._done      = True
+            score  = self._grader.grade(submitted_text)
+            reward = self._grader.reward_for_score(score)
+            bd     = self._grader.breakdown(submitted_text)
+            pct = int(score * 100)
+            if score >= 0.85:
+                emoji = "✅"
+                label = "EXCELLENT"
+            elif score >= 0.60:
+                emoji = "🟡"
+                label = "GOOD"
+            elif score >= 0.35:
+                emoji = "🟠"
+                label = "PARTIAL"
+            else:
+                emoji = "❌"
+                label = "POOR"
+            msg = (
+                f"{emoji} {label} — Score: {score:.2f}/1.00 → Reward: {reward:.2f}/5.00  ({pct}%)\n"
+                f"Key concepts matched   : {len(bd['key_matched'])}/{len(bd['key_matched'])+len(bd['key_missed'])}  "
+                f"{bd['key_matched']}\n"
+                f"Bonus concepts matched : {len(bd['bonus_matched'])}/{len(bd['bonus_matched'])+len(bd['bonus_missed'])}  "
+                f"{bd['bonus_matched']}"
+            )
+            return msg, Reward(
+                value=reward,
+                reason=f"Property submission score={score:.3f}",
+                partial=False,
+            )
+        # ── unknown action ────────────────────────────────────────────────────
+        return (
+            f"Unknown action type: '{at}'. Valid: {[a.value for a in AVAILABLE_ACTIONS]}",
+            Reward(value=-0.10, reason="Unknown action"),
+        )

tasks/task2/grader.py ADDED Viewed

	@@ -0,0 +1,171 @@

+"""
+grader.py  (Task 2 – Property Discovery)
+-----------------------------------------
+Deterministic scorer for natural-language property submissions.
+Score formula
+─────────────
+  key_phrases   weight = 0.70
+  bonus_phrases weight = 0.30
+  score = 0.70 * (matched_key / total_key)
+        + 0.30 * (matched_bonus / total_bonus)
+Phrase matching
+───────────────
+A phrase is considered matched if ALL its words (after normalisation)
+appear in the submitted text. This is intentionally lenient — it
+doesn't require the words to be adjacent, so "balance increases by
+msg.value" is matched by "the caller's vault balance increases by the
+sent msg.value amount".
+Synonym expansion allows common paraphrases to also match
+(e.g. "caller" → "msg.sender", "sender", "user").
+Terminal reward = score × 5.0   (range: 0.0 – 5.0)
+One submission attempt per episode.
+"""
+from __future__ import annotations
+import string
+from typing import Dict, List, Optional
+# ── Text normalisation ────────────────────────────────────────────────────────
+_PUNCT = str.maketrans("", "", string.punctuation)
+def _norm(text: str) -> str:
+    """Lowercase, strip punctuation, collapse whitespace → word string."""
+    return " ".join(text.lower().translate(_PUNCT).split())
+def _word_set(text: str) -> set:
+    """Return normalised set of words."""
+    return set(_norm(text).split())
+# ── Synonym table ─────────────────────────────────────────────────────────────
+# Maps canonical word → list of accepted synonyms (all lowercase, no punct).
+SYNONYMS: Dict[str, List[str]] = {
+    "caller":      ["caller", "sender", "user", "msgsender", "msg sender"],
+    "balance":     ["balance", "holdings", "amount held"],
+    "increases":   ["increases", "incremented", "added", "grows", "rise"],
+    "decreases":   ["decreases", "decremented", "reduced", "subtracted", "falls"],
+    "transfers":   ["transfers", "sends", "moved", "forwarded", "sent"],
+    "reverts":     ["reverts", "fails", "rejected", "throws", "require"],
+    "zero":        ["zero", "0", "nothing", "none", "empty"],
+    "owner":       ["owner", "admin", "authorized"],
+    "entire":      ["entire", "full", "whole", "all", "total"],
+    "returned":    ["returned", "sent back", "refunded", "transferred back"],
+    "reset":       ["reset", "zeroed", "set to zero", "cleared"],
+    "only":        ["only", "exclusively", "restricted"],
+    "price":       ["price", "cost", "rate"],
+    "tokens":      ["tokens", "token amount"],
+    "rewards":     ["rewards", "reward tokens", "accrued"],
+    "staked":      ["staked", "deposited", "locked"],
+    "winner":      ["winner", "winning bidder", "successful bidder"],
+}
+def _expand_words(phrase_words: List[str]) -> List[List[str]]:
+    """
+    For each word in the phrase, generate synonym variants.
+    Returns a list of word-list variants to try.
+    Only substitutes ONE word at a time to avoid combinatorial explosion.
+    """
+    variants = [phrase_words]  # original
+    for i, word in enumerate(phrase_words):
+        if word in SYNONYMS:
+            for syn in SYNONYMS[word]:
+                syn_words = _norm(syn).split()
+                new_variant = phrase_words[:i] + syn_words + phrase_words[i + 1:]
+                variants.append(new_variant)
+    return variants
+def _phrase_matched(text_words: set, phrase: str) -> bool:
+    """
+    True if ALL words in the phrase (or a synonym variant) appear in text_words.
+    Uses word-set containment, not substring adjacency.
+    """
+    norm_words = _norm(phrase).split()
+    for variant_words in _expand_words(norm_words):
+        if all(w in text_words for w in variant_words):
+            return True
+    return False
+# ── Grader ────────────────────────────────────────────────────────────────────
+class Task2Grader:
+    """
+    Grades a Task 2 property submission.
+    Parameters
+    ----------
+    function_name  : name of the target function
+    property_data  : the 'property' dict from the dataset
+                     Must have: natural_language, key_phrases, bonus_phrases
+    """
+    KEY_WEIGHT   = 0.70
+    BONUS_WEIGHT = 0.30
+    def __init__(self, function_name: str, property_data: Dict) -> None:
+        self.function_name    = function_name
+        self.natural_language = property_data.get("natural_language", "")
+        self.key_phrases      = property_data.get("key_phrases", [])
+        self.bonus_phrases    = property_data.get("bonus_phrases", [])
+    # ── Public API ──────────────────────��─────────────────────────────────────
+    def grade(self, submitted: str) -> float:
+        """Deterministic score in [0.0, 1.0]."""
+        if not submitted or not submitted.strip():
+            return 0.0
+        tw = _word_set(submitted)
+        key_score   = self._phrase_score(tw, self.key_phrases)
+        bonus_score = self._phrase_score(tw, self.bonus_phrases)
+        raw = self.KEY_WEIGHT * key_score + self.BONUS_WEIGHT * bonus_score
+        return round(min(max(raw, 0.0), 1.0), 4)
+    def reward_for_score(self, score: float) -> float:
+        """Maps [0.0, 1.0] → [0.0, 5.0]."""
+        return round(score * 5.0, 4)
+    def breakdown(self, submitted: str) -> Dict:
+        """Detailed scoring breakdown for debugging."""
+        tw = _word_set(submitted)
+        key_hits   = [p for p in self.key_phrases   if _phrase_matched(tw, p)]
+        bonus_hits = [p for p in self.bonus_phrases if _phrase_matched(tw, p)]
+        score      = self.grade(submitted)
+        return {
+            "score":         score,
+            "reward":        self.reward_for_score(score),
+            "key_matched":   key_hits,
+            "key_missed":    [p for p in self.key_phrases   if p not in key_hits],
+            "bonus_matched": bonus_hits,
+            "bonus_missed":  [p for p in self.bonus_phrases if p not in bonus_hits],
+            "key_score":     self._phrase_score(tw, self.key_phrases),
+            "bonus_score":   self._phrase_score(tw, self.bonus_phrases),
+        }
+    def get_canonical_answer(self) -> Dict:
+        """For debugging / logging only."""
+        return {
+            "function":         self.function_name,
+            "natural_language": self.natural_language,
+            "key_phrases":      self.key_phrases,
+        }
+    # ── Internal ──────────────────────────────────────────────────────────────
+    def _phrase_score(self, text_words: set, phrases: List[str]) -> float:
+        if not phrases:
+            return 1.0
+        matched = sum(1 for p in phrases if _phrase_matched(text_words, p))
+        return matched / len(phrases)

validate.py CHANGED Viewed

@@ -1,290 +1,280 @@
 """
 validate.py
 -----------
-Pre-submission validation script.
-Checks all OpenEnv spec requirements locally before submitting.
-Usage:
-  python validate.py
-Exit code 0 = all checks pass.
-Exit code 1 = one or more checks failed.
 """
-import json
-import sys
-import traceback
 from typing import Callable, List, Tuple
-# ─────────────────────────────────────────────────────────────────────────────
-# Helpers
-# ─────────────────────────────────────────────────────────────────────────────
-PASS = "✅"
-FAIL = "❌"
-SKIP = "⏭ "
 results: List[Tuple[str, bool, str]] = []
 def check(name: str, fn: Callable[[], None]) -> None:
     try:
-        fn()
-        results.append((name, True, ""))
         print(f"  {PASS} {name}")
     except Exception as e:
-        tb = traceback.format_exc(limit=3)
         results.append((name, False, str(e)))
-        print(f"  {FAIL} {name}")
-        print(f"       {e}")
-# ─────────────────────────────────────────────────────────────────────────────
-# Checks
-# ─────────────────────────────────────────────────────────────────────────────
 def check_imports():
-    from env.schemas import Observation, Action, Reward, StepResult, ResetResult, StateResult
     from tasks.task1.environment import Task1Environment
     from tasks.task1.grader import Task1Grader
     from data.data_loader import load_contracts
 def check_openenv_yaml():
     import yaml
-    with open("openenv.yaml") as f:
-        spec = yaml.safe_load(f)
     assert "name" in spec
-    assert "tasks" in spec
-    assert len(spec["tasks"]) >= 3, "Need at least 3 tasks defined"
     assert "observation_space" in spec
     assert "action_space" in spec
     assert "reward" in spec
 def check_pydantic_models():
     from env.schemas import Observation, Action, ActionType, Reward, StepResult, ResetResult, StateResult
-    # Instantiate each model
-    obs = Observation(
-        task_id="t1", contract_name="C", contract_description="D",
-        available_actions=["submit"]
-    )
     assert obs.task_id == "t1"
-    action = Action(action_type=ActionType.LIST_FUNCTIONS)
-    assert action.action_type == ActionType.LIST_FUNCTIONS
-    reward = Reward(value=1.0, reason="test")
-    assert reward.value == 1.0
-    step = StepResult(observation=obs, reward=reward, done=False)
-    assert not step.done
-    reset = ResetResult(observation=obs)
-    assert reset.observation.task_id == "t1"
-    state = StateResult(task_id="t1", contract_name="C", step_count=0,
-                        cumulative_reward=0.0, done=False)
-    assert state.step_count == 0
 def check_data_loading():
-    from data.data_loader import load_contracts, get_all_vulnerable_entries
     contracts = load_contracts()
-    assert len(contracts) >= 1, "No contracts loaded"
-    entries = get_all_vulnerable_entries(contracts)
-    assert len(entries) >= 3, f"Need >= 3 vulnerable functions, got {len(entries)}"
-    for contract, fn in entries:
-        assert fn.get("vulnerable") is True
-        assert fn.get("vulnerability_details") is not None
-        assert "issue" in fn["vulnerability_details"]
-def check_env_reset():
-    from tasks.task1.environment import Task1Environment
-    env = Task1Environment()
-    result = env.reset(seed=42)
-    assert result.observation is not None
-    assert result.observation.task_id == "task1_vuln_detection"
-    assert result.observation.contract_name != ""
-    assert not result.observation.done
-    assert result.observation.step_count == 0
-def check_env_step():
     from tasks.task1.environment import Task1Environment
     from env.schemas import Action, ActionType
     env = Task1Environment()
-    env.reset(seed=42)
-    result = env.step(Action(action_type=ActionType.LIST_FUNCTIONS))
-    assert result.observation is not None
-    assert isinstance(result.reward.value, float)
-    assert isinstance(result.done, bool)
-    assert "info" in result.model_dump()
-def check_env_state():
-    from tasks.task1.environment import Task1Environment
-    env = Task1Environment()
-    env.reset(seed=42)
-    state = env.state()
-    assert state.task_id == "task1_vuln_detection"
-    assert state.contract_name != ""
-    assert state.target_function is not None  # exposed for debugging
-def check_grader_scores_in_range():
     from tasks.task1.grader import Task1Grader
     cases = [
-        ("withdraw", "Reentrancy vulnerability", "withdraw", "reentrancy", 1.0),
-        ("withdraw", "Reentrancy vulnerability", "withdraw", "something else", 0.5),
-        ("withdraw", "Reentrancy vulnerability", "deposit", "reentrancy", 0.0),
     ]
     for tf, issue, sf, sv, expected in cases:
         g = Task1Grader(tf, issue)
         score = g.grade_submission(sf, sv)
-        assert 0.0 <= score <= 1.0, f"Score {score} out of range"
         assert abs(score - expected) < 0.01, f"Expected {expected}, got {score}"
-def check_grader_deterministic():
-    from tasks.task1.grader import Task1Grader
-    g = Task1Grader("withdraw", "Reentrancy vulnerability")
-    s1 = g.grade_submission("withdraw", "reentrancy")
-    s2 = g.grade_submission("withdraw", "reentrancy")
-    assert s1 == s2 == 1.0, "Grader must be deterministic"
 def check_reward_shaping():
-    """Verify reward is non-binary (multiple distinct values across steps)."""
-    from tasks.task1.environment import Task1Environment
     from env.schemas import Action, ActionType
-    env = Task1Environment()
     env.reset(seed=1)
-    rewards = set()
-    for at in [ActionType.LIST_FUNCTIONS, ActionType.GET_FILE_METADATA, ActionType.GET_CALL_GRAPH]:
-        r = env.step(Action(action_type=at))
-        rewards.add(round(r.reward.value, 4))
-    # Should have at least 2 distinct shaping reward values
-    assert len(rewards) >= 2, f"Expected multiple reward values, got {rewards}"
-def check_episode_boundary():
-    """Episode must end after submit and raise on subsequent step."""
     from tasks.task1.environment import Task1Environment
     from env.schemas import Action, ActionType
     env = Task1Environment()
     env.reset(seed=2)
-    env.step(Action(action_type=ActionType.SUBMIT, params={
-        "function_name": "withdraw", "vulnerability_type": "test"
-    }))
     try:
         env.step(Action(action_type=ActionType.LIST_FUNCTIONS))
-        raise AssertionError("Should have raised RuntimeError after episode end")
     except RuntimeError:
-        pass  # Expected
 def check_repeated_query_penalty():
     from tasks.task1.environment import Task1Environment
     from env.schemas import Action, ActionType
-    env = Task1Environment()
-    env.reset(seed=3)
     env.step(Action(action_type=ActionType.LIST_FUNCTIONS))
     r = env.step(Action(action_type=ActionType.LIST_FUNCTIONS))
-    assert r.reward.value == -0.40, f"Expected -0.40 for repeated query, got {r.reward.value}"
-def check_tasks_list():
-    """All three tasks must be listed (even if placeholders)."""
-    from tasks.task2 import __all__ as t2  # noqa
-    from tasks.task3 import __all__ as t3  # noqa
-def check_dockerfile_exists():
     import os
-    assert os.path.exists("Dockerfile"), "Dockerfile is missing"
-    with open("Dockerfile") as f:
-        content = f.read()
-    assert "7860" in content, "Dockerfile must EXPOSE 7860 (HF Spaces)"
-    assert "uvicorn" in content or "CMD" in content
 def check_inference_script():
     import os
-    assert os.path.exists("inference.py"), "inference.py is missing"
-    with open("inference.py") as f:
-        content = f.read()
-    assert "OPENAI_API_KEY" in content or "HF_TOKEN" in content, \
-        "inference.py must read API credentials from env vars"
-    assert "API_BASE_URL" in content
-    assert "MODEL_NAME" in content
-def check_baseline_json_schema():
-    """baseline_scores.json must have valid schema if it exists."""
     import os
-    if not os.path.exists("baseline_scores.json"):
-        return  # OK — file is generated at runtime
-    with open("baseline_scores.json") as f:
-        data = json.load(f)
     assert "tasks" in data
-    for task in data["tasks"]:
-        score = task["avg_grader_score"]
-        assert 0.0 <= score <= 1.0, f"Score {score} out of range"
-# ─────────────────────────────────────────────────────────────────────────────
-# Runner
-# ─────────────────────────────────────────────────────────────────────────────
 def main():
-    print("=" * 60)
-    print("OpenEnv Pre-Submission Validation")
-    print("=" * 60)
-    all_checks = [
-        ("Python imports",              check_imports),
-        ("openenv.yaml format",         check_openenv_yaml),
-        ("Pydantic model types",        check_pydantic_models),
-        ("Dataset loading (3+ vulns)",  check_data_loading),
-        ("env.reset() → ResetResult",   check_env_reset),
-        ("env.step() → StepResult",     check_env_step),
-        ("env.state() → StateResult",   check_env_state),
-        ("Grader scores in [0.0, 1.0]", check_grader_scores_in_range),
-        ("Grader is deterministic",     check_grader_deterministic),
-        ("Reward shaping (non-binary)", check_reward_shaping),
-        ("Episode boundary (done=True)",check_episode_boundary),
-        ("Repeated query penalty",      check_repeated_query_penalty),
-        ("Task 2 & 3 placeholders",     check_tasks_list),
-        ("Dockerfile exists + port",    check_dockerfile_exists),
-        ("inference.py exists + vars",  check_inference_script),
-        ("baseline_scores.json schema", check_baseline_json_schema),
-    ]
     print()
-    for name, fn in all_checks:
         check(name, fn)
-    print()
     passed = sum(1 for _, ok, _ in results if ok)
-    total = len(results)
-    failed = [(n, msg) for n, ok, msg in results if not ok]
-    print("=" * 60)
     print(f"Results: {passed}/{total} checks passed")
     if failed:
         print("\nFailed checks:")
-        for name, msg in failed:
-            print(f"  {FAIL} {name}: {msg}")
-        print()
-        print("❌ VALIDATION FAILED — fix the issues above before submitting.")
         sys.exit(1)
     else:
-        print()
-        print("✅ ALL CHECKS PASSED — ready to submit!")
         sys.exit(0)
 if __name__ == "__main__":
     main()

 """
 validate.py
 -----------
+Pre-submission validation. Checks all OpenEnv spec requirements.
+Usage:  python validate.py
+Exit 0 = all checks pass.  Exit 1 = one or more failures.
 """
+import json, sys, traceback
 from typing import Callable, List, Tuple
+PASS = "✅"; FAIL = "❌"
 results: List[Tuple[str, bool, str]] = []
 def check(name: str, fn: Callable[[], None]) -> None:
     try:
+        fn(); results.append((name, True, ""))
         print(f"  {PASS} {name}")
     except Exception as e:
         results.append((name, False, str(e)))
+        print(f"  {FAIL} {name}\n       {e}")
+# ── Checks ────────────────────────────────────────────────────────────────────
 def check_imports():
+    from env.schemas import Observation, Action, Reward, StepResult, ResetResult, StateResult, ActionType
     from tasks.task1.environment import Task1Environment
     from tasks.task1.grader import Task1Grader
+    from tasks.task2.environment import Task2Environment
+    from tasks.task2.grader import Task2Grader
     from data.data_loader import load_contracts
 def check_openenv_yaml():
     import yaml
+    with open("openenv.yaml") as f: spec = yaml.safe_load(f)
     assert "name" in spec
+    assert len(spec.get("tasks", [])) >= 3
     assert "observation_space" in spec
     assert "action_space" in spec
     assert "reward" in spec
 def check_pydantic_models():
     from env.schemas import Observation, Action, ActionType, Reward, StepResult, ResetResult, StateResult
+    obs = Observation(task_id="t1", contract_name="C", contract_description="D", available_actions=["submit"])
     assert obs.task_id == "t1"
+    action = Action(action_type=ActionType.LIST_FUNCTIONS); assert action.action_type == ActionType.LIST_FUNCTIONS
+    action2 = Action(action_type=ActionType.SUBMIT_PROPERTY); assert action2.action_type == ActionType.SUBMIT_PROPERTY
+    reward = Reward(value=1.0, reason="test"); assert reward.value == 1.0
+    step = StepResult(observation=obs, reward=reward, done=False); assert not step.done
+    reset = ResetResult(observation=obs); assert reset.observation.task_id == "t1"
 def check_data_loading():
+    from data.data_loader import load_contracts, get_all_vulnerable_entries, get_all_property_entries
     contracts = load_contracts()
+    assert len(contracts) >= 1
+    vuln_entries = get_all_vulnerable_entries(contracts)
+    assert len(vuln_entries) >= 3, f"Need >=3 vulnerable fns, got {len(vuln_entries)}"
+    prop_entries = get_all_property_entries(contracts)
+    assert len(prop_entries) >= 3, f"Need >=3 property fns, got {len(prop_entries)}"
+    for _, fn in prop_entries:
+        p = fn["property"]
+        assert "natural_language" in p
+        assert "key_phrases" in p
+        assert "bonus_phrases" in p
+        assert len(p["key_phrases"]) >= 2
+def check_t1_env():
     from tasks.task1.environment import Task1Environment
     from env.schemas import Action, ActionType
     env = Task1Environment()
+    r = env.reset(seed=42); assert r.observation.task_id == "task1_vuln_detection"
+    s = env.step(Action(action_type=ActionType.LIST_FUNCTIONS))
+    assert isinstance(s.reward.value, float)
+    assert s.observation.step_count == 1
+    st = env.state(); assert st.target_function is not None
+def check_t2_env():
+    from tasks.task2.environment import Task2Environment
+    from env.schemas import Action, ActionType
+    env = Task2Environment()
+    r = env.reset(seed=42)
+    assert r.observation.task_id == "task2_property_discovery"
+    assert "target_function" in r.observation.extra
+    # test each action type
+    for at in [ActionType.GET_FUNCTION_CODE, ActionType.GET_FUNCTION_NATSPEC,
+               ActionType.GET_FILE_NATSPEC, ActionType.GET_IO, ActionType.GET_RELATED_FUNCTIONS]:
+        s = env.step(Action(action_type=at)); assert s.reward.value < 0
+    s = env.step(Action(action_type=ActionType.GET_SIMILAR_RULE))
+    assert s.reward.value == -0.20
+def check_t2_env_submit():
+    from tasks.task2.environment import Task2Environment
+    from data.data_loader import load_contracts, get_function_by_name
+    from env.schemas import Action, ActionType
+    env = Task2Environment()
+    r = env.reset(seed=42)
+    fn_name  = r.observation.extra["target_function"]
+    contract = r.observation.contract_name
+    contracts = load_contracts()
+    gt_text = ""
+    for c in contracts:
+        if c["contract_name"] == contract:
+            fn = get_function_by_name(c, fn_name)
+            if fn and fn.get("property"):
+                gt_text = fn["property"]["natural_language"]
+    result = env.step(Action(action_type=ActionType.SUBMIT_PROPERTY, params={"property": gt_text}))
+    assert result.done
+    assert result.reward.value > 0, f"GT text should score >0, got {result.reward.value}"
+def check_t2_one_submit_only():
+    from tasks.task2.environment import Task2Environment
+    from env.schemas import Action, ActionType
+    env = Task2Environment()
+    env.reset(seed=5)
+    env.step(Action(action_type=ActionType.SUBMIT_PROPERTY, params={"property": "test"}))
+    # Second submit must either fail (episode done → RuntimeError) or return negative reward
+    try:
+        s2 = env.step(Action(action_type=ActionType.SUBMIT_PROPERTY, params={"property": "test2"}))
+        # If it doesn't raise, the reward must be negative
+        assert s2.reward.value < 0, "Second submit should penalise"
+    except RuntimeError:
+        pass  # expected
+def check_t1_grader():
     from tasks.task1.grader import Task1Grader
     cases = [
+        ("withdraw", "Reentrancy vulnerability",  "withdraw", "reentrancy",             1.0),
+        ("withdraw", "Reentrancy vulnerability",  "withdraw", "something else",          0.5),
+        ("withdraw", "Reentrancy vulnerability",  "deposit",  "reentrancy",             0.0),
     ]
     for tf, issue, sf, sv, expected in cases:
         g = Task1Grader(tf, issue)
         score = g.grade_submission(sf, sv)
+        assert 0.0 <= score <= 1.0
         assert abs(score - expected) < 0.01, f"Expected {expected}, got {score}"
+def check_t2_grader():
+    from tasks.task2.grader import Task2Grader
+    from data.data_loader import load_contracts, get_all_property_entries
+    contracts = load_contracts()
+    entries = get_all_property_entries(contracts)
+    for contract, fn in entries:
+        g = Task2Grader(fn["name"], fn["property"])
+        # Ground truth must score ≥ 0.65
+        gt_score = g.grade(fn["property"]["natural_language"])
+        assert gt_score >= 0.65, f"{fn['name']}: gt_score={gt_score} < 0.65"
+        # Empty must be 0.0
+        assert g.grade("") == 0.0
+        # Deterministic
+        assert g.grade("test text") == g.grade("test text")
+        # Score in [0,1]
+        assert 0.0 <= gt_score <= 1.0
+        # Reward maps correctly
+        assert abs(g.reward_for_score(gt_score) - gt_score * 5.0) < 0.01
 def check_reward_shaping():
+    from tasks.task2.environment import Task2Environment
     from env.schemas import Action, ActionType
+    env = Task2Environment()
     env.reset(seed=1)
+    rewards = {env.step(Action(action_type=at)).reward.value
+               for at in [ActionType.GET_FUNCTION_CODE, ActionType.GET_FILE_NATSPEC, ActionType.GET_IO]}
+    assert len(rewards) >= 2, f"Need multiple reward values, got {rewards}"
+def check_t1_episode_boundary():
     from tasks.task1.environment import Task1Environment
     from env.schemas import Action, ActionType
     env = Task1Environment()
     env.reset(seed=2)
+    env.step(Action(action_type=ActionType.SUBMIT,
+                    params={"function_name": "withdraw", "vulnerability_type": "test"}))
     try:
         env.step(Action(action_type=ActionType.LIST_FUNCTIONS))
+        raise AssertionError("Should raise RuntimeError after done")
     except RuntimeError:
+        pass
 def check_repeated_query_penalty():
     from tasks.task1.environment import Task1Environment
     from env.schemas import Action, ActionType
+    env = Task1Environment(); env.reset(seed=3)
     env.step(Action(action_type=ActionType.LIST_FUNCTIONS))
     r = env.step(Action(action_type=ActionType.LIST_FUNCTIONS))
+    assert r.reward.value == -0.40
+def check_t2_repeated_penalty():
+    from tasks.task2.environment import Task2Environment
+    from env.schemas import Action, ActionType
+    env = Task2Environment(); env.reset(seed=3)
+    env.step(Action(action_type=ActionType.GET_FUNCTION_CODE))
+    r = env.step(Action(action_type=ActionType.GET_FUNCTION_CODE))
+    assert r.reward.value == -0.40
+def check_task_placeholders():
+    from tasks.task3 import __all__ as t3
+def check_dockerfile():
     import os
+    assert os.path.exists("Dockerfile")
+    with open("Dockerfile") as f: c = f.read()
+    assert "7860" in c
+    assert "uvicorn" in c or "CMD" in c
 def check_inference_script():
     import os
+    assert os.path.exists("inference.py")
+    with open("inference.py") as f: c = f.read()
+    assert "HF_TOKEN" in c
+    assert "API_BASE_URL" in c
+    assert "MODEL_NAME" in c
+    assert "task2" in c.lower() or "Task2" in c or "TASK 2" in c
+def check_baseline_json():
     import os
+    if not os.path.exists("baseline_scores.json"): return
+    with open("baseline_scores.json") as f: data = json.load(f)
     assert "tasks" in data
+    for t in data["tasks"]:
+        assert 0.0 <= t["avg_grader_score"] <= 1.0
+def check_similar_rule_lookup():
+    from data.data_loader import load_contracts, get_similar_rule
+    contracts = load_contracts()
+    sr = get_similar_rule(contracts, "SimpleVault", "withdraw")
+    assert sr is not None, "similar_rule should exist for withdraw"
+    assert "property_hint" in sr
+    assert "contract_name" in sr
+# ── Runner ────────────────────────────────────────────────────────────────────
+ALL_CHECKS = [
+    ("Python imports (T1 + T2)",           check_imports),
+    ("openenv.yaml format",                 check_openenv_yaml),
+    ("Pydantic models (incl T2 actions)",   check_pydantic_models),
+    ("Dataset: vuln + property entries",    check_data_loading),
+    ("Task 1: reset / step / state",        check_t1_env),
+    ("Task 2: reset + all 6 browse actions",check_t2_env),
+    ("Task 2: submit_property scores > 0",  check_t2_env_submit),
+    ("Task 2: one submit only",             check_t2_one_submit_only),
+    ("Task 1 grader: 0/0.5/1.0 rubric",    check_t1_grader),
+    ("Task 2 grader: all 11 properties",    check_t2_grader),
+    ("Reward shaping (multi-value)",        check_reward_shaping),
+    ("T1 episode boundary",                 check_t1_episode_boundary),
+    ("T1 repeated query penalty (-0.40)",   check_repeated_query_penalty),
+    ("T2 repeated query penalty (-0.40)",   check_t2_repeated_penalty),
+    ("Task 3 placeholder exists",           check_task_placeholders),
+    ("Dockerfile + port 7860",              check_dockerfile),
+    ("inference.py: creds + Task 2 code",   check_inference_script),
+    ("baseline_scores.json schema",         check_baseline_json),
+    ("similar_rule data lookup",            check_similar_rule_lookup),
+]
 def main():
+    print("=" * 64)
+    print("OpenEnv Pre-Submission Validation  (Task 1 + Task 2)")
+    print("=" * 64)
     print()
+    for name, fn in ALL_CHECKS:
         check(name, fn)
     passed = sum(1 for _, ok, _ in results if ok)
+    total  = len(results)
+    failed = [(n, m) for n, ok, m in results if not ok]
+    print()
+    print("=" * 64)
     print(f"Results: {passed}/{total} checks passed")
     if failed:
         print("\nFailed checks:")
+        for n, m in failed:
+            print(f"  {FAIL} {n}: {m}")
+        print("\n❌ VALIDATION FAILED — fix the issues above before submitting.")
         sys.exit(1)
     else:
+        print("\n✅ ALL CHECKS PASSED — ready to submit!")
         sys.exit(0)
 if __name__ == "__main__":
     main()