Spaces:
Sleeping
Sleeping
ajaxwin commited on
Commit ·
73c779f
1
Parent(s): 88875f7
Readme -> Docs
Browse files- Docs.md +296 -0
- README.md +40 -303
- SPACES_README.md +0 -57
Docs.md
ADDED
|
@@ -0,0 +1,296 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Smart Contract Audit RL Environment
|
| 2 |
+
|
| 3 |
+
> **OpenEnv-compliant reinforcement learning environment for smart contract security analysis.**
|
| 4 |
+
> Three fully implemented tasks covering the core workflow of a professional Solidity auditor.
|
| 5 |
+
|
| 6 |
+
[](openenv.yaml)
|
| 7 |
+
[](https://python.org)
|
| 8 |
+
[](LICENSE)
|
| 9 |
+
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
## Motivation
|
| 13 |
+
|
| 14 |
+
Smart contract auditing is a $500M+ industry where human experts identify security flaws, write formal properties, and check whether code satisfies those properties. This environment lets agents practise exactly those three tasks using real Solidity contracts from Certora-audited DeFi projects.
|
| 15 |
+
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
## Tasks at a Glance
|
| 19 |
+
|
| 20 |
+
| # | Name | Difficulty | Status | One-line description |
|
| 21 |
+
|---|------|-----------|--------|---------------------|
|
| 22 |
+
| 1 | Targeted Vulnerability Detection | Medium | ✅ Active | Find which function is vulnerable and name the vulnerability |
|
| 23 |
+
| 2 | Property Discovery | Hard | ✅ Active | Write the natural-language postcondition for a given function |
|
| 24 |
+
| 3 | Rule Checker | Easy | ✅ Active | Identify which function violates a given property |
|
| 25 |
+
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
## Task 1 — Targeted Vulnerability Detection *(Medium)*
|
| 29 |
+
|
| 30 |
+
**Setup:** A Solidity contract (4–6 functions) is shown. One function contains a critical vulnerability.
|
| 31 |
+
|
| 32 |
+
**Objective:** Name the vulnerable function and describe its vulnerability type in 2–3 words.
|
| 33 |
+
|
| 34 |
+
### Actions
|
| 35 |
+
|
| 36 |
+
| Action | Params | Reward |
|
| 37 |
+
|--------|--------|--------|
|
| 38 |
+
| `list_functions` | — | −0.05 |
|
| 39 |
+
| `get_function_code` | `function_name` | +0.05 if target / −0.10 if other |
|
| 40 |
+
| `get_function_summary` | `function_name` | +0.03 if target / −0.05 if other |
|
| 41 |
+
| `get_file_metadata` | — | −0.04 |
|
| 42 |
+
| `get_state_variable` | `variable_name` (opt.) | −0.05 |
|
| 43 |
+
| `get_call_graph` | — | −0.08 |
|
| 44 |
+
| `submit` | `function_name`, `vulnerability_type` | **+5.0 / +1.0 / −1.5** |
|
| 45 |
+
|
| 46 |
+
Repeated queries: **−0.40**
|
| 47 |
+
|
| 48 |
+
### Grader
|
| 49 |
+
|
| 50 |
+
- **1.0** → correct function + correct vulnerability keyword → reward **+5.0**
|
| 51 |
+
- **0.5** → correct function, vague/wrong vulnerability type → reward **+1.0**
|
| 52 |
+
- **0.0** → wrong function → reward **−1.5**
|
| 53 |
+
|
| 54 |
+
### Vulnerability types covered
|
| 55 |
+
Reentrancy · Missing access control · Integer overflow · tx.origin authentication ·
|
| 56 |
+
Front-running · Timestamp dependence · Denial of service · Unchecked return value
|
| 57 |
+
|
| 58 |
+
---
|
| 59 |
+
|
| 60 |
+
## Task 2 — Property Discovery *(Hard)*
|
| 61 |
+
|
| 62 |
+
**Setup:** A single Solidity function is shown. The agent must discover its natural-language correctness property.
|
| 63 |
+
|
| 64 |
+
**Objective:** Write a precise 2–4 sentence postcondition describing what the function guarantees on success.
|
| 65 |
+
|
| 66 |
+
### Actions
|
| 67 |
+
|
| 68 |
+
| Action | Params | Reward |
|
| 69 |
+
|--------|--------|--------|
|
| 70 |
+
| `get_function_code` | — | −0.06 |
|
| 71 |
+
| `get_function_natspec` | — | −0.08 |
|
| 72 |
+
| `get_file_natspec` | — | −0.03 |
|
| 73 |
+
| `get_related_functions` | — | −0.06 |
|
| 74 |
+
| `get_signature` | — | −0.04 |
|
| 75 |
+
| `get_similar_rule` | — | −0.20 |
|
| 76 |
+
| `submit_property` | `property` (string) | **0.0–5.0** scored, ONE attempt |
|
| 77 |
+
|
| 78 |
+
### Grader (keyword-weighted)
|
| 79 |
+
|
| 80 |
+
```
|
| 81 |
+
score = 0.70 × (key_phrases_matched / total_key)
|
| 82 |
+
+ 0.30 × (bonus_phrases_matched / total_bonus)
|
| 83 |
+
reward = score × 5.0
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
Matching uses **word-set containment + synonym expansion** — words don't need to be adjacent.
|
| 87 |
+
|
| 88 |
+
---
|
| 89 |
+
|
| 90 |
+
## Task 3 — Rule Checker *(Easy)*
|
| 91 |
+
|
| 92 |
+
**Setup:** A Solidity contract is shown alongside a violated property in natural English. One function breaks that property.
|
| 93 |
+
|
| 94 |
+
**Objective:** Identify which function violates the property.
|
| 95 |
+
|
| 96 |
+
### Actions
|
| 97 |
+
|
| 98 |
+
| Action | Params | Reward |
|
| 99 |
+
|--------|--------|--------|
|
| 100 |
+
| `list_functions` | — | −0.05 |
|
| 101 |
+
| `get_function_metadata` | `function_name` | −0.05 |
|
| 102 |
+
| `get_function_code` | `function_name` | −0.10 |
|
| 103 |
+
| `get_state_variable` | `variable_name` (opt.) | −0.05 |
|
| 104 |
+
| `get_call_graph` | — | −0.08 |
|
| 105 |
+
| `get_property_specification` | — | **−0.03** (cheapest — read this first!) |
|
| 106 |
+
| `submit_function` | `function_name` | **+5.0 / +1.5 / −1.5**, ONE attempt |
|
| 107 |
+
|
| 108 |
+
### Grader (three-tier deterministic)
|
| 109 |
+
|
| 110 |
+
- **1.0** → exact target function (case-insensitive) → reward **+5.0**
|
| 111 |
+
- **0.3** → a direct internal subfunction of the target → reward **+1.5**
|
| 112 |
+
- **0.0** → anything else → reward **−1.5**
|
| 113 |
+
|
| 114 |
+
`get_property_specification` returns the precise pre/post-condition (`rule_broken_specs`). Reading it costs only −0.03 and usually provides enough information to identify the violating function without inspecting all code.
|
| 115 |
+
|
| 116 |
+
---
|
| 117 |
+
|
| 118 |
+
## Observation Space
|
| 119 |
+
|
| 120 |
+
All tasks share the same `Observation` structure:
|
| 121 |
+
|
| 122 |
+
```json
|
| 123 |
+
{
|
| 124 |
+
"task_id": "task3_rule_checker",
|
| 125 |
+
"contract_name": "SimpleVault",
|
| 126 |
+
"contract_description": "An ETH vault that allows users to deposit...",
|
| 127 |
+
"available_actions": ["list_functions", "get_function_metadata", "..."],
|
| 128 |
+
"last_action": "get_property_specification",
|
| 129 |
+
"last_action_result": "Formal property:\nPre: caller != owner...",
|
| 130 |
+
"step_count": 1,
|
| 131 |
+
"cumulative_reward": -0.03,
|
| 132 |
+
"done": false,
|
| 133 |
+
"extra": {
|
| 134 |
+
"property_english": "Only the owner should be able to drain the vault...",
|
| 135 |
+
"solidity_version": "0.8.0",
|
| 136 |
+
"hint": "Find the function that violates this property..."
|
| 137 |
+
}
|
| 138 |
+
}
|
| 139 |
+
```
|
| 140 |
+
|
| 141 |
+
For Task 2, `extra` contains `target_function` and `target_signature`.
|
| 142 |
+
For Task 3, `extra` contains `property_english`.
|
| 143 |
+
|
| 144 |
+
---
|
| 145 |
+
|
| 146 |
+
## Project Structure
|
| 147 |
+
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
```
|
| 151 |
+
|
| 152 |
+
---
|
| 153 |
+
|
| 154 |
+
## Setup
|
| 155 |
+
|
| 156 |
+
### Local Python
|
| 157 |
+
|
| 158 |
+
```bash
|
| 159 |
+
pip install -r requirements.txt
|
| 160 |
+
|
| 161 |
+
# Start the server
|
| 162 |
+
python app.py # → http://localhost:7860
|
| 163 |
+
|
| 164 |
+
# Interactive / scripted demos
|
| 165 |
+
python demo.py --auto # Task 1 scripted demo
|
| 166 |
+
python demo.py --auto --seed 42 # Task 2 (same flag, different env seed)
|
| 167 |
+
|
| 168 |
+
# Full evaluation harness (no LLM required)
|
| 169 |
+
python eval.py # All 3 tasks, 8 episodes each
|
| 170 |
+
python eval.py --task 3 # Task 3 only
|
| 171 |
+
python eval.py --episodes 16 --verbose
|
| 172 |
+
|
| 173 |
+
# Pre-submission validation
|
| 174 |
+
python validate-submission.py # 23/23 checks
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
### Docker
|
| 178 |
+
|
| 179 |
+
```bash
|
| 180 |
+
docker build -t sc-audit-env .
|
| 181 |
+
docker run -p 7860:7860 sc-audit-env
|
| 182 |
+
```
|
| 183 |
+
|
| 184 |
+
### Direct Python API
|
| 185 |
+
|
| 186 |
+
```python
|
| 187 |
+
# Task 3 example
|
| 188 |
+
from tasks.task3.environment import Task3Environment
|
| 189 |
+
from env.schemas import Action, ActionType
|
| 190 |
+
|
| 191 |
+
env = Task3Environment()
|
| 192 |
+
r = env.reset(seed=42)
|
| 193 |
+
print(r.observation.extra["property_english"])
|
| 194 |
+
# "Only the owner should be able to drain the vault..."
|
| 195 |
+
|
| 196 |
+
s = env.step(Action(action_type=ActionType.GET_PROPERTY_SPECIFICATION))
|
| 197 |
+
s = env.step(Action(action_type=ActionType.SUBMIT_FUNCTION,
|
| 198 |
+
params={"function_name": "emergencyDrain"}))
|
| 199 |
+
print(s.reward.value) # +5.0
|
| 200 |
+
```
|
| 201 |
+
|
| 202 |
+
---
|
| 203 |
+
|
| 204 |
+
## HTTP API
|
| 205 |
+
|
| 206 |
+
| Method | Endpoint | Description |
|
| 207 |
+
|--------|----------|-------------|
|
| 208 |
+
| `GET` | `/health` | Liveness probe |
|
| 209 |
+
| `GET` | `/tasks` | All tasks + status |
|
| 210 |
+
| `POST` | `/reset` | Start episode (`task_id`, `seed`) |
|
| 211 |
+
| `POST` | `/step` | Take action (`action_type`, `params`) |
|
| 212 |
+
| `GET` | `/state` | Internal debug state |
|
| 213 |
+
| `GET` | `/action_space?task_id=...` | Action schema |
|
| 214 |
+
| `GET` | `/observation_space` | Observation schema |
|
| 215 |
+
|
| 216 |
+
```bash
|
| 217 |
+
# Full Task 3 episode
|
| 218 |
+
curl -X POST localhost:7860/reset \
|
| 219 |
+
-H "Content-Type: application/json" \
|
| 220 |
+
-d '{"task_id":"task3_rule_checker","seed":42}'
|
| 221 |
+
|
| 222 |
+
curl -X POST localhost:7860/step \
|
| 223 |
+
-H "Content-Type: application/json" \
|
| 224 |
+
-d '{"action_type":"get_property_specification","params":{}}'
|
| 225 |
+
|
| 226 |
+
curl -X POST localhost:7860/step \
|
| 227 |
+
-H "Content-Type: application/json" \
|
| 228 |
+
-d '{"action_type":"submit_function","params":{"function_name":"emergencyDrain"}}'
|
| 229 |
+
```
|
| 230 |
+
|
| 231 |
+
---
|
| 232 |
+
|
| 233 |
+
## Baseline Inference
|
| 234 |
+
|
| 235 |
+
```bash
|
| 236 |
+
export API_BASE_URL="https://api.openai.com/v1"
|
| 237 |
+
export MODEL_NAME="gpt-4o-mini"
|
| 238 |
+
export HF_TOKEN="sk-..."
|
| 239 |
+
python inference.py
|
| 240 |
+
```
|
| 241 |
+
|
| 242 |
+
### Expected scores (gpt-4o-mini, 3 episodes per task)
|
| 243 |
+
|
| 244 |
+
| Task | Avg Grader Score | Notes |
|
| 245 |
+
|------|-----------------|-------|
|
| 246 |
+
| Task 1 | ~0.67 | Good at classic vulns; struggles with subtle ones |
|
| 247 |
+
| Task 2 | ~0.55 | Reasonable properties; misses specific variable names |
|
| 248 |
+
| Task 3 | ~0.78 | Property text gives strong signal; usually correct in 3–4 steps |
|
| 249 |
+
|
| 250 |
+
---
|
| 251 |
+
|
| 252 |
+
## Evaluation Summary
|
| 253 |
+
|
| 254 |
+
Deterministic oracle / partial / floor tiers verified on 8 episodes (seeds 42–49):
|
| 255 |
+
|
| 256 |
+
| Task | Oracle | Partial/Sub | Floor | Ordering |
|
| 257 |
+
|------|--------|-------------|-------|----------|
|
| 258 |
+
| Task 1 | **1.000** | 0.500 | 0.000 | ✅ 1.0 > 0.5 > 0.0 |
|
| 259 |
+
| Task 2 | **0.775** | 0.034 | 0.000 | ✅ 0.775 > 0.034 > 0.0 |
|
| 260 |
+
| Task 3 | **1.000** | 0.037 | 0.000 | ✅ 1.0 > 0.037 > 0.0 |
|
| 261 |
+
|
| 262 |
+
The clear separation across all three tasks confirms the graders provide **meaningful gradient signal** across the full reward range — a core requirement for RL training environments.
|
| 263 |
+
|
| 264 |
+
---
|
| 265 |
+
|
| 266 |
+
## OpenEnv Spec Compliance
|
| 267 |
+
|
| 268 |
+
| Requirement | Status |
|
| 269 |
+
|-------------|--------|
|
| 270 |
+
| Typed `Observation`, `Action`, `Reward` Pydantic models | ✅ |
|
| 271 |
+
| `step(action) → StepResult(obs, reward, done, info)` | ✅ |
|
| 272 |
+
| `reset() → ResetResult` | ✅ |
|
| 273 |
+
| `state() → StateResult` | ✅ |
|
| 274 |
+
| `openenv.yaml` metadata | ✅ |
|
| 275 |
+
| 3 tasks, all active | ✅ |
|
| 276 |
+
| Grader scores in [0.0, 1.0] | ✅ |
|
| 277 |
+
| Shaped rewards (non-binary signal) | ✅ |
|
| 278 |
+
| Dockerfile + port 7860 | ✅ |
|
| 279 |
+
| `inference.py` with OpenAI client | ✅ |
|
| 280 |
+
| `validate.py` — 23/23 checks pass | ✅ |
|
| 281 |
+
|
| 282 |
+
---
|
| 283 |
+
|
| 284 |
+
## Deploying to Hugging Face Spaces
|
| 285 |
+
|
| 286 |
+
```bash
|
| 287 |
+
# Copy the HF frontmatter into README.md, then:
|
| 288 |
+
git remote add hf https://huggingface.co/spaces/<user>/<space>
|
| 289 |
+
git push hf main
|
| 290 |
+
```
|
| 291 |
+
|
| 292 |
+
---
|
| 293 |
+
|
| 294 |
+
## License
|
| 295 |
+
|
| 296 |
+
MIT. Contract vulnerability patterns adapted from Certora audits on production DeFi protocols.
|
README.md
CHANGED
|
@@ -1,320 +1,57 @@
|
|
| 1 |
-
# Smart Contract Audit RL Environment
|
| 2 |
-
|
| 3 |
-
> **OpenEnv-compliant reinforcement learning environment for smart contract security analysis.**
|
| 4 |
-
> Three fully implemented tasks covering the core workflow of a professional Solidity auditor.
|
| 5 |
-
|
| 6 |
-
[](openenv.yaml)
|
| 7 |
-
[](https://python.org)
|
| 8 |
-
[](LICENSE)
|
| 9 |
-
|
| 10 |
-
---
|
| 11 |
-
|
| 12 |
-
## Motivation
|
| 13 |
-
|
| 14 |
-
Smart contract auditing is a $500M+ industry where human experts identify security flaws, write formal properties, and check whether code satisfies those properties. This environment lets agents practise exactly those three tasks using real Solidity contracts from Certora-audited DeFi projects.
|
| 15 |
-
|
| 16 |
-
---
|
| 17 |
-
|
| 18 |
-
## Tasks at a Glance
|
| 19 |
-
|
| 20 |
-
| # | Name | Difficulty | Status | One-line description |
|
| 21 |
-
|---|------|-----------|--------|---------------------|
|
| 22 |
-
| 1 | Targeted Vulnerability Detection | Medium | ✅ Active | Find which function is vulnerable and name the vulnerability |
|
| 23 |
-
| 2 | Property Discovery | Hard | ✅ Active | Write the natural-language postcondition for a given function |
|
| 24 |
-
| 3 | Rule Checker | Easy | ✅ Active | Identify which function violates a given property |
|
| 25 |
-
|
| 26 |
-
---
|
| 27 |
-
|
| 28 |
-
## Task 1 — Targeted Vulnerability Detection *(Medium)*
|
| 29 |
-
|
| 30 |
-
**Setup:** A Solidity contract (4–6 functions) is shown. One function contains a critical vulnerability.
|
| 31 |
-
|
| 32 |
-
**Objective:** Name the vulnerable function and describe its vulnerability type in 2–3 words.
|
| 33 |
-
|
| 34 |
-
### Actions
|
| 35 |
-
|
| 36 |
-
| Action | Params | Reward |
|
| 37 |
-
|--------|--------|--------|
|
| 38 |
-
| `list_functions` | — | −0.05 |
|
| 39 |
-
| `get_function_code` | `function_name` | +0.05 if target / −0.10 if other |
|
| 40 |
-
| `get_function_summary` | `function_name` | +0.03 if target / −0.05 if other |
|
| 41 |
-
| `get_file_metadata` | — | −0.04 |
|
| 42 |
-
| `get_state_variable` | `variable_name` (opt.) | −0.05 |
|
| 43 |
-
| `get_call_graph` | — | −0.08 |
|
| 44 |
-
| `submit` | `function_name`, `vulnerability_type` | **+5.0 / +1.0 / −1.5** |
|
| 45 |
-
|
| 46 |
-
Repeated queries: **−0.40**
|
| 47 |
-
|
| 48 |
-
### Grader
|
| 49 |
-
|
| 50 |
-
- **1.0** → correct function + correct vulnerability keyword → reward **+5.0**
|
| 51 |
-
- **0.5** → correct function, vague/wrong vulnerability type → reward **+1.0**
|
| 52 |
-
- **0.0** → wrong function → reward **−1.5**
|
| 53 |
-
|
| 54 |
-
### Vulnerability types covered
|
| 55 |
-
Reentrancy · Missing access control · Integer overflow · tx.origin authentication ·
|
| 56 |
-
Front-running · Timestamp dependence · Denial of service · Unchecked return value
|
| 57 |
-
|
| 58 |
---
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
| `get_io` | — | −0.04 |
|
| 75 |
-
| `get_similar_rule` | — | −0.20 |
|
| 76 |
-
| `submit_property` | `property` (string) | **0.0–5.0** scored, ONE attempt |
|
| 77 |
-
|
| 78 |
-
### Grader (keyword-weighted)
|
| 79 |
-
|
| 80 |
-
```
|
| 81 |
-
score = 0.70 × (key_phrases_matched / total_key)
|
| 82 |
-
+ 0.30 × (bonus_phrases_matched / total_bonus)
|
| 83 |
-
reward = score × 5.0
|
| 84 |
-
```
|
| 85 |
-
|
| 86 |
-
Matching uses **word-set containment + synonym expansion** — words don't need to be adjacent.
|
| 87 |
-
|
| 88 |
-
---
|
| 89 |
-
|
| 90 |
-
## Task 3 — Rule Checker *(Easy)*
|
| 91 |
-
|
| 92 |
-
**Setup:** A Solidity contract is shown alongside a violated property in natural English. One function breaks that property.
|
| 93 |
-
|
| 94 |
-
**Objective:** Identify which function violates the property.
|
| 95 |
-
|
| 96 |
-
### Actions
|
| 97 |
-
|
| 98 |
-
| Action | Params | Reward |
|
| 99 |
-
|--------|--------|--------|
|
| 100 |
-
| `list_functions` | — | −0.05 |
|
| 101 |
-
| `get_function_metadata` | `function_name` | −0.05 |
|
| 102 |
-
| `get_function_code` | `function_name` | −0.10 |
|
| 103 |
-
| `get_state_variable` | `variable_name` (opt.) | −0.05 |
|
| 104 |
-
| `get_call_graph` | — | −0.08 |
|
| 105 |
-
| `get_property_specification` | — | **−0.03** (cheapest — read this first!) |
|
| 106 |
-
| `submit_function` | `function_name` | **+5.0 / +1.5 / −1.5**, ONE attempt |
|
| 107 |
-
|
| 108 |
-
### Grader (three-tier deterministic)
|
| 109 |
-
|
| 110 |
-
- **1.0** → exact target function (case-insensitive) → reward **+5.0**
|
| 111 |
-
- **0.3** → a direct internal subfunction of the target → reward **+1.5**
|
| 112 |
-
- **0.0** → anything else → reward **−1.5**
|
| 113 |
-
|
| 114 |
-
`get_property_specification` returns the precise pre/post-condition (`rule_broken_specs`). Reading it costs only −0.03 and usually provides enough information to identify the violating function without inspecting all code.
|
| 115 |
-
|
| 116 |
-
---
|
| 117 |
-
|
| 118 |
-
## Observation Space
|
| 119 |
-
|
| 120 |
-
All tasks share the same `Observation` structure:
|
| 121 |
-
|
| 122 |
-
```json
|
| 123 |
-
{
|
| 124 |
-
"task_id": "task3_rule_checker",
|
| 125 |
-
"contract_name": "SimpleVault",
|
| 126 |
-
"contract_description": "An ETH vault that allows users to deposit...",
|
| 127 |
-
"available_actions": ["list_functions", "get_function_metadata", "..."],
|
| 128 |
-
"last_action": "get_property_specification",
|
| 129 |
-
"last_action_result": "Formal property:\nPre: caller != owner...",
|
| 130 |
-
"step_count": 1,
|
| 131 |
-
"cumulative_reward": -0.03,
|
| 132 |
-
"done": false,
|
| 133 |
-
"extra": {
|
| 134 |
-
"property_english": "Only the owner should be able to drain the vault...",
|
| 135 |
-
"solidity_version": "0.8.0",
|
| 136 |
-
"hint": "Find the function that violates this property..."
|
| 137 |
-
}
|
| 138 |
-
}
|
| 139 |
-
```
|
| 140 |
-
|
| 141 |
-
For Task 2, `extra` contains `target_function` and `target_signature`.
|
| 142 |
-
For Task 3, `extra` contains `property_english`.
|
| 143 |
-
|
| 144 |
---
|
| 145 |
|
| 146 |
-
#
|
| 147 |
-
|
| 148 |
-
```
|
| 149 |
-
smart-contract-env/
|
| 150 |
-
├── data/
|
| 151 |
-
│ ├── contracts.json # 4 contracts, 8 vulns, 11 properties, 8 rule episodes
|
| 152 |
-
│ └── data_loader.py # loaders for all three tasks
|
| 153 |
-
├── env/
|
| 154 |
-
│ ├── base_env.py # Abstract OpenEnv base class
|
| 155 |
-
│ └── schemas.py # Typed Pydantic models (all ActionTypes)
|
| 156 |
-
├── tasks/
|
| 157 |
-
│ ├── task1/
|
| 158 |
-
│ │ ├── environment.py # Vulnerability detection environment
|
| 159 |
-
│ │ └── grader.py # Longest-match keyword grader (0/0.5/1.0)
|
| 160 |
-
│ ├── task2/
|
| 161 |
-
│ │ ├── environment.py # Property discovery (one submit_property)
|
| 162 |
-
│ │ └── grader.py # Word-set + synonym grader (0.0–1.0)
|
| 163 |
-
│ └── task3/
|
| 164 |
-
│ ├── environment.py # Rule checker (one submit_function)
|
| 165 |
-
│ └── grader.py # Three-tier grader (1.0/0.3/0.0)
|
| 166 |
-
├── app.py # FastAPI — all OpenEnv HTTP endpoints
|
| 167 |
-
├── inference.py # Baseline LLM agent (all 3 tasks)
|
| 168 |
-
├── eval.py # Oracle/partial/floor evaluation harness
|
| 169 |
-
├── demo.py # Colourised scripted demos for all 3 tasks
|
| 170 |
-
├── validate.py # 23-check pre-submission validator
|
| 171 |
-
├── openenv.yaml # Full OpenEnv spec metadata
|
| 172 |
-
├── Dockerfile # Port 7860, healthcheck
|
| 173 |
-
└── requirements.txt
|
| 174 |
-
```
|
| 175 |
-
|
| 176 |
-
---
|
| 177 |
-
|
| 178 |
-
## Setup
|
| 179 |
-
|
| 180 |
-
### Local Python
|
| 181 |
-
|
| 182 |
-
```bash
|
| 183 |
-
pip install -r requirements.txt
|
| 184 |
-
|
| 185 |
-
# Start the server
|
| 186 |
-
python app.py # → http://localhost:7860
|
| 187 |
-
|
| 188 |
-
# Interactive / scripted demos
|
| 189 |
-
python demo.py --auto # Task 1 scripted demo
|
| 190 |
-
python demo.py --auto --seed 42 # Task 2 (same flag, different env seed)
|
| 191 |
-
|
| 192 |
-
# Full evaluation harness (no LLM required)
|
| 193 |
-
python eval.py # All 3 tasks, 8 episodes each
|
| 194 |
-
python eval.py --task 3 # Task 3 only
|
| 195 |
-
python eval.py --episodes 16 --verbose
|
| 196 |
-
|
| 197 |
-
# Pre-submission validation
|
| 198 |
-
python validate.py # 23/23 checks
|
| 199 |
-
```
|
| 200 |
-
|
| 201 |
-
### Docker
|
| 202 |
-
|
| 203 |
-
```bash
|
| 204 |
-
docker build -t sc-audit-env .
|
| 205 |
-
docker run -p 7860:7860 sc-audit-env
|
| 206 |
-
```
|
| 207 |
-
|
| 208 |
-
### Direct Python API
|
| 209 |
-
|
| 210 |
-
```python
|
| 211 |
-
# Task 3 example
|
| 212 |
-
from tasks.task3.environment import Task3Environment
|
| 213 |
-
from env.schemas import Action, ActionType
|
| 214 |
-
|
| 215 |
-
env = Task3Environment()
|
| 216 |
-
r = env.reset(seed=42)
|
| 217 |
-
print(r.observation.extra["property_english"])
|
| 218 |
-
# "Only the owner should be able to drain the vault..."
|
| 219 |
-
|
| 220 |
-
s = env.step(Action(action_type=ActionType.GET_PROPERTY_SPECIFICATION))
|
| 221 |
-
s = env.step(Action(action_type=ActionType.SUBMIT_FUNCTION,
|
| 222 |
-
params={"function_name": "emergencyDrain"}))
|
| 223 |
-
print(s.reward.value) # +5.0
|
| 224 |
-
```
|
| 225 |
|
| 226 |
-
-
|
| 227 |
|
| 228 |
-
|
|
|
|
| 229 |
|
| 230 |
-
|
| 231 |
-
|--------|----------|-------------|
|
| 232 |
-
| `GET` | `/health` | Liveness probe |
|
| 233 |
-
| `GET` | `/tasks` | All tasks + status |
|
| 234 |
-
| `POST` | `/reset` | Start episode (`task_id`, `seed`) |
|
| 235 |
-
| `POST` | `/step` | Take action (`action_type`, `params`) |
|
| 236 |
-
| `GET` | `/state` | Internal debug state |
|
| 237 |
-
| `GET` | `/action_space?task_id=...` | Action schema |
|
| 238 |
-
| `GET` | `/observation_space` | Observation schema |
|
| 239 |
|
| 240 |
```bash
|
| 241 |
-
#
|
| 242 |
-
curl -X POST
|
| 243 |
-H "Content-Type: application/json" \
|
| 244 |
-
-d '{"task_id":"
|
| 245 |
|
| 246 |
-
|
|
|
|
| 247 |
-H "Content-Type: application/json" \
|
| 248 |
-
-d '{"action_type":"
|
| 249 |
|
| 250 |
-
|
|
|
|
| 251 |
-H "Content-Type: application/json" \
|
| 252 |
-
-d '{"action_type":"
|
| 253 |
```
|
| 254 |
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
## Baseline Inference
|
| 258 |
-
|
| 259 |
-
```bash
|
| 260 |
-
export API_BASE_URL="https://api.openai.com/v1"
|
| 261 |
-
export MODEL_NAME="gpt-4o-mini"
|
| 262 |
-
export HF_TOKEN="sk-..."
|
| 263 |
-
python inference.py
|
| 264 |
-
```
|
| 265 |
-
|
| 266 |
-
### Expected scores (gpt-4o-mini, 3 episodes per task)
|
| 267 |
-
|
| 268 |
-
| Task | Avg Grader Score | Notes |
|
| 269 |
-
|------|-----------------|-------|
|
| 270 |
-
| Task 1 | ~0.67 | Good at classic vulns; struggles with subtle ones |
|
| 271 |
-
| Task 2 | ~0.55 | Reasonable properties; misses specific variable names |
|
| 272 |
-
| Task 3 | ~0.78 | Property text gives strong signal; usually correct in 3–4 steps |
|
| 273 |
-
|
| 274 |
-
---
|
| 275 |
-
|
| 276 |
-
## Evaluation Summary
|
| 277 |
-
|
| 278 |
-
Deterministic oracle / partial / floor tiers verified on 8 episodes (seeds 42–49):
|
| 279 |
-
|
| 280 |
-
| Task | Oracle | Partial/Sub | Floor | Ordering |
|
| 281 |
-
|------|--------|-------------|-------|----------|
|
| 282 |
-
| Task 1 | **1.000** | 0.500 | 0.000 | ✅ 1.0 > 0.5 > 0.0 |
|
| 283 |
-
| Task 2 | **0.775** | 0.034 | 0.000 | ✅ 0.775 > 0.034 > 0.0 |
|
| 284 |
-
| Task 3 | **1.000** | 0.037 | 0.000 | ✅ 1.0 > 0.037 > 0.0 |
|
| 285 |
-
|
| 286 |
-
The clear separation across all three tasks confirms the graders provide **meaningful gradient signal** across the full reward range — a core requirement for RL training environments.
|
| 287 |
-
|
| 288 |
-
---
|
| 289 |
-
|
| 290 |
-
## OpenEnv Spec Compliance
|
| 291 |
-
|
| 292 |
-
| Requirement | Status |
|
| 293 |
-
|-------------|--------|
|
| 294 |
-
| Typed `Observation`, `Action`, `Reward` Pydantic models | ✅ |
|
| 295 |
-
| `step(action) → StepResult(obs, reward, done, info)` | ✅ |
|
| 296 |
-
| `reset() → ResetResult` | ✅ |
|
| 297 |
-
| `state() → StateResult` | ✅ |
|
| 298 |
-
| `openenv.yaml` metadata | ✅ |
|
| 299 |
-
| 3 tasks, all active | ✅ |
|
| 300 |
-
| Grader scores in [0.0, 1.0] | ✅ |
|
| 301 |
-
| Shaped rewards (non-binary signal) | ✅ |
|
| 302 |
-
| Dockerfile + port 7860 | ✅ |
|
| 303 |
-
| `inference.py` with OpenAI client | ✅ |
|
| 304 |
-
| `validate.py` — 23/23 checks pass | ✅ |
|
| 305 |
-
|
| 306 |
-
---
|
| 307 |
-
|
| 308 |
-
## Deploying to Hugging Face Spaces
|
| 309 |
-
|
| 310 |
-
```bash
|
| 311 |
-
# Copy the HF frontmatter into README.md, then:
|
| 312 |
-
git remote add hf https://huggingface.co/spaces/<user>/<space>
|
| 313 |
-
git push hf main
|
| 314 |
-
```
|
| 315 |
-
|
| 316 |
-
---
|
| 317 |
|
| 318 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 319 |
|
| 320 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Smart Contract Audit RL Environment
|
| 3 |
+
emoji: 🔍
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: indigo
|
| 6 |
+
sdk: docker
|
| 7 |
+
app_port: 7860
|
| 8 |
+
tags:
|
| 9 |
+
- openenv
|
| 10 |
+
- reinforcement-learning
|
| 11 |
+
- smart-contracts
|
| 12 |
+
- solidity
|
| 13 |
+
- security
|
| 14 |
+
- evaluation
|
| 15 |
+
license: mit
|
| 16 |
+
short_description: OpenEnv RL environment for smart contract security auditing
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
---
|
| 18 |
|
| 19 |
+
# Smart Contract Audit RL Environment
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
+
> OpenEnv-compliant RL environment for Solidity security analysis.
|
| 22 |
|
| 23 |
+
This Space exposes the full OpenEnv HTTP interface for **Task 1: Targeted Vulnerability Detection**.
|
| 24 |
+
Agents explore Solidity contracts using a structured action API and identify vulnerable functions.
|
| 25 |
|
| 26 |
+
## Quick start
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
```bash
|
| 29 |
+
# Reset — start a new episode
|
| 30 |
+
curl -X POST $SPACE_URL/reset \
|
| 31 |
-H "Content-Type: application/json" \
|
| 32 |
+
-d '{"task_id": "task1_vuln_detection", "seed": 42}'
|
| 33 |
|
| 34 |
+
# Step — list contract functions
|
| 35 |
+
curl -X POST $SPACE_URL/step \
|
| 36 |
-H "Content-Type: application/json" \
|
| 37 |
+
-d '{"action_type": "list_functions", "params": {}}'
|
| 38 |
|
| 39 |
+
# Submit answer
|
| 40 |
+
curl -X POST $SPACE_URL/step \
|
| 41 |
-H "Content-Type: application/json" \
|
| 42 |
+
-d '{"action_type": "submit", "params": {"function_name": "withdraw", "vulnerability_type": "reentrancy"}}'
|
| 43 |
```
|
| 44 |
|
| 45 |
+
## Endpoints
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
+
| Method | Path | Description |
|
| 48 |
+
|--------|------|-------------|
|
| 49 |
+
| GET | `/health` | Liveness probe |
|
| 50 |
+
| GET | `/tasks` | All tasks + status |
|
| 51 |
+
| POST | `/reset` | New episode |
|
| 52 |
+
| POST | `/step` | Take action |
|
| 53 |
+
| GET | `/state` | Debug state |
|
| 54 |
+
| GET | `/action_space` | Action schema |
|
| 55 |
+
| GET | `/observation_space` | Observation schema |
|
| 56 |
|
| 57 |
+
See the full [README](README.md) for detailed documentation.
|
SPACES_README.md
DELETED
|
@@ -1,57 +0,0 @@
|
|
| 1 |
-
---
|
| 2 |
-
title: Smart Contract Audit RL Environment
|
| 3 |
-
emoji: 🔍
|
| 4 |
-
colorFrom: blue
|
| 5 |
-
colorTo: indigo
|
| 6 |
-
sdk: docker
|
| 7 |
-
app_port: 7860
|
| 8 |
-
tags:
|
| 9 |
-
- openenv
|
| 10 |
-
- reinforcement-learning
|
| 11 |
-
- smart-contracts
|
| 12 |
-
- solidity
|
| 13 |
-
- security
|
| 14 |
-
- evaluation
|
| 15 |
-
license: mit
|
| 16 |
-
short_description: OpenEnv RL environment for smart contract security auditing
|
| 17 |
-
---
|
| 18 |
-
|
| 19 |
-
# Smart Contract Audit RL Environment
|
| 20 |
-
|
| 21 |
-
> OpenEnv-compliant RL environment for Solidity security analysis.
|
| 22 |
-
|
| 23 |
-
This Space exposes the full OpenEnv HTTP interface for **Task 1: Targeted Vulnerability Detection**.
|
| 24 |
-
Agents explore Solidity contracts using a structured action API and identify vulnerable functions.
|
| 25 |
-
|
| 26 |
-
## Quick start
|
| 27 |
-
|
| 28 |
-
```bash
|
| 29 |
-
# Reset — start a new episode
|
| 30 |
-
curl -X POST $SPACE_URL/reset \
|
| 31 |
-
-H "Content-Type: application/json" \
|
| 32 |
-
-d '{"task_id": "task1_vuln_detection", "seed": 42}'
|
| 33 |
-
|
| 34 |
-
# Step — list contract functions
|
| 35 |
-
curl -X POST $SPACE_URL/step \
|
| 36 |
-
-H "Content-Type: application/json" \
|
| 37 |
-
-d '{"action_type": "list_functions", "params": {}}'
|
| 38 |
-
|
| 39 |
-
# Submit answer
|
| 40 |
-
curl -X POST $SPACE_URL/step \
|
| 41 |
-
-H "Content-Type: application/json" \
|
| 42 |
-
-d '{"action_type": "submit", "params": {"function_name": "withdraw", "vulnerability_type": "reentrancy"}}'
|
| 43 |
-
```
|
| 44 |
-
|
| 45 |
-
## Endpoints
|
| 46 |
-
|
| 47 |
-
| Method | Path | Description |
|
| 48 |
-
|--------|------|-------------|
|
| 49 |
-
| GET | `/health` | Liveness probe |
|
| 50 |
-
| GET | `/tasks` | All tasks + status |
|
| 51 |
-
| POST | `/reset` | New episode |
|
| 52 |
-
| POST | `/step` | Take action |
|
| 53 |
-
| GET | `/state` | Debug state |
|
| 54 |
-
| GET | `/action_space` | Action schema |
|
| 55 |
-
| GET | `/observation_space` | Observation schema |
|
| 56 |
-
|
| 57 |
-
See the full [README](README.md) for detailed documentation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|