ajaxwin commited on
Commit
73c779f
·
1 Parent(s): 88875f7

Readme -> Docs

Browse files
Files changed (3) hide show
  1. Docs.md +296 -0
  2. README.md +40 -303
  3. SPACES_README.md +0 -57
Docs.md ADDED
@@ -0,0 +1,296 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Smart Contract Audit RL Environment
2
+
3
+ > **OpenEnv-compliant reinforcement learning environment for smart contract security analysis.**
4
+ > Three fully implemented tasks covering the core workflow of a professional Solidity auditor.
5
+
6
+ [![OpenEnv Spec](https://img.shields.io/badge/OpenEnv-1.2-blue)](openenv.yaml)
7
+ [![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-brightgreen)](https://python.org)
8
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow)](LICENSE)
9
+
10
+ ---
11
+
12
+ ## Motivation
13
+
14
+ Smart contract auditing is a $500M+ industry where human experts identify security flaws, write formal properties, and check whether code satisfies those properties. This environment lets agents practise exactly those three tasks using real Solidity contracts from Certora-audited DeFi projects.
15
+
16
+ ---
17
+
18
+ ## Tasks at a Glance
19
+
20
+ | # | Name | Difficulty | Status | One-line description |
21
+ |---|------|-----------|--------|---------------------|
22
+ | 1 | Targeted Vulnerability Detection | Medium | ✅ Active | Find which function is vulnerable and name the vulnerability |
23
+ | 2 | Property Discovery | Hard | ✅ Active | Write the natural-language postcondition for a given function |
24
+ | 3 | Rule Checker | Easy | ✅ Active | Identify which function violates a given property |
25
+
26
+ ---
27
+
28
+ ## Task 1 — Targeted Vulnerability Detection *(Medium)*
29
+
30
+ **Setup:** A Solidity contract (4–6 functions) is shown. One function contains a critical vulnerability.
31
+
32
+ **Objective:** Name the vulnerable function and describe its vulnerability type in 2–3 words.
33
+
34
+ ### Actions
35
+
36
+ | Action | Params | Reward |
37
+ |--------|--------|--------|
38
+ | `list_functions` | — | −0.05 |
39
+ | `get_function_code` | `function_name` | +0.05 if target / −0.10 if other |
40
+ | `get_function_summary` | `function_name` | +0.03 if target / −0.05 if other |
41
+ | `get_file_metadata` | — | −0.04 |
42
+ | `get_state_variable` | `variable_name` (opt.) | −0.05 |
43
+ | `get_call_graph` | — | −0.08 |
44
+ | `submit` | `function_name`, `vulnerability_type` | **+5.0 / +1.0 / −1.5** |
45
+
46
+ Repeated queries: **−0.40**
47
+
48
+ ### Grader
49
+
50
+ - **1.0** → correct function + correct vulnerability keyword → reward **+5.0**
51
+ - **0.5** → correct function, vague/wrong vulnerability type → reward **+1.0**
52
+ - **0.0** → wrong function → reward **−1.5**
53
+
54
+ ### Vulnerability types covered
55
+ Reentrancy · Missing access control · Integer overflow · tx.origin authentication ·
56
+ Front-running · Timestamp dependence · Denial of service · Unchecked return value
57
+
58
+ ---
59
+
60
+ ## Task 2 — Property Discovery *(Hard)*
61
+
62
+ **Setup:** A single Solidity function is shown. The agent must discover its natural-language correctness property.
63
+
64
+ **Objective:** Write a precise 2–4 sentence postcondition describing what the function guarantees on success.
65
+
66
+ ### Actions
67
+
68
+ | Action | Params | Reward |
69
+ |--------|--------|--------|
70
+ | `get_function_code` | — | −0.06 |
71
+ | `get_function_natspec` | — | −0.08 |
72
+ | `get_file_natspec` | — | −0.03 |
73
+ | `get_related_functions` | — | −0.06 |
74
+ | `get_signature` | — | −0.04 |
75
+ | `get_similar_rule` | — | −0.20 |
76
+ | `submit_property` | `property` (string) | **0.0–5.0** scored, ONE attempt |
77
+
78
+ ### Grader (keyword-weighted)
79
+
80
+ ```
81
+ score = 0.70 × (key_phrases_matched / total_key)
82
+ + 0.30 × (bonus_phrases_matched / total_bonus)
83
+ reward = score × 5.0
84
+ ```
85
+
86
+ Matching uses **word-set containment + synonym expansion** — words don't need to be adjacent.
87
+
88
+ ---
89
+
90
+ ## Task 3 — Rule Checker *(Easy)*
91
+
92
+ **Setup:** A Solidity contract is shown alongside a violated property in natural English. One function breaks that property.
93
+
94
+ **Objective:** Identify which function violates the property.
95
+
96
+ ### Actions
97
+
98
+ | Action | Params | Reward |
99
+ |--------|--------|--------|
100
+ | `list_functions` | — | −0.05 |
101
+ | `get_function_metadata` | `function_name` | −0.05 |
102
+ | `get_function_code` | `function_name` | −0.10 |
103
+ | `get_state_variable` | `variable_name` (opt.) | −0.05 |
104
+ | `get_call_graph` | — | −0.08 |
105
+ | `get_property_specification` | — | **−0.03** (cheapest — read this first!) |
106
+ | `submit_function` | `function_name` | **+5.0 / +1.5 / −1.5**, ONE attempt |
107
+
108
+ ### Grader (three-tier deterministic)
109
+
110
+ - **1.0** → exact target function (case-insensitive) → reward **+5.0**
111
+ - **0.3** → a direct internal subfunction of the target → reward **+1.5**
112
+ - **0.0** → anything else → reward **−1.5**
113
+
114
+ `get_property_specification` returns the precise pre/post-condition (`rule_broken_specs`). Reading it costs only −0.03 and usually provides enough information to identify the violating function without inspecting all code.
115
+
116
+ ---
117
+
118
+ ## Observation Space
119
+
120
+ All tasks share the same `Observation` structure:
121
+
122
+ ```json
123
+ {
124
+ "task_id": "task3_rule_checker",
125
+ "contract_name": "SimpleVault",
126
+ "contract_description": "An ETH vault that allows users to deposit...",
127
+ "available_actions": ["list_functions", "get_function_metadata", "..."],
128
+ "last_action": "get_property_specification",
129
+ "last_action_result": "Formal property:\nPre: caller != owner...",
130
+ "step_count": 1,
131
+ "cumulative_reward": -0.03,
132
+ "done": false,
133
+ "extra": {
134
+ "property_english": "Only the owner should be able to drain the vault...",
135
+ "solidity_version": "0.8.0",
136
+ "hint": "Find the function that violates this property..."
137
+ }
138
+ }
139
+ ```
140
+
141
+ For Task 2, `extra` contains `target_function` and `target_signature`.
142
+ For Task 3, `extra` contains `property_english`.
143
+
144
+ ---
145
+
146
+ ## Project Structure
147
+
148
+ ```
149
+
150
+ ```
151
+
152
+ ---
153
+
154
+ ## Setup
155
+
156
+ ### Local Python
157
+
158
+ ```bash
159
+ pip install -r requirements.txt
160
+
161
+ # Start the server
162
+ python app.py # → http://localhost:7860
163
+
164
+ # Interactive / scripted demos
165
+ python demo.py --auto # Task 1 scripted demo
166
+ python demo.py --auto --seed 42 # Task 2 (same flag, different env seed)
167
+
168
+ # Full evaluation harness (no LLM required)
169
+ python eval.py # All 3 tasks, 8 episodes each
170
+ python eval.py --task 3 # Task 3 only
171
+ python eval.py --episodes 16 --verbose
172
+
173
+ # Pre-submission validation
174
+ python validate-submission.py # 23/23 checks
175
+ ```
176
+
177
+ ### Docker
178
+
179
+ ```bash
180
+ docker build -t sc-audit-env .
181
+ docker run -p 7860:7860 sc-audit-env
182
+ ```
183
+
184
+ ### Direct Python API
185
+
186
+ ```python
187
+ # Task 3 example
188
+ from tasks.task3.environment import Task3Environment
189
+ from env.schemas import Action, ActionType
190
+
191
+ env = Task3Environment()
192
+ r = env.reset(seed=42)
193
+ print(r.observation.extra["property_english"])
194
+ # "Only the owner should be able to drain the vault..."
195
+
196
+ s = env.step(Action(action_type=ActionType.GET_PROPERTY_SPECIFICATION))
197
+ s = env.step(Action(action_type=ActionType.SUBMIT_FUNCTION,
198
+ params={"function_name": "emergencyDrain"}))
199
+ print(s.reward.value) # +5.0
200
+ ```
201
+
202
+ ---
203
+
204
+ ## HTTP API
205
+
206
+ | Method | Endpoint | Description |
207
+ |--------|----------|-------------|
208
+ | `GET` | `/health` | Liveness probe |
209
+ | `GET` | `/tasks` | All tasks + status |
210
+ | `POST` | `/reset` | Start episode (`task_id`, `seed`) |
211
+ | `POST` | `/step` | Take action (`action_type`, `params`) |
212
+ | `GET` | `/state` | Internal debug state |
213
+ | `GET` | `/action_space?task_id=...` | Action schema |
214
+ | `GET` | `/observation_space` | Observation schema |
215
+
216
+ ```bash
217
+ # Full Task 3 episode
218
+ curl -X POST localhost:7860/reset \
219
+ -H "Content-Type: application/json" \
220
+ -d '{"task_id":"task3_rule_checker","seed":42}'
221
+
222
+ curl -X POST localhost:7860/step \
223
+ -H "Content-Type: application/json" \
224
+ -d '{"action_type":"get_property_specification","params":{}}'
225
+
226
+ curl -X POST localhost:7860/step \
227
+ -H "Content-Type: application/json" \
228
+ -d '{"action_type":"submit_function","params":{"function_name":"emergencyDrain"}}'
229
+ ```
230
+
231
+ ---
232
+
233
+ ## Baseline Inference
234
+
235
+ ```bash
236
+ export API_BASE_URL="https://api.openai.com/v1"
237
+ export MODEL_NAME="gpt-4o-mini"
238
+ export HF_TOKEN="sk-..."
239
+ python inference.py
240
+ ```
241
+
242
+ ### Expected scores (gpt-4o-mini, 3 episodes per task)
243
+
244
+ | Task | Avg Grader Score | Notes |
245
+ |------|-----------------|-------|
246
+ | Task 1 | ~0.67 | Good at classic vulns; struggles with subtle ones |
247
+ | Task 2 | ~0.55 | Reasonable properties; misses specific variable names |
248
+ | Task 3 | ~0.78 | Property text gives strong signal; usually correct in 3–4 steps |
249
+
250
+ ---
251
+
252
+ ## Evaluation Summary
253
+
254
+ Deterministic oracle / partial / floor tiers verified on 8 episodes (seeds 42–49):
255
+
256
+ | Task | Oracle | Partial/Sub | Floor | Ordering |
257
+ |------|--------|-------------|-------|----------|
258
+ | Task 1 | **1.000** | 0.500 | 0.000 | ✅ 1.0 > 0.5 > 0.0 |
259
+ | Task 2 | **0.775** | 0.034 | 0.000 | ✅ 0.775 > 0.034 > 0.0 |
260
+ | Task 3 | **1.000** | 0.037 | 0.000 | ✅ 1.0 > 0.037 > 0.0 |
261
+
262
+ The clear separation across all three tasks confirms the graders provide **meaningful gradient signal** across the full reward range — a core requirement for RL training environments.
263
+
264
+ ---
265
+
266
+ ## OpenEnv Spec Compliance
267
+
268
+ | Requirement | Status |
269
+ |-------------|--------|
270
+ | Typed `Observation`, `Action`, `Reward` Pydantic models | ✅ |
271
+ | `step(action) → StepResult(obs, reward, done, info)` | ✅ |
272
+ | `reset() → ResetResult` | ✅ |
273
+ | `state() → StateResult` | ✅ |
274
+ | `openenv.yaml` metadata | ✅ |
275
+ | 3 tasks, all active | ✅ |
276
+ | Grader scores in [0.0, 1.0] | ✅ |
277
+ | Shaped rewards (non-binary signal) | ✅ |
278
+ | Dockerfile + port 7860 | ✅ |
279
+ | `inference.py` with OpenAI client | ✅ |
280
+ | `validate.py` — 23/23 checks pass | ✅ |
281
+
282
+ ---
283
+
284
+ ## Deploying to Hugging Face Spaces
285
+
286
+ ```bash
287
+ # Copy the HF frontmatter into README.md, then:
288
+ git remote add hf https://huggingface.co/spaces/<user>/<space>
289
+ git push hf main
290
+ ```
291
+
292
+ ---
293
+
294
+ ## License
295
+
296
+ MIT. Contract vulnerability patterns adapted from Certora audits on production DeFi protocols.
README.md CHANGED
@@ -1,320 +1,57 @@
1
- # Smart Contract Audit RL Environment
2
-
3
- > **OpenEnv-compliant reinforcement learning environment for smart contract security analysis.**
4
- > Three fully implemented tasks covering the core workflow of a professional Solidity auditor.
5
-
6
- [![OpenEnv Spec](https://img.shields.io/badge/OpenEnv-1.2-blue)](openenv.yaml)
7
- [![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-brightgreen)](https://python.org)
8
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow)](LICENSE)
9
-
10
- ---
11
-
12
- ## Motivation
13
-
14
- Smart contract auditing is a $500M+ industry where human experts identify security flaws, write formal properties, and check whether code satisfies those properties. This environment lets agents practise exactly those three tasks using real Solidity contracts from Certora-audited DeFi projects.
15
-
16
- ---
17
-
18
- ## Tasks at a Glance
19
-
20
- | # | Name | Difficulty | Status | One-line description |
21
- |---|------|-----------|--------|---------------------|
22
- | 1 | Targeted Vulnerability Detection | Medium | ✅ Active | Find which function is vulnerable and name the vulnerability |
23
- | 2 | Property Discovery | Hard | ✅ Active | Write the natural-language postcondition for a given function |
24
- | 3 | Rule Checker | Easy | ✅ Active | Identify which function violates a given property |
25
-
26
- ---
27
-
28
- ## Task 1 — Targeted Vulnerability Detection *(Medium)*
29
-
30
- **Setup:** A Solidity contract (4–6 functions) is shown. One function contains a critical vulnerability.
31
-
32
- **Objective:** Name the vulnerable function and describe its vulnerability type in 2–3 words.
33
-
34
- ### Actions
35
-
36
- | Action | Params | Reward |
37
- |--------|--------|--------|
38
- | `list_functions` | — | −0.05 |
39
- | `get_function_code` | `function_name` | +0.05 if target / −0.10 if other |
40
- | `get_function_summary` | `function_name` | +0.03 if target / −0.05 if other |
41
- | `get_file_metadata` | — | −0.04 |
42
- | `get_state_variable` | `variable_name` (opt.) | −0.05 |
43
- | `get_call_graph` | — | −0.08 |
44
- | `submit` | `function_name`, `vulnerability_type` | **+5.0 / +1.0 / −1.5** |
45
-
46
- Repeated queries: **−0.40**
47
-
48
- ### Grader
49
-
50
- - **1.0** → correct function + correct vulnerability keyword → reward **+5.0**
51
- - **0.5** → correct function, vague/wrong vulnerability type → reward **+1.0**
52
- - **0.0** → wrong function → reward **−1.5**
53
-
54
- ### Vulnerability types covered
55
- Reentrancy · Missing access control · Integer overflow · tx.origin authentication ·
56
- Front-running · Timestamp dependence · Denial of service · Unchecked return value
57
-
58
  ---
59
-
60
- ## Task 2 — Property Discovery *(Hard)*
61
-
62
- **Setup:** A single Solidity function is shown. The agent must discover its natural-language correctness property.
63
-
64
- **Objective:** Write a precise 2–4 sentence postcondition describing what the function guarantees on success.
65
-
66
- ### Actions
67
-
68
- | Action | Params | Reward |
69
- |--------|--------|--------|
70
- | `get_function_code` | — | −0.06 |
71
- | `get_function_natspec` | — | −0.08 |
72
- | `get_file_natspec` | — | −0.03 |
73
- | `get_related_functions` | | −0.06 |
74
- | `get_io` | — | −0.04 |
75
- | `get_similar_rule` | — | −0.20 |
76
- | `submit_property` | `property` (string) | **0.0–5.0** scored, ONE attempt |
77
-
78
- ### Grader (keyword-weighted)
79
-
80
- ```
81
- score = 0.70 × (key_phrases_matched / total_key)
82
- + 0.30 × (bonus_phrases_matched / total_bonus)
83
- reward = score × 5.0
84
- ```
85
-
86
- Matching uses **word-set containment + synonym expansion** — words don't need to be adjacent.
87
-
88
- ---
89
-
90
- ## Task 3 — Rule Checker *(Easy)*
91
-
92
- **Setup:** A Solidity contract is shown alongside a violated property in natural English. One function breaks that property.
93
-
94
- **Objective:** Identify which function violates the property.
95
-
96
- ### Actions
97
-
98
- | Action | Params | Reward |
99
- |--------|--------|--------|
100
- | `list_functions` | — | −0.05 |
101
- | `get_function_metadata` | `function_name` | −0.05 |
102
- | `get_function_code` | `function_name` | −0.10 |
103
- | `get_state_variable` | `variable_name` (opt.) | −0.05 |
104
- | `get_call_graph` | — | −0.08 |
105
- | `get_property_specification` | — | **−0.03** (cheapest — read this first!) |
106
- | `submit_function` | `function_name` | **+5.0 / +1.5 / −1.5**, ONE attempt |
107
-
108
- ### Grader (three-tier deterministic)
109
-
110
- - **1.0** → exact target function (case-insensitive) → reward **+5.0**
111
- - **0.3** → a direct internal subfunction of the target → reward **+1.5**
112
- - **0.0** → anything else → reward **−1.5**
113
-
114
- `get_property_specification` returns the precise pre/post-condition (`rule_broken_specs`). Reading it costs only −0.03 and usually provides enough information to identify the violating function without inspecting all code.
115
-
116
- ---
117
-
118
- ## Observation Space
119
-
120
- All tasks share the same `Observation` structure:
121
-
122
- ```json
123
- {
124
- "task_id": "task3_rule_checker",
125
- "contract_name": "SimpleVault",
126
- "contract_description": "An ETH vault that allows users to deposit...",
127
- "available_actions": ["list_functions", "get_function_metadata", "..."],
128
- "last_action": "get_property_specification",
129
- "last_action_result": "Formal property:\nPre: caller != owner...",
130
- "step_count": 1,
131
- "cumulative_reward": -0.03,
132
- "done": false,
133
- "extra": {
134
- "property_english": "Only the owner should be able to drain the vault...",
135
- "solidity_version": "0.8.0",
136
- "hint": "Find the function that violates this property..."
137
- }
138
- }
139
- ```
140
-
141
- For Task 2, `extra` contains `target_function` and `target_signature`.
142
- For Task 3, `extra` contains `property_english`.
143
-
144
  ---
145
 
146
- ## Project Structure
147
-
148
- ```
149
- smart-contract-env/
150
- ├── data/
151
- │ ├── contracts.json # 4 contracts, 8 vulns, 11 properties, 8 rule episodes
152
- │ └── data_loader.py # loaders for all three tasks
153
- ├── env/
154
- │ ├── base_env.py # Abstract OpenEnv base class
155
- │ └── schemas.py # Typed Pydantic models (all ActionTypes)
156
- ├── tasks/
157
- │ ├── task1/
158
- │ │ ├── environment.py # Vulnerability detection environment
159
- │ │ └── grader.py # Longest-match keyword grader (0/0.5/1.0)
160
- │ ├── task2/
161
- │ │ ├── environment.py # Property discovery (one submit_property)
162
- │ │ └── grader.py # Word-set + synonym grader (0.0–1.0)
163
- │ └── task3/
164
- │ ├── environment.py # Rule checker (one submit_function)
165
- │ └── grader.py # Three-tier grader (1.0/0.3/0.0)
166
- ├── app.py # FastAPI — all OpenEnv HTTP endpoints
167
- ├── inference.py # Baseline LLM agent (all 3 tasks)
168
- ├── eval.py # Oracle/partial/floor evaluation harness
169
- ├── demo.py # Colourised scripted demos for all 3 tasks
170
- ├── validate.py # 23-check pre-submission validator
171
- ├── openenv.yaml # Full OpenEnv spec metadata
172
- ├── Dockerfile # Port 7860, healthcheck
173
- └── requirements.txt
174
- ```
175
-
176
- ---
177
-
178
- ## Setup
179
-
180
- ### Local Python
181
-
182
- ```bash
183
- pip install -r requirements.txt
184
-
185
- # Start the server
186
- python app.py # → http://localhost:7860
187
-
188
- # Interactive / scripted demos
189
- python demo.py --auto # Task 1 scripted demo
190
- python demo.py --auto --seed 42 # Task 2 (same flag, different env seed)
191
-
192
- # Full evaluation harness (no LLM required)
193
- python eval.py # All 3 tasks, 8 episodes each
194
- python eval.py --task 3 # Task 3 only
195
- python eval.py --episodes 16 --verbose
196
-
197
- # Pre-submission validation
198
- python validate.py # 23/23 checks
199
- ```
200
-
201
- ### Docker
202
-
203
- ```bash
204
- docker build -t sc-audit-env .
205
- docker run -p 7860:7860 sc-audit-env
206
- ```
207
-
208
- ### Direct Python API
209
-
210
- ```python
211
- # Task 3 example
212
- from tasks.task3.environment import Task3Environment
213
- from env.schemas import Action, ActionType
214
-
215
- env = Task3Environment()
216
- r = env.reset(seed=42)
217
- print(r.observation.extra["property_english"])
218
- # "Only the owner should be able to drain the vault..."
219
-
220
- s = env.step(Action(action_type=ActionType.GET_PROPERTY_SPECIFICATION))
221
- s = env.step(Action(action_type=ActionType.SUBMIT_FUNCTION,
222
- params={"function_name": "emergencyDrain"}))
223
- print(s.reward.value) # +5.0
224
- ```
225
 
226
- ---
227
 
228
- ## HTTP API
 
229
 
230
- | Method | Endpoint | Description |
231
- |--------|----------|-------------|
232
- | `GET` | `/health` | Liveness probe |
233
- | `GET` | `/tasks` | All tasks + status |
234
- | `POST` | `/reset` | Start episode (`task_id`, `seed`) |
235
- | `POST` | `/step` | Take action (`action_type`, `params`) |
236
- | `GET` | `/state` | Internal debug state |
237
- | `GET` | `/action_space?task_id=...` | Action schema |
238
- | `GET` | `/observation_space` | Observation schema |
239
 
240
  ```bash
241
- # Full Task 3 episode
242
- curl -X POST localhost:7860/reset \
243
  -H "Content-Type: application/json" \
244
- -d '{"task_id":"task3_rule_checker","seed":42}'
245
 
246
- curl -X POST localhost:7860/step \
 
247
  -H "Content-Type: application/json" \
248
- -d '{"action_type":"get_property_specification","params":{}}'
249
 
250
- curl -X POST localhost:7860/step \
 
251
  -H "Content-Type: application/json" \
252
- -d '{"action_type":"submit_function","params":{"function_name":"emergencyDrain"}}'
253
  ```
254
 
255
- ---
256
-
257
- ## Baseline Inference
258
-
259
- ```bash
260
- export API_BASE_URL="https://api.openai.com/v1"
261
- export MODEL_NAME="gpt-4o-mini"
262
- export HF_TOKEN="sk-..."
263
- python inference.py
264
- ```
265
-
266
- ### Expected scores (gpt-4o-mini, 3 episodes per task)
267
-
268
- | Task | Avg Grader Score | Notes |
269
- |------|-----------------|-------|
270
- | Task 1 | ~0.67 | Good at classic vulns; struggles with subtle ones |
271
- | Task 2 | ~0.55 | Reasonable properties; misses specific variable names |
272
- | Task 3 | ~0.78 | Property text gives strong signal; usually correct in 3–4 steps |
273
-
274
- ---
275
-
276
- ## Evaluation Summary
277
-
278
- Deterministic oracle / partial / floor tiers verified on 8 episodes (seeds 42–49):
279
-
280
- | Task | Oracle | Partial/Sub | Floor | Ordering |
281
- |------|--------|-------------|-------|----------|
282
- | Task 1 | **1.000** | 0.500 | 0.000 | ✅ 1.0 > 0.5 > 0.0 |
283
- | Task 2 | **0.775** | 0.034 | 0.000 | ✅ 0.775 > 0.034 > 0.0 |
284
- | Task 3 | **1.000** | 0.037 | 0.000 | ✅ 1.0 > 0.037 > 0.0 |
285
-
286
- The clear separation across all three tasks confirms the graders provide **meaningful gradient signal** across the full reward range — a core requirement for RL training environments.
287
-
288
- ---
289
-
290
- ## OpenEnv Spec Compliance
291
-
292
- | Requirement | Status |
293
- |-------------|--------|
294
- | Typed `Observation`, `Action`, `Reward` Pydantic models | ✅ |
295
- | `step(action) → StepResult(obs, reward, done, info)` | ✅ |
296
- | `reset() → ResetResult` | ✅ |
297
- | `state() → StateResult` | ✅ |
298
- | `openenv.yaml` metadata | ✅ |
299
- | 3 tasks, all active | ✅ |
300
- | Grader scores in [0.0, 1.0] | ✅ |
301
- | Shaped rewards (non-binary signal) | ✅ |
302
- | Dockerfile + port 7860 | ✅ |
303
- | `inference.py` with OpenAI client | ✅ |
304
- | `validate.py` — 23/23 checks pass | ✅ |
305
-
306
- ---
307
-
308
- ## Deploying to Hugging Face Spaces
309
-
310
- ```bash
311
- # Copy the HF frontmatter into README.md, then:
312
- git remote add hf https://huggingface.co/spaces/<user>/<space>
313
- git push hf main
314
- ```
315
-
316
- ---
317
 
318
- ## License
 
 
 
 
 
 
 
 
319
 
320
- MIT. Contract vulnerability patterns adapted from Certora audits on production DeFi protocols.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Smart Contract Audit RL Environment
3
+ emoji: 🔍
4
+ colorFrom: blue
5
+ colorTo: indigo
6
+ sdk: docker
7
+ app_port: 7860
8
+ tags:
9
+ - openenv
10
+ - reinforcement-learning
11
+ - smart-contracts
12
+ - solidity
13
+ - security
14
+ - evaluation
15
+ license: mit
16
+ short_description: OpenEnv RL environment for smart contract security auditing
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  ---
18
 
19
+ # Smart Contract Audit RL Environment
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
+ > OpenEnv-compliant RL environment for Solidity security analysis.
22
 
23
+ This Space exposes the full OpenEnv HTTP interface for **Task 1: Targeted Vulnerability Detection**.
24
+ Agents explore Solidity contracts using a structured action API and identify vulnerable functions.
25
 
26
+ ## Quick start
 
 
 
 
 
 
 
 
27
 
28
  ```bash
29
+ # Reset start a new episode
30
+ curl -X POST $SPACE_URL/reset \
31
  -H "Content-Type: application/json" \
32
+ -d '{"task_id": "task1_vuln_detection", "seed": 42}'
33
 
34
+ # Step list contract functions
35
+ curl -X POST $SPACE_URL/step \
36
  -H "Content-Type: application/json" \
37
+ -d '{"action_type": "list_functions", "params": {}}'
38
 
39
+ # Submit answer
40
+ curl -X POST $SPACE_URL/step \
41
  -H "Content-Type: application/json" \
42
+ -d '{"action_type": "submit", "params": {"function_name": "withdraw", "vulnerability_type": "reentrancy"}}'
43
  ```
44
 
45
+ ## Endpoints
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
+ | Method | Path | Description |
48
+ |--------|------|-------------|
49
+ | GET | `/health` | Liveness probe |
50
+ | GET | `/tasks` | All tasks + status |
51
+ | POST | `/reset` | New episode |
52
+ | POST | `/step` | Take action |
53
+ | GET | `/state` | Debug state |
54
+ | GET | `/action_space` | Action schema |
55
+ | GET | `/observation_space` | Observation schema |
56
 
57
+ See the full [README](README.md) for detailed documentation.
SPACES_README.md DELETED
@@ -1,57 +0,0 @@
1
- ---
2
- title: Smart Contract Audit RL Environment
3
- emoji: 🔍
4
- colorFrom: blue
5
- colorTo: indigo
6
- sdk: docker
7
- app_port: 7860
8
- tags:
9
- - openenv
10
- - reinforcement-learning
11
- - smart-contracts
12
- - solidity
13
- - security
14
- - evaluation
15
- license: mit
16
- short_description: OpenEnv RL environment for smart contract security auditing
17
- ---
18
-
19
- # Smart Contract Audit RL Environment
20
-
21
- > OpenEnv-compliant RL environment for Solidity security analysis.
22
-
23
- This Space exposes the full OpenEnv HTTP interface for **Task 1: Targeted Vulnerability Detection**.
24
- Agents explore Solidity contracts using a structured action API and identify vulnerable functions.
25
-
26
- ## Quick start
27
-
28
- ```bash
29
- # Reset — start a new episode
30
- curl -X POST $SPACE_URL/reset \
31
- -H "Content-Type: application/json" \
32
- -d '{"task_id": "task1_vuln_detection", "seed": 42}'
33
-
34
- # Step — list contract functions
35
- curl -X POST $SPACE_URL/step \
36
- -H "Content-Type: application/json" \
37
- -d '{"action_type": "list_functions", "params": {}}'
38
-
39
- # Submit answer
40
- curl -X POST $SPACE_URL/step \
41
- -H "Content-Type: application/json" \
42
- -d '{"action_type": "submit", "params": {"function_name": "withdraw", "vulnerability_type": "reentrancy"}}'
43
- ```
44
-
45
- ## Endpoints
46
-
47
- | Method | Path | Description |
48
- |--------|------|-------------|
49
- | GET | `/health` | Liveness probe |
50
- | GET | `/tasks` | All tasks + status |
51
- | POST | `/reset` | New episode |
52
- | POST | `/step` | Take action |
53
- | GET | `/state` | Debug state |
54
- | GET | `/action_space` | Action schema |
55
- | GET | `/observation_space` | Observation schema |
56
-
57
- See the full [README](README.md) for detailed documentation.