DataBoySu commited on
Commit Β·
acfb96b
1
Parent(s): dfd1faa
readme
Browse files
LICENSE
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
BSD 3-Clause License
|
| 2 |
+
|
| 3 |
+
Copyright (c) 2024-present, OpenEnv Contributors
|
| 4 |
+
All rights reserved.
|
| 5 |
+
|
| 6 |
+
Redistribution and use in source and binary forms, with or without
|
| 7 |
+
modification, are permitted provided that the following conditions are met:
|
| 8 |
+
|
| 9 |
+
1. Redistributions of source code must retain the above copyright notice, this
|
| 10 |
+
list of conditions and the following disclaimer.
|
| 11 |
+
|
| 12 |
+
2. Redistributions in binary form must reproduce the above copyright notice,
|
| 13 |
+
this list of conditions and the following disclaimer in the documentation
|
| 14 |
+
and/or other materials provided with the distribution.
|
| 15 |
+
|
| 16 |
+
3. Neither the name of the copyright holder nor the names of its
|
| 17 |
+
contributors may be used to endorse or promote products derived from
|
| 18 |
+
this software without specific prior written permission.
|
| 19 |
+
|
| 20 |
+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
| 21 |
+
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
| 22 |
+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
| 23 |
+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
| 24 |
+
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
| 25 |
+
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
| 26 |
+
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
| 27 |
+
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
| 28 |
+
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
| 29 |
+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
README.md
CHANGED
|
@@ -11,7 +11,7 @@ tags:
|
|
| 11 |
|
| 12 |
<div align="center">
|
| 13 |
|
| 14 |
-
# π΅οΈ AML Investigator OpenEnv RL Environment
|
| 15 |
|
| 16 |
**A financial crime investigation environment for training and evaluating LLM agents**
|
| 17 |
|
|
@@ -81,11 +81,11 @@ Every investigation runs as a sequence of steps between agent and environment. T
|
|
| 81 |
|
| 82 |
```mermaid
|
| 83 |
sequenceDiagram
|
| 84 |
-
participant A as Agent
|
| 85 |
-
participant E as Environment
|
| 86 |
-
participant D as Data Layer
|
| 87 |
|
| 88 |
-
E-->>A: reset()
|
| 89 |
|
| 90 |
loop Until submit_decision or budget=0
|
| 91 |
A->>E: step(AmlAction)
|
|
@@ -96,7 +96,7 @@ sequenceDiagram
|
|
| 96 |
|
| 97 |
A->>E: step(submit_decision, evidence=[...])
|
| 98 |
E->>E: Run Grader
|
| 99 |
-
E-->>A: AmlObservation<br/>(done=True, reward=0.0
|
| 100 |
```
|
| 101 |
|
| 102 |
---
|
|
@@ -112,7 +112,7 @@ The agent communicates exclusively through **typed Pydantic actions**. No regex
|
|
| 112 |
| `get_kyc_record` | `entity_id` | Retrieve address, entity type, and corporate directors. |
|
| 113 |
| `submit_decision` | `decision: FRAUD\|CLEAR`, `evidence_links: List[str]` | Terminal action. Ends the episode and triggers the grader. |
|
| 114 |
|
| 115 |
-
> **Why Pydantic?** The LLM is the router. Strict schemas with `Field(description="...")` mean the model reads the tool contract, not a prompt full of prose instructions. Malformed output is caught at validation, not execution
|
| 116 |
|
| 117 |
---
|
| 118 |
|
|
@@ -147,7 +147,7 @@ The trap is the jurisdiction flag. A naive model panics and submits `FRAUD`. A w
|
|
| 147 |
|
| 148 |
```mermaid
|
| 149 |
flowchart LR
|
| 150 |
-
A([Alert:<br/>ACC-101
|
| 151 |
|
| 152 |
subgraph Investigation
|
| 153 |
B[query_transactions<br/>ACC-101] --> C{Memo:<br/>'Heavy Machinery<br/>Purchase - Unit 4'}
|
|
@@ -157,7 +157,7 @@ flowchart LR
|
|
| 157 |
F --> G{50 inbound payments<br/>from global firms}
|
| 158 |
end
|
| 159 |
|
| 160 |
-
G --> H([submit_decision<br/>CLEAR])
|
| 161 |
|
| 162 |
style A fill:#ef4444,color:#fff
|
| 163 |
style H fill:#22c55e,color:#fff
|
|
@@ -175,15 +175,15 @@ The agent must paginate through hundreds of normal car-sale transactions to surf
|
|
| 175 |
|
| 176 |
```mermaid
|
| 177 |
flowchart TD
|
| 178 |
-
A([Alert:<br/>ACC-200 deposit velocity spike]) --> B
|
| 179 |
|
| 180 |
-
subgraph Investigation["Paginate
|
| 181 |
B[query_transactions<br/>ACC-200<br/>offset 0, 10, 20...] --> C{14 deposits<br/>$9,900 and $9,500<br/>below $10k threshold}
|
| 182 |
C --> D[get_kyc_record<br/>ACC-301, ACC-302, ACC-303]
|
| 183 |
D --> E{All 3 accounts:<br/>Opened same day<br/>Occupation: Student}
|
| 184 |
end
|
| 185 |
|
| 186 |
-
E --> F([submit_decision<br/>FRAUD<br/>evidence: ACC-301, ACC-302, ACC-303])
|
| 187 |
|
| 188 |
style A fill:#f97316,color:#fff
|
| 189 |
style F fill:#dc2626,color:#fff
|
|
@@ -203,28 +203,29 @@ This is the full haystack. `ACC-500` has 500+ transactions. `ACC-700` has hundre
|
|
| 203 |
|
| 204 |
```mermaid
|
| 205 |
flowchart TD
|
| 206 |
-
A([Alert:<br/>ACC-500
|
| 207 |
|
| 208 |
-
subgraph Trap["The Bait
|
| 209 |
-
X["$100 transfer<br/>to Watchlist Target"]
|
| 210 |
end
|
| 211 |
|
| 212 |
subgraph Investigation["The Real Loop"]
|
| 213 |
B --> C["search_transactions<br/>ACC-700<br/>keyword: 'consulting'"]
|
| 214 |
-
C --> D{48hrs later:<br/>ACC-700
|
| 215 |
D --> E[get_kyc_record<br/>ACC-888]
|
| 216 |
E --> F{Director:<br/>Robert House}
|
| 217 |
F --> G[get_kyc_record<br/>ACC-500]
|
| 218 |
G --> H{Director:<br/>Apex Management Corp}
|
| 219 |
H --> I[get_kyc_record<br/>Apex Management Corp]
|
| 220 |
-
I --> J{CEO:<br/>Robert House same person}
|
| 221 |
end
|
| 222 |
|
| 223 |
A -.->|naive agent wastes budget| X
|
| 224 |
-
J --> K([submit_decision<br/>FRAUD<br/>evidence: ACC-500, ACC-700, ACC-888])
|
| 225 |
|
| 226 |
style A fill:#ef4444,color:#fff
|
| 227 |
style X fill:#6b7280,color:#fff,stroke-dasharray: 5 5
|
|
|
|
| 228 |
style K fill:#dc2626,color:#fff
|
| 229 |
style J fill:#fbbf24,color:#000
|
| 230 |
```
|
|
@@ -277,6 +278,88 @@ Fraud scenarios are injected with camouflage: 5β10 "normal" bridging transacti
|
|
| 277 |
|
| 278 |
---
|
| 279 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 280 |
## Core Engineering Principles
|
| 281 |
|
| 282 |
These principles govern how the environment is designed and why each decision was made.
|
|
|
|
| 11 |
|
| 12 |
<div align="center">
|
| 13 |
|
| 14 |
+
# π΅οΈ AML Investigator β OpenEnv RL Environment
|
| 15 |
|
| 16 |
**A financial crime investigation environment for training and evaluating LLM agents**
|
| 17 |
|
|
|
|
| 81 |
|
| 82 |
```mermaid
|
| 83 |
sequenceDiagram
|
| 84 |
+
participant A as π€ Agent
|
| 85 |
+
participant E as βοΈ Environment
|
| 86 |
+
participant D as ποΈ Data Layer
|
| 87 |
|
| 88 |
+
E-->>A: reset() β AmlObservation<br/>(alert_details, budget=N)
|
| 89 |
|
| 90 |
loop Until submit_decision or budget=0
|
| 91 |
A->>E: step(AmlAction)
|
|
|
|
| 96 |
|
| 97 |
A->>E: step(submit_decision, evidence=[...])
|
| 98 |
E->>E: Run Grader
|
| 99 |
+
E-->>A: AmlObservation<br/>(done=True, reward=0.0β1.0)
|
| 100 |
```
|
| 101 |
|
| 102 |
---
|
|
|
|
| 112 |
| `get_kyc_record` | `entity_id` | Retrieve address, entity type, and corporate directors. |
|
| 113 |
| `submit_decision` | `decision: FRAUD\|CLEAR`, `evidence_links: List[str]` | Terminal action. Ends the episode and triggers the grader. |
|
| 114 |
|
| 115 |
+
> **Why Pydantic?** The LLM is the router. Strict schemas with `Field(description="...")` mean the model reads the tool contract, not a prompt full of prose instructions. Malformed output is caught at validation, not execution β preventing silent failures and hallucinated account IDs from crashing the environment.
|
| 116 |
|
| 117 |
---
|
| 118 |
|
|
|
|
| 147 |
|
| 148 |
```mermaid
|
| 149 |
flowchart LR
|
| 150 |
+
A([π¨ Alert:<br/>ACC-101 β ACC-909<br/>$50,000]) --> B
|
| 151 |
|
| 152 |
subgraph Investigation
|
| 153 |
B[query_transactions<br/>ACC-101] --> C{Memo:<br/>'Heavy Machinery<br/>Purchase - Unit 4'}
|
|
|
|
| 157 |
F --> G{50 inbound payments<br/>from global firms}
|
| 158 |
end
|
| 159 |
|
| 160 |
+
G --> H([β
submit_decision<br/>CLEAR])
|
| 161 |
|
| 162 |
style A fill:#ef4444,color:#fff
|
| 163 |
style H fill:#22c55e,color:#fff
|
|
|
|
| 175 |
|
| 176 |
```mermaid
|
| 177 |
flowchart TD
|
| 178 |
+
A([π¨ Alert:<br/>ACC-200 deposit velocity spike]) --> B
|
| 179 |
|
| 180 |
+
subgraph Investigation["Paginate β Spot β Cross-Reference"]
|
| 181 |
B[query_transactions<br/>ACC-200<br/>offset 0, 10, 20...] --> C{14 deposits<br/>$9,900 and $9,500<br/>below $10k threshold}
|
| 182 |
C --> D[get_kyc_record<br/>ACC-301, ACC-302, ACC-303]
|
| 183 |
D --> E{All 3 accounts:<br/>Opened same day<br/>Occupation: Student}
|
| 184 |
end
|
| 185 |
|
| 186 |
+
E --> F([π¨ submit_decision<br/>FRAUD<br/>evidence: ACC-301, ACC-302, ACC-303])
|
| 187 |
|
| 188 |
style A fill:#f97316,color:#fff
|
| 189 |
style F fill:#dc2626,color:#fff
|
|
|
|
| 203 |
|
| 204 |
```mermaid
|
| 205 |
flowchart TD
|
| 206 |
+
A([π¨ Alert:<br/>ACC-500 β ACC-700<br/>$2.5M]) --> B
|
| 207 |
|
| 208 |
+
subgraph Trap["β The Bait β Don't Take It"]
|
| 209 |
+
X["$100 transfer<br/>to 'Watchlist Target'"]
|
| 210 |
end
|
| 211 |
|
| 212 |
subgraph Investigation["The Real Loop"]
|
| 213 |
B --> C["search_transactions<br/>ACC-700<br/>keyword: 'consulting'"]
|
| 214 |
+
C --> D{48hrs later:<br/>ACC-700 β ACC-888<br/>$2.4M offshore}
|
| 215 |
D --> E[get_kyc_record<br/>ACC-888]
|
| 216 |
E --> F{Director:<br/>Robert House}
|
| 217 |
F --> G[get_kyc_record<br/>ACC-500]
|
| 218 |
G --> H{Director:<br/>Apex Management Corp}
|
| 219 |
H --> I[get_kyc_record<br/>Apex Management Corp]
|
| 220 |
+
I --> J{CEO:<br/>Robert House β same person}
|
| 221 |
end
|
| 222 |
|
| 223 |
A -.->|naive agent wastes budget| X
|
| 224 |
+
J --> K([π¨ submit_decision<br/>FRAUD<br/>evidence: ACC-500, ACC-700, ACC-888])
|
| 225 |
|
| 226 |
style A fill:#ef4444,color:#fff
|
| 227 |
style X fill:#6b7280,color:#fff,stroke-dasharray: 5 5
|
| 228 |
+
style Trap fill:#1f2937,color:#9ca3af
|
| 229 |
style K fill:#dc2626,color:#fff
|
| 230 |
style J fill:#fbbf24,color:#000
|
| 231 |
```
|
|
|
|
| 278 |
|
| 279 |
---
|
| 280 |
|
| 281 |
+
## Baseline Results
|
| 282 |
+
|
| 283 |
+
> **Model:** `openai/gpt-oss-20b` Β· **CoT:** enabled Β· **Run:** single pass, no fine-tuning
|
| 284 |
+
|
| 285 |
+
| Task | Steps Used | Budget | Grader Score | Net Reward | Verdict | Result |
|
| 286 |
+
|---|---|---|---|---|---|---|
|
| 287 |
+
| `aml_easy` | 3 / 5 | 2 remaining | 0.75 | **+0.69** | `CLEAR` β | β
Pass |
|
| 288 |
+
| `aml_medium` | 6 / 12 | 6 remaining | 0.75 | **+0.63** | `FRAUD` β | β
Pass |
|
| 289 |
+
| `aml_hard` | 16 / 20 | 0 remaining | 0.00 | **β0.32** | none | β Fail |
|
| 290 |
+
|
| 291 |
+
Net reward = grader score β (steps Γ 0.02)
|
| 292 |
+
|
| 293 |
+
### Per-Task Analysis
|
| 294 |
+
|
| 295 |
+
**`aml_easy` β Pass (0.75 / 1.0)**
|
| 296 |
+
|
| 297 |
+
The agent navigated the task in the minimum viable number of steps: one transaction query, one KYC lookup, then `CLEAR`. It correctly ignored the high-risk jurisdiction flag after reading the memo. The score stopped at `0.75` rather than `1.0` because `evidence_links` was submitted empty β the grader expects at least the cleared account ID as documented evidence of the reasoning chain.
|
| 298 |
+
|
| 299 |
+
```
|
| 300 |
+
[STEP] query_transactions ACC-9001
|
| 301 |
+
[STEP] get_kyc_record ENT-9001
|
| 302 |
+
[STEP] submit_decision CLEAR evidence=[] β missing evidence β capped at 0.75
|
| 303 |
+
```
|
| 304 |
+
|
| 305 |
+
**`aml_medium` β Pass (0.75 / 1.0)**
|
| 306 |
+
|
| 307 |
+
The agent identified structuring activity and correctly returned a `FRAUD` verdict, but submitted only one of the three smurf accounts (`ACC-9010`) in evidence. The grader applies partial credit proportional to smurf accounts found β `1/3` identified yields `0.75`. The agent also issued a `search_transactions` call with keyword `"Invoice"` which was not relevant to the structuring pattern, suggesting mild reasoning noise before it converged on the correct account.
|
| 308 |
+
|
| 309 |
+
```
|
| 310 |
+
[STEP] query_transactions ACC-9010 (offset 0)
|
| 311 |
+
[STEP] query_transactions ACC-9011 (offset 0)
|
| 312 |
+
[STEP] get_kyc_record ENT-9010
|
| 313 |
+
[STEP] search_transactions ACC-9010 keyword="Invoice" β off-path call
|
| 314 |
+
[STEP] get_kyc_record ENT-0159
|
| 315 |
+
[STEP] submit_decision FRAUD evidence=["ACC-9010"] β found 1/3 smurfs β 0.75
|
| 316 |
+
```
|
| 317 |
+
|
| 318 |
+
**`aml_hard` β Fail (0.00)**
|
| 319 |
+
|
| 320 |
+
The model completed two valid steps (paginating `ACC-9021` at offset 0 and 10), then entered a catastrophic failure loop. From step 3 onward, the model produced empty or non-JSON output on every turn, triggering the recovery action, which defaulted to `query_transactions(ACC-9021, offset=0)` β the same call, 14 times in a row. The budget was exhausted without a `submit_decision` ever being issued.
|
| 321 |
+
|
| 322 |
+
```
|
| 323 |
+
[STEP] query_transactions ACC-9021 offset=0 β valid
|
| 324 |
+
[STEP] query_transactions ACC-9021 offset=10 β valid
|
| 325 |
+
[DEBUG] Non-JSON/invalid model action Γ 14 β context collapse
|
| 326 |
+
[END] score=0.00 budget exhausted
|
| 327 |
+
```
|
| 328 |
+
|
| 329 |
+
The root cause is context window pressure. By step 2, the sliding window already contained two large paginated transaction payloads.
|
| 330 |
+
|
| 331 |
+
### Failure Mode Summary
|
| 332 |
+
|
| 333 |
+
```mermaid
|
| 334 |
+
flowchart LR
|
| 335 |
+
A[Step 2: Two large<br/>transaction payloads<br/>in context] --> B[Model outputs<br/>prose instead of JSON]
|
| 336 |
+
B --> C[Recovery action:<br/>query_transactions<br/>offset=0]
|
| 337 |
+
C --> D[Same large payload<br/>re-injected into context]
|
| 338 |
+
D --> B
|
| 339 |
+
D --> E{Budget = 0}
|
| 340 |
+
E --> F([score = 0.00])
|
| 341 |
+
|
| 342 |
+
style B fill:#ef4444,color:#fff
|
| 343 |
+
style F fill:#7f1d1d,color:#fff
|
| 344 |
+
```
|
| 345 |
+
|
| 346 |
+
### What This Tells Us
|
| 347 |
+
|
| 348 |
+
The tasks are correctly difficulty-stratified.
|
| 349 |
+
The easy and medium tasks are solvable by an instruction-following model with chain-of-thought, but not perfectly β both runs left score on the table due to incomplete evidence submission.
|
| 350 |
+
The hard task exposes a genuine capability gap: multi-hop KYC cross-referencing under token pressure requires either a larger model, a tighter context compaction strategy, or both.
|
| 351 |
+
|
| 352 |
+
The `[DEBUG] Non-JSON/invalid model action` recovery path is functioning as designed β the environment did not crash, and each recovery action was logged and penalized correctly.
|
| 353 |
+
|
| 354 |
+
| Failure Mode | Observed In | Environment Response |
|
| 355 |
+
|---|---|---|
|
| 356 |
+
| Empty `evidence_links` on correct verdict | Easy, Medium | Grader caps score; no crash |
|
| 357 |
+
| Off-path tool calls | Medium | Step penalty applied; agent self-corrects |
|
| 358 |
+
| Context collapse β non-JSON output | Hard | Recovery action fired; logged as `[DEBUG]` |
|
| 359 |
+
| Recovery loop exhausts budget | Hard | Episode terminates cleanly; score `0.00` |
|
| 360 |
+
|
| 361 |
+
---
|
| 362 |
+
|
| 363 |
## Core Engineering Principles
|
| 364 |
|
| 365 |
These principles govern how the environment is designed and why each decision was made.
|