DataBoySu commited on
Commit
acfb96b
Β·
1 Parent(s): dfd1faa
Files changed (2) hide show
  1. LICENSE +29 -0
  2. README.md +101 -18
LICENSE ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ BSD 3-Clause License
2
+
3
+ Copyright (c) 2024-present, OpenEnv Contributors
4
+ All rights reserved.
5
+
6
+ Redistribution and use in source and binary forms, with or without
7
+ modification, are permitted provided that the following conditions are met:
8
+
9
+ 1. Redistributions of source code must retain the above copyright notice, this
10
+ list of conditions and the following disclaimer.
11
+
12
+ 2. Redistributions in binary form must reproduce the above copyright notice,
13
+ this list of conditions and the following disclaimer in the documentation
14
+ and/or other materials provided with the distribution.
15
+
16
+ 3. Neither the name of the copyright holder nor the names of its
17
+ contributors may be used to endorse or promote products derived from
18
+ this software without specific prior written permission.
19
+
20
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21
+ AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22
+ IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23
+ DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24
+ FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25
+ DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26
+ SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27
+ CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28
+ OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29
+ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
README.md CHANGED
@@ -11,7 +11,7 @@ tags:
11
 
12
  <div align="center">
13
 
14
- # πŸ•΅οΈ AML Investigator OpenEnv RL Environment
15
 
16
  **A financial crime investigation environment for training and evaluating LLM agents**
17
 
@@ -81,11 +81,11 @@ Every investigation runs as a sequence of steps between agent and environment. T
81
 
82
  ```mermaid
83
  sequenceDiagram
84
- participant A as Agent
85
- participant E as Environment
86
- participant D as Data Layer
87
 
88
- E-->>A: reset() -> AmlObservation<br/>(alert_details, budget=N)
89
 
90
  loop Until submit_decision or budget=0
91
  A->>E: step(AmlAction)
@@ -96,7 +96,7 @@ sequenceDiagram
96
 
97
  A->>E: step(submit_decision, evidence=[...])
98
  E->>E: Run Grader
99
- E-->>A: AmlObservation<br/>(done=True, reward=0.0-1.0)
100
  ```
101
 
102
  ---
@@ -112,7 +112,7 @@ The agent communicates exclusively through **typed Pydantic actions**. No regex
112
  | `get_kyc_record` | `entity_id` | Retrieve address, entity type, and corporate directors. |
113
  | `submit_decision` | `decision: FRAUD\|CLEAR`, `evidence_links: List[str]` | Terminal action. Ends the episode and triggers the grader. |
114
 
115
- > **Why Pydantic?** The LLM is the router. Strict schemas with `Field(description="...")` mean the model reads the tool contract, not a prompt full of prose instructions. Malformed output is caught at validation, not execution, preventing silent failures and hallucinated account IDs from crashing the environment.
116
 
117
  ---
118
 
@@ -147,7 +147,7 @@ The trap is the jurisdiction flag. A naive model panics and submits `FRAUD`. A w
147
 
148
  ```mermaid
149
  flowchart LR
150
- A([Alert:<br/>ACC-101 to ACC-909<br/>$50,000]) --> B
151
 
152
  subgraph Investigation
153
  B[query_transactions<br/>ACC-101] --> C{Memo:<br/>'Heavy Machinery<br/>Purchase - Unit 4'}
@@ -157,7 +157,7 @@ flowchart LR
157
  F --> G{50 inbound payments<br/>from global firms}
158
  end
159
 
160
- G --> H([submit_decision<br/>CLEAR])
161
 
162
  style A fill:#ef4444,color:#fff
163
  style H fill:#22c55e,color:#fff
@@ -175,15 +175,15 @@ The agent must paginate through hundreds of normal car-sale transactions to surf
175
 
176
  ```mermaid
177
  flowchart TD
178
- A([Alert:<br/>ACC-200 deposit velocity spike]) --> B
179
 
180
- subgraph Investigation["Paginate -> Spot -> Cross-Reference"]
181
  B[query_transactions<br/>ACC-200<br/>offset 0, 10, 20...] --> C{14 deposits<br/>$9,900 and $9,500<br/>below $10k threshold}
182
  C --> D[get_kyc_record<br/>ACC-301, ACC-302, ACC-303]
183
  D --> E{All 3 accounts:<br/>Opened same day<br/>Occupation: Student}
184
  end
185
 
186
- E --> F([submit_decision<br/>FRAUD<br/>evidence: ACC-301, ACC-302, ACC-303])
187
 
188
  style A fill:#f97316,color:#fff
189
  style F fill:#dc2626,color:#fff
@@ -203,28 +203,29 @@ This is the full haystack. `ACC-500` has 500+ transactions. `ACC-700` has hundre
203
 
204
  ```mermaid
205
  flowchart TD
206
- A([Alert:<br/>ACC-500 to ACC-700<br/>$2.5M]) --> B
207
 
208
- subgraph Trap["The Bait - Do Not Take It"]
209
- X["$100 transfer<br/>to Watchlist Target"]
210
  end
211
 
212
  subgraph Investigation["The Real Loop"]
213
  B --> C["search_transactions<br/>ACC-700<br/>keyword: 'consulting'"]
214
- C --> D{48hrs later:<br/>ACC-700 to ACC-888<br/>$2.4M offshore}
215
  D --> E[get_kyc_record<br/>ACC-888]
216
  E --> F{Director:<br/>Robert House}
217
  F --> G[get_kyc_record<br/>ACC-500]
218
  G --> H{Director:<br/>Apex Management Corp}
219
  H --> I[get_kyc_record<br/>Apex Management Corp]
220
- I --> J{CEO:<br/>Robert House same person}
221
  end
222
 
223
  A -.->|naive agent wastes budget| X
224
- J --> K([submit_decision<br/>FRAUD<br/>evidence: ACC-500, ACC-700, ACC-888])
225
 
226
  style A fill:#ef4444,color:#fff
227
  style X fill:#6b7280,color:#fff,stroke-dasharray: 5 5
 
228
  style K fill:#dc2626,color:#fff
229
  style J fill:#fbbf24,color:#000
230
  ```
@@ -277,6 +278,88 @@ Fraud scenarios are injected with camouflage: 5–10 "normal" bridging transacti
277
 
278
  ---
279
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
280
  ## Core Engineering Principles
281
 
282
  These principles govern how the environment is designed and why each decision was made.
 
11
 
12
  <div align="center">
13
 
14
+ # πŸ•΅οΈ AML Investigator β€” OpenEnv RL Environment
15
 
16
  **A financial crime investigation environment for training and evaluating LLM agents**
17
 
 
81
 
82
  ```mermaid
83
  sequenceDiagram
84
+ participant A as πŸ€– Agent
85
+ participant E as βš™οΈ Environment
86
+ participant D as πŸ—„οΈ Data Layer
87
 
88
+ E-->>A: reset() β†’ AmlObservation<br/>(alert_details, budget=N)
89
 
90
  loop Until submit_decision or budget=0
91
  A->>E: step(AmlAction)
 
96
 
97
  A->>E: step(submit_decision, evidence=[...])
98
  E->>E: Run Grader
99
+ E-->>A: AmlObservation<br/>(done=True, reward=0.0–1.0)
100
  ```
101
 
102
  ---
 
112
  | `get_kyc_record` | `entity_id` | Retrieve address, entity type, and corporate directors. |
113
  | `submit_decision` | `decision: FRAUD\|CLEAR`, `evidence_links: List[str]` | Terminal action. Ends the episode and triggers the grader. |
114
 
115
+ > **Why Pydantic?** The LLM is the router. Strict schemas with `Field(description="...")` mean the model reads the tool contract, not a prompt full of prose instructions. Malformed output is caught at validation, not execution β€” preventing silent failures and hallucinated account IDs from crashing the environment.
116
 
117
  ---
118
 
 
147
 
148
  ```mermaid
149
  flowchart LR
150
+ A([🚨 Alert:<br/>ACC-101 β†’ ACC-909<br/>$50,000]) --> B
151
 
152
  subgraph Investigation
153
  B[query_transactions<br/>ACC-101] --> C{Memo:<br/>'Heavy Machinery<br/>Purchase - Unit 4'}
 
157
  F --> G{50 inbound payments<br/>from global firms}
158
  end
159
 
160
+ G --> H([βœ… submit_decision<br/>CLEAR])
161
 
162
  style A fill:#ef4444,color:#fff
163
  style H fill:#22c55e,color:#fff
 
175
 
176
  ```mermaid
177
  flowchart TD
178
+ A([🚨 Alert:<br/>ACC-200 deposit velocity spike]) --> B
179
 
180
+ subgraph Investigation["Paginate β†’ Spot β†’ Cross-Reference"]
181
  B[query_transactions<br/>ACC-200<br/>offset 0, 10, 20...] --> C{14 deposits<br/>$9,900 and $9,500<br/>below $10k threshold}
182
  C --> D[get_kyc_record<br/>ACC-301, ACC-302, ACC-303]
183
  D --> E{All 3 accounts:<br/>Opened same day<br/>Occupation: Student}
184
  end
185
 
186
+ E --> F([🚨 submit_decision<br/>FRAUD<br/>evidence: ACC-301, ACC-302, ACC-303])
187
 
188
  style A fill:#f97316,color:#fff
189
  style F fill:#dc2626,color:#fff
 
203
 
204
  ```mermaid
205
  flowchart TD
206
+ A([🚨 Alert:<br/>ACC-500 β†’ ACC-700<br/>$2.5M]) --> B
207
 
208
+ subgraph Trap["❌ The Bait β€” Don't Take It"]
209
+ X["$100 transfer<br/>to 'Watchlist Target'"]
210
  end
211
 
212
  subgraph Investigation["The Real Loop"]
213
  B --> C["search_transactions<br/>ACC-700<br/>keyword: 'consulting'"]
214
+ C --> D{48hrs later:<br/>ACC-700 β†’ ACC-888<br/>$2.4M offshore}
215
  D --> E[get_kyc_record<br/>ACC-888]
216
  E --> F{Director:<br/>Robert House}
217
  F --> G[get_kyc_record<br/>ACC-500]
218
  G --> H{Director:<br/>Apex Management Corp}
219
  H --> I[get_kyc_record<br/>Apex Management Corp]
220
+ I --> J{CEO:<br/>Robert House ← same person}
221
  end
222
 
223
  A -.->|naive agent wastes budget| X
224
+ J --> K([🚨 submit_decision<br/>FRAUD<br/>evidence: ACC-500, ACC-700, ACC-888])
225
 
226
  style A fill:#ef4444,color:#fff
227
  style X fill:#6b7280,color:#fff,stroke-dasharray: 5 5
228
+ style Trap fill:#1f2937,color:#9ca3af
229
  style K fill:#dc2626,color:#fff
230
  style J fill:#fbbf24,color:#000
231
  ```
 
278
 
279
  ---
280
 
281
+ ## Baseline Results
282
+
283
+ > **Model:** `openai/gpt-oss-20b` Β· **CoT:** enabled Β· **Run:** single pass, no fine-tuning
284
+
285
+ | Task | Steps Used | Budget | Grader Score | Net Reward | Verdict | Result |
286
+ |---|---|---|---|---|---|---|
287
+ | `aml_easy` | 3 / 5 | 2 remaining | 0.75 | **+0.69** | `CLEAR` βœ“ | βœ… Pass |
288
+ | `aml_medium` | 6 / 12 | 6 remaining | 0.75 | **+0.63** | `FRAUD` βœ“ | βœ… Pass |
289
+ | `aml_hard` | 16 / 20 | 0 remaining | 0.00 | **βˆ’0.32** | none | ❌ Fail |
290
+
291
+ Net reward = grader score βˆ’ (steps Γ— 0.02)
292
+
293
+ ### Per-Task Analysis
294
+
295
+ **`aml_easy` β€” Pass (0.75 / 1.0)**
296
+
297
+ The agent navigated the task in the minimum viable number of steps: one transaction query, one KYC lookup, then `CLEAR`. It correctly ignored the high-risk jurisdiction flag after reading the memo. The score stopped at `0.75` rather than `1.0` because `evidence_links` was submitted empty β€” the grader expects at least the cleared account ID as documented evidence of the reasoning chain.
298
+
299
+ ```
300
+ [STEP] query_transactions ACC-9001
301
+ [STEP] get_kyc_record ENT-9001
302
+ [STEP] submit_decision CLEAR evidence=[] ← missing evidence β†’ capped at 0.75
303
+ ```
304
+
305
+ **`aml_medium` β€” Pass (0.75 / 1.0)**
306
+
307
+ The agent identified structuring activity and correctly returned a `FRAUD` verdict, but submitted only one of the three smurf accounts (`ACC-9010`) in evidence. The grader applies partial credit proportional to smurf accounts found β€” `1/3` identified yields `0.75`. The agent also issued a `search_transactions` call with keyword `"Invoice"` which was not relevant to the structuring pattern, suggesting mild reasoning noise before it converged on the correct account.
308
+
309
+ ```
310
+ [STEP] query_transactions ACC-9010 (offset 0)
311
+ [STEP] query_transactions ACC-9011 (offset 0)
312
+ [STEP] get_kyc_record ENT-9010
313
+ [STEP] search_transactions ACC-9010 keyword="Invoice" ← off-path call
314
+ [STEP] get_kyc_record ENT-0159
315
+ [STEP] submit_decision FRAUD evidence=["ACC-9010"] ← found 1/3 smurfs β†’ 0.75
316
+ ```
317
+
318
+ **`aml_hard` β€” Fail (0.00)**
319
+
320
+ The model completed two valid steps (paginating `ACC-9021` at offset 0 and 10), then entered a catastrophic failure loop. From step 3 onward, the model produced empty or non-JSON output on every turn, triggering the recovery action, which defaulted to `query_transactions(ACC-9021, offset=0)` β€” the same call, 14 times in a row. The budget was exhausted without a `submit_decision` ever being issued.
321
+
322
+ ```
323
+ [STEP] query_transactions ACC-9021 offset=0 ← valid
324
+ [STEP] query_transactions ACC-9021 offset=10 ← valid
325
+ [DEBUG] Non-JSON/invalid model action Γ— 14 ← context collapse
326
+ [END] score=0.00 budget exhausted
327
+ ```
328
+
329
+ The root cause is context window pressure. By step 2, the sliding window already contained two large paginated transaction payloads.
330
+
331
+ ### Failure Mode Summary
332
+
333
+ ```mermaid
334
+ flowchart LR
335
+ A[Step 2: Two large<br/>transaction payloads<br/>in context] --> B[Model outputs<br/>prose instead of JSON]
336
+ B --> C[Recovery action:<br/>query_transactions<br/>offset=0]
337
+ C --> D[Same large payload<br/>re-injected into context]
338
+ D --> B
339
+ D --> E{Budget = 0}
340
+ E --> F([score = 0.00])
341
+
342
+ style B fill:#ef4444,color:#fff
343
+ style F fill:#7f1d1d,color:#fff
344
+ ```
345
+
346
+ ### What This Tells Us
347
+
348
+ The tasks are correctly difficulty-stratified.
349
+ The easy and medium tasks are solvable by an instruction-following model with chain-of-thought, but not perfectly β€” both runs left score on the table due to incomplete evidence submission.
350
+ The hard task exposes a genuine capability gap: multi-hop KYC cross-referencing under token pressure requires either a larger model, a tighter context compaction strategy, or both.
351
+
352
+ The `[DEBUG] Non-JSON/invalid model action` recovery path is functioning as designed β€” the environment did not crash, and each recovery action was logged and penalized correctly.
353
+
354
+ | Failure Mode | Observed In | Environment Response |
355
+ |---|---|---|
356
+ | Empty `evidence_links` on correct verdict | Easy, Medium | Grader caps score; no crash |
357
+ | Off-path tool calls | Medium | Step penalty applied; agent self-corrects |
358
+ | Context collapse β†’ non-JSON output | Hard | Recovery action fired; logged as `[DEBUG]` |
359
+ | Recovery loop exhausts budget | Hard | Episode terminates cleanly; score `0.00` |
360
+
361
+ ---
362
+
363
  ## Core Engineering Principles
364
 
365
  These principles govern how the environment is designed and why each decision was made.