Spaces:

SolusOps
/

AML_env

Running

App Files Files Community

DataBoySu commited on 19 days ago

Commit

dfd1faa

1 Parent(s): a4c032a

agent working

Browse files

Files changed (4) hide show

README.md +394 -193
inference.py +113 -23
models.py +21 -1
server/AML_env_environment.py +9 -2

README.md CHANGED Viewed

@@ -9,281 +9,482 @@ tags:
   - openenv
 ---
-# AML Investigator Environment
-A financial crime investigation environment for Reinforcement Learning agents.
-The agent must query a mock banking system (transactions, KYC records) under a strict API budget
-to investigate flagged accounts and submit a final fraud/clear decision.
-## Quick Start
-The simplest way to use the Aml Env environment is through the `AmlEnv` class:
-```python
-from AML_env import AmlAction, AmlEnv
-try:
-    # Create environment from Docker image (built from root Dockerfile)
-    env = AmlEnv.from_docker_image("aml-env:latest")
-    # Reset to a specific task
-    obs = env.reset(task="aml_easy")
-    print(f"Alert: {obs.observation.alert_details}")
-    print(f"Budget: {obs.observation.budget_remaining}")
-    # Query transactions
-    result = env.step(AmlAction(action={
-        "action_type": "query_transactions",
-        "account_id": "ACC-9001",
-        "limit": 10,
-        "offset": 0,
-    }))
-    print(f"Transactions: {result.observation.last_action_result}")
-    # Submit final decision
-    result = env.step(AmlAction(action={
-        "action_type": "submit_decision",
-        "decision": "CLEAR",
-        "evidence_links": [],
-    }))
-    print(f"Done: {result.done}, Reward: {result.reward}")
-finally:
-    env.close()
 ```
-That's it! The `AmlEnv.from_docker_image()` method handles:
-- Starting the Docker container
-- Waiting for the server to be ready
-- Connecting to the environment
-- Container cleanup when you call `close()`
-## Building the Docker Image
-Before using the environment, you need to build the Docker image:
-```bash
-# From project root
-docker build -t aml-env:latest .
 ```
-## Deploying to Hugging Face Spaces
-You can easily deploy your OpenEnv environment to Hugging Face Spaces using the `openenv push` command:
-```bash
-# From the environment directory (where openenv.yaml is located)
-openenv push
-# Or specify options
-openenv push --namespace my-org --private
 ```
-The `openenv push` command will:
-1. Validate that the directory is an OpenEnv environment (checks for `openenv.yaml`)
-2. Prepare a custom build for Hugging Face Docker space (enables web interface)
-3. Upload to Hugging Face (ensuring you're logged in)
-### Prerequisites
-- Authenticate with Hugging Face: The command will prompt for login if not already authenticated
-### Options
-- `--directory`, `-d`: Directory containing the OpenEnv environment (defaults to current directory)
-- `--repo-id`, `-r`: Repository ID in format 'username/repo-name' (defaults to 'username/env-name' from openenv.yaml)
-- `--base-image`, `-b`: Base Docker image to use (overrides Dockerfile FROM)
-- `--private`: Deploy the space as private (default: public)
-### Examples
-```bash
-# Push to your personal namespace (defaults to username/env-name from openenv.yaml)
-openenv push
-# Push to a specific repository
-openenv push --repo-id my-org/my-env
-# Push with a custom base image
-openenv push --base-image ghcr.io/meta-pytorch/openenv-base:latest
-# Push as a private space
-openenv push --private
-# Combine options
-openenv push --repo-id my-org/my-env --base-image custom-base:latest --private
 ```
-After deployment, your space will be available at:
-`https://huggingface.co/spaces/<repo-id>`
-The deployed space includes:
-- **Web Interface** at `/web` - Interactive UI for exploring the environment
-- **API Documentation** at `/docs` - Full OpenAPI/Swagger interface
-- **Health Check** at `/health` - Container health monitoring
-- **WebSocket** at `/ws` - Persistent session endpoint for low-latency interactions
-## Environment Details
-### Action Space
-**AmlAction** wraps one of four tool calls (discriminated by `action_type`):
-| Tool | Fields | Description |
-|---|---|---|
-| `query_transactions` | `account_id`, `limit`, `offset` | Paginated transaction history for an account |
-| `search_transactions` | `account_id`, `keyword` | Search memo_text of transactions |
-| `get_kyc_record` | `entity_id` | Retrieve KYC data for an entity |
-| `submit_decision` | `decision` (`FRAUD`\|`CLEAR`), `evidence_links` | Final verdict — ends the episode |
-### Observation Space
-**AmlObservation** is returned after every `reset()` and `step()`:
-| Field | Type | Description |
-|---|---|---|
-| `alert_details` | `str` | The investigation mission (constant per episode) |
-| `budget_remaining` | `int` | API calls left before forced termination |
-| `last_action` | `str \| None` | Name of the last tool called |
-| `last_action_result` | `Any` | Payload returned by the last tool |
-| `error_message` | `str \| None` | Error string if the last action failed |
-| `done` | `bool` | Whether the episode has ended |
-| `reward` | `float` | Per-step reward signal |
-### Reward
-- **Per step:** `-0.02` (efficiency penalty discourages random looping)
-- **Submit FRAUD (correct):** grader returns `0.4`–`1.0` depending on evidence quality
-- **Submit CLEAR (correct false positive):** grader returns `1.0`
-- **Budget exhausted without submission:** episode ends with accumulated negative rewards
-## Advanced Usage
-### Connecting to an Existing Server
-If you already have a Aml Env environment server running, you can connect directly:
-```python
-from AML_env import AmlEnv
-# Connect to existing server
-AML_envenv = AmlEnv(base_url="<ENV_HTTP_URL_HERE>")
-# Use as normal
-result = AML_envenv.reset()
-result = AML_envenv.step(AmlAction(message="Hello!"))
 ```
-Note: When connecting to an existing server, `AML_envenv.close()` will NOT stop the server.
-### Using the Context Manager
-The client supports context manager usage for automatic connection management:
-```python
-from AML_env import AmlAction, AmlEnv
-# Connect with context manager (auto-connects and closes)
-with AmlEnv(base_url="http://localhost:8000") as env:
-    result = env.reset()
-    print(f"Reset: {result.observation.echoed_message}")
-    # Multiple steps with low latency
-    for msg in ["Hello", "World", "!"]:
-        result = env.step(AmlAction(message=msg))
-        print(f"Echoed: {result.observation.echoed_message}")
 ```
-The client uses WebSocket connections for:
-- **Lower latency**: No HTTP connection overhead per request
-- **Persistent session**: Server maintains your environment state
-- **Efficient for episodes**: Better for many sequential steps
-### Concurrent WebSocket Sessions
-The server supports multiple concurrent WebSocket connections. To enable this,
-modify `server/app.py` to use factory mode:
-```python
-# In server/app.py - use factory mode for concurrent sessions
-app = create_app(
-    AmlEnvironment,  # Pass class, not instance
-    AmlAction,
-    AmlObservation,
-    max_concurrent_envs=4,  # Allow 4 concurrent sessions
-)
 ```
-Then multiple clients can connect simultaneously:
 ```python
 from AML_env import AmlAction, AmlEnv
-from concurrent.futures import ThreadPoolExecutor
-def run_episode(client_id: int):
-    with AmlEnv(base_url="http://localhost:8000") as env:
-        result = env.reset()
-        for i in range(10):
-            result = env.step(AmlAction(message=f"Client {client_id}, step {i}"))
-        return client_id, result.observation.message_length
-# Run 4 episodes concurrently
-with ThreadPoolExecutor(max_workers=4) as executor:
-    results = list(executor.map(run_episode, range(4)))
 ```
-## Development & Testing
-### Direct Environment Testing
-Test the environment logic directly without starting the HTTP server:
 ```bash
-# From the server directory
-python3 server/AML_env_environment.py
 ```
-This verifies that:
-- Environment resets correctly
-- Step executes actions properly
-- State tracking works
-- Rewards are calculated correctly
-### Running Locally
-Run the server locally for development:
 ```bash
-uvicorn server.app:app --reload
 ```
 ## Project Structure
 ```
 AML_env/
-├── Dockerfile                    # Container image (root, HF Spaces compliant)
-├── .dockerignore                 # Docker build exclusions
-├── .hfignore                     # HF Space upload exclusions
-├── .gitignore                    # Git exclusions
-├── __init__.py                   # Package exports (AmlEnv, AmlAction, AmlObservation)
-├── client.py                     # AmlEnv WebSocket client
-├── models.py                     # Pydantic action/observation schemas
-├── inference.py                  # Baseline RL agent (OpenAI client, [START]/[STEP]/[END] logs)
-├── openenv.yaml                  # OpenEnv manifest (tasks, graders, port)
-├── pyproject.toml                # Project metadata and uv dependencies
-├── uv.lock                       # Locked dependency graph
-├── README.md                     # This file (also HF Space card)
 ├── data/
-│   ├── entities.json             # 312 KYC entity records
-│   ├── accounts.json             # 410 bank accounts
-│   └── transactions.json         # 5,079 transactions (haystack + fraud scenarios)
 ├── graders/
-│   ├── __init__.py
-│   ├── aml_easy.py               # "The False Positive" grader
-│   ├── aml_medium.py             # "The Smurf Network" grader
-│   └── aml_hard.py               # "The Corporate Mirage" grader
 ├── server/
-│   ├── __init__.py
-│   ├── AML_env_environment.py    # Core OpenEnv environment (reset/step/state)
-│   ├── app.py                    # FastAPI server (CORS, create_app wrapper)
-│   └── requirements.txt          # Pip fallback requirements
 └── tools/
-    ├── haystack.py               # Financial graph generator
-    └── tasks.json                # Manual fraud scenario definitions
 ```

   - openenv
 ---
+<div align="center">
+# 🕵️ AML Investigator OpenEnv RL Environment
+**A financial crime investigation environment for training and evaluating LLM agents**
+[![OpenEnv](https://img.shields.io/badge/OpenEnv-compatible-6366f1?style=flat-square)](https://github.com/openenv)
+[![FastAPI](https://img.shields.io/badge/FastAPI-async-009688?style=flat-square&logo=fastapi)](https://fastapi.tiangolo.com)
+[![Pydantic](https://img.shields.io/badge/Pydantic-v2-e92063?style=flat-square)](https://docs.pydantic.dev)
+[![Docker](https://img.shields.io/badge/Docker-ready-2496ED?style=flat-square&logo=docker)](https://www.docker.com)
+[![HF Spaces](https://img.shields.io/badge/HuggingFace-Spaces-FFD21E?style=flat-square&logo=huggingface)](https://huggingface.co/spaces)
+</div>
+---
+## What Is This?
+Most RL benchmarks for language models test knowledge retrieval or reasoning in isolation. This environment tests something harder and more practical: **can an LLM agent act as a financial investigator?**
+The agent is given a banking system alert and a budget of API calls. It must use tools to query transaction ledgers, search memo fields, pull KYC records, and finally submit a verdict — `FRAUD` or `CLEAR` — with evidence. The agent is rewarded for correctness and efficiency; it is penalized for every wasted call.
+What makes this environment non-trivial:
+- **The haystack is real noise.** 5,000+ transactions of legitimate payroll, utility bills, and vendor invoices surround every fraud signal.
+- **Pagination is mandatory.** Corporate accounts hold 150–500 transactions. Dumping them all into context causes an OOM failure. The agent must learn to search and paginate strategically.
+- **False flags are everywhere.** The hard task contains a $100 transfer to an entity with a watchlist name — designed specifically to bait the agent into wasting its budget.
+- **KYC cross-referencing.** The hardest task cannot be solved by reading transactions alone. The agent must chain multiple `get_kyc_record` calls to trace hidden ownership loops.
+---
+## Architecture Overview
+```mermaid
+graph TD
+    subgraph Agent["LLM Agent (inference.py)"]
+        P[Prompt + Alert Details]
+        T[Tool Selection via Pydantic JSON]
+        C[Sliding Context Window]
+    end
+    subgraph Server["OpenEnv Server (FastAPI)"]
+        E[AML Environment<br/>Reset / Step]
+        G[Grader<br/>aml_easy, aml_medium, aml_hard]
+    end
+    subgraph Data["Mock Banking Database /data"]
+        ENT[entities.json<br/>312 KYC Records]
+        ACC[accounts.json<br/>410 Bank Accounts]
+        TXN[transactions.json<br/>5,079 Transactions]
+    end
+    P -->|AmlAction JSON| E
+    E -->|AmlObservation| C
+    C --> T
+    T --> P
+    E <-->|O1 dict lookups| ENT
+    E <-->|O1 dict lookups| ACC
+    E <-->|O1 dict lookups| TXN
+    E -->|submit_decision| G
+    G -->|score 0.0-1.0| E
 ```
+---
+## The Episode Loop
+Every investigation runs as a sequence of steps between agent and environment. The agent sees no state beyond what it has explicitly queried.
+```mermaid
+sequenceDiagram
+    participant A as Agent
+    participant E as Environment
+    participant D as Data Layer
+    E-->>A: reset() -> AmlObservation<br/>(alert_details, budget=N)
+    loop Until submit_decision or budget=0
+        A->>E: step(AmlAction)
+        E->>D: dict lookup (O(1))
+        D-->>E: raw records
+        E-->>A: AmlObservation<br/>(last_action_result, budget-=1, reward-=0.02)
+    end
+    A->>E: step(submit_decision, evidence=[...])
+    E->>E: Run Grader
+    E-->>A: AmlObservation<br/>(done=True, reward=0.0-1.0)
 ```
+---
+## Action Space
+The agent communicates exclusively through **typed Pydantic actions**. No regex parsing. No free-form text commands. Every action dispatches to exactly one tool.
+| Action | Key Parameters | Purpose |
+|---|---|---|
+| `query_transactions` | `account_id`, `limit=10`, `offset=0` | Paginated ledger history. **Must paginate** for corporate accounts. |
+| `search_transactions` | `account_id`, `keyword` | Filter `memo_text` fields. Cuts noise without burning pagination budget. |
+| `get_kyc_record` | `entity_id` | Retrieve address, entity type, and corporate directors. |
+| `submit_decision` | `decision: FRAUD\|CLEAR`, `evidence_links: List[str]` | Terminal action. Ends the episode and triggers the grader. |
+> **Why Pydantic?** The LLM is the router. Strict schemas with `Field(description="...")` mean the model reads the tool contract, not a prompt full of prose instructions. Malformed output is caught at validation, not execution, preventing silent failures and hallucinated account IDs from crashing the environment.
+---
+## Observation Space
+Every `reset()` and `step()` returns an `AmlObservation` containing the agent's full situational picture.
+```python
+class AmlObservation(BaseModel):
+    alert_details: str          # Investigation mission — constant per episode
+    budget_remaining: int       # API calls left before forced termination
+    last_action: str | None     # Name of the last tool called
+    last_action_result: Any     # Exact payload returned by the last tool
+    error_message: str | None   # Formatted error if the last call failed (not a crash)
+    done: bool                  # Whether the episode has ended
+    reward: float               # Cumulative reward signal
 ```
+> **Errors are data, not exceptions.** If the agent hallucinates `ACC-9999`, the environment catches the `KeyError`, formats it as `"Account 'ACC-9999' not found"`, and returns it as `error_message`. The container never crashes. The agent can read the error and self-correct on the next step.
+---
+## The Three Tasks
+The environment ships with three investigation scenarios of escalating difficulty, each targeting a distinct AML typology.
+### Task 1 — The False Positive `aml_easy`
+> **Alert:** `ACC-101` (local construction company) transferred $50,000 to `ACC-909`, a newly registered entity in a high-risk jurisdiction.
+The trap is the jurisdiction flag. A naive model panics and submits `FRAUD`. A well-reasoned agent reads the memo, pulls the KYC record, and discovers a legitimate equipment supplier.
+```mermaid
+flowchart LR
+    A([Alert:<br/>ACC-101 to ACC-909<br/>$50,000]) --> B
+    subgraph Investigation
+        B[query_transactions<br/>ACC-101] --> C{Memo:<br/>'Heavy Machinery<br/>Purchase - Unit 4'}
+        C --> D[get_kyc_record<br/>ACC-909]
+        D --> E{Registered as:<br/>Global Tractor Sales Ltd}
+        E --> F[query_transactions<br/>ACC-909]
+        F --> G{50 inbound payments<br/>from global firms}
+    end
+    G --> H([submit_decision<br/>CLEAR])
+    style A fill:#ef4444,color:#fff
+    style H fill:#22c55e,color:#fff
 ```
+**Reward:** `1.0` for `CLEAR`. The agent proves it can dismiss noise without over-indexing on surface-level signals.
+---
+### Task 2 — The Smurf Network `aml_medium`
+> **Alert:** `ACC-200` (used car dealership) shows a spike in cash deposits over a 5-day window.
+The agent must paginate through hundreds of normal car-sale transactions to surface 14 cash deposits — all for exactly $9,900 or $9,500, just below the $10,000 AML reporting threshold. The three sender accounts (`ACC-301`, `ACC-302`, `ACC-303`) were all opened on the same day with the same occupation listed: `Student`.
+```mermaid
+flowchart TD
+    A([Alert:<br/>ACC-200 deposit velocity spike]) --> B
+    subgraph Investigation["Paginate -> Spot -> Cross-Reference"]
+        B[query_transactions<br/>ACC-200<br/>offset 0, 10, 20...] --> C{14 deposits<br/>$9,900 and $9,500<br/>below $10k threshold}
+        C --> D[get_kyc_record<br/>ACC-301, ACC-302, ACC-303]
+        D --> E{All 3 accounts:<br/>Opened same day<br/>Occupation: Student}
+    end
+    E --> F([submit_decision<br/>FRAUD<br/>evidence: ACC-301, ACC-302, ACC-303])
+    style A fill:#f97316,color:#fff
+    style F fill:#dc2626,color:#fff
+```
+**Partial credit scoring:** The grader awards proportional reward based on how many of the three smurf accounts are included in `evidence_links`. Identifying 1 of 3 scores higher than 0 but lower than the full `1.0`.
+---
+### Task 3 — The Corporate Mirage `aml_hard`
+> **Alert:** `ACC-500` (major logistics firm) transferred $2.5M to `ACC-700` (generic consulting agency).
+This is the full haystack. `ACC-500` has 500+ transactions. `ACC-700` has hundreds of outbound payments to vendors, charities, and payroll. Hidden inside: 48 hours after receiving $2.5M, `ACC-700` moves $2.4M offshore. The ownership chain requires three chained KYC lookups to resolve.
+**The false flag trap:** `ACC-500` also made a $100 payment to an entity named `Al-Qaeda Watchlist Target`. This is deliberate bait. Agents that investigate the $100 transfer instead of the $2.5M loop receive a score of `0.05`.
+```mermaid
+flowchart TD
+    A([Alert:<br/>ACC-500 to ACC-700<br/>$2.5M]) --> B
+    subgraph Trap["The Bait - Do Not Take It"]
+        X["$100 transfer<br/>to Watchlist Target"]
+    end
+    subgraph Investigation["The Real Loop"]
+        B --> C["search_transactions<br/>ACC-700<br/>keyword: 'consulting'"]
+        C --> D{48hrs later:<br/>ACC-700 to ACC-888<br/>$2.4M offshore}
+        D --> E[get_kyc_record<br/>ACC-888]
+        E --> F{Director:<br/>Robert House}
+        F --> G[get_kyc_record<br/>ACC-500]
+        G --> H{Director:<br/>Apex Management Corp}
+        H --> I[get_kyc_record<br/>Apex Management Corp]
+        I --> J{CEO:<br/>Robert House same person}
+    end
+    A -.->|naive agent wastes budget| X
+    J --> K([submit_decision<br/>FRAUD<br/>evidence: ACC-500, ACC-700, ACC-888])
+    style A fill:#ef4444,color:#fff
+    style X fill:#6b7280,color:#fff,stroke-dasharray: 5 5
+    style K fill:#dc2626,color:#fff
+    style J fill:#fbbf24,color:#000
 ```
+**Scoring:** Full `1.0` for identifying all three accounts with the circular KYC loop documented. `0.05` if the agent chases the false flag instead.
+---
+## Reward Structure
+```
+Episode reward = Σ(step penalties) + terminal reward
+Step penalty:    −0.02  per API call  (discourages random exploration)
+FRAUD correct:   +0.4 to +1.0        (scales with evidence quality)
+CLEAR correct:   +1.0                 (false positives must be dismissed confidently)
+Budget exhaust:   0.0                 (no terminal reward — accumulated penalties only)
 ```
+Budget scales with task difficulty:
+| Task | Budget | Rationale |
+|---|---|---|
+| `aml_easy` | 5 calls | 4 tool calls are sufficient; any more suggests confusion |
+| `aml_medium` | 12 calls | Pagination required; partial paths need room |
+| `aml_hard` | 20 calls | Three KYC hops + pagination across two high-volume accounts |
+---
+## The Mock Knowledge Graph
+The haystack is a procedurally generated slice of a fictional bank, seeded for reproducibility.
 ```
+entities.json     312 records    80% Individual, 20% Corporate (with directors list)
+accounts.json     410 records    95% Active, 5% Closed
+transactions.json 5,079 rows     Procedural noise + 3 injected fraud scenarios
+```
+Transaction `memo_text` is typed by sender/receiver pair to simulate realistic commerce:
+| Flow | Example Memos | Amount Range |
+|---|---|---|
+| Corporate → Individual | `Payroll`, `Salary Q3`, `Expense Reimbursement` | $2,000–$10,000 |
+| Corporate → Corporate | `Server Hosting`, `Consulting Retainer`, `Invoice #XXXX` | $500–$50,000 |
+| Individual → Corporate | `Utility Bill`, `Gym Membership`, `Coffee` | $5–$200 |
+| Individual → Individual | `Dinner split`, `Rent share`, `Birthday gift` | $10–$500 |
+Fraud scenarios are injected with camouflage: 5–10 "normal" bridging transactions connect each manual account to the procedural haystack so no fraud node appears as an isolated island in the graph.
+---
+## Core Engineering Principles
+These principles govern how the environment is designed and why each decision was made.
+<details>
+<summary><strong>1. You don't design the control flow</strong></summary>
+The `step()` function is a pure reactive state machine. If the agent queries the same account five times in a row, the environment returns the result five times. It never forces a sequence or nudges toward the solution path. The agent is in the driver's seat.
+</details>
+<details>
+<summary><strong>2. Errors are data, not control flow</strong></summary>
+Hallucinated account IDs, missing entity records, malformed queries — all are caught with `try/except`, formatted as human-readable strings, and returned as `error_message` in the observation. The container never crashes on bad agent output.
+</details>
+<details>
+<summary><strong>3. The conversation is the database</strong></summary>
+The environment is stateless between calls. The agent's only memory is the `AmlObservation` history it has accumulated. Every response includes `budget_remaining`, `last_action`, and the full `last_action_result` payload so nothing is lost between turns.
+</details>
+<details>
+<summary><strong>4. No regex. Pydantic is the contract.</strong></summary>
+Actions are strictly typed Pydantic models with `Field(description="...")` on every parameter. The LLM reads the schema to understand how to use each tool. Invalid JSON is caught at validation — not mid-execution.
+</details>
+<details>
+<summary><strong>5. Pagination is an OOM prevention mechanism</strong></summary>
+Corporate accounts have 150–500 transactions. Returning them all in one response would blow up the context window. The `query_transactions` tool enforces a `limit` parameter (default 10, max configurable). The agent must learn to paginate or use keyword search to find signals in high-volume accounts.
+</details>
+<details>
+<summary><strong>6. Context compaction is layered</strong></summary>
+The inference script maintains a sliding window over conversation history (last 4–5 steps). Internal chain-of-thought reasoning is routed to `stderr`, keeping `stdout` clean for the grader's `[START]`/`[STEP]`/`[END]` log parsing.
+</details>
+<details>
+<summary><strong>7. The prompt is code, not config</strong></summary>
+The `alert_details` string returned by `reset()` is the agent's mission statement. It defines the goal, names the flagged account, and sets the investigation frame. Vague alerts produce vague investigations.
+</details>
+---
+## Quick Start
+### Prerequisites
+```bash
+pip install faker  # for haystack generation
+docker build -t aml-env:latest .
+```
+### Running an Episode
 ```python
 from AML_env import AmlAction, AmlEnv
+try:
+    env = AmlEnv.from_docker_image("aml-env:latest")
+    # Choose task: "aml_easy" | "aml_medium" | "aml_hard"
+    obs = env.reset(task="aml_medium")
+    print(f"Alert:  {obs.observation.alert_details}")
+    print(f"Budget: {obs.observation.budget_remaining}")
+    # Page through transactions
+    result = env.step(AmlAction(action={
+        "action_type": "query_transactions",
+        "account_id": "ACC-200",
+        "limit": 10,
+        "offset": 0,
+    }))
+    print(result.observation.last_action_result)
+    # Search by keyword to cut noise
+    result = env.step(AmlAction(action={
+        "action_type": "search_transactions",
+        "account_id": "ACC-700",
+        "keyword": "consulting",
+    }))
+    # Pull KYC record
+    result = env.step(AmlAction(action={
+        "action_type": "get_kyc_record",
+        "entity_id": "ENT-0042",
+    }))
+    # Submit final verdict
+    result = env.step(AmlAction(action={
+        "action_type": "submit_decision",
+        "decision": "FRAUD",
+        "evidence_links": ["ACC-301", "ACC-302", "ACC-303"],
+    }))
+    print(f"Done: {result.done}  |  Reward: {result.reward:.3f}")
+finally:
+    env.close()
 ```
+### Connect to an Existing Server
+```python
+env = AmlEnv(base_url="http://localhost:8760")
+```
+### Regenerate the Haystack
 ```bash
+# Procedural noise only
+python tools/haystack.py
+# Inject hand-written fraud scenarios
+python tools/haystack.py --inject tools/tasks.json --output-dir data/
 ```
+---
+## Deployment
+### Local Development
+```bash
+uvicorn server.app:app --reload --port 8760
+```
+### Hugging Face Spaces
 ```bash
+# From environment directory
+openenv push
+# Private space with custom repo
+openenv push --repo-id my-org/aml-investigator --private
 ```
+After deployment, the space exposes:
+| Endpoint | Description |
+|---|---|
+| `/web` | Interactive UI for manual exploration |
+| `/docs` | Swagger / OpenAPI interface |
+| `/ws` | WebSocket endpoint for low-latency agent sessions |
+| `/health` | Container health check |
+---
 ## Project Structure
 ```
 AML_env/
+├── Dockerfile                       # HF Spaces compliant; exposes port 8760
+├── openenv.yaml                     # Task manifest: aml_easy, aml_medium, aml_hard
+├── models.py                        # Pydantic AmlAction + AmlObservation schemas
+├── client.py                        # AmlEnv WebSocket client
+├── inference.py                     # Baseline agent: asyncio, sliding window, stderr CoT
+│
 ├── data/
+│   ├── entities.json                # 312 KYC entity records
+│   ├── accounts.json                # 410 bank accounts
+│   └── transactions.json            # 5,079 transactions (haystack + fraud)
+│
 ├── graders/
+│   ├── aml_easy.py                  # False positive — reward CLEAR, penalise over-flagging
+│   ├── aml_medium.py                # Smurf network — partial credit per smurf account found
+│   └── aml_hard.py                  # Corporate mirage — 0.05 if false-flag bait taken
+│
 ├── server/
+│   ├── AML_env_environment.py       # Core state machine: reset(), step(), budget, grader dispatch
+│   ├── app.py                       # FastAPI wrapper with CORS
+│   └── requirements.txt
+│
 └── tools/
+    ├── haystack.py                  # Procedural KB generator (Faker + random)
+    └── tasks.json                   # Hand-written fraud scenario definitions
 ```
+---
+## Evaluation Log Format
+The inference script emits strict single-line logs to `stdout` for automated grading:
+```
+[START] {"task": "aml_hard", "budget": 20}
+[STEP]  {"action": "query_transactions", "reward": -0.02, "done": false, "budget": 19}
+[STEP]  {"action": "get_kyc_record",     "reward": -0.02, "done": false, "budget": 18}
+[STEP]  {"action": "submit_decision",    "reward":  0.85, "done": true,  "budget": 17}
+[END]   {"total_reward": 0.79, "steps": 3, "decision": "FRAUD"}
+```
+Internal chain-of-thought reasoning routes to `stderr` and is never visible to the grader.
+---
+<div align="center">
+Built with [OpenEnv](https://github.com/openenv) · Deployed on [Hugging Face Spaces](https://huggingface.co/spaces)
+</div>

inference.py CHANGED Viewed

@@ -7,6 +7,7 @@ import os
 import json
 import textwrap
 import sys
 from typing import List, Optional
 from openai import OpenAI
@@ -39,6 +40,13 @@ SYSTEM_PROMPT = textwrap.dedent(
     2. {"action": {"action_type": "search_transactions", "account_id": "ACC-XXXX", "keyword": "invoice"}}
     3. {"action": {"action_type": "get_kyc_record", "entity_id": "ENT-XXXX"}}
     4. {"action": {"action_type": "submit_decision", "decision": "FRAUD", "evidence_links": ["ACC-1234"]}} (Use "CLEAR" for False Positives with empty evidence_links).
     """
 ).strip()
@@ -106,6 +114,68 @@ def _extract_text_from_completions_api(completion: object) -> str:
     raise ValueError("Completions API response text is empty")
 def log_start(task: str, env: str, model: str) -> None:
     print(f"[START] task={task} env={env} model={model}", flush=True)
@@ -119,6 +189,18 @@ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> No
     rewards_str = ",".join(f"{r:.2f}" for r in rewards)
     print(f"[END] success={str(success).lower()} steps={steps} score={score:.2f} rewards={rewards_str}", flush=True)
 def get_model_message(client: OpenAI, obs_dict: dict, history: List[str]) -> str:
     history_block = "\n".join(history[-5:]) if history else "No previous steps."
     user_prompt = f"Observation:\n{json.dumps(obs_dict, indent=2)}\n\nHistory:\n{history_block}\n\nProvide your next JSON action:"
@@ -130,10 +212,11 @@ def get_model_message(client: OpenAI, obs_dict: dict, history: List[str]) -> str
                 {"role": "system", "content": SYSTEM_PROMPT},
                 {"role": "user", "content": user_prompt},
             ],
-            temperature=0.1,
-            max_tokens=200,
         )
-        return _extract_text_from_chat_completion(completion)
     except Exception as chat_exc:
         # Retry via Responses API for OpenAI-compatible providers that do not
         # populate chat.completions choices consistently.
@@ -142,18 +225,18 @@ def get_model_message(client: OpenAI, obs_dict: dict, history: List[str]) -> str
                 model=MODEL_NAME,
                 instructions=SYSTEM_PROMPT,
                 input=user_prompt,
-                max_output_tokens=200,
             )
-            return _extract_text_from_responses_api(response)
         except Exception as responses_exc:
             try:
                 completion = client.completions.create(
                     model=MODEL_NAME,
                     prompt=f"{SYSTEM_PROMPT}\n\n{user_prompt}",
-                    temperature=0.1,
                     max_tokens=200,
                 )
-                return _extract_text_from_completions_api(completion)
             except Exception as completions_exc:
                 print(
                     (
@@ -163,7 +246,7 @@ def get_model_message(client: OpenAI, obs_dict: dict, history: List[str]) -> str
                     file=sys.stderr,
                     flush=True,
                 )
-        return FALLBACK_ACTION_JSON
 async def main() -> None:
     client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
@@ -177,6 +260,7 @@ async def main() -> None:
         steps_taken = 0
         score = 0.0
         success = False
         log_start(task=task_name, env=BENCHMARK, model=MODEL_NAME)
@@ -191,25 +275,26 @@ async def main() -> None:
                 action_str = get_model_message(client, obs_dict, history)
                 # Parse LLM string to Pydantic Model
                 try:
-                    # Strip possible markdown backticks
-                    clean_str = action_str.replace("```json", "").replace("```", "").strip()
                     action_json = json.loads(clean_str)
                     action_obj = AmlAction.model_validate(action_json)
                     error = None
                 except Exception as e:
                     # Errors are data! If the LLM writes bad JSON, we catch it and force a dummy action
                     # so the environment can return a schema error to the LLM.
                     error = f"JSON Parse/Schema Error: {str(e)}"
-                    action_obj = AmlAction.model_validate(
-                        {
-                            "action": {
-                                "action_type": "submit_decision",
-                                "decision": "CLEAR",
-                                "evidence_links": [],
-                            }
-                        }
-                    )
                 obs = env.step(action_obj)
@@ -219,16 +304,21 @@ async def main() -> None:
                 rewards.append(reward)
                 steps_taken = step
-                log_step(step=step, action=action_str.replace('\n', ''), reward=reward, done=done, error=error)
                 history.append(f"Step {step}: Action: {action_str} -> Result: {obs.last_action_result} | Error: {obs.error_message}")
                 if done:
                     break
-            # Calculate a baseline score for the stdout logs (Graders handle real scoring)
-            score = sum(rewards) + 1.0 if "submit_decision" in (obs.last_action or "") else 0.0
             score = min(max(score, 0.01), 0.99)
-            success = score > 0.5
         finally:
             log_end(success=success, steps=steps_taken, score=score, rewards=rewards)

 import json
 import textwrap
 import sys
+import re
 from typing import List, Optional
 from openai import OpenAI
     2. {"action": {"action_type": "search_transactions", "account_id": "ACC-XXXX", "keyword": "invoice"}}
     3. {"action": {"action_type": "get_kyc_record", "entity_id": "ENT-XXXX"}}
     4. {"action": {"action_type": "submit_decision", "decision": "FRAUD", "evidence_links": ["ACC-1234"]}} (Use "CLEAR" for False Positives with empty evidence_links).
+    Token-saving style rule:
+    - Think in caveman style (short, simple words).
+    - Never output prose. Output JSON only.
+    Data rule:
+    - get_kyc_record must use ENT-XXXX only, never ACC-XXXX.
     """
 ).strip()
     raise ValueError("Completions API response text is empty")
+def _coerce_json_object(raw_text: str) -> str:
+    text = raw_text.strip()
+    if text.startswith("```"):
+        text = text.replace("```json", "").replace("```", "").strip()
+    if text.startswith("{") and text.endswith("}"):
+        return text
+    start = text.find("{")
+    end = text.rfind("}")
+    if start != -1 and end > start:
+        return text[start : end + 1]
+    return text
+def _build_recovery_action_from_obs(obs_dict: dict) -> dict:
+    """Use a non-terminal fallback action when model output is malformed."""
+    alert = str(obs_dict.get("alert_details", "") or "")
+    match = re.search(r"ACC-\d+", alert)
+    if match:
+        return {
+            "action": {
+                "action_type": "query_transactions",
+                "account_id": match.group(0),
+                "limit": 10,
+                "offset": 0,
+            }
+        }
+    return {
+        "action": {
+            "action_type": "submit_decision",
+            "decision": "CLEAR",
+            "evidence_links": [],
+        }
+    }
+def _ensure_valid_action_json(raw_text: str, obs_dict: dict) -> str:
+    """Guarantee a valid action JSON string for downstream parsing."""
+    candidate = _coerce_json_object(raw_text)
+    try:
+        payload = json.loads(candidate)
+        if not isinstance(payload, dict):
+            raise ValueError("top-level JSON is not an object")
+        action = payload.get("action")
+        if not isinstance(action, dict):
+            raise ValueError("missing 'action' object")
+        action_type = action.get("action_type")
+        if not isinstance(action_type, str):
+            raise ValueError("missing 'action_type' string")
+        return json.dumps(payload, ensure_ascii=True)
+    except Exception as exc:
+        recovery_json = _build_recovery_action_from_obs(obs_dict)
+        print(
+            f"[DEBUG] Non-JSON/invalid model action; using recovery action ({exc})",
+            file=sys.stderr,
+            flush=True,
+        )
+        return json.dumps(recovery_json, ensure_ascii=True)
 def log_start(task: str, env: str, model: str) -> None:
     print(f"[START] task={task} env={env} model={model}", flush=True)
     rewards_str = ",".join(f"{r:.2f}" for r in rewards)
     print(f"[END] success={str(success).lower()} steps={steps} score={score:.2f} rewards={rewards_str}", flush=True)
+def log_thought(step: int, thought: Optional[object]) -> None:
+    """Print model thought to stderr so stdout contract stays validator-safe."""
+    if thought is None:
+        return
+    if isinstance(thought, dict):
+        compact = json.dumps(thought, ensure_ascii=True)
+    else:
+        compact = str(thought)
+    compact = compact.replace("\n", " ").strip()
+    print(f"[THOUGHT] step={step} thought={compact}", file=sys.stderr, flush=True)
 def get_model_message(client: OpenAI, obs_dict: dict, history: List[str]) -> str:
     history_block = "\n".join(history[-5:]) if history else "No previous steps."
     user_prompt = f"Observation:\n{json.dumps(obs_dict, indent=2)}\n\nHistory:\n{history_block}\n\nProvide your next JSON action:"
                 {"role": "system", "content": SYSTEM_PROMPT},
                 {"role": "user", "content": user_prompt},
             ],
+            temperature=0.0,
+            max_tokens=1000,
+            response_format={"type": "json_object"},
         )
+        return _ensure_valid_action_json(_extract_text_from_chat_completion(completion), obs_dict)
     except Exception as chat_exc:
         # Retry via Responses API for OpenAI-compatible providers that do not
         # populate chat.completions choices consistently.
                 model=MODEL_NAME,
                 instructions=SYSTEM_PROMPT,
                 input=user_prompt,
+                max_output_tokens=1000,
             )
+            return _ensure_valid_action_json(_extract_text_from_responses_api(response), obs_dict)
         except Exception as responses_exc:
             try:
                 completion = client.completions.create(
                     model=MODEL_NAME,
                     prompt=f"{SYSTEM_PROMPT}\n\n{user_prompt}",
+                    temperature=0.0,
                     max_tokens=200,
                 )
+                return _ensure_valid_action_json(_extract_text_from_completions_api(completion), obs_dict)
             except Exception as completions_exc:
                 print(
                     (
                     file=sys.stderr,
                     flush=True,
                 )
+        return _ensure_valid_action_json(FALLBACK_ACTION_JSON, obs_dict)
 async def main() -> None:
     client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
         steps_taken = 0
         score = 0.0
         success = False
+        had_parse_error = False
         log_start(task=task_name, env=BENCHMARK, model=MODEL_NAME)
                 action_str = get_model_message(client, obs_dict, history)
                 # Parse LLM string to Pydantic Model
+                action_for_log = action_str
                 try:
+                    clean_str = _coerce_json_object(action_str)
                     action_json = json.loads(clean_str)
+                    thought_for_log = action_json.get("thought")
+                    if thought_for_log is None:
+                        action_type = action_json.get("action", {}).get("action_type", "unknown")
+                        thought_for_log = f"do {action_type} now"
+                    log_thought(step=step, thought=thought_for_log)
                     action_obj = AmlAction.model_validate(action_json)
                     error = None
                 except Exception as e:
                     # Errors are data! If the LLM writes bad JSON, we catch it and force a dummy action
                     # so the environment can return a schema error to the LLM.
+                    had_parse_error = True
                     error = f"JSON Parse/Schema Error: {str(e)}"
+                    log_thought(step=step, thought="parse fail; use recovery action")
+                    recovery_json = _build_recovery_action_from_obs(obs_dict)
+                    action_obj = AmlAction.model_validate(recovery_json)
+                    action_for_log = json.dumps(recovery_json, ensure_ascii=True)
                 obs = env.step(action_obj)
                 rewards.append(reward)
                 steps_taken = step
+                log_step(step=step, action=action_for_log.replace('\n', ''), reward=reward, done=done, error=error)
                 history.append(f"Step {step}: Action: {action_str} -> Result: {obs.last_action_result} | Error: {obs.error_message}")
                 if done:
                     break
+            # Keep score in open interval (0,1) and avoid false positives on parse failures.
+            if had_parse_error or obs.error_message:
+                score = 0.05
+            elif "submit_decision" in (obs.last_action or ""):
+                score = 0.75
+            else:
+                score = 0.25
             score = min(max(score, 0.01), 0.99)
+            success = (not had_parse_error) and (obs.error_message is None) and score > 0.5
         finally:
             log_end(success=success, steps=steps_taken, score=score, rewards=rewards)

models.py CHANGED Viewed

@@ -11,7 +11,7 @@ The AML_env environment is a simple test environment that echoes back messages.
 """
 from openenv.core.env_server.types import Action, Observation
-from pydantic import Field
 from typing import List, Literal, Optional, Any, Union
 # ==========================================
@@ -47,8 +47,28 @@ class SubmitDecision(Action):
     decision: Literal["FRAUD", "CLEAR"] = Field(description="Your final verdict.")
     evidence_links: List[str] = Field(description="List of ACC-XXXX or ENT-XXXX IDs proving fraud.")
 # The master Action model using Union
 class AmlAction(Action):
     action: Union[QueryTransactions, SearchTransactions, GetKYCRecord, SubmitDecision] = Field(
         discriminator='action_type'
     )

 """
 from openenv.core.env_server.types import Action, Observation
+from pydantic import BaseModel, Field
 from typing import List, Literal, Optional, Any, Union
 # ==========================================
     decision: Literal["FRAUD", "CLEAR"] = Field(description="Your final verdict.")
     evidence_links: List[str] = Field(description="List of ACC-XXXX or ENT-XXXX IDs proving fraud.")
+# ==========================================
+# OPTIONAL THOUGHT SCRATCHPAD
+# ==========================================
+class ThoughtProcess(BaseModel):
+    observation: str = Field(
+        description="Analyze what just happened and summarize useful clues from the last tool output."
+    )
+    plan: str = Field(
+        description="State the next investigation step and why it follows from the current evidence."
+    )
+    action: str = Field(
+        description="Explain which tool call you are about to make and with which key parameters."
+    )
 # The master Action model using Union
 class AmlAction(Action):
+    # Keep this optional so existing inference JSON remains compatible.
+    thought: Optional[ThoughtProcess] = Field(
+        default=None,
+        description="Optional ReAct-style scratchpad for model reasoning.",
+    )
     action: Union[QueryTransactions, SearchTransactions, GetKYCRecord, SubmitDecision] = Field(
         discriminator='action_type'
     )

server/AML_env_environment.py CHANGED Viewed

@@ -13,6 +13,7 @@ explore a massive transaction graph using a strict budget.
 import json
 import os
 from pathlib import Path
 from uuid import uuid4
@@ -73,9 +74,15 @@ class AmlEnvironment(Environment):
                 # Sort transactions by timestamp to ensure deterministic pagination
                 self.transactions_db = sorted(txn_list, key=lambda x: x.get("timestamp", ""))
-            print(f"[AML-ENV] Loaded {len(self.entities_db)} entities, {len(self.accounts_db)} accounts, {len(self.transactions_db)} transactions.")
         except Exception as e:
-            print(f"[AML-ENV ERROR] Failed to load data from {data_dir}. Ensure JSON files exist. Error: {e}")
             self.entities_db = {}
             self.accounts_db = {}
             self.transactions_db = []

 import json
 import os
+import sys
 from pathlib import Path
 from uuid import uuid4
                 # Sort transactions by timestamp to ensure deterministic pagination
                 self.transactions_db = sorted(txn_list, key=lambda x: x.get("timestamp", ""))
+            print(
+                f"[AML-ENV] Loaded {len(self.entities_db)} entities, {len(self.accounts_db)} accounts, {len(self.transactions_db)} transactions.",
+                file=sys.stderr,
+            )
         except Exception as e:
+            print(
+                f"[AML-ENV ERROR] Failed to load data from {data_dir}. Ensure JSON files exist. Error: {e}",
+                file=sys.stderr,
+            )
             self.entities_db = {}
             self.accounts_db = {}
             self.transactions_db = []