Spaces:

ps2181
/

invoice-processing-pipeline

Sleeping

ps2181 Claude Sonnet 4.6 commited on Apr 4

Commit

0bf71ce

1 Parent(s): 347eb5c

Add full invoice processing pipeline environment

- FastAPI server with /reset, /step, /state, /health, /tasks, /grader endpoints
- 3 tasks: easy (extraction), medium (batch cleaning), hard (PO reconciliation)
- Pydantic models, OpenEnv spec (openenv.yaml), partial-credit graders
- Baseline inference.py scoring easy:1.0, medium:1.0, hard:0.895 (avg 0.965)
- Dockerfile for HF Spaces (non-root UID 1000, port 7860)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (13) hide show

.gitignore +4 -0
Dockerfile +23 -0
README.md +267 -6
__init__.py +5 -0
client.py +106 -0
inference.py +332 -0
models.py +71 -0
openenv.yaml +45 -0
pyproject.toml +0 -0
requirements.txt +5 -0
server/__init__.py +1 -0
server/app.py +158 -0
server/environment.py +638 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,4 @@

+.env
+__pycache__/
+*.pyc
+*.pyo

Dockerfile ADDED Viewed

	@@ -0,0 +1,23 @@

+FROM python:3.11-slim
+# HF Spaces requires a non-root user with UID 1000
+RUN useradd -m -u 1000 user
+WORKDIR /app
+# Install dependencies first (layer caching)
+COPY --chown=user requirements.txt .
+RUN pip install --no-cache-dir --upgrade -r requirements.txt
+# Copy application code
+COPY --chown=user . /app
+# Switch to non-root user
+USER user
+ENV HOME=/home/user \
+    PATH=/home/user/.local/bin:$PATH
+# HF Spaces default port
+EXPOSE 7860
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,11 +1,272 @@
 ---
 title: Invoice Processing Pipeline
-emoji: 🐨
-colorFrom: gray
-colorTo: yellow
 sdk: docker
-pinned: false
-short_description: openenv
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 title: Invoice Processing Pipeline
+emoji: 🧾
+colorFrom: blue
+colorTo: green
 sdk: docker
+app_port: 7860
+tags:
+  - openenv
 ---
+# Invoice Processing Pipeline — OpenEnv Environment
+An OpenEnv environment where an AI agent learns to **extract**, **clean**, and **reconcile** invoice data — a task that mirrors real-world accounts-payable workflows affecting every business.
+The agent receives raw invoice text (simulating OCR output or messy CSV imports), processes it into structured data, and receives graded scores (0.0–1.0) with detailed feedback at every step.
+---
+## Motivation
+Invoice processing is one of the most common, tedious, and error-prone tasks in business operations. Finance teams spend countless hours:
+- **Extracting** vendor names, dates, line items, and totals from unstructured documents
+- **Cleaning** inconsistent formats (dates, currencies, vendor name variations)
+- **Reconciling** invoices against purchase orders to catch overcharges, missing items, and billing errors
+This environment provides a controlled, reproducible setting to train and evaluate AI agents on these tasks, with clear partial-credit signals that make it suitable for RL training.
+---
+## Project Structure
+```
+invoice_processing_pipeline/
+├── models.py              Pydantic models: InvoiceAction, InvoiceObservation, InvoiceState
+├── client.py              Python client (sync + async) for training code
+├── inference.py           LLM baseline agent (OpenAI-compatible)
+├── server/
+│   ├── __init__.py
+│   ├── environment.py     Core logic: invoice generation, graders, reward computation
+│   └── app.py             FastAPI server with /reset, /step, /state endpoints
+├── openenv.yaml           OpenEnv metadata
+├── Dockerfile             Container build
+├── requirements.txt       Python dependencies
+├── pyproject.toml         Package configuration
+└── README.md              This file
+```
+---
+## Tasks
+| Task | Difficulty | Description |
+|------|-----------|-------------|
+| `easy` | Easy | Extract structured fields from a **single, clean** invoice |
+| `medium` | Medium | Clean and normalise a **batch of messy** invoices (3–5 invoices) |
+| `hard` | Hard | Extract, clean, AND **reconcile against purchase orders** with discrepancy detection |
+### Easy: Single Invoice Extraction
+The agent receives a well-formatted invoice with clear structure. It must extract: vendor name, date, currency, total, and all line items with descriptions, quantities, unit prices, and amounts.
+### Medium: Batch Invoice Cleaning
+The agent receives 3–5 invoices with realistic messiness:
+- **Date format chaos**: `01/15/2024`, `15-01-2024`, `January 15, 2024`, `15.01.2024`
+- **Vendor name typos**: `"Acme Crp"`, `"GloablTech Solutions"`, `"Prmie Office Supplies"`
+- **Mixed currency formats**: `$`, `€`, `£` symbols instead of `USD`, `EUR`, `GBP` codes
+- **String/number mixing**: amounts like `"$149.99"` instead of `149.99`
+- **Math errors**: `qty × unit_price ≠ amount` in some line items
+### Hard: Invoice-PO Reconciliation
+The agent receives messy invoices PLUS purchase orders and must:
+1. Clean all invoice data (same as medium)
+2. Compare each invoice against its corresponding PO
+3. Flag discrepancies: overcharges, extra items, and missing items
+---
+## Observation Space
+| Field | Type | Description |
+|-------|------|-------------|
+| `raw_text` | string | Raw invoice text (OCR-style or batch format) |
+| `task_id` | string | `easy`, `medium`, or `hard` |
+| `difficulty` | string | Same as `task_id` |
+| `task_description` | string | What the agent should do |
+| `attempt_number` | int | Current attempt (0 = just reset) |
+| `max_attempts` | int | Maximum allowed attempts (5) |
+| `feedback` | string | Detailed grader feedback from last attempt |
+| `hint` | string | Appears after 2+ failed attempts |
+| `reference_data` | string | Purchase order data (hard task only) |
+---
+## Action Space
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `extracted_data` | JSON object | Yes | Structured invoice data (format depends on task) |
+| `explanation` | string | No | Agent reasoning (optional) |
+### Expected `extracted_data` format by task:
+**Easy:**
+```json
+{
+    "vendor": "Acme Corp",
+    "date": "2024-06-15",
+    "currency": "USD",
+    "total": 1249.95,
+    "line_items": [
+        {"description": "Laptop Computer", "qty": 1, "unit_price": 1099.99, "amount": 1099.99},
+        {"description": "Wireless Mouse", "qty": 5, "unit_price": 29.99, "amount": 149.95}
+    ]
+}
+```
+**Medium:**
+```json
+{
+    "invoices": [
+        {"vendor": "...", "date": "YYYY-MM-DD", "currency": "USD", "total": 0.0, "line_items": [...]}
+    ]
+}
+```
+**Hard:**
+```json
+{
+    "invoices": [...],
+    "discrepancies": [
+        {"invoice_idx": 0, "type": "overcharge", "item_description": "Laptop Computer", "detail": "Invoice price 1199.99 vs PO price 1099.99"}
+    ]
+}
+```
+---
+## Reward Function
+Rewards are provided at **every step** (not just terminal), giving agents a rich training signal.
+### Easy Task Scoring (0.0–1.0)
+| Component | Weight | Condition |
+|-----------|--------|-----------|
+| Vendor name | 0.15 | Exact match (case-insensitive) |
+| Date | 0.10 | Exact match (YYYY-MM-DD) |
+| Currency | 0.05 | Exact match (3-letter code) |
+| Total | 0.20 | Within ±0.01 |
+| Line items | 0.50 | Per-item matching on description, qty, unit_price, amount |
+### Medium Task Scoring
+Average of per-invoice scores using the Easy grading rubric across the full batch.
+### Hard Task Scoring
+| Component | Weight |
+|-----------|--------|
+| Extraction + Cleaning | 60% (same as Medium grading) |
+| Discrepancy Detection | 40% (precision + recall of flagged discrepancies) |
+### Attempt Penalty
+If all 5 attempts are exhausted without reaching 95% score, a **0.85× multiplier** is applied to the final reward.
+---
+## Setup and Usage
+### Local Development
+```bash
+# Clone the repository
+git clone <your-repo-url>
+cd invoice_processing_pipeline
+# Install dependencies
+pip install -r requirements.txt
+# Start the server
+uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload
+# Test with curl
+curl http://localhost:7860/health
+curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{"task_id": "easy"}'
+```
+### Docker
+```bash
+docker build -t invoice-env .
+docker run -p 7860:7860 invoice-env
+```
+### Running the Baseline
+```bash
+export HF_TOKEN=your_token_here
+export API_BASE_URL=https://router.huggingface.co/v1
+export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
+export ENV_URL=http://localhost:7860
+python inference.py
+```
+### Python Client
+```python
+from client import InvoiceEnvClient
+with InvoiceEnvClient("http://localhost:7860") as env:
+    result = env.reset(task_id="easy")
+    print(result["observation"]["raw_text"])
+    result = env.step({
+        "vendor": "Acme Corp",
+        "date": "2024-06-15",
+        "currency": "USD",
+        "total": 1249.95,
+        "line_items": [...]
+    })
+    print(f"Score: {result['reward']}")
+    print(f"Feedback: {result['observation']['feedback']}")
+```
+---
+## API Endpoints
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/reset` | POST | Start a new episode (`{"task_id": "easy\|medium\|hard"}`) |
+| `/step` | POST | Submit extracted data, get reward + feedback |
+| `/state` | GET | Get current episode metadata |
+| `/tasks` | GET | List all tasks with schemas |
+| `/grader` | POST | Score a submission without modifying state |
+| `/health` | GET | Health check |
+| `/docs` | GET | Swagger API docs |
+---
+## Baseline Scores
+| Agent | Easy | Medium | Hard | Average |
+|-------|------|--------|------|---------|
+| Oracle (ground truth) | 1.00 | 1.00 | 1.00 | 1.00 |
+| Qwen2.5-72B-Instruct | ~0.90 | ~0.65 | ~0.45 | ~0.67 |
+| Random (empty JSON) | 0.00 | 0.00 | 0.00 | 0.00 |
+*Scores are approximate and may vary due to random invoice generation.*
+---
+## Design Decisions
+- **Synthetic data generation**: Every episode creates fresh invoices, preventing memorisation and ensuring reproducibility via random seeds.
+- **Partial credit at every step**: The grader scores each component independently (vendor, date, line items, etc.), giving agents fine-grained reward signal.
+- **Progressive difficulty**: Easy tests pure extraction, Medium adds data quality issues, Hard adds cross-document reasoning.
+- **Realistic noise**: Vendor typos, date format variations, and currency symbol mixing are modelled after actual OCR and data entry errors.
+- **Attempt-based penalty**: Encourages agents to get it right early rather than brute-forcing over many attempts.
+---
+## Links
+- OpenEnv GitHub: https://github.com/meta-pytorch/OpenEnv
+- Hugging Face Environment Hub: https://huggingface.co/openenv

__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+"""Invoice Processing Pipeline — OpenEnv Environment."""
+from models import InvoiceAction, InvoiceObservation, InvoiceState
+__all__ = ["InvoiceAction", "InvoiceObservation", "InvoiceState"]

client.py ADDED Viewed

	@@ -0,0 +1,106 @@

+"""
+Python client for the Invoice Processing Pipeline environment.
+Usage:
+    from client import InvoiceEnvClient
+    from models import InvoiceAction
+    client = InvoiceEnvClient(base_url="http://localhost:7860")
+    result = client.reset(task_id="easy")
+    print(result["observation"]["raw_text"])
+    result = client.step({"vendor": "Acme Corp", "date": "2024-06-15", ...})
+    print(result["reward"])
+"""
+from __future__ import annotations
+from typing import Any, Dict, Optional
+import httpx
+class InvoiceEnvClient:
+    """Synchronous HTTP client for the Invoice Processing Pipeline."""
+    def __init__(self, base_url: str = "http://localhost:7860", timeout: float = 30.0):
+        self.base_url = base_url.rstrip("/")
+        self._client = httpx.Client(timeout=timeout)
+    def reset(self, task_id: str = "easy") -> Dict[str, Any]:
+        """Reset the environment for a new episode."""
+        resp = self._client.post(f"{self.base_url}/reset", json={"task_id": task_id})
+        resp.raise_for_status()
+        return resp.json()
+    def step(self, extracted_data: Dict[str, Any], explanation: str = "") -> Dict[str, Any]:
+        """Submit extracted/cleaned data and get reward + feedback."""
+        resp = self._client.post(
+            f"{self.base_url}/step",
+            json={"extracted_data": extracted_data, "explanation": explanation},
+        )
+        resp.raise_for_status()
+        return resp.json()
+    def state(self) -> Dict[str, Any]:
+        """Get current episode state."""
+        resp = self._client.get(f"{self.base_url}/state")
+        resp.raise_for_status()
+        return resp.json()
+    def tasks(self) -> Dict[str, Any]:
+        """List available tasks and schemas."""
+        resp = self._client.get(f"{self.base_url}/tasks")
+        resp.raise_for_status()
+        return resp.json()
+    def health(self) -> Dict[str, Any]:
+        """Check server health."""
+        resp = self._client.get(f"{self.base_url}/health")
+        resp.raise_for_status()
+        return resp.json()
+    def close(self):
+        """Close the HTTP client."""
+        self._client.close()
+    def __enter__(self):
+        return self
+    def __exit__(self, *args):
+        self.close()
+class AsyncInvoiceEnvClient:
+    """Async HTTP client for the Invoice Processing Pipeline."""
+    def __init__(self, base_url: str = "http://localhost:7860", timeout: float = 30.0):
+        self.base_url = base_url.rstrip("/")
+        self._client = httpx.AsyncClient(timeout=timeout)
+    async def reset(self, task_id: str = "easy") -> Dict[str, Any]:
+        resp = await self._client.post(f"{self.base_url}/reset", json={"task_id": task_id})
+        resp.raise_for_status()
+        return resp.json()
+    async def step(self, extracted_data: Dict[str, Any], explanation: str = "") -> Dict[str, Any]:
+        resp = await self._client.post(
+            f"{self.base_url}/step",
+            json={"extracted_data": extracted_data, "explanation": explanation},
+        )
+        resp.raise_for_status()
+        return resp.json()
+    async def state(self) -> Dict[str, Any]:
+        resp = await self._client.get(f"{self.base_url}/state")
+        resp.raise_for_status()
+        return resp.json()
+    async def close(self):
+        await self._client.aclose()
+    async def __aenter__(self):
+        return self
+    async def __aexit__(self, *args):
+        await self.close()

inference.py ADDED Viewed

	@@ -0,0 +1,332 @@

+"""
+Inference Script — Invoice Processing Pipeline
+================================================
+Runs an LLM agent against all 3 tasks (easy, medium, hard) and produces
+structured stdout logs in the mandatory [START]/[STEP]/[END] format.
+Environment variables:
+    API_BASE_URL   LLM endpoint (default: HF router)
+    MODEL_NAME     Model identifier
+    HF_TOKEN       API key
+"""
+import json
+import os
+import textwrap
+from typing import Any, Dict, List, Optional
+import httpx
+from dotenv import load_dotenv
+from openai import OpenAI
+load_dotenv()
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
+MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
+ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
+BENCHMARK = "invoice_processing_pipeline"
+MAX_STEPS = 5
+TEMPERATURE = 0.3
+MAX_TOKENS = 2048
+SUCCESS_THRESHOLD = 0.5
+# ---------------------------------------------------------------------------
+# Logging helpers (mandatory format)
+# ---------------------------------------------------------------------------
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    # Truncate action for readability
+    action_short = action[:200].replace("\n", " ") if action else "null"
+    print(
+        f"[STEP] step={step} action={action_short} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
+        flush=True,
+    )
+# ---------------------------------------------------------------------------
+# System prompts per task
+# ---------------------------------------------------------------------------
+SYSTEM_PROMPTS = {
+    "easy": textwrap.dedent("""
+        You are an invoice data extraction agent. You receive raw invoice text and must
+        extract structured data from it.
+        RESPOND WITH ONLY A VALID JSON OBJECT (no markdown, no explanation, no backticks).
+        Required JSON structure:
+        {
+            "vendor": "string",
+            "date": "YYYY-MM-DD",
+            "currency": "USD|EUR|GBP",
+            "total": number,
+            "line_items": [
+                {"description": "string", "qty": integer, "unit_price": number, "amount": number}
+            ]
+        }
+        Rules:
+        - Date must be in YYYY-MM-DD format
+        - Currency must be a 3-letter code (USD, EUR, GBP)
+        - Total and amounts must be numbers, not strings
+        - Include ALL line items from the invoice
+        - amount = qty * unit_price
+    """).strip(),
+    "medium": textwrap.dedent("""
+        You are an invoice data cleaning agent. You receive a batch of messy invoices
+        and must clean and normalise them.
+        RESPOND WITH ONLY A VALID JSON OBJECT (no markdown, no explanation, no backticks).
+        Required JSON structure:
+        {
+            "invoices": [
+                {
+                    "vendor": "corrected vendor name",
+                    "date": "YYYY-MM-DD",
+                    "currency": "USD|EUR|GBP",
+                    "total": number,
+                    "line_items": [
+                        {"description": "string", "qty": integer, "unit_price": number, "amount": number}
+                    ]
+                }
+            ]
+        }
+        Cleaning rules:
+        - Fix vendor name typos (e.g. "Acme Crp" -> "Acme Corp")
+        - Normalise dates to YYYY-MM-DD
+        - Convert currency symbols ($, €, £) to codes (USD, EUR, GBP)
+        - Strip currency symbols from amounts and ensure they are numbers
+        - Verify line item math: amount = qty * unit_price. If wrong, recalculate amount.
+        - Recalculate totals as sum of line item amounts
+    """).strip(),
+    "hard": textwrap.dedent("""
+        You are an invoice reconciliation agent. You receive messy invoices AND purchase
+        orders. You must clean the invoices AND identify discrepancies between invoices
+        and their corresponding purchase orders.
+        RESPOND WITH ONLY A VALID JSON OBJECT (no markdown, no explanation, no backticks).
+        Required JSON structure:
+        {
+            "invoices": [
+                {
+                    "vendor": "corrected name",
+                    "date": "YYYY-MM-DD",
+                    "currency": "USD|EUR|GBP",
+                    "total": number,
+                    "line_items": [
+                        {"description": "string", "qty": integer, "unit_price": number, "amount": number}
+                    ]
+                }
+            ],
+            "discrepancies": [
+                {
+                    "invoice_idx": 0,
+                    "type": "overcharge|extra_item|missing_item",
+                    "item_description": "string",
+                    "detail": "description of the discrepancy"
+                }
+            ]
+        }
+        Discrepancy types:
+        - "overcharge": invoice unit_price > PO unit_price for same item
+        - "extra_item": item on invoice but not on PO
+        - "missing_item": item on PO but not on invoice
+        Also apply all cleaning rules: fix vendor names, normalise dates, convert currencies, fix amounts.
+    """).strip(),
+}
+# ---------------------------------------------------------------------------
+# Agent logic
+# ---------------------------------------------------------------------------
+def build_user_prompt(task_id: str, observation: Dict[str, Any], step: int) -> str:
+    """Build the user prompt from the observation."""
+    parts = [f"Step {step} of {observation['max_attempts']}"]
+    if observation.get("feedback"):
+        parts.append(f"\nFeedback from previous attempt:\n{observation['feedback']}")
+    if observation.get("hint"):
+        parts.append(f"\nHint: {observation['hint']}")
+    parts.append(f"\nTask: {observation['task_description']}")
+    parts.append(f"\n--- RAW INVOICE DATA ---\n{observation['raw_text']}")
+    if observation.get("reference_data"):
+        parts.append(f"\n--- PURCHASE ORDER DATA ---\n{observation['reference_data']}")
+    parts.append("\nExtract/clean the data and respond with ONLY valid JSON:")
+    return "\n".join(parts)
+def get_model_response(client: OpenAI, task_id: str, observation: Dict[str, Any], step: int) -> Dict[str, Any]:
+    """Call the LLM and parse its JSON response."""
+    user_prompt = build_user_prompt(task_id, observation, step)
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPTS[task_id]},
+                {"role": "user", "content": user_prompt},
+            ],
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+            stream=False,
+        )
+        raw = (completion.choices[0].message.content or "").strip()
+        # Strip markdown code fences if present
+        if raw.startswith("```"):
+            raw = raw.split("\n", 1)[-1] if "\n" in raw else raw[3:]
+            if raw.endswith("```"):
+                raw = raw[:-3]
+            raw = raw.strip()
+        return json.loads(raw)
+    except json.JSONDecodeError as e:
+        print(f"[DEBUG] JSON parse error: {e}", flush=True)
+        print(f"[DEBUG] Raw response: {raw[:500]}", flush=True)
+        return {}
+    except Exception as e:
+        print(f"[DEBUG] Model request failed: {e}", flush=True)
+        return {}
+# ---------------------------------------------------------------------------
+# Environment HTTP client
+# ---------------------------------------------------------------------------
+class EnvClient:
+    """Simple HTTP client for the Invoice Processing Pipeline environment."""
+    def __init__(self, base_url: str):
+        self.base_url = base_url.rstrip("/")
+        self.client = httpx.Client(timeout=30.0)
+    def reset(self, task_id: str = "easy") -> Dict[str, Any]:
+        resp = self.client.post(f"{self.base_url}/reset", json={"task_id": task_id})
+        resp.raise_for_status()
+        return resp.json()
+    def step(self, extracted_data: Dict[str, Any], explanation: str = "") -> Dict[str, Any]:
+        resp = self.client.post(
+            f"{self.base_url}/step",
+            json={"extracted_data": extracted_data, "explanation": explanation},
+        )
+        resp.raise_for_status()
+        return resp.json()
+    def state(self) -> Dict[str, Any]:
+        resp = self.client.get(f"{self.base_url}/state")
+        resp.raise_for_status()
+        return resp.json()
+    def close(self):
+        self.client.close()
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def run_task(client: OpenAI, env: EnvClient, task_id: str) -> float:
+    """Run a single task and return the final score."""
+    log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    try:
+        result = env.reset(task_id=task_id)
+        observation = result["observation"]
+        for step in range(1, MAX_STEPS + 1):
+            if result.get("done", False):
+                break
+            extracted = get_model_response(client, task_id, observation, step)
+            action_str = json.dumps(extracted)[:200]
+            result = env.step(extracted_data=extracted)
+            observation = result["observation"]
+            reward = result.get("reward", 0.0)
+            done = result.get("done", False)
+            rewards.append(reward)
+            steps_taken = step
+            log_step(step=step, action=action_str, reward=reward, done=done, error=None)
+            if done:
+                break
+        score = max(rewards) if rewards else 0.0
+        success = score >= SUCCESS_THRESHOLD
+    except Exception as e:
+        print(f"[DEBUG] Task {task_id} error: {e}", flush=True)
+        log_step(step=steps_taken + 1, action="error", reward=0.0, done=True, error=str(e))
+    finally:
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+    return score
+def main() -> None:
+    """Run all 3 tasks and report scores."""
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    env = EnvClient(ENV_URL)
+    scores = {}
+    try:
+        for task_id in ["easy", "medium", "hard"]:
+            scores[task_id] = run_task(client, env, task_id)
+            print(flush=True)
+        avg = sum(scores.values()) / len(scores) if scores else 0.0
+        print(f"\n=== BASELINE SCORES ===", flush=True)
+        for tid, sc in scores.items():
+            print(f"  {tid}: {sc:.3f}", flush=True)
+        print(f"  average: {avg:.3f}", flush=True)
+    finally:
+        env.close()
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,71 @@

+"""
+Pydantic models for the Invoice Processing Pipeline environment.
+Action:  Agent submits extracted/cleaned/reconciled invoice data as JSON.
+Observation: Agent receives raw invoice text, feedback, and task context.
+State:   Tracks episode progress, attempts, and scores.
+"""
+from typing import Any, Dict, List, Optional
+from pydantic import BaseModel, Field
+# ---------------------------------------------------------------------------
+# Action
+# ---------------------------------------------------------------------------
+class InvoiceAction(BaseModel):
+    """Action the agent submits each step."""
+    extracted_data: Dict[str, Any] = Field(
+        ...,
+        description=(
+            "JSON object with extracted/cleaned invoice fields. "
+            "Structure depends on the task. "
+            "Easy: {vendor, date, currency, total, line_items: [{description, qty, unit_price, amount}]}. "
+            "Medium: {invoices: [{vendor, date, currency, total, line_items}]} (batch of cleaned invoices). "
+            "Hard: {invoices: [...], discrepancies: [{invoice_idx, type, detail, expected, actual}]}."
+        ),
+    )
+    explanation: str = Field(
+        default="",
+        description="Optional reasoning about extraction or cleaning decisions.",
+    )
+# ---------------------------------------------------------------------------
+# Observation
+# ---------------------------------------------------------------------------
+class InvoiceObservation(BaseModel):
+    """What the agent sees each turn."""
+    raw_text: str = Field(..., description="Raw invoice text (OCR-style or CSV-style)")
+    task_id: str = Field(..., description="easy | medium | hard")
+    difficulty: str = Field(..., description="Same as task_id")
+    task_description: str = Field(..., description="What the agent should do")
+    attempt_number: int = Field(default=0, description="Current attempt (0 = just reset)")
+    max_attempts: int = Field(default=5, description="Max allowed attempts")
+    feedback: str = Field(default="", description="Detailed grader feedback from last attempt")
+    hint: str = Field(default="", description="Hint shown after 2+ failed attempts")
+    reference_data: str = Field(
+        default="",
+        description="For hard task: purchase order data to reconcile against",
+    )
+# ---------------------------------------------------------------------------
+# State
+# ---------------------------------------------------------------------------
+class InvoiceState(BaseModel):
+    """Internal episode state."""
+    episode_id: str = Field(default="")
+    task_id: str = Field(default="easy")
+    step_count: int = Field(default=0)
+    done: bool = Field(default=False)
+    last_reward: float = Field(default=0.0)
+    best_reward: float = Field(default=0.0)
+    rewards: List[float] = Field(default_factory=list)

openenv.yaml ADDED Viewed

	@@ -0,0 +1,45 @@

+name: invoice_processing_pipeline
+version: "1.0.0"
+description: >
+  An OpenEnv environment for training AI agents on real-world invoice processing:
+  data extraction from OCR text, batch cleaning & normalisation, and
+  reconciliation against purchase orders with discrepancy detection.
+author: "OpenEnv Challenge Submission"
+license: "MIT"
+tags:
+  - openenv
+  - invoice
+  - data-extraction
+  - data-cleaning
+  - reconciliation
+  - finance
+environment:
+  module: server.app
+  class: InvoiceEnvironment
+  action: models.InvoiceAction
+  observation: models.InvoiceObservation
+tasks:
+  - id: easy
+    name: "Single Invoice Extraction"
+    description: "Extract structured fields (vendor, date, currency, total, line items) from a single invoice."
+    difficulty: easy
+  - id: medium
+    name: "Batch Invoice Cleaning"
+    description: "Clean and normalise a batch of messy invoices: fix dates, vendor typos, currency codes, and amounts."
+    difficulty: medium
+  - id: hard
+    name: "Invoice-PO Reconciliation"
+    description: "Extract, clean, and reconcile invoices against purchase orders. Flag overcharges, extra items, and missing items."
+    difficulty: hard
+endpoints:
+  reset: /reset
+  step: /step
+  state: /state
+  health: /health

pyproject.toml ADDED Viewed

File without changes

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+fastapi>=0.104.0
+uvicorn[standard]>=0.24.0
+pydantic>=2.5.0
+httpx>=0.25.0
+openai>=1.0.0

server/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Server package for Invoice Processing Pipeline."""

server/app.py ADDED Viewed

	@@ -0,0 +1,158 @@

+"""
+FastAPI server for Invoice Processing Pipeline environment.
+Exposes /reset, /step, /state, /health, /tasks, /grader endpoints.
+"""
+from __future__ import annotations
+import json
+from typing import Any, Dict, Optional
+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from models import InvoiceAction, InvoiceObservation, InvoiceState
+from server.environment import InvoiceEnvironment
+app = FastAPI(
+    title="Invoice Processing Pipeline",
+    description="OpenEnv environment for invoice data extraction, cleaning, and reconciliation.",
+    version="1.0.0",
+)
+# Single environment instance (one episode at a time for the HF Space)
+env = InvoiceEnvironment()
+# ---------------------------------------------------------------------------
+# Request / Response schemas
+# ---------------------------------------------------------------------------
+class ResetRequest(BaseModel):
+    task_id: str = "easy"
+class StepRequest(BaseModel):
+    extracted_data: Dict[str, Any]
+    explanation: str = ""
+class ResetResponse(BaseModel):
+    observation: Dict[str, Any]
+    reward: float
+    done: bool
+    info: Dict[str, Any]
+class StepResponse(BaseModel):
+    observation: Dict[str, Any]
+    reward: float
+    done: bool
+    info: Dict[str, Any]
+class StateResponse(BaseModel):
+    episode_id: str
+    task_id: str
+    step_count: int
+    done: bool
+    last_reward: float
+    best_reward: float
+    rewards: list
+# ---------------------------------------------------------------------------
+# Endpoints
+# ---------------------------------------------------------------------------
+@app.get("/health")
+def health():
+    return {"status": "ok", "environment": "invoice_processing_pipeline"}
+@app.get("/tasks")
+def list_tasks():
+    """List available tasks with descriptions."""
+    tasks = []
+    for tid, info in InvoiceEnvironment.TASKS.items():
+        tasks.append({
+            "task_id": tid,
+            "description": info["description"],
+            "max_attempts": info["max_attempts"],
+        })
+    return {
+        "tasks": tasks,
+        "action_schema": InvoiceAction.model_json_schema(),
+        "observation_schema": InvoiceObservation.model_json_schema(),
+    }
+@app.post("/reset")
+def reset(req: ResetRequest = ResetRequest()):
+    obs, reward, done, info = env.reset(task_id=req.task_id)
+    return ResetResponse(
+        observation=obs.model_dump(),
+        reward=reward,
+        done=done,
+        info=info,
+    )
+@app.post("/step")
+def step(req: StepRequest):
+    if env.state.done:
+        raise HTTPException(status_code=400, detail="Episode is done. Call /reset first.")
+    action = InvoiceAction(
+        extracted_data=req.extracted_data,
+        explanation=req.explanation,
+    )
+    obs, reward, done, info = env.step(action)
+    return StepResponse(
+        observation=obs.model_dump(),
+        reward=reward,
+        done=done,
+        info=info,
+    )
+@app.get("/state")
+def get_state():
+    s = env.state
+    return StateResponse(
+        episode_id=s.episode_id,
+        task_id=s.task_id,
+        step_count=s.step_count,
+        done=s.done,
+        last_reward=s.last_reward,
+        best_reward=s.best_reward,
+        rewards=s.rewards,
+    )
+@app.post("/grader")
+def grader(req: StepRequest):
+    """Score a submission without modifying episode state (for testing)."""
+    import copy
+    saved_state = copy.deepcopy(env._state)
+    action = InvoiceAction(extracted_data=req.extracted_data, explanation=req.explanation)
+    task_id = env.state.task_id
+    if task_id == "easy":
+        from server.environment import _grade_easy
+        score, feedback = _grade_easy(action.extracted_data, env._ground_truth)
+    elif task_id == "medium":
+        from server.environment import _grade_medium
+        score, feedback = _grade_medium(action.extracted_data, env._ground_truth)
+    else:
+        from server.environment import _grade_hard
+        score, feedback = _grade_hard(
+            action.extracted_data, env._ground_truth, env._expected_discrepancies
+        )
+    return {"score": score, "feedback": feedback}
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=7860)

server/environment.py ADDED Viewed

	@@ -0,0 +1,638 @@

+"""
+Invoice Processing Pipeline — Core Environment
+Three tasks:
+  easy   — Extract structured fields from a single, relatively clean invoice.
+  medium — Clean & normalise a batch of messy invoices (date formats, vendor
+           name typos, currency symbols, duplicate detection).
+  hard   — Extract, clean, AND reconcile against purchase orders; flag
+           mismatches, overcharges, and missing items.
+Each episode generates fresh synthetic data so the agent cannot memorize.
+"""
+from __future__ import annotations
+import copy
+import json
+import random
+import re
+import string
+import uuid
+from datetime import date, timedelta
+from typing import Any, Dict, List, Optional, Tuple
+from models import InvoiceAction, InvoiceObservation, InvoiceState
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+VENDORS = [
+    "Acme Corp", "GlobalTech Solutions", "Prime Office Supplies",
+    "DataStream Inc", "CloudNine Services", "Metro Logistics",
+    "Pinnacle Electronics", "Summit Consulting", "Vertex Manufacturing",
+    "Horizon Digital", "NexGen Software", "BluePeak Analytics",
+]
+ITEMS = [
+    ("Laptop Computer", 899.99, 1299.99),
+    ("Wireless Mouse", 19.99, 49.99),
+    ("USB-C Hub", 29.99, 79.99),
+    ("Monitor Stand", 39.99, 89.99),
+    ("Keyboard", 49.99, 149.99),
+    ("Webcam HD", 59.99, 129.99),
+    ("Desk Lamp", 24.99, 69.99),
+    ("Notebook Pack", 9.99, 29.99),
+    ("Printer Paper (Ream)", 7.99, 14.99),
+    ("Whiteboard Markers (Set)", 5.99, 12.99),
+    ("External SSD 1TB", 79.99, 149.99),
+    ("Headset", 39.99, 99.99),
+    ("Cable Management Kit", 14.99, 34.99),
+    ("Ergonomic Chair", 299.99, 599.99),
+    ("Standing Desk Converter", 199.99, 399.99),
+]
+CURRENCIES = ["USD", "EUR", "GBP"]
+CURRENCY_SYMBOLS = {"USD": "$", "EUR": "€", "GBP": "£"}
+def _rand_date(start_year: int = 2024, end_year: int = 2025) -> date:
+    start = date(start_year, 1, 1)
+    end = date(end_year, 12, 31)
+    delta = (end - start).days
+    return start + timedelta(days=random.randint(0, delta))
+def _format_date_clean(d: date) -> str:
+    return d.strftime("%Y-%m-%d")
+def _format_date_messy(d: date) -> str:
+    """Return a randomly-chosen messy date format."""
+    formats = [
+        "%m/%d/%Y", "%d-%m-%Y", "%B %d, %Y", "%d %b %Y",
+        "%m-%d-%y", "%d.%m.%Y", "%Y/%m/%d",
+    ]
+    return d.strftime(random.choice(formats))
+def _typo_vendor(name: str) -> str:
+    """Introduce a subtle typo into a vendor name."""
+    strategies = ["swap", "drop", "double", "case"]
+    strat = random.choice(strategies)
+    idx = random.randint(1, max(1, len(name) - 2))
+    if strat == "swap" and idx < len(name) - 1:
+        return name[:idx] + name[idx + 1] + name[idx] + name[idx + 2:]
+    elif strat == "drop":
+        return name[:idx] + name[idx + 1:]
+    elif strat == "double":
+        return name[:idx] + name[idx] + name[idx:]
+    else:
+        return name[:idx] + name[idx].swapcase() + name[idx + 1:]
+def _generate_line_items(n: int) -> List[Dict[str, Any]]:
+    chosen = random.sample(ITEMS, min(n, len(ITEMS)))
+    items = []
+    for desc, lo, hi in chosen:
+        qty = random.randint(1, 20)
+        unit_price = round(random.uniform(lo, hi), 2)
+        amount = round(qty * unit_price, 2)
+        items.append({
+            "description": desc,
+            "qty": qty,
+            "unit_price": unit_price,
+            "amount": amount,
+        })
+    return items
+def _generate_invoice(vendor: str | None = None, currency: str | None = None) -> Dict[str, Any]:
+    vendor = vendor or random.choice(VENDORS)
+    currency = currency or random.choice(CURRENCIES)
+    inv_date = _rand_date()
+    line_items = _generate_line_items(random.randint(2, 6))
+    total = round(sum(it["amount"] for it in line_items), 2)
+    return {
+        "invoice_id": f"INV-{random.randint(10000, 99999)}",
+        "vendor": vendor,
+        "date": _format_date_clean(inv_date),
+        "currency": currency,
+        "total": total,
+        "line_items": line_items,
+    }
+# ===================================================================
+# TASK: EASY — single invoice extraction
+# ===================================================================
+def _render_clean_invoice(inv: Dict[str, Any]) -> str:
+    """Render a single invoice as semi-structured text (OCR-style)."""
+    sym = CURRENCY_SYMBOLS.get(inv["currency"], "$")
+    lines = [
+        f"INVOICE",
+        f"-------",
+        f"Invoice #: {inv['invoice_id']}",
+        f"Vendor: {inv['vendor']}",
+        f"Date: {inv['date']}",
+        f"Currency: {inv['currency']}",
+        f"",
+        f"Items:",
+        f"{'Description':<30} {'Qty':>5} {'Unit Price':>12} {'Amount':>12}",
+        f"{'-'*30} {'-'*5} {'-'*12} {'-'*12}",
+    ]
+    for it in inv["line_items"]:
+        lines.append(
+            f"{it['description']:<30} {it['qty']:>5} {sym}{it['unit_price']:>10.2f} {sym}{it['amount']:>10.2f}"
+        )
+    lines.append(f"{'':>30} {'':>5} {'TOTAL':>12} {sym}{inv['total']:>10.2f}")
+    return "\n".join(lines)
+def _grade_easy(submitted: Dict[str, Any], ground_truth: Dict[str, Any]) -> Tuple[float, str]:
+    """Grade single-invoice extraction. Returns (score, feedback)."""
+    score = 0.0
+    feedback_parts = []
+    # Vendor (0.15)
+    sub_vendor = submitted.get("vendor", "").strip()
+    if sub_vendor.lower() == ground_truth["vendor"].lower():
+        score += 0.15
+        feedback_parts.append("Vendor: correct")
+    else:
+        feedback_parts.append(f"Vendor: wrong (expected '{ground_truth['vendor']}', got '{sub_vendor}')")
+    # Date (0.10)
+    sub_date = submitted.get("date", "").strip()
+    if sub_date == ground_truth["date"]:
+        score += 0.10
+        feedback_parts.append("Date: correct")
+    else:
+        feedback_parts.append(f"Date: wrong (expected '{ground_truth['date']}', got '{sub_date}')")
+    # Currency (0.05)
+    sub_cur = submitted.get("currency", "").strip().upper()
+    if sub_cur == ground_truth["currency"]:
+        score += 0.05
+        feedback_parts.append("Currency: correct")
+    else:
+        feedback_parts.append(f"Currency: wrong (expected '{ground_truth['currency']}', got '{sub_cur}')")
+    # Total (0.20)
+    try:
+        sub_total = float(submitted.get("total", 0))
+        if abs(sub_total - ground_truth["total"]) < 0.01:
+            score += 0.20
+            feedback_parts.append("Total: correct")
+        else:
+            feedback_parts.append(f"Total: wrong (expected {ground_truth['total']}, got {sub_total})")
+    except (ValueError, TypeError):
+        feedback_parts.append("Total: could not parse")
+    # Line items (0.50)
+    sub_items = submitted.get("line_items", [])
+    gt_items = ground_truth["line_items"]
+    if not isinstance(sub_items, list):
+        feedback_parts.append("Line items: not a list")
+    else:
+        item_score = _grade_line_items(sub_items, gt_items)
+        score += item_score * 0.50
+        feedback_parts.append(f"Line items: {item_score:.0%} match ({len(sub_items)} submitted, {len(gt_items)} expected)")
+    return round(min(score, 1.0), 4), "; ".join(feedback_parts)
+def _grade_line_items(submitted: List[Dict], expected: List[Dict]) -> float:
+    """Compare line items, return fraction matched (0-1)."""
+    if not expected:
+        return 1.0 if not submitted else 0.0
+    matched = 0
+    used = set()
+    for gt in expected:
+        best = -1
+        best_score = 0.0
+        for i, sub in enumerate(submitted):
+            if i in used:
+                continue
+            s = _item_similarity(sub, gt)
+            if s > best_score:
+                best_score = s
+                best = i
+        if best >= 0 and best_score > 0.3:
+            matched += best_score
+            used.add(best)
+    return matched / len(expected)
+def _item_similarity(sub: Dict, gt: Dict) -> float:
+    """Score a single line item match (0-1)."""
+    s = 0.0
+    # description
+    sd = sub.get("description", "").lower().strip()
+    gd = gt["description"].lower().strip()
+    if sd == gd:
+        s += 0.25
+    elif sd in gd or gd in sd:
+        s += 0.15
+    # qty
+    try:
+        if int(sub.get("qty", -1)) == gt["qty"]:
+            s += 0.25
+    except (ValueError, TypeError):
+        pass
+    # unit_price
+    try:
+        if abs(float(sub.get("unit_price", -1)) - gt["unit_price"]) < 0.01:
+            s += 0.25
+    except (ValueError, TypeError):
+        pass
+    # amount
+    try:
+        if abs(float(sub.get("amount", -1)) - gt["amount"]) < 0.01:
+            s += 0.25
+    except (ValueError, TypeError):
+        pass
+    return s
+# ===================================================================
+# TASK: MEDIUM — batch cleaning & normalisation
+# ===================================================================
+def _make_messy_invoice(inv: Dict[str, Any]) -> Dict[str, Any]:
+    """Take a clean invoice dict and introduce messiness."""
+    messy = copy.deepcopy(inv)
+    # Messy date
+    d = date.fromisoformat(inv["date"])
+    messy["date"] = _format_date_messy(d)
+    # Possibly typo the vendor
+    if random.random() < 0.5:
+        messy["vendor"] = _typo_vendor(inv["vendor"])
+    # Mix currency symbol into amounts (remove currency field sometimes)
+    sym = CURRENCY_SYMBOLS.get(inv["currency"], "$")
+    if random.random() < 0.4:
+        messy["currency"] = sym  # symbol instead of code
+    if random.random() < 0.3:
+        messy["total"] = f"{sym}{inv['total']}"  # string instead of number
+    # Mess up some line item amounts
+    for it in messy["line_items"]:
+        if random.random() < 0.3:
+            it["amount"] = f"{sym}{it['amount']}"
+        if random.random() < 0.2:
+            it["unit_price"] = f"{sym}{it['unit_price']}"
+        if random.random() < 0.15:
+            # Wrong amount (qty * unit_price ≠ amount)
+            it["amount"] = round(it["qty"] * float(str(it["unit_price"]).replace(sym, "")) + random.uniform(0.5, 5.0), 2)
+    return messy
+def _render_messy_batch(invoices: List[Dict[str, Any]]) -> str:
+    """Render a batch of messy invoices as CSV-ish text."""
+    lines = ["=== INVOICE BATCH (requires cleaning) ===", ""]
+    for i, inv in enumerate(invoices):
+        lines.append(f"--- Invoice {i+1} ---")
+        lines.append(f"Vendor: {inv['vendor']}")
+        lines.append(f"Date: {inv['date']}")
+        lines.append(f"Currency: {inv.get('currency', 'N/A')}")
+        lines.append(f"Total: {inv.get('total', 'N/A')}")
+        lines.append("Items:")
+        for it in inv["line_items"]:
+            lines.append(f"  - {it['description']} | qty: {it.get('qty','?')} | price: {it.get('unit_price','?')} | amount: {it.get('amount','?')}")
+        lines.append("")
+    return "\n".join(lines)
+def _grade_medium(submitted: Dict[str, Any], ground_truths: List[Dict[str, Any]]) -> Tuple[float, str]:
+    """Grade batch cleaning. submitted should have 'invoices' key."""
+    sub_invoices = submitted.get("invoices", [])
+    if not isinstance(sub_invoices, list):
+        return 0.0, "Expected 'invoices' key with a list of cleaned invoices."
+    n_expected = len(ground_truths)
+    if len(sub_invoices) != n_expected:
+        # Partial credit still possible
+        pass
+    total_score = 0.0
+    feedback_parts = []
+    for idx, gt in enumerate(ground_truths):
+        if idx < len(sub_invoices):
+            s, fb = _grade_easy(sub_invoices[idx], gt)
+            total_score += s
+            feedback_parts.append(f"Invoice {idx+1}: {s:.2f} ({fb})")
+        else:
+            feedback_parts.append(f"Invoice {idx+1}: missing")
+    # Penalise extra invoices
+    if len(sub_invoices) > n_expected:
+        feedback_parts.append(f"Extra invoices submitted: {len(sub_invoices) - n_expected}")
+    avg = total_score / n_expected if n_expected > 0 else 0.0
+    return round(min(avg, 1.0), 4), "; ".join(feedback_parts)
+# ===================================================================
+# TASK: HARD — extraction + cleaning + reconciliation against POs
+# ===================================================================
+def _generate_purchase_order(inv: Dict[str, Any]) -> Dict[str, Any]:
+    """Generate a PO that mostly matches the invoice but may differ."""
+    po = copy.deepcopy(inv)
+    po["po_id"] = f"PO-{random.randint(10000, 99999)}"
+    discrepancies = []
+    # Possibly change a price (overcharge)
+    if random.random() < 0.6 and po["line_items"]:
+        idx = random.randint(0, len(po["line_items"]) - 1)
+        original_price = po["line_items"][idx]["unit_price"]
+        # PO has the CORRECT price; invoice will be higher (overcharge)
+        overcharge = round(original_price * random.uniform(1.05, 1.25), 2)
+        discrepancies.append({
+            "type": "overcharge",
+            "item_description": po["line_items"][idx]["description"],
+            "po_price": original_price,
+            "invoice_price": overcharge,
+        })
+        # We'll modify the invoice later
+        inv["line_items"][idx]["unit_price"] = overcharge
+        inv["line_items"][idx]["amount"] = round(inv["line_items"][idx]["qty"] * overcharge, 2)
+    # Possibly add an extra item to invoice (not in PO)
+    if random.random() < 0.4:
+        extra = _generate_line_items(1)[0]
+        inv["line_items"].append(extra)
+        discrepancies.append({
+            "type": "extra_item",
+            "item_description": extra["description"],
+            "detail": "Item on invoice but not on purchase order",
+        })
+    # Possibly remove an item from invoice (missing from invoice)
+    if random.random() < 0.3 and len(po["line_items"]) > 2:
+        removed = po["line_items"].pop(random.randint(0, len(po["line_items"]) - 1))
+        discrepancies.append({
+            "type": "missing_item",
+            "item_description": removed["description"],
+            "detail": "Item on purchase order but not on invoice",
+        })
+    # Recalculate totals
+    inv["total"] = round(sum(it["amount"] for it in inv["line_items"]), 2)
+    po["total"] = round(sum(it["amount"] for it in po["line_items"]), 2)
+    return po, discrepancies
+def _render_po(po: Dict[str, Any]) -> str:
+    """Render purchase order text."""
+    lines = [
+        f"PURCHASE ORDER: {po['po_id']}",
+        f"Vendor: {po['vendor']}",
+        f"Date: {po['date']}",
+        f"Currency: {po['currency']}",
+        f"",
+        "Ordered Items:",
+    ]
+    sym = CURRENCY_SYMBOLS.get(po["currency"], "$")
+    for it in po["line_items"]:
+        lines.append(f"  - {it['description']} x{it['qty']} @ {sym}{it['unit_price']:.2f} = {sym}{it['amount']:.2f}")
+    lines.append(f"PO Total: {sym}{po['total']:.2f}")
+    return "\n".join(lines)
+def _grade_hard(submitted: Dict[str, Any], ground_truths: List[Dict[str, Any]],
+                expected_discrepancies: List[List[Dict]]) -> Tuple[float, str]:
+    """Grade extraction + cleaning + reconciliation."""
+    # Extraction/cleaning portion (60%)
+    extraction_score, extraction_fb = _grade_medium(submitted, ground_truths)
+    # Discrepancy detection portion (40%)
+    sub_discrepancies = submitted.get("discrepancies", [])
+    if not isinstance(sub_discrepancies, list):
+        disc_score = 0.0
+        disc_fb = "No discrepancies list submitted"
+    else:
+        all_expected = []
+        for disc_list in expected_discrepancies:
+            all_expected.extend(disc_list)
+        if not all_expected:
+            disc_score = 1.0 if not sub_discrepancies else 0.5
+            disc_fb = "No discrepancies expected"
+        else:
+            matched = 0
+            for exp in all_expected:
+                for sub in sub_discrepancies:
+                    if _discrepancy_match(sub, exp):
+                        matched += 1
+                        break
+            precision = matched / len(sub_discrepancies) if sub_discrepancies else 0.0
+            recall = matched / len(all_expected) if all_expected else 1.0
+            disc_score = (precision + recall) / 2  # F1-like
+            disc_fb = f"Discrepancies: {matched}/{len(all_expected)} found, precision={precision:.2f}, recall={recall:.2f}"
+    total = extraction_score * 0.60 + disc_score * 0.40
+    feedback = f"Extraction: {extraction_score:.2f}; {disc_fb}"
+    return round(min(total, 1.0), 4), feedback
+def _discrepancy_match(submitted: Dict, expected: Dict) -> bool:
+    """Check if a submitted discrepancy matches an expected one."""
+    # Type must match
+    sub_type = submitted.get("type", "").lower().strip()
+    exp_type = expected.get("type", "").lower().strip()
+    if sub_type != exp_type:
+        return False
+    # Item description should roughly match
+    sub_desc = submitted.get("item_description", "").lower().strip()
+    exp_desc = expected.get("item_description", "").lower().strip()
+    if sub_desc and exp_desc:
+        if sub_desc == exp_desc or sub_desc in exp_desc or exp_desc in sub_desc:
+            return True
+    return False
+# ===================================================================
+# Environment
+# ===================================================================
+class InvoiceEnvironment:
+    """Core invoice processing environment."""
+    TASKS = {
+        "easy": {
+            "description": (
+                "Extract structured data from a single invoice. "
+                "Return a JSON object with keys: vendor, date (YYYY-MM-DD), "
+                "currency (3-letter code), total (number), "
+                "line_items (list of {description, qty, unit_price, amount})."
+            ),
+            "max_attempts": 5,
+        },
+        "medium": {
+            "description": (
+                "Clean and normalise a batch of messy invoices. "
+                "Fix date formats to YYYY-MM-DD, correct vendor name typos, "
+                "standardise currency to 3-letter codes, ensure amounts are numbers, "
+                "and verify line item math (qty * unit_price = amount). "
+                "Return {invoices: [cleaned invoice objects]}."
+            ),
+            "max_attempts": 5,
+        },
+        "hard": {
+            "description": (
+                "Extract and clean invoice data, then reconcile against purchase orders. "
+                "Identify discrepancies: overcharges (invoice price > PO price), "
+                "extra items (on invoice but not PO), missing items (on PO but not invoice). "
+                "Return {invoices: [cleaned], discrepancies: [{invoice_idx, type, item_description, detail}]}."
+            ),
+            "max_attempts": 5,
+        },
+    }
+    def __init__(self):
+        self._state = InvoiceState()
+        self._ground_truth: Any = None
+        self._raw_text: str = ""
+        self._reference_data: str = ""
+        self._messy_invoices: List[Dict] = []
+        self._expected_discrepancies: List[List[Dict]] = []
+    def reset(self, task_id: str = "easy") -> Tuple[InvoiceObservation, float, bool, Dict]:
+        """Reset the environment for a new episode."""
+        if task_id not in self.TASKS:
+            task_id = "easy"
+        self._state = InvoiceState(
+            episode_id=str(uuid.uuid4()),
+            task_id=task_id,
+            step_count=0,
+            done=False,
+            last_reward=0.0,
+            best_reward=0.0,
+            rewards=[],
+        )
+        self._reference_data = ""
+        self._expected_discrepancies = []
+        if task_id == "easy":
+            inv = _generate_invoice()
+            self._ground_truth = inv
+            self._raw_text = _render_clean_invoice(inv)
+        elif task_id == "medium":
+            n = random.randint(3, 5)
+            clean_invoices = [_generate_invoice() for _ in range(n)]
+            self._ground_truth = clean_invoices
+            messy = [_make_messy_invoice(copy.deepcopy(inv)) for inv in clean_invoices]
+            self._messy_invoices = messy
+            self._raw_text = _render_messy_batch(messy)
+        elif task_id == "hard":
+            n = random.randint(2, 4)
+            clean_invoices = [_generate_invoice() for _ in range(n)]
+            self._expected_discrepancies = []
+            po_texts = []
+            for inv in clean_invoices:
+                po, discs = _generate_purchase_order(inv)
+                self._expected_discrepancies.append(discs)
+                po_texts.append(_render_po(po))
+            self._ground_truth = clean_invoices
+            messy = [_make_messy_invoice(copy.deepcopy(inv)) for inv in clean_invoices]
+            self._raw_text = _render_messy_batch(messy)
+            self._reference_data = "\n\n".join(po_texts)
+        task_info = self.TASKS[task_id]
+        obs = InvoiceObservation(
+            raw_text=self._raw_text,
+            task_id=task_id,
+            difficulty=task_id,
+            task_description=task_info["description"],
+            attempt_number=0,
+            max_attempts=task_info["max_attempts"],
+            feedback="",
+            hint="",
+            reference_data=self._reference_data,
+        )
+        return obs, 0.0, False, {"episode_id": self._state.episode_id}
+    def step(self, action: InvoiceAction) -> Tuple[InvoiceObservation, float, bool, Dict]:
+        """Process one agent action."""
+        self._state.step_count += 1
+        task_id = self._state.task_id
+        task_info = self.TASKS[task_id]
+        attempt = self._state.step_count
+        # Grade
+        if task_id == "easy":
+            score, feedback = _grade_easy(action.extracted_data, self._ground_truth)
+        elif task_id == "medium":
+            score, feedback = _grade_medium(action.extracted_data, self._ground_truth)
+        else:
+            score, feedback = _grade_hard(
+                action.extracted_data, self._ground_truth, self._expected_discrepancies
+            )
+        # Track best
+        self._state.best_reward = max(self._state.best_reward, score)
+        self._state.last_reward = score
+        self._state.rewards.append(score)
+        # Done conditions
+        done = score >= 0.95 or attempt >= task_info["max_attempts"]
+        self._state.done = done
+        # Attempt penalty for using all attempts
+        reward = score
+        if done and attempt >= task_info["max_attempts"] and score < 0.95:
+            reward = score * 0.85  # penalty
+        # Hint after 2 failed attempts
+        hint = ""
+        if attempt >= 2 and score < 0.7:
+            if task_id == "easy":
+                hint = "Make sure dates are YYYY-MM-DD, amounts are numbers, and all line items are included."
+            elif task_id == "medium":
+                hint = "Check for vendor name typos, mixed date formats, and currency symbols mixed into amounts."
+            else:
+                hint = "Compare each invoice line item against the PO. Look for price differences and items present in one but not the other."
+        obs = InvoiceObservation(
+            raw_text=self._raw_text,
+            task_id=task_id,
+            difficulty=task_id,
+            task_description=task_info["description"],
+            attempt_number=attempt,
+            max_attempts=task_info["max_attempts"],
+            feedback=feedback,
+            hint=hint,
+            reference_data=self._reference_data,
+        )
+        return obs, round(reward, 4), done, {
+            "episode_id": self._state.episode_id,
+            "best_reward": self._state.best_reward,
+        }
+    @property
+    def state(self) -> InvoiceState:
+        return self._state