Spaces:

NeerajCodz
/

scrapeRL

Running

App Files Files Community

NeerajCodz commited on 8 days ago

Commit

df47251

0 Parent(s):

docs: update

Browse files

Files changed (14) hide show

docs/README.md +28 -0
docs/WebScraper_OpenEnv_SoftwareDoc.md +1654 -0
docs/agents.md +204 -0
docs/api.md +901 -0
docs/architecture.md +168 -0
docs/features.md +104 -0
docs/html-processing.md +739 -0
docs/mcp.md +977 -0
docs/memory.md +786 -0
docs/observability.md +147 -0
docs/openenv.md +220 -0
docs/rewards.md +637 -0
docs/search-engine.md +782 -0
docs/settings.md +750 -0

docs/README.md ADDED Viewed

	@@ -0,0 +1,28 @@

+# Documentation Index
+This documentation set supersedes and expands `WebScraper_OpenEnv_SoftwareDoc.md` into focused modules.
+## Core Docs
+- `openenv.md` — enhanced OpenEnv spec, actions, observations, lifecycle
+- `architecture.md` — system architecture, runtime, scheduling, scaling
+- `agents.md` — multi-agent roles, strategies, HITL, explainability
+- `rewards.md` — advanced reward function and signal breakdown
+## Platform Docs
+- `api.md` — multi-model API system and routing/ensemble/cost tracking
+- `mcp.md` — MCP integration, registry, lazy install, composition
+- `search-engine.md` — search providers, query optimization, credibility scoring
+- `html-processing.md` — semantic parsing, adaptive chunking, batch + diff processing
+- `memory.md` — unified memory system (short/working/long/shared)
+## Operations Docs
+- `settings.md` — dashboard settings and configuration controls
+- `observability.md` — metrics, traces, thought stream, cost telemetry
+- `features.md` — advanced capabilities and feature flags
+## Legacy
+- `WebScraper_OpenEnv_SoftwareDoc.md` remains as original monolithic source.

docs/WebScraper_OpenEnv_SoftwareDoc.md ADDED Viewed

	@@ -0,0 +1,1654 @@

+# WebScraper-OpenEnv: Software Design Document
+**Project:** WebScraper-OpenEnv
+**Version:** 1.0.0
+**Hackathon:** OpenEnv — Round 1
+**Author:** [Your Name]
+**Date:** March 2026
+---
+## Table of Contents
+1. [Project Overview](#1-project-overview)
+2. [Real-World Motivation](#2-real-world-motivation)
+3. [System Architecture](#3-system-architecture)
+4. [OpenEnv Specification](#4-openenv-specification)
+   - 4.1 Observation Model
+   - 4.2 Action Model
+   - 4.3 Reward Model
+   - 4.4 Episode Lifecycle
+5. [Environment State Machine](#5-environment-state-machine)
+6. [Task Definitions](#6-task-definitions)
+   - Task 1: Static Page Field Extraction (Easy)
+   - Task 2: Paginated Catalog Scraping (Medium)
+   - Task 3: Deep Research with Search & Fact Verification (Hard)
+7. [Grader Design](#7-grader-design)
+8. [Reward Function Design](#8-reward-function-design)
+9. [Network Layer — VPN & Proxy](#9-network-layer--vpn--proxy)
+   - 9.1 Architecture
+   - 9.2 Proxy Configuration
+   - 9.3 VPN Configuration
+   - 9.4 Public Pool
+   - 9.5 Settings Persistence
+10. [API Endpoint Specification](#10-api-endpoint-specification)
+11. [Data Models (Pydantic Schemas)](#11-data-models-pydantic-schemas)
+12. [Simulated Web Environment](#12-simulated-web-environment)
+13. [Baseline Inference Script](#13-baseline-inference-script)
+14. [Project Structure](#14-project-structure)
+15. [Dockerfile & Deployment](#15-dockerfile--deployment)
+16. [openenv.yaml](#16-openenvyaml)
+17. [Testing Strategy](#17-testing-strategy)
+18. [Known Limitations & Future Work](#18-known-limitations--future-work)
+---
+## 1. Project Overview
+**WebScraper-OpenEnv** is a reinforcement learning environment that challenges AI agents to perform structured **web data extraction** — a task humans and automated pipelines carry out every day for market research, competitive intelligence, lead generation, price monitoring, and data journalism.
+The environment wraps a fully **self-contained simulated web server** (no external network calls required) that presents realistic HTML pages with varying structure, noise, pagination, and adversarial anti-scraping patterns. Agents must issue targeted extraction actions to retrieve structured data within budget and quality constraints.
+This environment is designed to:
+- Evaluate an agent's ability to **parse and reason about semi-structured HTML**
+- Test **multi-step planning** across paginated or linked content
+- Stress-test **robustness** when pages are noisy, misleading, or rate-limited
+- Provide **dense reward signals** that guide learning rather than just measuring final output
+---
+## 2. Real-World Motivation
+Web scraping is a core capability required across:
+| Use Case | Example |
+|---|---|
+| E-commerce monitoring | Track competitor prices across 1,000 SKUs daily |
+| Lead generation | Extract company names, emails, headcount from directories |
+| Research automation | Aggregate paper titles, authors, abstracts from 5 sources |
+| News intelligence | Collect headlines, dates, sources matching a keyword |
+| Real estate | Pull property listings, prices, square footage from portals |
+Current LLM agents struggle with scraping because it requires:
+1. Selecting the right CSS/XPath selector or field label from noisy HTML
+2. Knowing *when to stop* (pagination boundary detection)
+3. Deduplication and normalization of extracted values
+4. Graceful recovery from blocked or malformed pages
+No existing OpenEnv environment covers this domain. **WebScraper-OpenEnv fills this gap.**
+---
+## 3. System Architecture
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                  Single Docker Container (:7860)                 │
+│                                                                  │
+│  ┌───────────────────────────────────────────────────────────┐  │
+│  │                Vite Frontend (React)                       │  │
+│  │   TaskSelector │ EpisodeViewer │ RewardChart │ Baseline   │  │
+│  │                 fetch("/api/...")                          │  │
+│  └────────────────────────┬──────────────────────────────────┘  │
+│                            │ same origin                         │
+│  ┌─────────────────────────▼────────────────────────────────┐   │
+│  │                    FastAPI Application                    │   │
+│  │                                                           │   │
+│  │  /api/reset  /api/step  /api/state  /api/tasks           │   │
+│  │  /api/grader  /api/baseline                              │   │
+│  │  /*  → serves frontend/dist/index.html (SPA fallback)    │   │
+│  │                                                           │   │
+│  │  ┌──────────────────────┐  ┌──────────────────────────┐  │   │
+│  │  │   WebScraperEnv      │  │    SimulatedWebServer    │  │   │
+│  │  │  - episode state     │◄►│  - HTML page generator   │  │   │
+│  │  │  - action dispatch   │  │  - pagination engine     │  │   │
+│  │  │  - reward engine     │  │  - noise injector        │  │   │
+│  │  │  - grader registry   │  │  - anti-scrape simulator │  │   │
+│  │  └──────────────────────┘  └──────────────────────────┘  │   │
+│  └───────────────────────────────────────────────────────────┘  │
+└─────────────────────────────────────────────────────────────────┘
+             ▲
+             │  HTTP JSON  (agents / baseline script)
+             ▼
+      AI Agent / Baseline Script
+```
+**Key design decisions:**
+- The simulated web server is **seeded and deterministic** — same `task_id` + `seed` always produces the same pages, enabling reproducible evaluation.
+- Pages are generated dynamically from Jinja2 templates with injected noise, not stored as static files, keeping the Docker image small.
+- The environment is **stateless across HTTP requests** but maintains episode state in-memory, keyed by `episode_id`.
+- The **Vite frontend** is compiled at Docker build time (Stage 1) and served as static files by FastAPI — no separate web server (nginx, etc.) needed. Single port, single process.
+---
+## 4. OpenEnv Specification
+### 4.1 Observation Model
+An `Observation` is returned after every `reset()` and `step()` call.
+```python
+class Observation(BaseModel):
+    episode_id: str              # UUID for the current episode
+    task_id: str                 # Task identifier ("task_easy" | "task_medium" | "task_hard")
+    step_number: int             # Current step count (0-indexed)
+    current_url: str             # Simulated URL of the current page
+    page_html: str               # Raw HTML content of the current page (trimmed to 8000 chars)
+    page_title: str              # <title> tag value
+    available_actions: list[str] # High-level action types available at this step
+    extracted_so_far: dict       # Fields extracted successfully in this episode so far
+    pages_visited: list[str]     # Ordered list of URLs visited this episode
+    budget_remaining: int        # Remaining step budget (starts at task max_steps)
+    task_description: str        # Human-readable task goal
+    target_fields: list[str]     # Names of fields the agent must extract
+    hints: list[str]             # Contextual hints (empty in hard mode)
+```
+**Design rationale:**
+- `page_html` is included directly in the observation so agents can act without a separate fetch step. Truncated at 8,000 characters to simulate token budget pressure realistically.
+- `extracted_so_far` gives the agent a running view of what it has already collected — critical for multi-page tasks.
+- `hints` are populated for easy/medium tasks and empty for hard, creating a natural difficulty gradient.
+### 4.2 Action Model
+An `Action` is submitted by the agent in each `step()` call.
+```python
+class Action(BaseModel):
+    action_type: ActionType      # Enum — see below
+    target_field: str | None     # Field name to extract (for EXTRACT actions)
+    selector: str | None         # CSS selector or field label hint
+    navigate_to: str | None      # URL or "next_page" / "prev_page" keyword
+    submit_extraction: dict | None  # Final field→value map (for SUBMIT action)
+    notes: str | None            # Agent's internal reasoning note (not scored, logged)
+```
+```python
+class ActionType(str, Enum):
+    EXTRACT_FIELD    = "extract_field"    # Extract one named field from current page
+    NAVIGATE         = "navigate"         # Go to a URL or next/prev page
+    SEARCH_PAGE      = "search_page"      # Regex/keyword search within current page HTML
+    INSPECT_ELEMENT  = "inspect_element"  # Get focused text around a CSS selector
+    SUBMIT           = "submit"           # Final answer — ends the episode
+    SKIP_PAGE        = "skip_page"        # Declare current page irrelevant, move on
+    # ── Task 3 / Hard mode only ─────────────────────────────────────────
+    SEARCH_ENGINE    = "search_engine"    # Issue a query to the configured search engine
+    VERIFY_FACT      = "verify_fact"      # Cross-check a field value against a second source
+    RESOLVE_CONFLICT = "resolve_conflict" # Declare which of two conflicting values is authoritative
+    FETCH_URL        = "fetch_url"        # Fetch an arbitrary URL (uses active proxy/VPN if set)
+```
+**Extended `Action` model for new types:**
+```python
+class Action(BaseModel):
+    action_type: ActionType
+    # --- Existing fields ---
+    target_field: str | None     = None
+    selector: str | None         = None
+    navigate_to: str | None      = None
+    submit_extraction: dict | None = None
+    notes: str | None            = None
+    # --- Search engine fields ---
+    query: str | None            = None   # Query string for SEARCH_ENGINE
+    search_engine: str | None    = None   # "google" | "bing" | "brave" | "ddg" (uses settings default if None)
+    result_limit: int            = 5      # Max search results to return (1–10)
+    # --- Fact verification fields ---
+    field_name: str | None       = None   # Field to verify in VERIFY_FACT
+    claimed_value: str | None    = None   # Value to check
+    verification_source: str | None = None  # URL to verify against
+    # --- Conflict resolution fields ---
+    conflicting_sources: list[str] | None = None  # Two URLs with disagreeing values
+    chosen_source: str | None    = None   # URL the agent judges more authoritative
+    rationale: str | None        = None   # Agent's justification (logged, not scored)
+```
+**Design rationale:**
+- Actions are **higher-level than raw HTTP** — the agent doesn't manage cookies or headers, it focuses on extraction logic.
+- `INSPECT_ELEMENT` gives the agent a focused window into the DOM, rewarding agents that learn to select precisely.
+- `SEARCH_ENGINE` issues a query through whichever engine the user has configured in Settings (or the environment's default). Results are returned as a ranked list of `{title, url, snippet}` objects — the agent then navigates to the most promising URL.
+- `VERIFY_FACT` instructs the environment to fetch a second source and check whether the claimed value appears there. Returns a `verified: bool` and a `confidence: float` — not a definitive answer, mirroring real-world uncertainty.
+- `RESOLVE_CONFLICT` is scored by the grader: if the agent picks the more authoritative source it earns a bonus; if it picks the wrong one it earns a penalty.
+- `SUBMIT` is the terminal action that triggers the grader.
+### 4.3 Reward Model
+```python
+class Reward(BaseModel):
+    value: float           # Reward for this step (-1.0 to +1.0)
+    cumulative: float      # Total reward accumulated this episode
+    breakdown: dict        # Labeled sub-rewards (for interpretability)
+    message: str           # Human-readable explanation
+```
+### 4.4 Episode Lifecycle
+```
+reset(task_id, seed?)
+  → Observation (step 0, fresh page, budget = max_steps)
+step(action: EXTRACT_FIELD | NAVIGATE | ...)
+  → Observation (updated state), Reward, done=False, info
+step(action: SUBMIT)
+  → Observation (terminal), Reward (grader score * scale), done=True, info
+state()
+  → Current episode state snapshot (same fields as Observation + internal metadata)
+```
+An episode also ends automatically if:
+- `budget_remaining` reaches 0 (budget exhaustion — scores whatever was extracted)
+- The agent navigates to more than `max_pages` unique URLs
+---
+## 5. Environment State Machine
+```
+         reset()
+            │
+            ▼
+    ┌──────────────┐
+    │   RUNNING    │◄──────────────────────────────────────────┐
+    │              │                                            │
+    │  step(NAV)   │──► fetch_page()    ──► update_obs() ──────┤
+    │  step(EXT)   │──► extract()       ──► update_obs() ──────┤
+    │  step(SRCH)  │──► search_html()   ──► update_obs() ──────┤
+    │  step(SE)    │──► search_engine() ──► ranked_results ────┤
+    │  step(VRF)   │──► verify_fact()   ──► confidence_score ──┤
+    │  step(RES)   │──► resolve()       ──► authoritative val ─┘
+    └──────┬───────┘
+           │
+    step(SUBMIT) or budget=0
+           │
+           ▼
+    ┌──────────────┐
+    │   TERMINAL   │──► grader.score() ──► final Reward
+    └──────────────┘
+```
+**State fields stored per episode:**
+| Field | Type | Description |
+|---|---|---|
+| `episode_id` | str | UUID |
+| `task_id` | str | Active task |
+| `seed` | int | RNG seed for page generation |
+| `step_number` | int | Steps taken |
+| `current_url` | str | Active page URL |
+| `pages_visited` | list | Navigation history |
+| `extracted_data` | dict | Field→value map built up by agent |
+| `ground_truth` | dict | Hidden correct field→value map |
+| `budget` | int | Steps remaining |
+| `status` | Enum | RUNNING / TERMINAL |
+| `created_at` | datetime | Episode start time |
+---
+## 6. Task Definitions
+### Task 1: Static Page Field Extraction (Easy)
+**ID:** `task_easy`
+**Max Steps:** 10
+**Max Pages:** 1
+**Hints:** Yes
+**Scenario:**
+The agent is given a single product listing page for an e-commerce store. The page contains a product name, price, SKU, star rating, and number of reviews. Minimal noise. Fields are labeled clearly.
+**Target Fields:**
+```
+product_name, price, sku, star_rating, review_count
+```
+**Sample Page URL:** `sim://shop.example.com/product/42`
+**Ground Truth (example, seeded):**
+```json
+{
+  "product_name": "Wireless Noise-Cancelling Headphones",
+  "price": "$89.99",
+  "sku": "WNC-4421-BLK",
+  "star_rating": "4.3",
+  "review_count": "1,247"
+}
+```
+**Success Criteria:**
+- Extract all 5 fields correctly → score 1.0
+- Partial credit per field (0.2 per field)
+- Normalized comparison (whitespace-stripped, case-insensitive)
+**Difficulty Rationale:** A capable LLM can find labeled fields in clean HTML in 1–3 steps with direct CSS selectors or simple keyword search.
+---
+### Task 2: Paginated Catalog Scraping (Medium)
+**ID:** `task_medium`
+**Max Steps:** 25
+**Max Pages:** 5
+**Hints:** Partial (structure hint, no selector hint)
+**Scenario:**
+The agent must scrape a product catalog spread across 3 pages of pagination (20 items per page, 60 total items simulated). The agent must collect the **name and price of the 3 cheapest items** across all pages. Items are listed in random price order. The agent must decide whether to visit all pages or infer from partial data.
+**Target Fields:**
+```
+cheapest_item_1_name, cheapest_item_1_price,
+cheapest_item_2_name, cheapest_item_2_price,
+cheapest_item_3_name, cheapest_item_3_price
+```
+**Complications introduced:**
+- Prices use mixed formats: `$12.99`, `$12.990`, `12.99 USD` — normalization required
+- One page contains a "Featured" item injected at the top that is actually overpriced
+- Pagination links use non-obvious URL patterns (`?pg=2` vs `?offset=20`)
+**Grader Logic:**
+1. Extract agent's top-3 cheapest items
+2. Compare to ground truth top-3 (computed by environment at episode start)
+3. Score = (# correctly identified items / 3) × quality bonus (if price values match within ±$0.01)
+**Difficulty Rationale:** Requires multi-page navigation planning, price normalization, and sorting logic — a significant step up from single-page extraction.
+---
+### Task 3: Deep Research with Search & Fact Verification (Hard)
+**ID:** `task_hard`
+**Max Steps:** 60
+**Max Pages:** 20
+**Hints:** None
+**Search Engine:** Required (uses configured engine or environment default)
+**Fact Verification:** Required for minimum 3 fields to achieve full score
+---
+**Scenario:**
+The agent is given a **target entity** (a mid-size private company, randomly selected per seed) and must build a fully sourced, verified intelligence profile. No starting URL is provided — the agent must begin by issuing search engine queries to discover relevant pages. Information is distributed across 6+ simulated domains and some fields only appear on pages that are only discoverable via search (not linked from the entry page). At least two fields will have conflicting values across sources, and the agent must explicitly resolve these conflicts to earn full credit.
+---
+**Target Fields (14 total, grouped by difficulty tier):**
+```
+── Tier 1 — Basic Identity (weight 1.0x each) ──────────────────────────
+company_name              Full legal name of the company
+headquarters_city         City of primary HQ
+headquarters_country      Country of primary HQ
+primary_industry          Top-level industry category (e.g. "FinTech", "SaaS")
+── Tier 2 — Operational Data (weight 1.5x each) ────────────────────────
+founding_year             Year company was founded  [CONFLICT present]
+employee_count_range      Bucketed range: "1-50" | "51-200" | "201-500" | "501-2000" | "2000+"
+ceo_name                  Full name of current CEO  [requires search to discover page]
+product_count             Number of distinct products/services listed  [requires enumeration]
+── Tier 3 — Financial & Strategic (weight 2.0x each) ───────────────────
+latest_funding_round_type Series A/B/C | Seed | Growth | IPO | Unknown
+latest_funding_amount_usd Numeric USD value (normalize: "$12M" → 12000000)
+total_funding_usd         Cumulative raised (may require summing across rounds)  [CONFLICT present]
+lead_investor             Name of lead investor in latest round  [search-only page]
+── Tier 4 — Verification Required (weight 2.5x each) ───────────────────
+founding_year_verified    Must call VERIFY_FACT; score only awarded if verified
+ceo_name_verified         Must call VERIFY_FACT from a second independent source
+```
+---
+**Complications introduced:**
+**Search-first discovery**
+No entry URL is provided. The agent must use `SEARCH_ENGINE` to find a homepage, news page, and financial data page. The simulated search engine returns ranked results with varying relevance — the top result is not always the most useful one.
+**Cross-domain fragmentation**
+Data is spread across 6 simulated domains. No single domain holds more than 4 fields. The agent must plan a visit sequence and track what it has found vs. what is still missing.
+| Domain | Fields present |
+|---|---|
+| `sim://company.example.com` | company_name, headquarters_city/country, primary_industry |
+| `sim://directory.example.com` | founding_year (version A), employee_count_range, ceo_name |
+| `sim://news.example.com` | latest_funding_round_type, latest_funding_amount_usd, lead_investor |
+| `sim://finance.example.com` | total_funding_usd, founding_year (version B — conflict), product_count |
+| `sim://regulatory.example.com` | founding_year (authoritative — SEC-style filing, only discoverable via search) |
+| `sim://linkedin-sim.example.com` | ceo_name (second independent source for verification) |
+**Deliberate conflicts**
+- `founding_year`: directory says 2011, finance page says 2013. The regulatory filing (search-only) says 2012 — this is the authoritative answer. Agent must issue `SEARCH_ENGINE` query to find it, then `RESOLVE_CONFLICT` naming it as authoritative.
+- `total_funding_usd`: news page reports latest round only; finance page has cumulative. Agent must distinguish these and report cumulative.
+**Prose extraction & normalization**
+- `employee_count_range` appears as: "We have grown to over 800 people worldwide" → must map to `"501-2000"`
+- `latest_funding_amount_usd` appears as: "raised $24.5 million in Series B" → must normalize to `24500000`
+- `product_count` requires counting `<li>` items inside a specific section, not reading a single labeled field
+**Simulated anti-scraping**
+- `finance.example.com` returns a 429-like interstitial on the first visit; agent must either retry (costs a step) or configure a proxy/VPN in settings to bypass it
+- `linkedin-sim.example.com` requires a `SEARCH_PAGE` keyword unlock before full content is accessible
+**Verification gates**
+Fields `founding_year_verified` and `ceo_name_verified` are only scoreable if the agent has issued a `VERIFY_FACT` action for them referencing a different domain than the one the value was originally extracted from. The grader checks the action log — extraction alone is not sufficient.
+---
+**Search Engine Behavior in Task 3:**
+When the agent calls `SEARCH_ENGINE`, the simulated engine returns results structured as:
+```json
+{
+  "query": "Acme Corp company profile",
+  "results": [
+    {
+      "rank": 1,
+      "title": "Acme Corp — Official Website",
+      "url": "sim://company.example.com/about",
+      "snippet": "Acme Corp is a leading SaaS platform headquartered in Austin..."
+    },
+    {
+      "rank": 2,
+      "title": "Acme Corp on Business Directory",
+      "url": "sim://directory.example.com/acme-corp",
+      "snippet": "Founded in 2011. 820 employees. CEO: Jane Doe..."
+    }
+  ],
+  "total_results_simulated": 47,
+  "engine_used": "brave"
+}
+```
+The agent can call `SEARCH_ENGINE` up to **8 times** per episode without penalty. Beyond 8 calls, each additional search costs `-0.05` reward (diminishing returns signal).
+---
+**Grader Logic:**
+```python
+def score_task_hard(submission, ground_truth, episode_state):
+    score = 0.0
+    max_score = sum(FIELD_WEIGHTS.values())  # 26.0 total weighted points
+    for field, weight in FIELD_WEIGHTS.items():
+        agent_val = normalize(submission.get(field))
+        truth_val = normalize(ground_truth[field])
+        if field.endswith("_verified"):
+            # Only award if agent issued a VERIFY_FACT for this field
+            # referencing a different source than the extraction source
+            verify_actions = [a for a in episode_state.action_log
+                              if a.action_type == "verify_fact"
+                              and a.field_name == field.replace("_verified", "")]
+            cross_source = any(
+                a.verification_source != episode_state.primary_source_for[field]
+                for a in verify_actions
+            )
+            if agent_val == truth_val and cross_source:
+                score += weight
+            elif agent_val == truth_val:
+                score += weight * 0.5  # Partial: correct but unverified
+        elif field in CONFLICT_FIELDS:
+            # Check agent issued RESOLVE_CONFLICT with correct authoritative source
+            resolve_actions = [a for a in episode_state.action_log
+                               if a.action_type == "resolve_conflict"
+                               and field in str(a)]
+            resolved_correctly = any(
+                a.chosen_source == AUTHORITATIVE_SOURCE[field]
+                for a in resolve_actions
+            )
+            if agent_val == truth_val and resolved_correctly:
+                score += weight
+            elif agent_val == truth_val:
+                score += weight * 0.6  # Correct value but no explicit resolution
+        else:
+            if agent_val == truth_val:
+                score += weight
+            elif partial_match(agent_val, truth_val):
+                score += weight * 0.4
+    # Coverage bonus: +0.5 if all 14 fields present in submission (even if some wrong)
+    coverage_bonus = 0.5 if len(submission) >= 14 else len(submission) / 14 * 0.5
+    raw = (score / max_score) + (coverage_bonus / (max_score + 0.5))
+    return min(raw, 1.0)
+```
+**Expected baseline scores:**
+| Agent | Expected Score | Bottleneck |
+|---|---|---|
+| gpt-4o-mini (no tools) | ~0.20 | Cannot discover search-only pages |
+| gpt-4o-mini + search | ~0.45 | Struggles with conflict resolution |
+| gpt-4o (ReAct loop) | ~0.62 | Verification gate compliance |
+| Human (manual) | ~0.90 | Benchmark ceiling |
+**Difficulty Rationale:** This task is genuinely hard for frontier models because it requires: (1) search-first discovery with no entry URL, (2) multi-domain planning across 6 sources, (3) fact verification as a mandatory action class (not just extracting a value), (4) explicit conflict resolution with source authority reasoning, and (5) normalization of numeric and prose values. No single capability is sufficient — the agent must exercise all of them in one episode.
+---
+## 7. Grader Design
+Each task has a dedicated `Grader` class implementing the following interface:
+```python
+class BaseGrader(ABC):
+    def score(
+        self,
+        agent_submission: dict,   # The agent's SUBMIT payload
+        ground_truth: dict,       # Hidden correct values
+        episode_state: EpisodeState
+    ) -> GraderResult:
+        ...
+class GraderResult(BaseModel):
+    score: float                  # 0.0 – 1.0
+    field_scores: dict[str, float] # Per-field breakdown
+    feedback: str                 # Human-readable explanation
+    penalty_applied: bool         # True if penalties were triggered
+    penalty_reason: str | None
+```
+**Normalization Rules applied before comparison:**
+| Field Type | Normalization |
+|---|---|
+| Price | Strip currency symbols, commas → float |
+| Text | Strip whitespace, lowercase, remove punctuation |
+| Number with commas | `"1,247"` → `1247` |
+| Range | `"500-999"` bucketed comparison |
+| Year | Integer comparison |
+**Penalties:**
+- If `step_number > max_steps * 0.8` and fewer than 50% fields extracted → efficiency penalty of -0.1
+- If agent submits more than 3 times (SUBMIT + reset-less re-attempts) → repeat penalty of -0.05 per extra submit
+**Determinism guarantee:** All graders use only the seeded `ground_truth` dict and the submitted dict. No randomness at score time.
+---
+## 8. Reward Function Design
+The reward function provides **dense signal across the full trajectory**, not just a terminal reward.
+```
+R_total = R_extraction + R_efficiency + R_navigation + R_terminal - R_penalty
+```
+### Per-Step Rewards
+| Event | Reward | Rationale |
+|---|---|---|
+| `EXTRACT_FIELD` → correct value | +0.15 | Core task success signal |
+| `EXTRACT_FIELD` → partially correct (wrong format, right content) | +0.05 | Encourages normalization learning |
+| `EXTRACT_FIELD` → wrong value | -0.05 | Penalizes overconfident extraction |
+| `EXTRACT_FIELD` → field already extracted | -0.10 | Penalizes redundant actions |
+| `NAVIGATE` → new relevant page | +0.05 | Rewards exploration |
+| `NAVIGATE` → already-visited page | -0.08 | Penalizes loops |
+| `NAVIGATE` → irrelevant page (no target fields) | -0.03 | Soft penalty for bad routing |
+| `SEARCH_PAGE` → finds target field hint | +0.03 | Rewards intelligent search |
+| `SEARCH_PAGE` → no results | -0.01 | Small cost for wasted action |
+| `INSPECT_ELEMENT` → valid selector hit | +0.02 | Rewards precision |
+| `SKIP_PAGE` → page is actually irrelevant | +0.05 | Rewards correct relevance judgment |
+| `SKIP_PAGE` → page contained target fields | -0.15 | Penalizes incorrect dismissal |
+| `SEARCH_ENGINE` → query within 8-call budget | 0.00 | Neutral — search is a tool, not scored |
+| `SEARCH_ENGINE` → discovers a new relevant domain | +0.08 | Rewards effective query formulation |
+| `SEARCH_ENGINE` → call #9+ (over budget) | -0.05 | Diminishing returns signal |
+| `VERIFY_FACT` → claimed value confirmed | +0.12 | Rewards verification behavior |
+| `VERIFY_FACT` → claimed value contradicted | +0.08 | Still rewards checking (good epistemic practice) |
+| `VERIFY_FACT` → verifying already-verified field | -0.05 | Penalizes redundant verification |
+| `RESOLVE_CONFLICT` → correct authoritative source | +0.20 | High reward for correct reasoning |
+| `RESOLVE_CONFLICT` → wrong authoritative source | -0.10 | Penalizes incorrect conflict resolution |
+| `FETCH_URL` → returns useful content | +0.02 | Small reward for productive fetch |
+| `FETCH_URL` → blocked (anti-scrape, no proxy set) | -0.03 | Mild penalty — should configure proxy |
+| `FETCH_URL` → blocked (proxy active, retry succeeds) | +0.05 | Rewards using proxy correctly |
+| Budget exhaustion (no SUBMIT) | -0.20 | Penalizes running out of budget |
+### Terminal Reward (on SUBMIT)
+```
+R_terminal = grader_score × 2.0
+```
+This scales the terminal reward to dominate the trajectory reward, ensuring the agent optimizes for final output quality.
+### Reward Range
+- Minimum possible (all wrong, loops, budget exhausted): approximately -2.5
+- Maximum possible (all correct, efficient path): approximately +2.5
+- Typical good agent trajectory: +1.0 to +1.8
+---
+## 9. Network Layer — VPN & Proxy
+The network layer is an optional but impactful system component. When active, all `NAVIGATE`, `FETCH_URL`, and `SEARCH_ENGINE` actions route outbound requests through the configured proxy or VPN. In simulation mode (default), the layer gates which simulated domains respond with 200 vs. 429 — giving agents a realistic incentive to configure networking.
+---
+### 9.1 Architecture
+```
+Agent Action (FETCH_URL / NAVIGATE / SEARCH_ENGINE)
+        │
+        ▼
+┌───────────────────────┐
+│   NetworkRouter       │
+│                       │
+│  active_proxy? ──────►│──► requests.Session(proxies={...})
+│  active_vpn?   ──────►│──► subprocess → wireguard/openvpn tunnel
+│  neither       ──────►│──► direct (or blocked by anti-scrape sim)
+└───────────────────────┘
+        │
+        ▼
+  SimulatedWebServer / Real HTTP (if live mode enabled)
+```
+**Two operating modes:**
+| Mode | Description | When used |
+|---|---|---|
+| `simulation` (default) | No real network; proxy/VPN settings control which simulated domains unblock | Always safe, deterministic, no credentials needed |
+| `live` | Real HTTP requests routed through configured proxy/VPN | Optional; requires user-supplied credentials or public pool selection |
+Mode is set in `Settings → Network → Mode`. `live` mode is off by default and requires explicit opt-in.
+---
+### 9.2 Proxy Configuration
+Proxies can be configured three ways: user-supplied credentials, a pre-tested public proxy pool, or disabled.
+**Settings model:**
+```python
+class ProxyConfig(BaseModel):
+    enabled: bool = False
+    mode: Literal["custom", "public_pool", "rotating"] = "custom"
+    # ── Custom proxy (user-supplied) ──────────────────────────────
+    host: str | None            = None   # e.g. "proxy.mycompany.com"
+    port: int | None            = None   # e.g. 8080
+    protocol: Literal["http", "https", "socks4", "socks5"] = "http"
+    username: str | None        = None   # Optional auth
+    password: str | None        = None   # Stored encrypted at rest (Fernet)
+    auth_scheme: Literal["basic", "digest", "ntlm"] = "basic"
+    # ── Public pool (no credentials required) ────────────────────
+    public_pool_provider: str | None = None  # "webshare" | "proxyscrape" | "openproxy"
+    public_pool_country_filter: str | None = None  # ISO 3166-1 e.g. "US", "DE"
+    # ── Rotating proxy ────────────────────────────────────────────
+    rotating_endpoint: str | None = None  # e.g. "rotate.proxyservice.io:8080"
+    rotate_every_n_requests: int  = 10
+    # ── Validation ────────────────────────────────────────────────
+    test_url: str = "http://httpbin.org/ip"
+    last_test_result: str | None = None   # "ok" | "timeout" | "auth_failed"
+    last_tested_at: datetime | None = None
+```
+**Proxy URL construction (internal):**
+```python
+def build_proxy_url(cfg: ProxyConfig) -> str:
+    if cfg.username and cfg.password:
+        return f"{cfg.protocol}://{cfg.username}:{cfg.password}@{cfg.host}:{cfg.port}"
+    return f"{cfg.protocol}://{cfg.host}:{cfg.port}"
+```
+**Public pool providers (pre-configured, no credentials):**
+| Provider key | Type | Notes |
+|---|---|---|
+| `webshare` | HTTP rotating | 10 free proxies on free tier |
+| `proxyscrape` | HTTP/SOCKS5 scraped list | Refreshed every 15 min |
+| `openproxy` | HTTP/HTTPS | Community maintained |
+The environment ships with a static list of ~50 pre-validated public proxies for simulation mode. In live mode, lists are fetched fresh from provider APIs.
+---
+### 9.3 VPN Configuration
+VPN integration supports **WireGuard** and **OpenVPN** protocols. Users paste their config file content or fill individual fields in the Settings UI.
+```python
+class VPNConfig(BaseModel):
+    enabled: bool = False
+    protocol: Literal["wireguard", "openvpn"] = "wireguard"
+    # ── WireGuard ─────────────────────────────────────────────────
+    wg_config_content: str | None  = None  # Full .conf file content (pasted in UI)
+    wg_interface_name: str         = "wg0"
+    # ── OpenVPN ───────────────────────────────────────────────────
+    ovpn_config_content: str | None = None  # Full .ovpn file content
+    ovpn_username: str | None       = None
+    ovpn_password: str | None       = None  # Encrypted at rest
+    # ── Common ────────────────────────────────────────────────────
+    server_label: str | None       = None   # Human label e.g. "US East — NordVPN"
+    kill_switch: bool              = True   # Block requests if tunnel drops
+    last_test_result: str | None   = None
+    last_connected_at: datetime | None = None
+```
+**VPN lifecycle (live mode):**
+```
+POST /api/settings/vpn/connect
+  → writes temp config file
+  → subprocess: wg-quick up wg0   OR   openvpn --daemon --config temp.ovpn
+  → polls interface for IP change
+  → stores connected_ip in session
+POST /api/settings/vpn/disconnect
+  → subprocess: wg-quick down wg0   OR   killall openvpn
+  → clears connected_ip
+```
+In **simulation mode**, VPN is purely logical — activating it marks the session as "VPN active" which causes the simulated anti-scrape layer to allow all domain requests.
+> **Docker note:** WireGuard and OpenVPN require `NET_ADMIN` and `SYS_MODULE` capabilities. The Dockerfile exposes these only if `ENABLE_LIVE_NETWORK=true` is set. HF Spaces deployment runs in simulation mode only (capabilities not available).
+---
+### 9.4 Public Pool (Quick Start)
+For users who don't have their own proxy or VPN, the Settings UI offers a **Public Pool** tab that requires zero configuration:
+| Pool name | Protocol | Speed | Reliability | Notes |
+|---|---|---|---|---|
+| WebShare Free | HTTP rotating | Medium | High | Registration required (free) |
+| ProxyScrape | HTTP/SOCKS5 | Variable | Medium | No registration |
+| OpenProxy Space | HTTP/HTTPS | Slow | Low | Community pool, use as fallback |
+| Simulation Bypass | Simulated | N/A | 100% | Always available; simulation only |
+Selecting "Simulation Bypass" is the recommended option for evaluation runs — it unlocks all simulated anti-scrape gates without needing real network credentials.
+---
+### 9.5 Settings Persistence
+All network settings are stored server-side in a lightweight JSON config file (`config/network_settings.json`). Passwords and VPN configs are encrypted using **Fernet symmetric encryption** with a key derived from a server-side secret (`SETTINGS_SECRET` env var).
+```python
+class NetworkSettings(BaseModel):
+    proxy: ProxyConfig   = ProxyConfig()
+    vpn: VPNConfig       = VPNConfig()
+    default_search_engine: Literal["google", "bing", "brave", "ddg"] = "brave"
+    live_mode_enabled: bool = False
+    request_timeout_seconds: int = 10
+    max_retries: int = 3
+    retry_backoff_factor: float = 1.5
+    user_agent: str = "WebScraperOpenEnv/1.0"
+```
+The Settings UI reads from `GET /api/settings` and writes via `PUT /api/settings`. Passwords are never returned in GET responses — they are write-only from the UI's perspective.
+---
+## 10. API Endpoint Specification
+All endpoints accept and return `application/json`.
+### `POST /api/reset`
+Initialize or restart an episode.
+**Request:**
+```json
+{ "task_id": "task_easy", "seed": 42 }
+```
+**Response:** `Observation` model
+---
+### `POST /api/step`
+Advance the episode by one action.
+**Request:**
+```json
+{
+  "episode_id": "uuid-...",
+  "action": {
+    "action_type": "extract_field",
+    "target_field": "price",
+    "selector": ".product-price"
+  }
+}
+```
+**Response:**
+```json
+{
+  "observation": { "..." : "..." },
+  "reward": { "value": 0.15, "cumulative": 0.15, "breakdown": {}, "message": "..." },
+  "done": false,
+  "info": { "step": 1, "budget_remaining": 9 }
+}
+```
+---
+### `GET /api/state`
+Return current episode state. **Query param:** `episode_id=uuid-...`
+---
+### `GET /api/tasks`
+Return all task definitions and their action schemas.
+---
+### `POST /api/grader`
+Score a completed episode.
+**Request:**
+```json
+{
+  "episode_id": "uuid-...",
+  "submission": { "product_name": "...", "price": "..." }
+}
+```
+**Response:** `GraderResult` model
+---
+### `POST /api/baseline`
+Trigger the built-in baseline inference script against all 3 tasks and return scores.
+**Response:**
+```json
+{
+  "baseline_model": "gpt-4o-mini",
+  "results": {
+    "task_easy":   { "score": 0.92, "steps": 4,  "fields_correct": 5  },
+    "task_medium": { "score": 0.67, "steps": 18, "fields_correct": 4  },
+    "task_hard":   { "score": 0.38, "steps": 54, "fields_correct": 8  }
+  },
+  "aggregate_score": 0.66,
+  "run_id": "baseline-seed42"
+}
+```
+---
+### `GET /api/settings`
+Return current network settings. **Passwords are never returned** — password fields are always `null` in the response.
+**Response:** `NetworkSettings` model (with password fields nulled)
+---
+### `PUT /api/settings`
+Update network settings (full or partial).
+**Request:** Partial `NetworkSettings` object — only provided fields are updated.
+```json
+{
+  "proxy": {
+    "enabled": true,
+    "mode": "custom",
+    "host": "proxy.example.com",
+    "port": 8080,
+    "protocol": "http",
+    "username": "user",
+    "password": "secret"
+  }
+}
+```
+---
+### `POST /api/settings/proxy/test`
+Test the current proxy configuration by making a request to `test_url`.
+**Response:**
+```json
+{
+  "success": true,
+  "exit_ip": "45.33.32.156",
+  "latency_ms": 312,
+  "error": null
+}
+```
+---
+### `POST /api/settings/vpn/connect`
+Activate the configured VPN tunnel (live mode only; simulation mode returns immediate success).
+**Response:**
+```json
+{
+  "connected": true,
+  "tunnel_ip": "10.8.0.2",
+  "exit_ip": "185.220.101.45",
+  "protocol": "wireguard",
+  "error": null
+}
+```
+---
+### `POST /api/settings/vpn/disconnect`
+Tear down the active VPN tunnel.
+---
+### `GET /api/settings/network/status`
+Returns current active network configuration — what proxy/VPN is live right now.
+**Response:**
+```json
+{
+  "proxy_active": true,
+  "proxy_host": "proxy.example.com:8080",
+  "vpn_active": false,
+  "vpn_server": null,
+  "exit_ip": "45.33.32.156",
+  "live_mode": false,
+  "default_search_engine": "brave"
+}
+```
+---
+### `GET /api/settings/public-pool`
+Returns the list of available public proxy/VPN pool options with current availability status.
+**Response:**
+```json
+{
+  "pools": [
+    { "key": "simulation_bypass", "name": "Simulation Bypass", "available": true, "requires_auth": false },
+    { "key": "webshare",          "name": "WebShare Free",      "available": true, "requires_auth": true  },
+    { "key": "proxyscrape",       "name": "ProxyScrape",        "available": true, "requires_auth": false },
+    { "key": "openproxy",         "name": "OpenProxy Space",    "available": true, "requires_auth": false }
+  ]
+}
+```
+---
+## 11. Data Models (Pydantic Schemas)
+```python
+# env/models.py
+from pydantic import BaseModel, Field
+from enum import Enum
+from typing import Optional
+import uuid
+class ActionType(str, Enum):
+    EXTRACT_FIELD   = "extract_field"
+    NAVIGATE        = "navigate"
+    SEARCH_PAGE     = "search_page"
+    INSPECT_ELEMENT = "inspect_element"
+    SUBMIT          = "submit"
+    SKIP_PAGE       = "skip_page"
+class Action(BaseModel):
+    action_type: ActionType
+    target_field: Optional[str] = None
+    selector: Optional[str] = None
+    navigate_to: Optional[str] = None
+    submit_extraction: Optional[dict] = None
+    notes: Optional[str] = None
+class Observation(BaseModel):
+    episode_id: str
+    task_id: str
+    step_number: int
+    current_url: str
+    page_html: str
+    page_title: str
+    available_actions: list[str]
+    extracted_so_far: dict
+    pages_visited: list[str]
+    budget_remaining: int
+    task_description: str
+    target_fields: list[str]
+    hints: list[str]
+class Reward(BaseModel):
+    value: float
+    cumulative: float
+    breakdown: dict[str, float]
+    message: str
+class GraderResult(BaseModel):
+    score: float = Field(ge=0.0, le=1.0)
+    field_scores: dict[str, float]
+    feedback: str
+    penalty_applied: bool
+    penalty_reason: Optional[str] = None
+class EpisodeState(BaseModel):
+    episode_id: str
+    task_id: str
+    seed: int
+    step_number: int
+    current_url: str
+    pages_visited: list[str]
+    extracted_data: dict
+    budget_remaining: int
+    status: str  # "running" | "terminal"
+    cumulative_reward: float
+    created_at: str
+    # Task 3 extras
+    action_log: list[dict] = []        # Full action history for grader inspection
+    search_calls_used: int = 0         # Track against 8-call free budget
+    verified_fields: list[str] = []    # Fields that have passed VERIFY_FACT
+    resolved_conflicts: list[str] = [] # Fields where RESOLVE_CONFLICT was issued
+class SearchResult(BaseModel):
+    rank: int
+    title: str
+    url: str
+    snippet: str
+class SearchEngineResponse(BaseModel):
+    query: str
+    results: list[SearchResult]
+    total_results_simulated: int
+    engine_used: str
+    calls_remaining: int  # Free budget remaining (8 - used)
+class VerifyFactResponse(BaseModel):
+    field_name: str
+    claimed_value: str
+    verification_source: str
+    verified: bool
+    confidence: float  # 0.0 – 1.0
+    supporting_text: str | None  # Excerpt from verification source
+    contradicting_text: str | None
+class NetworkStatus(BaseModel):
+    proxy_active: bool
+    proxy_host: Optional[str]
+    vpn_active: bool
+    vpn_server: Optional[str]
+    exit_ip: Optional[str]
+    live_mode: bool
+    default_search_engine: str
+```
+---
+## 12. Simulated Web Environment
+The `SimulatedWebServer` class generates HTML pages on-the-fly using Jinja2 templates seeded by a deterministic RNG.
+### Page Generator Pipeline
+```
+seed + task_id + url
+        │
+        ▼
+  RNG (random.Random)
+        │
+        ▼
+  Template Selector ──► Jinja2 template
+        │
+        ▼
+  Data Populator (products / company profiles / etc.)
+        │
+        ▼
+  Noise Injector ──► adds decoy elements, broken tags, ads
+        │
+        ▼
+  Anti-Scrape Layer ──► conditionally adds interstitials (task_hard)
+        │
+        ▼
+  HTML string (max 8,000 chars)
+```
+### Noise Types by Task
+| Noise Type | Easy | Medium | Hard |
+|---|---|---|---|
+| Decoy fields with similar labels | ❌ | ✅ | ✅ |
+| Inconsistent price formatting | ❌ | ✅ | ✅ |
+| Broken/unclosed HTML tags | ❌ | ❌ | ✅ |
+| Interstitial blocking page | ❌ | ❌ | ✅ |
+| Contradictory values across pages | ❌ | ❌ | ✅ |
+| JavaScript-only content (noscript fallback) | ❌ | ❌ | ✅ |
+| Paginated content (multi-page) | ❌ | ✅ | ✅ |
+### URL Scheme
+Simulated URLs follow the pattern `sim://<domain>/<path>`. The environment maps these to page generators internally — no DNS or network calls occur.
+```
+sim://shop.example.com/product/42              → product page (task_easy)
+sim://catalog.example.com/products?pg=1        → catalog page 1 of 3 (task_medium)
+sim://company.example.com/about                → company homepage (task_hard)
+sim://directory.example.com/org/acme           → directory listing (task_hard)
+sim://news.example.com/search?q=acme           → news aggregator (task_hard)
+sim://finance.example.com/ticker/ACME          → financial data (task_hard) ← 429 gate
+sim://regulatory.example.com/filings/ACME      → SEC-style filing (task_hard, search-only)
+sim://linkedin-sim.example.com/company/acme    → LinkedIn-style profile (task_hard, keyword gate)
+```
+**Anti-scrape simulation by domain:**
+| Domain | Block type | Bypass method |
+|---|---|---|
+| `finance.example.com` | 429 Rate-limit on first visit | Retry after 1 step, or activate proxy |
+| `linkedin-sim.example.com` | Keyword gate | `SEARCH_PAGE` with keyword "view_profile" |
+| `regulatory.example.com` | Not linked — only discoverable via search | `SEARCH_ENGINE` with relevant query |
+---
+## 13. Baseline Inference Script
+`scripts/baseline.py` uses the OpenAI API to run a ReAct-style loop against the environment.
+### Agent Strategy
+```
+System Prompt:
+  You are a web scraping agent. You will be given an HTML page and a list
+  of fields to extract. Use the available actions to extract all target
+  fields as efficiently as possible and then submit your findings.
+Loop:
+  1. Call /reset with task_id and seed=42
+  2. While not done:
+       a. Format observation as: current URL, page HTML (truncated),
+          fields still needed, steps remaining
+       b. Prompt LLM for next action in JSON format
+       c. Parse action → POST /step
+       d. If done: record score
+  3. Report all 3 task scores
+```
+### Configuration
+Read from environment variables:
+```
+OPENAI_API_KEY=...
+BASELINE_MODEL=gpt-4o-mini     # default
+BASELINE_SEED=42
+BASELINE_MAX_RETRIES=3
+```
+### Reproducibility
+- Fixed seed=42 for all tasks
+- Deterministic page generation
+- Temperature=0 for LLM calls
+- Results logged to `results/baseline_<timestamp>.json`
+### Expected Baseline Scores (gpt-4o-mini)
+| Task | Expected Score | Notes |
+|---|---|---|
+| task_easy | ~0.90 | Near-perfect on clean pages |
+| task_medium | ~0.60 | Pagination handling is tricky |
+| task_hard | ~0.35 | Multi-source coordination challenges |
+| **Aggregate** | **~0.62** | |
+---
+## 14. Project Structure
+```
+webscraper-openenv/
+├── README.md
+├── openenv.yaml
+├── Dockerfile
+├── requirements.txt
+│
+├── frontend/                         # Vite + React app
+│   ├── package.json
+│   ├── vite.config.ts
+│   ├── index.html
+│   └── src/
+│       ├── main.tsx
+│       ├── App.tsx
+│       ├── components/
+│       │   ├── TaskSelector.tsx           # Pick task_easy / task_medium / task_hard
+│       │   ├── EpisodeViewer.tsx          # Live observation display
+│       │   ├── ActionPanel.tsx            # Manual action builder (for debugging)
+│       │   ├── RewardChart.tsx            # Cumulative reward over steps
+│       │   ├── BaselineRunner.tsx         # Trigger /api/baseline and show scores
+│       │   └── settings/
+│       │       ├── SettingsPage.tsx       # Top-level settings shell (tabbed layout)
+│       │       ├── ProxySettings.tsx      # Proxy config form (custom / public pool / rotating)
+│       │       ├── VPNSettings.tsx        # VPN config form (WireGuard / OpenVPN file paste)
+│       │       ├── PublicPoolPicker.tsx   # Zero-config public proxy/VPN picker
+│       │       ├── NetworkStatus.tsx      # Live badge: proxy active, VPN active, exit IP
+│       │       └── SearchEngineSelector.tsx  # Default search engine picker
+│       ├── hooks/
+│       │   ├── useEpisode.ts             # Manages episode state via REST
+│       │   ├── useNetworkSettings.ts     # Read/write /api/settings
+│       │   └── useNetworkStatus.ts       # Polls /api/settings/network/status
+│       └── api/
+│           ├── client.ts                 # Typed fetch wrappers for all endpoints
+│           └── settingsClient.ts         # Settings-specific API calls
+│
+├── env/
+│   ├── __init__.py
+│   ├── environment.py              # WebScraperEnv (step/reset/state)
+│   ├── models.py                   # All Pydantic models
+│   ├── reward.py                   # RewardEngine
+│   ├── state.py                    # EpisodeState management
+│   ├── tasks/
+│   │   ├── task_easy.py
+│   │   ├── task_medium.py
+│   │   └── task_hard.py            # Includes search engine + verify + resolve logic
+│   └── simulator/
+│       ├── web_server.py
+│       ├── page_generator.py
+│       ├── search_engine.py        # SimulatedSearchEngine (ranked results by seed)
+│       ├── fact_verifier.py        # FactVerifier (cross-source consistency check)
+│       ├── noise_injector.py
+│       └── templates/
+│           ├── product.html
+│           ├── catalog.html
+│           ├── company.html
+│           ├── directory.html
+│           ├── news.html
+│           ├── finance.html
+│           ├── regulatory.html     # New: SEC-style filing page
+│           └── linkedin_sim.html   # New: LinkedIn-style profile page
+│
+├── network/
+│   ├── __init__.py
+│   ├── router.py                   # NetworkRouter (proxy/VPN dispatch)
+│   ├── proxy_manager.py            # ProxyManager (build URL, test, rotate)
+│   ├── vpn_manager.py              # VPNManager (wg-quick / openvpn subprocess)
+│   ├── public_pool.py              # PublicPoolFetcher (webshare, proxyscrape, openproxy)
+│   └── settings_store.py          # Encrypted read/write of network_settings.json
+│
+├── config/
+│   └── network_settings.json       # Persisted settings (passwords Fernet-encrypted)
+│
+├── api/
+│   ├── __init__.py
+│   ├── main.py                     # FastAPI app + static file mount
+│   ├── routes/
+│   │   ├── env_routes.py           # /api/reset, /api/step, /api/state, etc.
+│   │   └── settings_routes.py      # /api/settings/*, /api/settings/vpn/*, etc.
+│   └── schemas.py
+│
+├── scripts/
+│   ├── baseline.py
+│   └── validate.py
+│
+├── tests/
+│   ├── test_environment.py
+│   ├── test_graders.py
+│   ├── test_reward.py
+│   ├── test_task3_search.py        # Search engine + verify + resolve tests
+│   ├── test_network.py             # Proxy/VPN config + routing tests
+│   └── test_api.py
+│
+└── results/
+    └── baseline_seed42.json
+```
+---
+## 15. Dockerfile & Deployment
+Everything ships in a **single Docker container**. The build is a two-stage process: Stage 1 compiles the Vite frontend into static files; Stage 2 installs the Python backend and copies the compiled frontend in. FastAPI then serves both the API and the frontend from port 7860.
+### Request Routing (single port)
+```
+Port 7860
+    │
+    ├── /api/*       → FastAPI routes (all OpenEnv endpoints)
+    ├── /assets/*    → Vite static assets (JS, CSS, chunks)
+    └── /*           → index.html (SPA catch-all, handled by FastAPI StaticFiles)
+```
+FastAPI mounts the Vite build output (`frontend/dist/`) as a `StaticFiles` directory and adds a catch-all `GET /{full_path}` route that returns `index.html` so client-side routing works correctly.
+```python
+# api/main.py (relevant additions)
+from fastapi.staticfiles import StaticFiles
+from fastapi.responses import FileResponse
+app.mount("/assets", StaticFiles(directory="frontend/dist/assets"), name="assets")
+@app.get("/{full_path:path}", include_in_schema=False)
+async def spa_fallback(full_path: str):
+    return FileResponse("frontend/dist/index.html")
+```
+All API routes are prefixed with `/api` to avoid collisions with the SPA router:
+```
+POST /api/reset
+POST /api/step
+GET  /api/state
+GET  /api/tasks
+POST /api/grader
+POST /api/baseline
+```
+The Vite frontend calls `fetch("/api/...")` — no base URL configuration needed in production since everything is on the same origin.
+---
+### Dockerfile (multi-stage)
+```dockerfile
+# ── Stage 1: Build Vite frontend ──────────────────────────────────────
+FROM node:20-slim AS frontend-builder
+WORKDIR /frontend
+COPY frontend/package.json frontend/package-lock.json ./
+RUN npm ci
+COPY frontend/ ./
+RUN npm run build
+# Output: /frontend/dist/
+# ── Stage 2: Python backend + compiled frontend ────────────────────────
+FROM python:3.11-slim
+WORKDIR /app
+# System packages:
+#   wireguard-tools + iproute2  → wg-quick (live VPN, only used if ENABLE_LIVE_NETWORK=true)
+#   openvpn                     → OpenVPN tunnel (same gate)
+#   curl                        → proxy connectivity tests
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    wireguard-tools \
+    iproute2 \
+    openvpn \
+    curl \
+  && rm -rf /var/lib/apt/lists/*
+# Install Python dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy backend source
+COPY env/     ./env/
+COPY network/ ./network/
+COPY api/     ./api/
+COPY scripts/ ./scripts/
+COPY results/ ./results/
+COPY config/  ./config/
+COPY openenv.yaml .
+# Copy compiled frontend from stage 1
+COPY --from=frontend-builder /frontend/dist ./frontend/dist
+ENV PYTHONUNBUFFERED=1
+ENV PORT=7860
+# ENABLE_LIVE_NETWORK=false  → simulation mode (safe default, no NET_ADMIN needed)
+# ENABLE_LIVE_NETWORK=true   → real proxy/VPN (requires --cap-add NET_ADMIN SYS_MODULE)
+ENV ENABLE_LIVE_NETWORK=false
+ENV SETTINGS_SECRET=changeme_generate_a_real_key_in_production
+EXPOSE 7860
+CMD ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "7860"]
+```
+**Live network mode (local only, not for HF Spaces):**
+```bash
+docker run -p 7860:7860 \
+  --cap-add NET_ADMIN \
+  --cap-add SYS_MODULE \
+  --sysctl net.ipv4.conf.all.src_valid_mark=1 \
+  -e ENABLE_LIVE_NETWORK=true \
+  -e OPENAI_API_KEY=$OPENAI_API_KEY \
+  -e SETTINGS_SECRET=$(openssl rand -hex 32) \
+  webscraper-openenv
+```
+---
+### requirements.txt
+```
+fastapi>=0.110.0
+uvicorn[standard]>=0.29.0
+pydantic>=2.6.0
+jinja2>=3.1.3
+openai>=1.20.0
+pytest>=8.1.0
+httpx>=0.27.0
+aiofiles>=23.2.1       # FastAPI StaticFiles
+cryptography>=42.0.0   # Fernet encryption for stored credentials
+requests[socks]>=2.31.0  # SOCKS4/5 proxy support
+```
+During local development, Vite's dev server runs on `:5173` and the FastAPI backend runs on `:8000`. The proxy forwards all `/api` calls to avoid CORS issues:
+```typescript
+import { defineConfig } from 'vite'
+import react from '@vitejs/plugin-react'
+export default defineConfig({
+  plugins: [react()],
+  server: {
+    proxy: {
+      '/api': {
+        target: 'http://localhost:8000',
+        changeOrigin: true,
+      }
+    }
+  }
+})
+```
+In production (inside Docker), no proxy is needed — both frontend and backend are on port 7860.
+---
+### requirements.txt
+```
+fastapi>=0.110.0
+uvicorn[standard]>=0.29.0
+pydantic>=2.6.0
+jinja2>=3.1.3
+openai>=1.20.0
+pytest>=8.1.0
+httpx>=0.27.0
+aiofiles>=23.2.1   # Required for FastAPI StaticFiles
+```
+---
+### Local Development Workflow
+```bash
+# Option A: Full Docker (production-identical)
+docker build -t webscraper-openenv .
+docker run -p 7860:7860 -e OPENAI_API_KEY=$OPENAI_API_KEY webscraper-openenv
+# Visit: http://localhost:7860
+# Option B: Split dev servers (fast iteration)
+# Terminal 1 — backend
+uvicorn api.main:app --reload --port 8000
+# Terminal 2 — frontend
+cd frontend && npm run dev
+# Visit: http://localhost:5173 (proxies API to :8000)
+```
+### Build & Smoke Test
+```bash
+docker build -t webscraper-openenv .
+# Smoke test the API
+curl http://localhost:7860/api/tasks
+# Smoke test the frontend is served
+curl -s http://localhost:7860 | grep -q "<div id=\"root\">" && echo "Frontend OK"
+# Full reset/step cycle
+curl -X POST http://localhost:7860/api/reset \
+     -H "Content-Type: application/json" \
+     -d '{"task_id": "task_easy", "seed": 42}'
+```
+### Hugging Face Spaces Deployment
+The Space will be tagged with `openenv` and configured as:
+- **SDK:** Docker
+- **App port:** 7860
+- **Secrets:** `OPENAI_API_KEY` set via HF Secrets UI
+- No extra build steps needed — the Dockerfile handles `npm ci && npm run build` internally in Stage 1
+---
+## 15. openenv.yaml
+```yaml
+name: webscraper-openenv
+version: "1.0.0"
+description: >
+  A web scraping environment where AI agents extract structured data
+  from simulated HTML pages with varying complexity, pagination,
+  and adversarial noise patterns.
+author: "[Your Name]"
+license: MIT
+tags:
+  - openenv
+  - web-scraping
+  - information-extraction
+  - nlp
+  - real-world
+tasks:
+  - id: task_easy
+    name: "Static Page Field Extraction"
+    difficulty: easy
+    max_steps: 10
+    description: "Extract 5 product fields from a single clean product page."
+  - id: task_medium
+    name: "Paginated Catalog Scraping"
+    difficulty: medium
+    max_steps: 25
+    description: "Find the 3 cheapest items across 3 pages of a product catalog."
+  - id: task_hard
+    name: "Multi-Source Research Aggregation"
+    difficulty: hard
+    max_steps: 40
+    description: "Aggregate a company profile from 4 different simulated web sources."
+api:
+  reset:   POST /reset
+  step:    POST /step
+  state:   GET  /state
+  tasks:   GET  /tasks
+  grader:  POST /grader
+  baseline: POST /baseline
+observation_space:
+  type: structured
+  fields:
+    - page_html: string
+    - current_url: string
+    - extracted_so_far: object
+    - budget_remaining: integer
+    - target_fields: array
+action_space:
+  type: structured
+  action_types:
+    - extract_field
+    - navigate
+    - search_page
+    - inspect_element
+    - submit
+    - skip_page
+reward_range: [-2.5, 2.5]
+episode_termination:
+  - "SUBMIT action called"
+  - "budget_remaining reaches 0"
+```
+---
+## 16. Testing Strategy
+### Unit Tests
+**`test_graders.py`**
+- Test each grader with perfect submission → expect score = 1.0
+- Test each grader with empty submission → expect score = 0.0
+- Test partial submissions → expect intermediate scores
+- Test normalization edge cases (price formats, whitespace, encoding)
+**`test_reward.py`**
+- Correct extraction event → reward > 0
+- Redundant extraction → reward < 0
+- Navigation loop → cumulative negative reward
+- SUBMIT with perfect answer → large positive reward
+**`test_environment.py`**
+- `reset()` returns clean state with step_number=0
+- `state()` after 3 steps returns step_number=3
+- Budget exhaustion terminates episode
+- Same seed produces identical HTML
+### Integration Tests
+**`test_api.py`**
+- Full episode run via HTTP for each task
+- `/baseline` endpoint completes without error
+- `/grader` returns score in [0.0, 1.0]
+- Invalid episode_id returns 404
+### Validation
+```bash
+openenv validate .
+```
+Expected: All checks pass, spec compliance confirmed.
+---
+## 17. Known Limitations & Future Work
+| Limitation | Impact | Future Fix |
+|---|---|---|
+| HTML truncated to 8,000 chars | Very long pages lose content | Configurable window + scrolling action |
+| No JavaScript rendering simulation | JS-heavy sites not fully modeled | Add iframe/shadow DOM simulation |
+| Single in-memory episode store | Not horizontally scalable | Redis-backed episode store |
+| English-only pages | Non-English scraping not tested | Multilingual page templates |
+| Fixed set of 3 tasks | Limited evaluation breadth | Procedural task generation with task_level param |
+| No rate limiting simulation in easy/medium | Less realistic for those tiers | Progressive rate limiting across difficulty |
+---
+*End of Software Design Document*
+*WebScraper-OpenEnv — OpenEnv Round 1 Submission*

docs/agents.md ADDED Viewed

	@@ -0,0 +1,204 @@

+# Agents System Design
+## Overview
+The agent runtime is a multi-agent, memory-aware RL orchestration layer for web extraction tasks. It supports:
+- Single-agent and multi-agent execution modes
+- Strategy selection (`search-first`, `direct-extraction`, `multi-hop-reasoning`)
+- Human-in-the-loop intervention
+- Explainable decision traces
+- Self-improvement from past episodes
+## Agent Roles
+### 1. Planner Agent
+Builds a plan before action:
+- Goal decomposition
+- Tool selection plan
+- Risk and fallback path
+### 2. Navigator Agent
+Explores pages and search results:
+- URL prioritization
+- Link traversal policy
+- Page relevance scoring
+### 3. Extractor Agent
+Extracts structured fields:
+- Selector and schema inference
+- Adaptive chunk extraction
+- Long-page batch processing
+### 4. Verifier Agent
+Checks consistency and trust:
+- Cross-source verification
+- Conflict resolution
+- Confidence calibration
+### 5. Memory Agent
+Manages memory write/read/search:
+- Episode summaries
+- Pattern persistence
+- Retrieval ranking and pruning
+## Execution Modes
+### Single-Agent
+One policy handles all actions.
+Pros: low overhead, simple.
+Cons: weaker specialization.
+### Multi-Agent
+Coordinator delegates work:
+1. Planner emits execution graph
+2. Navigator discovers candidate pages
+3. Extractor parses and emits data
+4. Verifier validates outputs
+5. Memory Agent stores reusable patterns
+Pros: modular, robust, scalable.
+Cons: coordination overhead.
+## Agent Communication
+Shared channels:
+- `agent_messages`: async inter-agent messages
+- `task_state`: current objective and progress
+- `global_knowledge`: reusable facts and patterns
+Message schema:
+```json
+{
+  "message_id": "msg_123",
+  "from": "navigator",
+  "to": "extractor",
+  "type": "page_candidate",
+  "payload": {
+    "url": "https://site.com/p/123",
+    "relevance": 0.91
+  },
+  "timestamp": "2026-03-27T00:00:00Z"
+}
+```
+## Decision Policy
+Policy input includes:
+- Observation
+- Working memory context
+- Retrieved long-term memory hits
+- Tool registry availability
+- Budget and constraints
+Policy output includes:
+- Next action
+- Confidence
+- Rationale
+- Fallback action (optional)
+## Strategy Library
+Built-in strategy templates:
+- `search-first`: broad discovery then narrow extraction
+- `direct-extraction`: immediate field extraction from target page
+- `multi-hop-reasoning`: iterative search and verification
+- `table-centric`: table-first parsing
+- `form-centric`: forms and input structures prioritized
+Strategy selection can be:
+- Manual (user setting)
+- Automatic (router based on task signature)
+## Self-Improving Agent Loop
+After each episode:
+1. Compute reward breakdown
+2. Extract failed and successful patterns
+3. Update strategy performance table
+4. Store high-confidence selectors in long-term memory
+5. Penalize redundant navigation patterns
+## Explainable AI Mode
+Each action can emit:
+- Why this action was chosen
+- Why alternatives were rejected
+- Which memory/tool evidence was used
+Example trace:
+```text
+Action: EXTRACT_FIELD(price)
+Why: Pattern "span.product-price" had 0.93 historical confidence on similar domains.
+Alternatives rejected: ".price-box .value" (lower confidence 0.58), regex-only extraction (unstable on this layout).
+```
+## Human-in-the-Loop
+Optional checkpoints:
+- Approve/reject planned action
+- Override selector/tool/model
+- Force verification before submit
+Intervention modes:
+- `off`: fully autonomous
+- `review`: pause on low-confidence steps
+- `strict`: require approval on all submit/fetch/verify actions
+## Scenario Simulator Hooks
+Agents can be tested against:
+- Noisy HTML
+- Missing fields
+- Broken pagination
+- Adversarial layouts
+- Dynamic content with delayed rendering
+Simulation metrics:
+- Completion
+- Recovery score
+- Generalization score
+- Cost and latency
+## APIs
+- `POST /api/agents/run`
+- `POST /api/agents/plan`
+- `POST /api/agents/override`
+- `GET /api/agents/state/{episode_id}`
+- `GET /api/agents/trace/{episode_id}`
+## Dashboard Widgets
+- Live thought stream
+- Agent role timeline
+- Inter-agent message feed
+- Strategy performance chart
+- Confidence and override panel

docs/api.md ADDED Viewed

	@@ -0,0 +1,901 @@

+# 🤖 Multi-Model API System
+## Table of Contents
+1. [Overview](#overview)
+2. [Supported Providers](#supported-providers)
+3. [Smart Model Router](#smart-model-router)
+4. [Model Ensemble](#model-ensemble)
+5. [Cost & Token Tracking](#cost--token-tracking)
+6. [Prompt Management](#prompt-management)
+7. [Configuration](#configuration)
+8. [API Reference](#api-reference)
+---
+## Overview
+The **Multi-Model API System** provides a unified interface for interacting with multiple LLM providers (OpenAI, Anthropic, Google, Groq, etc.), enabling:
+- **Flexibility:** Switch between models without code changes
+- **Optimization:** Auto-route requests to the best model for each task
+- **Cost Control:** Track spending and enforce budgets
+- **Reliability:** Fallback to alternative models on failure
+- **Experimentation:** A/B test prompts and models
+### Architecture
+```
+┌────────────────────────────────────────────────────────────────┐
+│                     Agent Request                               │
+│              "Extract product price"                            │
+└────────────────────────┬───────────────────────────────────────┘
+                         │
+                         ▼
+┌────────────────────────────────────────────────────────────────┐
+│                  Smart Model Router                             │
+│  ┌──────────────────────────────────────────────────────────┐  │
+│  │  Task Classifier:                                         │  │
+│  │    • Reasoning → GPT-4 / Claude                           │  │
+│  │    • Fast extraction → Groq / Gemini Flash                │  │
+│  │    • Long context → Claude / GPT-4-32k                    │  │
+│  │    • Cost-sensitive → Gemini / Groq                       │  │
+│  └──────────────────────────────────────────────────────────┘  │
+└────────────────────────┬───────────────────────────────────────┘
+                         │
+         ┌───────────────┼───────────────┬───────────────┐
+         │               │               │               │
+         ▼               ▼               ▼               ▼
+┌─────────────┐  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
+│   OpenAI    │  │  Anthropic  │ │   Google    │ │    Groq     │
+│   Adapter   │  │   Adapter   │ │   Adapter   │ │   Adapter   │
+└──────┬──────┘  └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
+       │                │                │                │
+       ▼                ▼                ▼                ▼
+┌─────────────┐  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
+│ gpt-4-turbo │  │ claude-3.5  │ │ gemini-pro  │ │ llama-3-70b │
+│ gpt-4o-mini │  │ claude-3    │ │ gemini-flash│ │ mixtral-8x7b│
+└─────────────┘  └─────────────┘ └─────────────┘ └─────────────┘
+```
+---
+## Supported Providers
+### 1. OpenAI
+**Models:**
+- `gpt-4-turbo` - Best reasoning, multimodal
+- `gpt-4o` - Fast GPT-4 variant
+- `gpt-4o-mini` - Cost-effective, fast
+- `gpt-3.5-turbo` - Legacy, cheap
+**Capabilities:**
+- Function calling
+- JSON mode
+- Vision (gpt-4-turbo, gpt-4o)
+- 128k context (gpt-4-turbo)
+**Configuration:**
+```python
+{
+    "provider": "openai",
+    "api_key": "sk-...",
+    "organization": "org-...",  # Optional
+    "models": {
+        "default": "gpt-4o-mini",
+        "reasoning": "gpt-4-turbo",
+        "fast": "gpt-4o-mini"
+    },
+    "parameters": {
+        "temperature": 0.7,
+        "max_tokens": 4096,
+        "timeout": 60
+    }
+}
+```
+### 2. Anthropic (Claude)
+**Models:**
+- `claude-3-opus-20240229` - Most capable
+- `claude-3-sonnet-20240229` - Balanced
+- `claude-3-haiku-20240307` - Fast and cheap
+- `claude-3-5-sonnet-20240620` - Latest, best
+**Capabilities:**
+- 200k context window
+- Strong reasoning
+- Excellent instruction following
+- Tool use (function calling)
+**Configuration:**
+```python
+{
+    "provider": "anthropic",
+    "api_key": "sk-ant-...",
+    "models": {
+        "default": "claude-3-5-sonnet-20240620",
+        "reasoning": "claude-3-opus-20240229",
+        "fast": "claude-3-haiku-20240307"
+    },
+    "parameters": {
+        "temperature": 0.7,
+        "max_tokens": 4096,
+        "timeout": 90
+    }
+}
+```
+### 3. Google (Gemini)
+**Models:**
+- `gemini-1.5-pro` - Best quality, 2M context
+- `gemini-1.5-flash` - Fast, 1M context
+- `gemini-1.0-pro` - Legacy
+**Capabilities:**
+- Massive context (1M-2M tokens)
+- Multimodal (text, image, video, audio)
+- Extremely cost-effective
+- Function calling
+**Configuration:**
+```python
+{
+    "provider": "google",
+    "api_key": "AIza...",
+    "models": {
+        "default": "gemini-1.5-flash",
+        "reasoning": "gemini-1.5-pro",
+        "fast": "gemini-1.5-flash"
+    },
+    "parameters": {
+        "temperature": 0.7,
+        "max_output_tokens": 8192,
+        "timeout": 60
+    }
+}
+```
+### 4. Groq
+**Models:**
+- `llama-3.1-405b` - Largest Llama
+- `llama-3.1-70b-versatile` - Balanced
+- `llama-3.1-8b-instant` - Ultra-fast
+- `mixtral-8x7b-32768` - Good reasoning
+**Capabilities:**
+- **Extremely fast inference** (500+ tokens/sec)
+- Free tier available
+- Open-source models
+- JSON mode
+**Configuration:**
+```python
+{
+    "provider": "groq",
+    "api_key": "gsk_...",
+    "models": {
+        "default": "llama-3.1-70b-versatile",
+        "reasoning": "llama-3.1-405b",
+        "fast": "llama-3.1-8b-instant"
+    },
+    "parameters": {
+        "temperature": 0.7,
+        "max_tokens": 8192,
+        "timeout": 30
+    }
+}
+```
+### 5. Mistral AI
+**Models:**
+- `mistral-large-latest` - Best quality
+- `mistral-medium-latest` - Balanced
+- `mistral-small-latest` - Fast and cheap
+- `mixtral-8x22b` - Open-source, strong
+**Configuration:**
+```python
+{
+    "provider": "mistral",
+    "api_key": "...",
+    "models": {
+        "default": "mistral-medium-latest",
+        "reasoning": "mistral-large-latest",
+        "fast": "mistral-small-latest"
+    }
+}
+```
+### 6. Cohere
+**Models:**
+- `command-r-plus` - Best for RAG
+- `command-r` - Balanced
+- `command-light` - Fast
+**Specialization:** RAG, embeddings, reranking
+### 7. Perplexity
+**Models:**
+- `pplx-70b-online` - Web-connected
+- `pplx-7b-online` - Fast, web-connected
+**Specialization:** Real-time web search and citations
+### 8. Together AI
+**Models:** 50+ open-source models
+- Llama variants
+- Mistral variants
+- Code models (CodeLlama, StarCoder)
+**Use Case:** Access to latest open-source models
+### 9. Custom / Self-Hosted
+**Supported:**
+- **Ollama** (local models)
+- **vLLM** (self-hosted inference)
+- **LM Studio** (local GUI)
+- **LocalAI** (OpenAI-compatible local server)
+**Configuration:**
+```python
+{
+    "provider": "custom",
+    "base_url": "http://localhost:11434/v1",  # Ollama
+    "api_key": "not-needed",
+    "models": {
+        "default": "llama3:70b",
+        "fast": "llama3:8b"
+    }
+}
+```
+---
+## Smart Model Router
+The **Smart Model Router** automatically selects the best model for each request based on task characteristics.
+### Routing Strategy
+```python
+class ModelRouter:
+    def route(self, task: Task, context: Dict) -> ModelConfig:
+        """Select the best model for this task."""
+        # 1. Explicit user preference
+        if context.get("preferred_model"):
+            return self.get_model(context["preferred_model"])
+        # 2. Task-based routing
+        if task.type == "reasoning":
+            return self.route_reasoning(task, context)
+        elif task.type == "extraction":
+            return self.route_extraction(task, context)
+        elif task.type == "classification":
+            return self.route_classification(task, context)
+        # 3. Fallback to default
+        return self.default_model
+    def route_reasoning(self, task: Task, context: Dict) -> ModelConfig:
+        """Route complex reasoning tasks."""
+        # Long context? Use Claude or Gemini
+        if context.get("input_tokens", 0) > 50000:
+            return self.get_model("claude-3-5-sonnet")  # 200k context
+        # Need reliability? Use GPT-4 or Claude
+        if task.importance == "high":
+            return self.get_model("gpt-4-turbo")
+        # Cost-sensitive? Use Gemini or Groq
+        if context.get("budget_mode"):
+            return self.get_model("gemini-1.5-flash")
+        return self.get_model("claude-3-5-sonnet")  # Default for reasoning
+    def route_extraction(self, task: Task, context: Dict) -> ModelConfig:
+        """Route simple extraction tasks."""
+        # Speed critical? Use Groq
+        if context.get("latency_critical"):
+            return self.get_model("llama-3.1-70b-versatile", provider="groq")
+        # Cost-sensitive? Use Gemini Flash or Groq
+        return self.get_model("gemini-1.5-flash")
+```
+### Routing Rules
+| Task Type | Input Size | Priority | Recommended Model | Reason |
+|-----------|-----------|----------|-------------------|--------|
+| Reasoning | Any | High | `gpt-4-turbo` | Best quality |
+| Reasoning | >50k tokens | Any | `claude-3-5-sonnet` | 200k context |
+| Reasoning | Any | Budget | `gemini-1.5-flash` | Cheap, good quality |
+| Extraction | <10k tokens | Speed | `groq/llama-3.1-70b` | 500+ tok/sec |
+| Extraction | Any | Budget | `gpt-4o-mini` | $0.15/1M tokens |
+| Classification | <5k tokens | Any | `groq/llama-3.1-8b` | Ultra-fast |
+| Long Context | >100k tokens | Any | `gemini-1.5-pro` | 2M context |
+| Vision | Images | Any | `gpt-4o` | Best multimodal |
+| Web Search | Any | Any | `perplexity` | Web-connected |
+### Configuration
+```python
+class RouterConfig(BaseModel):
+    enabled: bool = True
+    strategy: Literal["task_based", "cost_optimized", "speed_optimized", "quality_optimized"]
+    # Task-based routing rules
+    routing_rules: Dict[str, str] = {
+        "reasoning_high_priority": "gpt-4-turbo",
+        "reasoning_budget": "gemini-1.5-flash",
+        "extraction_fast": "groq/llama-3.1-70b",
+        "extraction_accurate": "claude-3-5-sonnet",
+        "long_context": "gemini-1.5-pro",
+        "vision": "gpt-4o"
+    }
+    # Fallback chain
+    fallback_order: List[str] = [
+        "claude-3-5-sonnet",
+        "gpt-4o-mini",
+        "gemini-1.5-flash",
+        "groq/llama-3.1-70b"
+    ]
+    # Auto-retry on failure
+    auto_retry: bool = True
+    max_retries: int = 3
+```
+---
+## Model Ensemble
+**Model Ensemble** runs multiple models in parallel and merges their outputs for higher quality or consensus.
+### Ensemble Strategies
+#### 1. Voting (Classification/Extraction)
+Run 3+ models, take majority vote.
+```python
+class VotingEnsemble:
+    async def predict(self, prompt: str, models: List[str]) -> Any:
+        """Run multiple models and vote on result."""
+        tasks = [self.call_model(model, prompt) for model in models]
+        results = await asyncio.gather(*tasks)
+        # Count votes
+        from collections import Counter
+        votes = Counter(results)
+        winner, count = votes.most_common(1)[0]
+        confidence = count / len(results)
+        return {
+            "result": winner,
+            "confidence": confidence,
+            "votes": dict(votes)
+        }
+# Example: Extract price with 3 models
+ensemble = VotingEnsemble()
+result = await ensemble.predict(
+    prompt="Extract the product price: <html>...",
+    models=["gpt-4o-mini", "gemini-1.5-flash", "groq/llama-3.1-70b"]
+)
+# Result: {"result": "$49.99", "confidence": 1.0, "votes": {"$49.99": 3}}
+```
+#### 2. Ranking (Quality Assessment)
+Run multiple models, rank outputs by quality.
+```python
+class RankingEnsemble:
+    async def generate(self, prompt: str, models: List[str]) -> List[Dict]:
+        """Generate with multiple models and rank by quality."""
+        tasks = [self.call_model(model, prompt) for model in models]
+        results = await asyncio.gather(*tasks)
+        # Score each result
+        scored_results = []
+        for model, output in zip(models, results):
+            score = self.quality_scorer.score(output, prompt)
+            scored_results.append({
+                "model": model,
+                "output": output,
+                "quality_score": score
+            })
+        # Sort by score
+        scored_results.sort(key=lambda x: x["quality_score"], reverse=True)
+        return scored_results
+# Example: Generate reasoning with ranking
+ensemble = RankingEnsemble()
+results = await ensemble.generate(
+    prompt="Explain how to extract a price from HTML",
+    models=["gpt-4-turbo", "claude-3-5-sonnet", "gemini-1.5-pro"]
+)
+best_result = results[0]  # Highest quality
+```
+#### 3. Fusion (Merging Outputs)
+Merge complementary outputs from multiple models.
+```python
+class FusionEnsemble:
+    async def extract_structured(self, prompt: str, models: List[str]) -> Dict:
+        """Extract structured data with multiple models and merge."""
+        tasks = [self.call_model(model, prompt) for model in models]
+        results = await asyncio.gather(*tasks)
+        # Merge fields with confidence weighting
+        merged = {}
+        for field in self.extract_fields(results):
+            values = [r.get(field) for r in results if r.get(field)]
+            if not values:
+                continue
+            # Use most common value, or highest-confidence model's value
+            from collections import Counter
+            counts = Counter(values)
+            merged[field] = counts.most_common(1)[0][0]
+        return merged
+# Example: Extract product data with fusion
+ensemble = FusionEnsemble()
+product = await ensemble.extract_structured(
+    prompt="Extract product details: <html>...",
+    models=["gpt-4o-mini", "gemini-1.5-flash", "claude-3-haiku"]
+)
+# Merges: {name: "...", price: "$X", rating: "Y" } from all models
+```
+#### 4. Verification (Primary + Validator)
+One model generates, another validates.
+```python
+class VerificationEnsemble:
+    async def generate_and_verify(
+        self,
+        prompt: str,
+        generator_model: str,
+        validator_model: str
+    ) -> Dict:
+        """Generate with one model, verify with another."""
+        # Generate
+        output = await self.call_model(generator_model, prompt)
+        # Verify
+        verification_prompt = f"""
+        Original task: {prompt}
+        Generated output: {output}
+        Is this output correct and complete? Explain any issues.
+        """
+        verification = await self.call_model(validator_model, verification_prompt)
+        return {
+            "output": output,
+            "verification": verification,
+            "confidence": self.parse_confidence(verification)
+        }
+# Example: Generate with Groq (fast), verify with Claude (accurate)
+ensemble = VerificationEnsemble()
+result = await ensemble.generate_and_verify(
+    prompt="Extract all product prices from this catalog page",
+    generator_model="groq/llama-3.1-70b",
+    validator_model="claude-3-5-sonnet"
+)
+```
+### Ensemble Configuration
+```python
+class EnsembleConfig(BaseModel):
+    enabled: bool = False  # Off by default (costs more)
+    strategy: Literal["voting", "ranking", "fusion", "verification"]
+    # Model selection
+    models: List[str] = []  # If empty, router selects
+    # Voting settings
+    min_agreement: float = 0.67  # Require 67% agreement
+    # Ranking settings
+    quality_metric: Literal["coherence", "accuracy", "completeness"]
+    # Verification settings
+    generator_model: Optional[str] = None
+    validator_model: Optional[str] = None
+```
+---
+## Cost & Token Tracking
+Track spending and token usage across all models.
+### Cost Tracker
+```python
+class CostTracker:
+    # Pricing (as of March 2026, per 1M tokens)
+    PRICING = {
+        "gpt-4-turbo": {"input": 10.00, "output": 30.00},
+        "gpt-4o": {"input": 5.00, "output": 15.00},
+        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
+        "claude-3-opus": {"input": 15.00, "output": 75.00},
+        "claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
+        "claude-3-haiku": {"input": 0.25, "output": 1.25},
+        "gemini-1.5-pro": {"input": 3.50, "output": 10.50},
+        "gemini-1.5-flash": {"input": 0.35, "output": 1.05},
+        "groq/llama-3.1-70b": {"input": 0.59, "output": 0.79},
+        "groq/llama-3.1-8b": {"input": 0.05, "output": 0.08},
+    }
+    def calculate_cost(
+        self,
+        model: str,
+        input_tokens: int,
+        output_tokens: int
+    ) -> float:
+        """Calculate cost for this request."""
+        pricing = self.PRICING.get(model, {"input": 0, "output": 0})
+        cost = (
+            (input_tokens / 1_000_000) * pricing["input"] +
+            (output_tokens / 1_000_000) * pricing["output"]
+        )
+        return cost
+    def track_request(self, request: ModelRequest, response: ModelResponse):
+        """Track a model request."""
+        cost = self.calculate_cost(
+            model=request.model,
+            input_tokens=response.usage.prompt_tokens,
+            output_tokens=response.usage.completion_tokens
+        )
+        self.db.insert({
+            "timestamp": datetime.now(),
+            "model": request.model,
+            "input_tokens": response.usage.prompt_tokens,
+            "output_tokens": response.usage.completion_tokens,
+            "total_tokens": response.usage.total_tokens,
+            "cost_usd": cost,
+            "latency_ms": response.latency_ms,
+            "task_type": request.task_type,
+            "success": response.success
+        })
+```
+### Budget Enforcement
+```python
+class BudgetEnforcer:
+    def __init__(self, daily_budget_usd: float):
+        self.daily_budget = daily_budget_usd
+        self.cost_tracker = CostTracker()
+    def check_budget(self) -> bool:
+        """Check if budget allows this request."""
+        today_cost = self.cost_tracker.get_today_cost()
+        return today_cost < self.daily_budget
+    async def call_with_budget(self, request: ModelRequest) -> ModelResponse:
+        """Make request only if budget allows."""
+        if not self.check_budget():
+            # Fallback to cheapest model
+            request.model = "groq/llama-3.1-8b-instant"
+            logger.warning(f"Budget exceeded, downgrading to {request.model}")
+        response = await self.call_model(request)
+        self.cost_tracker.track_request(request, response)
+        return response
+```
+### Token Usage Dashboard
+**UI Display:**
+```
+┌──��───────────────────────────────────────────────────────────┐
+│ Token Usage & Cost (Last 24h)                                │
+├──────────────────────────────────────────────────────────────┤
+│                                                               │
+│ Total Tokens:   1,234,567                                    │
+│ Total Cost:     $12.34                                       │
+│ Requests:       456                                          │
+│ Avg Latency:    1.2s                                         │
+│                                                               │
+│ ┌────────────────────────────────────────────────────────┐   │
+│ │ Cost by Model                                          │   │
+│ │ ████████████████████ gpt-4-turbo      $6.50 (53%)    │   │
+│ │ ██████████ claude-3-5-sonnet         $3.20 (26%)    │   │
+│ │ █████ gemini-1.5-flash               $1.80 (15%)    │   │
+│ │ ██ groq/llama-3.1-70b                $0.84 (6%)     │   │
+│ └────────────────────────────────────────────────────────┘   │
+│                                                               │
+│ ┌────────────────────────────────────────────────────────┐   │
+│ │ Token Usage by Model                                   │   │
+│ │ Model              Input    Output   Total      Cost   │   │
+│ │ gpt-4-turbo        123K     45K      168K      $6.50  │   │
+│ │ claude-3-5-sonnet  456K     89K      545K      $3.20  │   │
+│ │ gemini-1.5-flash   890K     234K     1124K     $1.80  │   │
+│ └────────────────────────────────────────────────────────┘   │
+│                                                               │
+│ Budget: $12.34 / $20.00 (62% used)                           │
+│ [█████████████████░░░░░░░░░░]                                │
+│                                                               │
+│ ⚠️ Budget 80% threshold: Alert enabled                       │
+│                                                               │
+└──────────────────────────────────────────────────────────────┘
+```
+---
+## Prompt Management
+Manage, version, and A/B test prompts.
+### Prompt Templates
+```python
+class PromptTemplate(BaseModel):
+    template_id: str
+    name: str
+    template: str
+    variables: List[str]
+    version: int
+    created_at: datetime
+    performance_score: Optional[float] = None
+class PromptManager:
+    def get_template(self, template_id: str, version: Optional[int] = None) -> PromptTemplate:
+        """Get prompt template by ID and version."""
+        if version is None:
+            return self.get_latest_version(template_id)
+        return self.db.get(template_id, version)
+    def render(self, template_id: str, variables: Dict) -> str:
+        """Render template with variables."""
+        template = self.get_template(template_id)
+        return template.template.format(**variables)
+    def create_version(self, template_id: str, new_template: str) -> int:
+        """Create new version of template."""
+        current = self.get_template(template_id)
+        new_version = current.version + 1
+        self.db.insert(PromptTemplate(
+            template_id=template_id,
+            name=current.name,
+            template=new_template,
+            variables=current.variables,
+            version=new_version,
+            created_at=datetime.now()
+        ))
+        return new_version
+```
+### Example Templates
+```python
+# Extraction prompt
+EXTRACTION_PROMPT = """
+You are a web scraping agent. Extract the following fields from the HTML:
+Target fields: {target_fields}
+HTML content:
+{html_content}
+Return a JSON object with the extracted values. If a field is not found, use null.
+Example output format:
+{{
+  "field1": "value1",
+  "field2": "value2"
+}}
+"""
+# Reasoning prompt
+REASONING_PROMPT = """
+You are analyzing a web page to plan your next extraction action.
+Current goal: {goal}
+Page URL: {url}
+Available actions: {actions}
+Previous attempts: {history}
+Think step by step:
+1. What information is most important for the goal?
+2. What patterns do you see in the HTML structure?
+3. Which action is most likely to succeed?
+4. What could go wrong?
+Provide your reasoning and then choose an action.
+"""
+# Register templates
+prompt_manager = PromptManager()
+prompt_manager.register("extraction_v1", EXTRACTION_PROMPT, ["target_fields", "html_content"])
+prompt_manager.register("reasoning_v1", REASONING_PROMPT, ["goal", "url", "actions", "history"])
+```
+### A/B Testing
+```python
+class PromptABTest:
+    def __init__(self, template_id: str, variants: List[int]):
+        self.template_id = template_id
+        self.variants = variants  # Version numbers
+        self.results = {v: [] for v in variants}
+    def get_variant(self) -> int:
+        """Select variant (round-robin or random)."""
+        return random.choice(self.variants)
+    def track_result(self, variant: int, success: bool, score: float):
+        """Track performance of a variant."""
+        self.results[variant].append({"success": success, "score": score})
+    def get_winner(self) -> int:
+        """Determine which variant performs best."""
+        avg_scores = {
+            v: np.mean([r["score"] for r in results])
+            for v, results in self.results.items()
+            if results
+        }
+        return max(avg_scores, key=avg_scores.get)
+# Run A/B test
+test = PromptABTest("extraction_v1", variants=[1, 2, 3])
+for episode in episodes:
+    variant = test.get_variant()
+    prompt = prompt_manager.render(f"extraction_v1", variables, version=variant)
+    result = await model.generate(prompt)
+    test.track_result(variant, result.success, result.score)
+winner = test.get_winner()
+print(f"Best variant: v{winner}")
+```
+---
+## Configuration
+### Settings Panel
+```python
+class APISettings(BaseModel):
+    # Provider configurations
+    providers: Dict[str, ProviderConfig] = {}
+    # Default model
+    default_model: str = "gpt-4o-mini"
+    # Smart routing
+    router: RouterConfig = RouterConfig()
+    # Ensemble
+    ensemble: EnsembleConfig = EnsembleConfig()
+    # Cost control
+    daily_budget_usd: float = 20.00
+    alert_threshold: float = 0.8  # Alert at 80% budget
+    # Rate limiting
+    max_requests_per_minute: int = 60
+    # Retry policy
+    max_retries: int = 3
+    retry_delay_seconds: int = 2
+    # Prompt management
+    prompt_templates: Dict[str, str] = {}
+```
+**UI Example:**
+```
+┌────────────────────────────────────────────────────────────┐
+│ API Settings                                                │
+├────────────────────────────────────────────────────────────┤
+│                                                             │
+│ Model Providers:                                            │
+│ ┌─────────────────────────────────────────────────────┐    │
+│ │ ☑ OpenAI                                             │    │
+│ │   API Key: [sk-proj-••••••••••••••••] [Test]       │    │
+│ │   Default: [gpt-4o-mini ▼]                          │    │
+│ │                                                      │    │
+│ │ ☑ Anthropic                                          │    │
+│ │   API Key: [sk-ant-••••••••••••••••] [Test]        │    │
+│ │   Default: [claude-3-5-sonnet ▼]                    │    │
+│ │                                                      │    │
+│ │ ☑ Google                                             │    │
+│ │   API Key: [AIza••••••••••••••••••••] [Test]       │    │
+│ │   Default: [gemini-1.5-flash ▼]                     │    │
+│ │                                                      │    │
+│ │ ☑ Groq                                               │    │
+│ │   API Key: [gsk_••••••••••••••••••••] [Test]       │    │
+│ │   Default: [llama-3.1-70b-versatile ▼]              │    │
+│ │                                                      │    │
+│ │ ☐ Mistral   [Configure]                             │    │
+│ │ ☐ Cohere    [Configure]                             │    │
+│ │ ☐ Custom    [Configure]                             │    │
+│ └─────────────────────────────────────────────────────┘    │
+│                                                             │
+│ Smart Routing:                                              │
+│   ☑ Enabled                                                │
+│   Strategy: [Task-Based ▼]                                 │
+│   Fallback: [claude → gpt-4o-mini → gemini → groq]        │
+│                                                             │
+│ Model Ensemble:                                             │
+│   ☐ Enabled (increases cost)                               │
+│   Strategy: [Voting ▼]                                     │
+│   Models:   [gpt-4o-mini, gemini-flash, groq/llama ▼]     │
+│                                                             │
+│ Cost Control:                                               │
+│   Daily Budget:  [$20.00]                                  │
+│   Alert at:      [80%] of budget                           │
+│   Current Usage: $12.34 / $20.00 (62%)                     │
+│                                                             │
+│              [Save Settings]  [Reset to Defaults]          │
+└────────────────────────────────────────────────────────────┘
+```
+---
+## API Reference
+### Python Client
+```python
+from webscraper_env import MultiModelAPI
+# Initialize with config
+api = MultiModelAPI(settings=APISettings())
+# Simple generation
+response = await api.generate(
+    prompt="Extract product price from: <html>...",
+    model="gpt-4o-mini"  # Optional, uses router if omitted
+)
+# With routing
+response = await api.generate(
+    prompt="Complex reasoning task...",
+    task_type="reasoning",  # Router selects best model
+    priority="high"
+)
+# With ensemble
+response = await api.generate_ensemble(
+    prompt="Extract all prices",
+    strategy="voting",
+    models=["gpt-4o-mini", "gemini-1.5-flash", "groq/llama-3.1-70b"]
+)
+# Streaming
+async for chunk in api.generate_stream(prompt="...", model="claude-3-5-sonnet"):
+    print(chunk.text, end="", flush=True)
+```
+---
+**Next:** See [mcp.md](./mcp.md) for MCP server integration.

docs/architecture.md ADDED Viewed

	@@ -0,0 +1,168 @@

+# System Architecture
+## Overview
+WebScraper-OpenEnv is designed as a modular, dashboard-first RL environment with extensible APIs, MCP tools, and multi-model routing.
+## High-Level Topology
+```text
+Frontend Dashboard (React/Vite)
+        |
+        v
+FastAPI Control Plane
+  - episode lifecycle
+  - action dispatch
+  - reward engine
+  - tool registry API
+  - settings + policy
+        |
+        +--> Agent Runtime
+        |      - planner/navigator/extractor/verifier
+        |      - memory manager
+        |      - model router
+        |
+        +--> MCP Gateway
+        |      - tool discovery
+        |      - lazy install/load
+        |      - schema + timeout + retries
+        |
+        +--> Search Layer
+        |      - provider routing
+        |      - query optimization
+        |      - credibility scoring
+        |
+        +--> Memory Layer
+        |      - short/working/long/shared
+        |      - vector index + persistent storage
+        |
+        +--> Observability
+               - traces/logs/metrics/cost dashboard
+```
+## Core Subsystems
+### 1. Control Plane
+Responsibilities:
+- reset/step/state APIs
+- request validation
+- action authorization and policy checks
+- deterministic episode management
+### 2. Agent Runtime
+Responsibilities:
+- policy inference
+- strategy execution
+- fallback handling
+- action explainability
+### 3. Tooling Plane (MCP)
+Responsibilities:
+- dynamic tool registry
+- server health checks
+- lazy installation
+- composition workflows
+### 4. Data Plane
+Responsibilities:
+- HTML ingestion and chunking
+- extraction and normalization
+- verification and reconciliation
+- output persistence
+### 5. Analytics Plane
+Responsibilities:
+- reward component logging
+- model/token/cost accounting
+- tool usage telemetry
+- memory quality analytics
+## Processing Pipeline
+1. `reset(task_id, seed)`
+2. observation emitted
+3. policy selects action
+4. action executes (native/MCP/search/memory)
+5. reward computed and logged
+6. done check
+7. repeat until terminal
+## Batch and Parallel Design
+### Batch
+- large HTML split into semantic chunks
+- chunk extraction batched with bounded size
+- merge + dedupe + confidence rank
+### Parallel
+- independent chunk tasks run concurrently
+- search and verification can run in parallel branches
+- configurable worker limits and queue priorities
+## Queue and Scheduler
+Task queue supports:
+- priority classes (`high`, `normal`, `low`)
+- cancellation tokens
+- retry policy with backoff
+- dead-letter queue for repeated failures
+## Storage Architecture
+- Episode state: in-memory + optional persistence
+- Long-term memory: vector DB + metadata store
+- Logs/metrics: append-only time-series-friendly sink
+- Exports: JSON/CSV trace packs
+## Reliability
+- per-tool timeout and retry
+- per-step safety budget
+- circuit breaker for failing providers
+- deterministic fallback chains
+## Security
+- API key vaulting via env/config secrets
+- MCP allowlist
+- output sanitization
+- redaction of sensitive tokens in logs
+## Deployment
+Single-container baseline:
+- frontend static build served by API backend
+- optional sidecars for DB/vector/MCP infra
+Scale-out profile:
+- separate API and worker pools
+- managed vector DB
+- queue-backed distributed execution
+- central observability backend
+## Compatibility Goals
+- local dev mode with minimal dependencies
+- cloud mode with managed infra
+- optional self-hosted LLM endpoints
+## Future Architecture Extensions
+- distributed multi-agent graph execution
+- adaptive autoscaling by queue pressure
+- global memory federation across projects

docs/features.md ADDED Viewed

	@@ -0,0 +1,104 @@

+# Advanced Features
+## Overview
+This document captures high-end platform capabilities beyond baseline extraction.
+## 1) Self-Improving Agent
+Post-episode learning loop:
+- classify failures by root cause
+- update selector/tool strategy priors
+- persist successful patterns with confidence
+- penalize repeated failure paths
+## 2) Strategy Library
+Built-in strategies:
+- Search-first
+- Direct extraction
+- Multi-hop reasoning
+- Verification-first
+- Table-first
+Each strategy tracks:
+- win rate
+- cost per success
+- average latency
+- domain affinity
+## 3) Explainable AI Mode
+For every decision, provide:
+- selected action and confidence
+- top alternatives considered
+- evidence from memory/tools/search
+- expected reward impact
+## 4) Human-in-the-Loop
+Intervention controls:
+- approve/reject action
+- force tool/model switch
+- enforce verification before submit
+- set hard constraints during runtime
+## 5) Scenario Simulator
+Stress testing scenarios:
+- noisy HTML
+- broken DOM
+- pagination traps
+- conflicting facts
+- anti-scraping patterns
+Outputs:
+- robustness score
+- recovery score
+- strategy suitability map
+## 6) Context Compression
+- rolling summaries
+- salience-based pruning
+- token-aware context packing
+- differential memory refresh
+## 7) Batch + Parallel Runtime
+- task queue with priorities
+- parallel extraction workers
+- bounded concurrency
+- idempotent retry handling
+## 8) Prompt Versioning and Evaluation
+- versioned prompt templates
+- A/B testing by task type
+- reward/cost comparison dashboards
+- rollout and rollback controls
+## 9) MCP Toolchain Composition
+Composable flow examples:
+- Browser MCP -> Parser MCP -> Validator MCP -> DB MCP
+- Search MCP -> Fetch MCP -> Extract MCP -> Verify MCP
+## 10) Governance and Safety
+- tool allowlist/denylist
+- PII redaction in logs
+- budget and rate guardrails
+- provenance tracking for extracted facts
+## Feature Flags
+All advanced features should be toggleable from Settings and safely disabled by default where cost/latency impact is high.

docs/html-processing.md ADDED Viewed

	@@ -0,0 +1,739 @@

+# 🌐 HTML Processing Engine
+## Table of Contents
+1. [Overview](#overview)
+2. [Semantic Understanding](#semantic-understanding)
+3. [Content Classification](#content-classification)
+4. [Smart Extraction](#smart-extraction)
+5. [Adaptive Chunking](#adaptive-chunking)
+6. [Batch Processing](#batch-processing)
+7. [Diff-Based Updates](#diff-based-updates)
+8. [Schema Detection](#schema-detection)
+---
+## Overview
+The **HTML Processing Engine** provides advanced capabilities for understanding, parsing, and extracting data from complex web pages.
+### Challenges
+Modern web pages are challenging:
+- **Size:** 1MB+ of HTML
+- **Complexity:** Nested divs, dynamic IDs, inline styles
+- **Noise:** Ads, tracking scripts, navigation repeated on every page
+- **Inconsistency:** Same site uses different structures across pages
+- **Obfuscation:** Anti-scraping measures (randomized classes, etc.)
+### Solution
+Our engine provides:
+- ✅ **Semantic understanding** of page structure
+- ✅ **Content classification** (main content vs noise)
+- ✅ **Smart extraction** with pattern recognition
+- ✅ **Adaptive chunking** for large pages
+- ✅ **Batch processing** with deduplication
+- ✅ **Diff-based updates** for paginated content
+---
+## Semantic Understanding
+### Architecture
+```python
+class SemanticHTMLAnalyzer:
+    """Understands page structure at a semantic level."""
+    def analyze(self, html: str) -> SemanticStructure:
+        """Analyze HTML and identify semantic regions."""
+        soup = BeautifulSoup(html, 'lxml')
+        structure = SemanticStructure()
+        structure.header = self.detect_header(soup)
+        structure.navigation = self.detect_navigation(soup)
+        structure.main_content = self.detect_main_content(soup)
+        structure.sidebar = self.detect_sidebar(soup)
+        structure.footer = self.detect_footer(soup)
+        structure.ads = self.detect_ads(soup)
+        structure.forms = self.detect_forms(soup)
+        structure.tables = self.detect_tables(soup)
+        structure.lists = self.detect_lists(soup)
+        structure.product_cards = self.detect_product_cards(soup)
+        return structure
+```
+### Semantic Regions
+#### 1. Header Detection
+```python
+def detect_header(self, soup: BeautifulSoup) -> Optional[Tag]:
+    """Detect page header."""
+    # Try semantic tags first
+    header = soup.find('header')
+    if header:
+        return header
+    # Try common patterns
+    candidates = soup.find_all(['div', 'section'], class_=re.compile(r'header|top|banner', re.I))
+    if candidates:
+        # Pick the topmost element
+        return min(candidates, key=lambda el: self.get_vertical_position(el))
+    # Fallback: First div with logo + navigation
+    for div in soup.find_all('div'):
+        has_logo = div.find(['img', 'svg'], class_=re.compile(r'logo', re.I))
+        has_nav = div.find(['nav', 'ul'], class_=re.compile(r'menu|nav', re.I))
+        if has_logo and has_nav:
+            return div
+    return None
+```
+#### 2. Main Content Detection
+```python
+def detect_main_content(self, soup: BeautifulSoup) -> Optional[Tag]:
+    """Detect main content area (most important for extraction)."""
+    # Try semantic tags
+    main = soup.find('main')
+    if main:
+        return main
+    article = soup.find('article')
+    if article:
+        return article
+    # Content scoring approach
+    candidates = soup.find_all(['div', 'section'])
+    scored = []
+    for candidate in candidates:
+        score = 0
+        # More text = higher score
+        text_length = len(candidate.get_text(strip=True))
+        score += text_length * 0.1
+        # Has article/main role
+        if candidate.get('role') in ['main', 'article']:
+            score += 100
+        # Common content class names
+        if candidate.get('class'):
+            classes = ' '.join(candidate.get('class'))
+            if re.search(r'content|main|article|post|product', classes, re.I):
+                score += 50
+        # Penalize if contains nav/aside
+        if candidate.find(['nav', 'aside']):
+            score -= 30
+        scored.append((candidate, score))
+    if scored:
+        scored.sort(key=lambda x: x[1], reverse=True)
+        return scored[0][0]
+    return None
+```
+#### 3. Product Card Detection
+```python
+def detect_product_cards(self, soup: BeautifulSoup) -> List[Tag]:
+    """Detect product cards in e-commerce listings."""
+    cards = []
+    # Pattern 1: Schema.org markup
+    cards.extend(soup.find_all(itemtype=re.compile(r'schema.org/Product')))
+    # Pattern 2: Common class patterns
+    class_patterns = [
+        r'product[-_]card',
+        r'product[-_]item',
+        r'product[-_]box',
+        r'item[-_]card',
+        r'listing[-_]item'
+    ]
+    for pattern in class_patterns:
+        cards.extend(soup.find_all(class_=re.compile(pattern, re.I)))
+    # Pattern 3: Structural detection
+    # Look for repeated elements with image + title + price
+    candidates = soup.find_all(['div', 'article', 'li'])
+    for candidate in candidates:
+        has_image = candidate.find(['img'])
+        has_title = candidate.find(['h1', 'h2', 'h3', 'h4'], class_=re.compile(r'title|name', re.I))
+        has_price = candidate.find(class_=re.compile(r'price', re.I))
+        if has_image and has_title and has_price:
+            cards.append(candidate)
+    # Deduplicate
+    return list(set(cards))
+```
+---
+## Content Classification
+### Classifier
+```python
+class ContentClassifier:
+    """Classify HTML elements by type."""
+    CATEGORIES = [
+        'navigation',
+        'header',
+        'footer',
+        'sidebar',
+        'main_content',
+        'advertisement',
+        'product_listing',
+        'product_detail',
+        'form',
+        'table',
+        'pagination',
+        'breadcrumb',
+        'comment_section',
+        'related_items'
+    ]
+    def classify_element(self, element: Tag) -> str:
+        """Classify a single element."""
+        features = self.extract_features(element)
+        return self.model.predict(features)
+    def extract_features(self, element: Tag) -> Dict:
+        """Extract features for classification."""
+        return {
+            'tag_name': element.name,
+            'class_names': element.get('class', []),
+            'id': element.get('id', ''),
+            'role': element.get('role', ''),
+            'text_length': len(element.get_text(strip=True)),
+            'link_density': self.calculate_link_density(element),
+            'has_images': bool(element.find('img')),
+            'has_forms': bool(element.find('form')),
+            'position': self.get_vertical_position(element),
+            'parent_classes': element.parent.get('class', []) if element.parent else [],
+            'children_count': len(element.find_all(recursive=False)),
+            'schema_type': element.get('itemtype', '')
+        }
+```
+### Classification Rules
+```python
+def classify_by_rules(self, element: Tag) -> Optional[str]:
+    """Rule-based classification (fast, deterministic)."""
+    # Navigation
+    if element.name == 'nav':
+        return 'navigation'
+    if any('nav' in str(c) for c in element.get('class', [])):
+        return 'navigation'
+    # Header
+    if element.name == 'header':
+        return 'header'
+    # Footer
+    if element.name == 'footer':
+        return 'footer'
+    # Advertisement (common patterns)
+    ad_patterns = ['ad', 'advertisement', 'sponsored', 'promo']
+    classes = ' '.join(element.get('class', []))
+    if any(pattern in classes.lower() for pattern in ad_patterns):
+        return 'advertisement'
+    # Product listing
+    if element.get('itemtype') == 'http://schema.org/Product':
+        return 'product_detail'
+    # Form
+    if element.name == 'form' or element.find('form'):
+        return 'form'
+    # Table
+    if element.name == 'table':
+        return 'table'
+    return None
+```
+---
+## Smart Extraction
+### Pattern-Based Extraction
+```python
+class SmartExtractor:
+    """Intelligently extract data based on field semantics."""
+    def extract(self, html: str, field_name: str) -> ExtractionResult:
+        """Extract a field using multiple strategies."""
+        soup = BeautifulSoup(html, 'lxml')
+        # Strategy 1: Schema.org markup
+        result = self.extract_from_schema(soup, field_name)
+        if result:
+            return result
+        # Strategy 2: OpenGraph / meta tags
+        result = self.extract_from_meta(soup, field_name)
+        if result:
+            return result
+        # Strategy 3: Pattern matching
+        result = self.extract_by_pattern(soup, field_name)
+        if result:
+            return result
+        # Strategy 4: ML-based extraction
+        result = self.extract_by_ml(soup, field_name)
+        if result:
+            return result
+        return ExtractionResult(value=None, confidence=0.0)
+```
+### Field-Specific Patterns
+```python
+EXTRACTION_PATTERNS = {
+    'price': {
+        'regexes': [
+            r'\$\s*\d+[.,]\d{2}',           # $49.99
+            r'€\s*\d+[.,]\d{2}',            # €49,99
+            r'£\s*\d+[.,]\d{2}',            # £49.99
+            r'\d+[.,]\d{2}\s*USD',          # 49.99 USD
+        ],
+        'css_selectors': [
+            '[itemprop="price"]',
+            '.price',
+            '.product-price',
+            'span.sale-price',
+            'div.price-box span',
+        ],
+        'class_keywords': ['price', 'cost', 'sale', 'amount'],
+        'text_indicators': ['$', '€', '£', 'USD', 'EUR', 'GBP']
+    },
+    'product_name': {
+        'css_selectors': [
+            '[itemprop="name"]',
+            'h1.product-title',
+            'h1.product-name',
+            'div.product-info h1',
+        ],
+        'class_keywords': ['title', 'name', 'product-name'],
+        'heading_tags': ['h1', 'h2']
+    },
+    'rating': {
+        'regexes': [
+            r'(\d+\.?\d*)\s*out of\s*5',
+            r'(\d+\.?\d*)\s*/\s*5',
+            r'(\d+\.?\d*)\s*stars?',
+        ],
+        'css_selectors': [
+            '[itemprop="ratingValue"]',
+            '.rating',
+            '.star-rating',
+            'span.rating-value',
+        ],
+        'class_keywords': ['rating', 'stars', 'score'],
+    },
+    'email': {
+        'regexes': [
+            r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
+        ],
+        'css_selectors': [
+            '[href^="mailto:"]',
+            '[itemprop="email"]',
+        ]
+    },
+    'phone': {
+        'regexes': [
+            r'\+?1?\s*\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}',  # US format
+            r'\+\d{1,3}\s?\(?\d{1,4}\)?[\s.-]?\d{3,4}[\s.-]?\d{3,4}',  # International
+        ],
+        'css_selectors': [
+            '[href^="tel:"]',
+            '[itemprop="telephone"]',
+        ]
+    }
+}
+```
+### Confidence Scoring
+```python
+def score_extraction(self, value: Any, field_name: str, method: str) -> float:
+    """Score extraction confidence."""
+    confidence = 0.0
+    # Method confidence
+    method_confidence = {
+        'schema.org': 0.95,
+        'meta_tag': 0.90,
+        'pattern_match': 0.70,
+        'ml_model': 0.80,
+        'class_name': 0.60
+    }
+    confidence += method_confidence.get(method, 0.5)
+    # Value validation
+    if field_name == 'price':
+        if self.is_valid_price(value):
+            confidence += 0.1
+        else:
+            confidence -= 0.3
+    elif field_name == 'email':
+        if self.is_valid_email(value):
+            confidence += 0.1
+        else:
+            confidence = 0.0  # Invalid email
+    # Context validation
+    parent_context = self.get_parent_context(value)
+    if field_name in parent_context:
+        confidence += 0.1
+    return min(confidence, 1.0)
+```
+---
+## Adaptive Chunking
+### Chunking Strategy
+```python
+class AdaptiveChunker:
+    """Split large HTML into processable chunks."""
+    def chunk(self, html: str, max_size: int = 50000) -> List[Chunk]:
+        """Split HTML intelligently."""
+        soup = BeautifulSoup(html, 'lxml')
+        if len(html) <= max_size:
+            return [Chunk(html=html, type='full', index=0)]
+        # Strategy 1: Split by semantic sections
+        chunks = self.chunk_by_sections(soup, max_size)
+        if chunks:
+            return chunks
+        # Strategy 2: Split by repeated elements (product cards)
+        chunks = self.chunk_by_repeated_elements(soup, max_size)
+        if chunks:
+            return chunks
+        # Strategy 3: Sliding window with overlap
+        chunks = self.chunk_by_sliding_window(html, max_size, overlap=5000)
+        return chunks
+    def chunk_by_sections(self, soup: BeautifulSoup, max_size: int) -> List[Chunk]:
+        """Split by major sections."""
+        sections = soup.find_all(['article', 'section', 'div'], class_=re.compile(r'section|container', re.I))
+        chunks = []
+        current_chunk = ""
+        current_index = 0
+        for section in sections:
+            section_html = str(section)
+            if len(current_chunk) + len(section_html) > max_size:
+                # Save current chunk
+                if current_chunk:
+                    chunks.append(Chunk(
+                        html=current_chunk,
+                        type='section',
+                        index=current_index
+                    ))
+                    current_index += 1
+                # Start new chunk
+                current_chunk = section_html
+            else:
+                current_chunk += section_html
+        # Add final chunk
+        if current_chunk:
+            chunks.append(Chunk(html=current_chunk, type='section', index=current_index))
+        return chunks
+    def chunk_by_repeated_elements(self, soup: BeautifulSoup, max_size: int) -> List[Chunk]:
+        """Split by repeated elements (e.g., product cards)."""
+        # Detect repeated pattern
+        repeated = self.detect_repeated_elements(soup)
+        if not repeated:
+            return []
+        chunks = []
+        current_chunk = ""
+        current_items = []
+        current_index = 0
+        for element in repeated:
+            element_html = str(element)
+            if len(current_chunk) + len(element_html) > max_size:
+                # Save current chunk
+                if current_chunk:
+                    chunks.append(Chunk(
+                        html=current_chunk,
+                        type='repeated',
+                        index=current_index,
+                        item_count=len(current_items)
+                    ))
+                    current_index += 1
+                # Start new chunk
+                current_chunk = element_html
+                current_items = [element]
+            else:
+                current_chunk += element_html
+                current_items.append(element)
+        # Add final chunk
+        if current_chunk:
+            chunks.append(Chunk(
+                html=current_chunk,
+                type='repeated',
+                index=current_index,
+                item_count=len(current_items)
+            ))
+        return chunks
+```
+---
+## Batch Processing
+### Parallel Processing
+```python
+class BatchProcessor:
+    """Process large pages in parallel batches."""
+    async def process_large_page(
+        self,
+        html: str,
+        extraction_task: ExtractionTask
+    ) -> List[Dict]:
+        """Process a large page in parallel."""
+        # 1. Chunk the HTML
+        chunks = self.chunker.chunk(html)
+        # 2. Process chunks in parallel
+        tasks = [
+            self.process_chunk(chunk, extraction_task)
+            for chunk in chunks
+        ]
+        chunk_results = await asyncio.gather(*tasks)
+        # 3. Merge and deduplicate results
+        merged = self.merge_results(chunk_results)
+        # 4. Cross-chunk validation
+        validated = self.validate_across_chunks(merged, chunks)
+        return validated
+    async def process_chunk(
+        self,
+        chunk: Chunk,
+        task: ExtractionTask
+    ) -> List[Dict]:
+        """Process a single chunk."""
+        extractor = SmartExtractor()
+        results = []
+        for field in task.fields:
+            result = extractor.extract(chunk.html, field)
+            if result.value:
+                results.append({
+                    'field': field,
+                    'value': result.value,
+                    'confidence': result.confidence,
+                    'chunk_index': chunk.index
+                })
+        return results
+    def merge_results(self, chunk_results: List[List[Dict]]) -> List[Dict]:
+        """Merge and deduplicate results from chunks."""
+        merged = {}
+        for chunk_result in chunk_results:
+            for item in chunk_result:
+                key = (item['field'], item['value'])
+                if key in merged:
+                    # Increase confidence if found in multiple chunks
+                    merged[key]['confidence'] = max(
+                        merged[key]['confidence'],
+                        item['confidence']
+                    )
+                    merged[key]['chunk_count'] += 1
+                else:
+                    merged[key] = {
+                        **item,
+                        'chunk_count': 1
+                    }
+        return list(merged.values())
+```
+---
+## Diff-Based Updates
+### Incremental Processing
+```python
+class DiffProcessor:
+    """Process only changed content between page loads."""
+    def __init__(self):
+        self.page_cache = {}
+    def process_with_diff(
+        self,
+        url: str,
+        current_html: str,
+        extraction_task: ExtractionTask
+    ) -> Dict:
+        """Process only the diff from last visit."""
+        previous_html = self.page_cache.get(url)
+        if not previous_html:
+            # First visit, process full page
+            result = self.process_full(current_html, extraction_task)
+            self.page_cache[url] = current_html
+            return result
+        # Calculate diff
+        diff = self.calculate_diff(previous_html, current_html)
+        if diff.similarity > 0.95:
+            # Page barely changed, use cached results
+            return self.page_cache.get(f"{url}_result")
+        # Process only changed regions
+        result = self.process_diff(diff, extraction_task)
+        # Update cache
+        self.page_cache[url] = current_html
+        self.page_cache[f"{url}_result"] = result
+        return result
+    def calculate_diff(self, html1: str, html2: str) -> Diff:
+        """Calculate structural diff between two HTML documents."""
+        soup1 = BeautifulSoup(html1, 'lxml')
+        soup2 = BeautifulSoup(html2, 'lxml')
+        # Find added, removed, and modified elements
+        diff = Diff()
+        diff.added = self.find_added_elements(soup1, soup2)
+        diff.removed = self.find_removed_elements(soup1, soup2)
+        diff.modified = self.find_modified_elements(soup1, soup2)
+        diff.similarity = self.calculate_similarity(soup1, soup2)
+        return diff
+```
+---
+## Schema Detection
+### Auto-Detect Data Schemas
+```python
+class SchemaDetector:
+    """Automatically detect data schemas in HTML."""
+    def detect_schema(self, html: str) -> Schema:
+        """Detect the implicit schema of the page."""
+        soup = BeautifulSoup(html, 'lxml')
+        # 1. Check for schema.org markup
+        schema_org = self.detect_schema_org(soup)
+        if schema_org:
+            return schema_org
+        # 2. Detect repeated patterns
+        repeated = self.detect_repeated_pattern(soup)
+        if repeated:
+            return self.infer_schema_from_pattern(repeated)
+        # 3. Detect tables
+        tables = soup.find_all('table')
+        if tables:
+            return self.infer_schema_from_table(tables[0])
+        return Schema()
+    def infer_schema_from_pattern(self, elements: List[Tag]) -> Schema:
+        """Infer schema from repeated elements."""
+        # Analyze first few elements
+        sample = elements[:5]
+        field_candidates = {}
+        for element in sample:
+            # Find all text-bearing children
+            children = element.find_all(string=True, recursive=True)
+            for child in children:
+                # Classify by parent tag/class
+                parent = child.parent
+                key = (parent.name, ' '.join(parent.get('class', [])))
+                if key not in field_candidates:
+                    field_candidates[key] = []
+                field_candidates[key].append(child.strip())
+        # Build schema
+        schema = Schema()
+        for (tag, class_name), values in field_candidates.items():
+            # Infer field type from values
+            field_type = self.infer_type(values)
+            field_name = self.guess_field_name(class_name, values)
+            schema.add_field(Field(
+                name=field_name,
+                type=field_type,
+                selector=f"{tag}.{class_name}" if class_name else tag,
+                sample_values=values
+            ))
+        return schema
+```
+---
+**Next:** See [search-engine.md](./search-engine.md) for search optimization.

docs/mcp.md ADDED Viewed

	@@ -0,0 +1,977 @@

+# 🔌 MCP Server Integration
+## Table of Contents
+1. [Overview](#overview)
+2. [Available MCP Servers](#available-mcp-servers)
+3. [Tool Registry & Discovery](#tool-registry--discovery)
+4. [HTML Processing MCPs](#html-processing-mcps)
+5. [Lazy Loading System](#lazy-loading-system)
+6. [MCP Composition](#mcp-composition)
+7. [Testing Panel](#testing-panel)
+8. [Configuration](#configuration)
+---
+## Overview
+The **Model Context Protocol (MCP)** enables the WebScraper agent to interact with external tools, databases, and services through a standardized interface. MCP servers expose **tools** that the agent can discover and use dynamically.
+### Why MCP?
+**Without MCP:**
+- Agent limited to built-in capabilities
+- Cannot access external databases, APIs, or specialized libraries
+- Difficult to extend without code changes
+**With MCP:**
+- ✅ Dynamically discover and use 100+ community tools
+- ✅ Access databases (PostgreSQL, MongoDB, etc.)
+- ✅ Use specialized libraries (BeautifulSoup, Selenium, Playwright)
+- ✅ Integrate with external APIs (Google, GitHub, etc.)
+- ✅ Extend agent capabilities without code changes
+### Architecture
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    WebScraper Agent                          │
+├─────────────────────────────────────────────────────────────┤
+│                                                              │
+│  ┌────────────────────────────────────────────────────┐     │
+│  │            MCP Tool Registry                        │     │
+│  │  - Discovers available tools from all MCP servers  │     │
+│  │  - Provides tool metadata to agent                 │     │
+│  │  - Routes tool calls to appropriate server         │     │
+│  └────────────────┬───────────────────────────────────┘     │
+│                   │                                          │
+└───────────────────┼──────────────────────────────────────────┘
+                    │
+        ┌───────────┼───────────┬──────────────┬─────────────┐
+        │           │           │              │             │
+        ▼           ▼           ▼              ▼             ▼
+┌──────────────┐ ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
+│ HTML Parser  │ │Browser  │ │ Database │ │  File    │ │  Custom  │
+│     MCP      │ │  MCP    │ │   MCP    │ │  System  │ │   MCP    │
+│              │ │         │ │          │ │   MCP    │ │          │
+│• BeautifulSoup││• Puppeteer││• Postgres││• Read    ││• Your    │
+│• lxml        ││• Playwright││• MongoDB │││• Write   ││  tools   │
+│• html5lib    ││• Selenium ││• Redis   │││• Search  ││          │
+└──────────────┘ └─────────┘ └──────────┘ └──────────┘ └──────────┘
+```
+---
+## Available MCP Servers
+### 1. HTML Processing & Parsing
+#### **beautifulsoup-mcp**
+Advanced HTML parsing and extraction.
+**Tools:**
+- `parse_html(html: str, parser: str = "html.parser")` → Parse HTML into DOM tree
+- `find_all(html: str, selector: str)` → CSS selector search
+- `extract_text(html: str, selector: str)` → Extract text content
+- `extract_attributes(html: str, selector: str, attrs: List[str])` → Get element attributes
+- `clean_html(html: str)` → Remove scripts, styles, comments
+- `extract_tables(html: str)` → Parse all tables into structured data
+**Configuration:**
+```json
+{
+  "mcpServers": {
+    "beautifulsoup": {
+      "command": "python",
+      "args": ["-m", "mcp_beautifulsoup"],
+      "enabled": true,
+      "autoDownload": true,
+      "config": {
+        "default_parser": "lxml",
+        "encodings": ["utf-8", "latin-1"]
+      }
+    }
+  }
+}
+```
+**Example Usage:**
+```python
+# Agent action
+action = Action(
+    action_type="MCP_TOOL_CALL",
+    tool_name="beautifulsoup.find_all",
+    tool_params={
+        "html": observation.page_html,
+        "selector": "div.product-card"
+    }
+)
+# Response
+{
+    "products": [
+        {"name": "Widget", "price": "$49.99"},
+        {"name": "Gadget", "price": "$39.99"}
+    ]
+}
+```
+#### **lxml-mcp**
+Fast XML/HTML parsing with XPath support.
+**Tools:**
+- `xpath_query(html: str, xpath: str)` → XPath extraction
+- `css_select(html: str, css: str)` → CSS selector (fast)
+- `validate_html(html: str)` → Check well-formedness
+#### **html5lib-mcp**
+Standards-compliant HTML5 parsing.
+**Tools:**
+- `parse_html5(html: str)` → Parse like a browser would
+- `sanitize_html(html: str, allowed_tags: List[str])` → Safe HTML cleaning
+### 2. Browser Automation
+#### **playwright-mcp**
+Full browser automation with JavaScript rendering.
+**Tools:**
+- `navigate(url: str, wait_for: str = "networkidle")` → Load page with JS
+- `click(selector: str)` → Click element
+- `fill_form(selector: str, value: str)` → Fill input
+- `screenshot(selector: str = None)` → Capture screenshot
+- `wait_for_selector(selector: str, timeout: int = 5000)` → Wait for element
+- `execute_script(script: str)` → Run custom JavaScript
+**Use Cases:**
+- Pages with client-side rendering (React, Vue, Angular)
+- Infinite scroll / lazy loading
+- Forms and interactions
+- Captcha handling
+**Configuration:**
+```json
+{
+  "mcpServers": {
+    "playwright": {
+      "command": "npx",
+      "args": ["@playwright/mcp-server"],
+      "enabled": false,  // Only enable when needed (heavy)
+      "autoDownload": true,
+      "config": {
+        "browser": "chromium",
+        "headless": true,
+        "viewport": {"width": 1920, "height": 1080}
+      }
+    }
+  }
+}
+```
+#### **puppeteer-mcp**
+Lightweight browser automation (Chrome DevTools Protocol).
+Similar to Playwright but lighter weight.
+#### **selenium-mcp**
+Legacy browser automation (more compatible, slower).
+### 3. Database Access
+#### **postgresql-mcp**
+Access PostgreSQL databases.
+**Tools:**
+- `query(sql: str, params: List = [])` → Execute SELECT
+- `execute(sql: str, params: List = [])` → Execute INSERT/UPDATE/DELETE
+- `list_tables()` → Get schema
+**Use Case:** Store scraped data directly to production database.
+#### **mongodb-mcp**
+Access MongoDB collections.
+**Tools:**
+- `find(collection: str, query: dict)` → Query documents
+- `insert(collection: str, document: dict)` → Insert document
+- `aggregate(collection: str, pipeline: List)` → Aggregation pipeline
+#### **redis-mcp**
+Fast cache and pub/sub.
+**Tools:**
+- `get(key: str)` → Retrieve cached value
+- `set(key: str, value: str, ttl: int)` → Cache value
+- `publish(channel: str, message: str)` → Pub/sub
+**Use Case:** Cache parsed HTML, share state between agents.
+### 4. File System
+#### **filesystem-mcp**
+Read/write local files.
+**Tools:**
+- `read_file(path: str)` → Read text/binary file
+- `write_file(path: str, content: str)` → Write file
+- `list_directory(path: str)` → List files
+- `search_files(pattern: str)` → Glob search
+**Use Case:** Save scraped data to CSV/JSON, read configuration files.
+### 5. Search Engines
+#### **google-search-mcp**
+Google Search API integration.
+**Tools:**
+- `search(query: str, num: int = 10)` → Google Search results
+- `search_images(query: str)` → Image search
+**Configuration:**
+```json
+{
+  "mcpServers": {
+    "google-search": {
+      "command": "python",
+      "args": ["-m", "mcp_google_search"],
+      "enabled": true,
+      "autoDownload": true,
+      "config": {
+        "api_key": "YOUR_GOOGLE_API_KEY",
+        "search_engine_id": "YOUR_SEARCH_ENGINE_ID"
+      }
+    }
+  }
+}
+```
+#### **bing-search-mcp**
+Bing Search API.
+#### **brave-search-mcp**
+Privacy-focused search (Brave Search API).
+#### **duckduckgo-mcp**
+Free, no-API search.
+**Tools:**
+- `search(query: str, max_results: int = 10)` → DDG results
+### 6. Data Extraction
+#### **readability-mcp**
+Extract main article content (removes ads, navigation, etc.).
+**Tools:**
+- `extract_article(html: str)` → Returns clean article text + metadata
+**Use Case:** Extract blog posts, news articles, documentation.
+#### **trafilatura-mcp**
+Advanced web scraping and text extraction.
+**Tools:**
+- `extract(url: str)` → Extract main content
+- `extract_metadata(html: str)` → Get title, author, date, etc.
+#### **newspaper-mcp**
+News article extraction and NLP.
+**Tools:**
+- `parse_article(url: str)` → Full article data
+- `extract_keywords(text: str)` → Keyword extraction
+- `summarize(text: str)` → Auto-summarization
+### 7. Data Validation
+#### **cerberus-mcp**
+Schema validation for extracted data.
+**Tools:**
+- `validate(data: dict, schema: dict)` → Validate against schema
+**Example:**
+```python
+# Define schema
+schema = {
+    "product_name": {"type": "string", "required": True, "minlength": 1},
+    "price": {"type": "float", "required": True, "min": 0},
+    "rating": {"type": "float", "min": 0, "max": 5}
+}
+# Validate extracted data
+result = mcp.call("cerberus.validate", data=extracted_data, schema=schema)
+if not result["valid"]:
+    print("Validation errors:", result["errors"])
+```
+#### **pydantic-mcp**
+Pydantic model validation.
+### 8. Computer Vision
+#### **ocr-mcp**
+Extract text from images (Tesseract OCR).
+**Tools:**
+- `extract_text(image_path: str, lang: str = "eng")` → OCR text
+**Use Case:** Extract prices from product images, read captchas (if legal).
+#### **image-analysis-mcp**
+Vision AI (GPT-4 Vision, Claude Vision).
+**Tools:**
+- `describe_image(image_path: str)` → Natural language description
+- `extract_structured(image_path: str, schema: dict)` → Extract structured data from images
+### 9. HTTP & Networking
+#### **requests-mcp**
+HTTP client with retry, session management.
+**Tools:**
+- `get(url: str, headers: dict = {})` → HTTP GET
+- `post(url: str, data: dict = {})` → HTTP POST
+#### **proxy-manager-mcp**
+Manage proxy rotation, IP reputation.
+**Tools:**
+- `get_proxy()` → Get next proxy from pool
+- `report_dead_proxy(proxy: str)` → Mark proxy as failed
+### 10. Utility
+#### **regex-mcp**
+Advanced regex operations.
+**Tools:**
+- `find_all(pattern: str, text: str)` → Find all matches
+- `replace(pattern: str, replacement: str, text: str)` → Regex replace
+- `validate(pattern: str)` → Check if regex is valid
+#### **datetime-mcp**
+Parse and normalize dates.
+**Tools:**
+- `parse_date(text: str)` → Parse natural language dates
+- `normalize_timezone(date: str, tz: str)` → Convert timezone
+#### **currency-mcp**
+Currency parsing and conversion.
+**Tools:**
+- `parse_price(text: str)` → Extract price and currency
+- `convert(amount: float, from_currency: str, to_currency: str)` → Convert
+---
+## Tool Registry & Discovery
+The **Tool Registry** automatically discovers all available tools from enabled MCP servers.
+### Architecture
+```python
+class MCPToolRegistry:
+    def __init__(self):
+        self.servers: Dict[str, MCPServer] = {}
+        self.tools: Dict[str, Tool] = {}  # tool_name → Tool
+    def discover_servers(self, config: MCPConfig):
+        """Load and connect to all enabled MCP servers."""
+        for server_name, server_config in config.mcpServers.items():
+            if not server_config.enabled:
+                continue
+            # Auto-download if needed
+            if server_config.autoDownload and not self.is_installed(server_config):
+                self.download_and_install(server_name, server_config)
+            # Connect to server
+            server = self.connect_server(server_name, server_config)
+            self.servers[server_name] = server
+            # Discover tools
+            for tool in server.list_tools():
+                full_name = f"{server_name}.{tool.name}"
+                self.tools[full_name] = tool
+    def get_tool(self, tool_name: str) -> Tool:
+        """Get tool by fully qualified name (server.tool)."""
+        return self.tools.get(tool_name)
+    def search_tools(self, query: str, category: str = None) -> List[Tool]:
+        """Search tools by natural language query."""
+        # Semantic search using tool descriptions
+        candidates = list(self.tools.values())
+        if category:
+            candidates = [t for t in candidates if t.category == category]
+        # Embed query and tools, rank by similarity
+        scored = []
+        for tool in candidates:
+            score = self.semantic_similarity(query, tool.description)
+            scored.append((tool, score))
+        scored.sort(key=lambda x: x[1], reverse=True)
+        return [tool for tool, score in scored[:10]]
+```
+### Tool Metadata
+Each tool exposes rich metadata:
+```python
+class Tool(BaseModel):
+    name: str                          # e.g., "find_all"
+    full_name: str                     # e.g., "beautifulsoup.find_all"
+    server: str                        # Server name
+    description: str                   # Human-readable description
+    category: str                      # "parsing" | "browser" | "database" | ...
+    input_schema: Dict[str, Any]       # JSON Schema for parameters
+    output_schema: Dict[str, Any]      # JSON Schema for return value
+    examples: List[ToolExample]        # Usage examples
+    cost: ToolCost                     # Time/resource cost estimate
+    requires_auth: bool                # Needs API keys?
+    rate_limit: Optional[RateLimit]    # Rate limiting info
+```
+**Example:**
+```python
+Tool(
+    name="find_all",
+    full_name="beautifulsoup.find_all",
+    server="beautifulsoup",
+    description="Find all HTML elements matching a CSS selector",
+    category="parsing",
+    input_schema={
+        "type": "object",
+        "properties": {
+            "html": {"type": "string", "description": "HTML content to search"},
+            "selector": {"type": "string", "description": "CSS selector"}
+        },
+        "required": ["html", "selector"]
+    },
+    output_schema={
+        "type": "array",
+        "items": {"type": "object"}
+    },
+    examples=[
+        ToolExample(
+            input={"html": "<div class='item'>A</div>", "selector": ".item"},
+            output=[{"tag": "div", "text": "A", "class": "item"}]
+        )
+    ],
+    cost=ToolCost(time_ms=10, cpu_intensive=False),
+    requires_auth=False
+)
+```
+### Auto Tool Discovery by Agent
+The agent can query the registry to find relevant tools:
+```python
+# Agent needs to parse HTML
+available_tools = tool_registry.search_tools(
+    query="parse HTML and extract elements by CSS selector",
+    category="parsing"
+)
+# Top result: beautifulsoup.find_all
+tool = available_tools[0]
+# Agent calls the tool
+action = Action(
+    action_type="MCP_TOOL_CALL",
+    tool_name=tool.full_name,
+    tool_params={
+        "html": observation.page_html,
+        "selector": "div.product-price"
+    }
+)
+```
+---
+## HTML Processing MCPs
+### BeautifulSoup MCP (Detailed)
+**Installation:**
+```bash
+pip install mcp-beautifulsoup
+```
+**Tools:**
+#### 1. `find_all(html, selector, limit=None)`
+Find all elements matching CSS selector.
+```python
+result = mcp.call("beautifulsoup.find_all", {
+    "html": "<div class='price'>$10</div><div class='price'>$20</div>",
+    "selector": "div.price"
+})
+# Returns: [{"text": "$10"}, {"text": "$20"}]
+```
+#### 2. `find_one(html, selector)`
+Find first matching element.
+```python
+result = mcp.call("beautifulsoup.find_one", {
+    "html": obs.page_html,
+    "selector": "h1.product-title"
+})
+# Returns: {"text": "Widget Pro", "tag": "h1"}
+```
+#### 3. `extract_tables(html)`
+Parse all `<table>` elements into structured data.
+```python
+result = mcp.call("beautifulsoup.extract_tables", {"html": obs.page_html})
+# Returns:
+[
+    {
+        "headers": ["Product", "Price", "Stock"],
+        "rows": [
+            ["Widget", "$49.99", "In Stock"],
+            ["Gadget", "$39.99", "Out of Stock"]
+        ]
+    }
+]
+```
+#### 4. `extract_links(html, base_url=None)`
+Extract all links from page.
+```python
+result = mcp.call("beautifulsoup.extract_links", {
+    "html": obs.page_html,
+    "base_url": "https://example.com"
+})
+# Returns:
+[
+    {"url": "https://example.com/product/123", "text": "View Product"},
+    {"url": "https://example.com/category/widgets", "text": "Widgets"}
+]
+```
+#### 5. `clean_html(html, remove=['script', 'style', 'noscript'])`
+Remove unwanted elements.
+```python
+result = mcp.call("beautifulsoup.clean_html", {
+    "html": obs.page_html,
+    "remove": ["script", "style", "footer", "nav"]
+})
+# Returns: Clean HTML without ads, scripts, navigation
+```
+#### 6. `smart_extract(html, field_name)`
+Intelligent extraction based on field name.
+```python
+# Agent wants to extract "price"
+result = mcp.call("beautifulsoup.smart_extract", {
+    "html": obs.page_html,
+    "field_name": "price"
+})
+# MCP searches for:
+#  - Elements with class/id containing "price"
+#  - Text matching price patterns ($X.XX, €X,XX)
+#  - Schema.org markup (itemprop="price")
+# Returns: {"value": "$49.99", "confidence": 0.92, "selector": "span.product-price"}
+```
+### Batch Processing for Long Content
+When HTML is too large (> 100KB), process in batches:
+```python
+class HTMLBatchProcessor:
+    def __init__(self, mcp_client, chunk_size: int = 50000):
+        self.mcp = mcp_client
+        self.chunk_size = chunk_size
+    def process_large_html(self, html: str, selector: str) -> List[Dict]:
+        """Process large HTML in chunks."""
+        # Split HTML into meaningful chunks (by sections, not mid-tag)
+        chunks = self.split_html_intelligently(html)
+        results = []
+        for i, chunk in enumerate(chunks):
+            # Process each chunk
+            chunk_results = self.mcp.call("beautifulsoup.find_all", {
+                "html": chunk,
+                "selector": selector
+            })
+            # Deduplicate across chunk boundaries
+            results.extend(self.deduplicate(chunk_results, results))
+        return results
+    def split_html_intelligently(self, html: str) -> List[str]:
+        """Split HTML at section boundaries, not mid-tag."""
+        soup = BeautifulSoup(html, 'lxml')
+        # Split by major sections (article, section, div.container, etc.)
+        sections = soup.find_all(['article', 'section', 'main'])
+        chunks = []
+        current_chunk = ""
+        for section in sections:
+            section_html = str(section)
+            if len(current_chunk) + len(section_html) > self.chunk_size:
+                chunks.append(current_chunk)
+                current_chunk = section_html
+            else:
+                current_chunk += section_html
+        if current_chunk:
+            chunks.append(current_chunk)
+        return chunks
+```
+---
+## Lazy Loading System
+MCP servers are **NOT downloaded by default**. They are installed on-demand when first used.
+### Download-on-Demand Flow
+```
+Agent wants to use a tool
+         │
+         ▼
+Is MCP server installed?
+         │
+    ┌────┴────┐
+   No        Yes
+    │          │
+    ▼          ▼
+Show dialog   Execute tool
+"Download
+ server X?"
+    │
+┌───┴───┐
+No     Yes
+│       │
+Skip    Download & Install
+        │
+        ▼
+     Cache for future use
+        │
+        ▼
+     Execute tool
+```
+### Implementation
+```python
+class LazyMCPLoader:
+    def __init__(self):
+        self.installed_servers: Set[str] = set()
+        self.download_queue: Queue[str] = Queue()
+    def ensure_server(self, server_name: str, config: MCPServerConfig) -> bool:
+        """Ensure MCP server is installed, download if needed."""
+        if server_name in self.installed_servers:
+            return True
+        if not config.autoDownload:
+            # Prompt user
+            if not self.prompt_user_download(server_name):
+                return False
+        # Download and install
+        return self.download_server(server_name, config)
+    def download_server(self, server_name: str, config: MCPServerConfig) -> bool:
+        """Download and install MCP server."""
+        try:
+            logger.info(f"Downloading MCP server: {server_name}")
+            if config.command == "npx":
+                # NPM package
+                subprocess.run([
+                    "npm", "install", "-g", config.args[1]
+                ], check=True)
+            elif config.command == "python":
+                # Python package
+                package_name = config.args[1].replace("-m ", "")
+                subprocess.run([
+                    "pip", "install", package_name
+                ], check=True)
+            self.installed_servers.add(server_name)
+            logger.info(f"✓ Installed {server_name}")
+            return True
+        except Exception as e:
+            logger.error(f"Failed to install {server_name}: {e}")
+            return False
+    def prompt_user_download(self, server_name: str) -> bool:
+        """Ask user if they want to download the server."""
+        # In UI, show dialog:
+        # "Tool X requires MCP server Y. Download and install? (50MB) [Yes] [No]"
+        return self.show_download_dialog(server_name)
+```
+### UI Dialog
+```
+┌──────────────────────────────────────────────────────────┐
+│ MCP Server Required                                       │
+├──────────────────────────────────────────────────────────┤
+│                                                           │
+│ The tool "beautifulsoup.find_all" requires the MCP       │
+│ server "beautifulsoup" which is not installed.           │
+│                                                           │
+│ Package: mcp-beautifulsoup                               │
+│ Size:    ~5 MB                                           │
+│                                                           │
+│ Would you like to download and install it now?           │
+│                                                           │
+│        [Download & Install]     [Skip]                   │
+│                                                           │
+│ ☑ Remember my choice for this server                     │
+└──────────────────────────────────────────────────────────┘
+```
+---
+## MCP Composition
+Combine multiple MCP tools to create powerful workflows.
+### Example 1: Parse HTML → Extract Tables → Save to Database
+```python
+# Step 1: Clean HTML
+cleaned = mcp.call("beautifulsoup.clean_html", {
+    "html": observation.page_html
+})
+# Step 2: Extract tables
+tables = mcp.call("beautifulsoup.extract_tables", {
+    "html": cleaned["html"]
+})
+# Step 3: Save to PostgreSQL
+for table in tables:
+    mcp.call("postgresql.execute", {
+        "sql": "INSERT INTO scraped_data (data) VALUES (%s)",
+        "params": [json.dumps(table)]
+    })
+```
+### Example 2: Search Google → Navigate → Parse Article → Summarize
+```python
+# Step 1: Search
+results = mcp.call("google-search.search", {
+    "query": "best widgets 2026",
+    "num": 5
+})
+# Step 2: Navigate to top result
+mcp.call("playwright.navigate", {
+    "url": results[0]["url"]
+})
+# Step 3: Extract article
+article = mcp.call("readability.extract_article", {
+    "html": mcp.call("playwright.get_html", {})
+})
+# Step 4: Summarize
+summary = mcp.call("llm.summarize", {
+    "text": article["text"],
+    "max_length": 200
+})
+```
+### Composition DSL
+Define reusable workflows:
+```python
+class MCPWorkflow:
+    def __init__(self, name: str, steps: List[WorkflowStep]):
+        self.name = name
+        self.steps = steps
+    async def execute(self, initial_input: Dict) -> Dict:
+        """Execute workflow steps sequentially."""
+        context = initial_input
+        for step in self.steps:
+            result = await mcp.call(step.tool, step.params(context))
+            context[step.output_var] = result
+        return context
+# Define workflow
+extract_and_save = MCPWorkflow(
+    name="extract_and_save",
+    steps=[
+        WorkflowStep(
+            tool="beautifulsoup.find_all",
+            params=lambda ctx: {"html": ctx["html"], "selector": ctx["selector"]},
+            output_var="extracted"
+        ),
+        WorkflowStep(
+            tool="cerberus.validate",
+            params=lambda ctx: {"data": ctx["extracted"], "schema": ctx["schema"]},
+            output_var="validated"
+        ),
+        WorkflowStep(
+            tool="postgresql.execute",
+            params=lambda ctx: {"sql": "INSERT INTO items ...", "params": ctx["validated"]},
+            output_var="saved"
+        )
+    ]
+)
+# Execute
+result = await extract_and_save.execute({
+    "html": obs.page_html,
+    "selector": "div.product",
+    "schema": PRODUCT_SCHEMA
+})
+```
+---
+## Testing Panel
+Test MCP tools manually before using them in agent workflows.
+### UI
+```
+┌─────────────────────────────────────────────────────────────┐
+│ MCP Testing Panel                                            │
+├─────────────────────────────────────────────────────────────┤
+│                                                              │
+│ Server:  [beautifulsoup ▼]                                  │
+│ Tool:    [find_all ▼]                                       │
+│                                                              │
+│ ┌──────────────────────────────────────────────────────┐    │
+│ │ Input Parameters:                                     │    │
+│ │                                                       │    │
+│ │ html:                                                 │    │
+│ │ ┌───────────────────────────────────────────────┐    │    │
+│ │ │ <div class="item">Item 1</div>                │    │    │
+│ │ │ <div class="item">Item 2</div>                │    │    │
+│ │ └───────────────────────────────────────────────┘    │    │
+│ │                                                       │    │
+│ │ selector: [div.item                           ]      │    │
+│ │                                                       │    │
+│ └──────────────────────────────────────────────────────┘    │
+│                                                              │
+│                  [Execute Tool]  [Clear]                     │
+│                                                              │
+│ ┌──────────────────────────────────────────────────────┐    │
+│ │ Output:                                               │    │
+│ │                                                       │    │
+│ │ [                                                     │    │
+│ │   {"tag": "div", "class": "item", "text": "Item 1"}, │    │
+│ │   {"tag": "div", "class": "item", "text": "Item 2"}  │    │
+│ │ ]                                                     │    │
+│ │                                                       │    │
+│ │ Execution time: 12ms                                  │    │
+│ │ Status: ✓ Success                                     │    │
+│ └──────────────────────────────────────────────────────┘    │
+│                                                              │
+│                       [Save as Example]                      │
+└─────────────────────────────────────────────────────────────┘
+```
+---
+## Configuration
+### Full MCP Configuration Example
+```json
+{
+  "mcpServers": {
+    "beautifulsoup": {
+      "command": "python",
+      "args": ["-m", "mcp_beautifulsoup"],
+      "enabled": true,
+      "autoDownload": true,
+      "config": {
+        "default_parser": "lxml"
+      }
+    },
+    "playwright": {
+      "command": "npx",
+      "args": ["@playwright/mcp-server"],
+      "enabled": false,
+      "autoDownload": false,
+      "config": {
+        "browser": "chromium",
+        "headless": true
+      }
+    },
+    "postgresql": {
+      "command": "python",
+      "args": ["-m", "mcp_postgresql"],
+      "enabled": false,
+      "autoDownload": false,
+      "config": {
+        "host": "localhost",
+        "port": 5432,
+        "database": "scraper_db",
+        "user": "postgres",
+        "password": "${PG_PASSWORD}"
+      }
+    },
+    "google-search": {
+      "command": "python",
+      "args": ["-m", "mcp_google_search"],
+      "enabled": true,
+      "autoDownload": true,
+      "config": {
+        "api_key": "${GOOGLE_API_KEY}",
+        "search_engine_id": "${GOOGLE_SE_ID}"
+      }
+    },
+    "filesystem": {
+      "command": "npx",
+      "args": ["-y", "@modelcontextprotocol/server-filesystem", "./scraped_data"],
+      "enabled": true,
+      "autoDownload": true
+    }
+  },
+  "mcpSettings": {
+    "autoDiscoverTools": true,
+    "toolTimeout": 30,
+    "maxConcurrentCalls": 5,
+    "retryFailedCalls": true,
+    "cacheToolResults": true,
+    "cacheTTL": 3600
+  }
+}
+```
+---
+**Next:** See [settings.md](./settings.md) for complete dashboard settings.

docs/memory.md ADDED Viewed

	@@ -0,0 +1,786 @@

+# 🧠 Unified Memory System
+## Table of Contents
+1. [Overview](#overview)
+2. [Memory Architecture](#memory-architecture)
+3. [Memory Layers](#memory-layers)
+4. [Memory Operations](#memory-operations)
+5. [Implementation Details](#implementation-details)
+6. [Configuration](#configuration)
+7. [Best Practices](#best-practices)
+---
+## Overview
+The **Unified Memory System** is the most critical upgrade for the WebScraper-OpenEnv agent. It provides persistent, contextual, and hierarchical memory across episodes, enabling the agent to learn from past experiences, maintain reasoning context, and share knowledge across multiple agents.
+### Why Memory Matters
+Without memory:
+- Agents repeat the same mistakes across episodes
+- No learning from successful extraction patterns
+- Cannot maintain context across long scraping sessions
+- Unable to share knowledge between multiple agents
+- Limited by context window size
+With unified memory:
+- ✅ Learn successful extraction strategies
+- ✅ Remember failed approaches to avoid repetition
+- ✅ Maintain reasoning context across steps
+- ✅ Share discoveries across agent instances
+- ✅ Overcome context window limitations
+---
+## Memory Architecture
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                     Unified Memory System                        │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                  │
+│  ┌────────────────┐  ┌────────────────┐  ┌──────────────────┐  │
+│  │  Short-Term    │  │   Working      │  │   Long-Term      │  │
+│  │   Memory       │  │   Memory       │  │    Memory        │  │
+│  │  (Episode)     │  │  (Reasoning)   │  │  (Persistent)    │  │
+│  └────────┬───────┘  └───────┬────────┘  └────────┬─────────┘  │
+│           │                  │                     │            │
+│           └──────────────────┼─────────────────────┘            │
+│                              │                                  │
+│                    ┌─────────▼──────────┐                       │
+│                    │   Memory Router    │                       │
+│                    │  - Query planner   │                       │
+│                    │  - Context builder │                       │
+│                    │  - Summarizer      │                       │
+│                    └─────────┬──────────┘                       │
+│                              │                                  │
+│           ┌──────────────────┼──────────────────┐               │
+│           │                  │                  │               │
+│  ┌────────▼────────┐  ┌──────▼─────────┐  ┌───▼──────────┐    │
+│  │  Shared Memory  │  │  Vector Index  │  │  MCP Storage │    │
+│  │  (Multi-Agent)  │  │  (FAISS/Qdrant)│  │  (File/DB)   │    │
+│  └─────────────────┘  └────────────────┘  └──────────────┘    │
+│                                                                  │
+└─────────────────────────────────────────────────────────────────┘
+```
+---
+## Memory Layers
+### 1. 🟢 Short-Term Memory (Per Episode)
+**Purpose:** Tracks the current scraping session state.
+**Lifecycle:** Exists for one episode, cleared on `reset()`.
+**Data Structure:**
+```python
+class EpisodeMemory(BaseModel):
+    episode_id: str
+    task_id: str
+    visited_urls: List[str]                    # Navigation history
+    extracted_data: Dict[str, Any]             # Field → value mappings
+    actions_history: List[Action]              # All actions taken
+    intermediate_notes: List[str]              # Agent's reasoning notes
+    observations: List[Observation]            # All observations received
+    page_summaries: Dict[str, str]             # URL �� content summary
+    extraction_attempts: Dict[str, List[Any]]  # Field → list of attempts
+    timestamp_created: datetime
+    timestamp_updated: datetime
+```
+**Use Cases:**
+- Track which pages have been visited to avoid cycles
+- Remember what data has been extracted
+- Maintain action history for debugging
+- Store intermediate reasoning
+**Example:**
+```python
+# Agent navigating a multi-page catalog
+episode_memory = {
+    "visited_urls": [
+        "/catalog/page/1",
+        "/catalog/page/2",
+        "/product/12345"
+    ],
+    "extracted_data": {
+        "product_name": "Widget Pro",
+        "price": "$49.99"
+    },
+    "intermediate_notes": [
+        "Price found in span.product-price",
+        "Next page link present, continuing pagination"
+    ]
+}
+```
+### 2. 🔵 Working Memory (Agent Thinking)
+**Purpose:** Temporary reasoning buffer for active decision-making.
+**Lifecycle:** Cleared after each action decision, or kept for multi-step reasoning.
+**Data Structure:**
+```python
+class WorkingMemory(BaseModel):
+    current_goal: str                          # Active objective
+    reasoning_steps: List[str]                 # Chain of thought
+    considered_actions: List[Action]           # Actions being evaluated
+    scratchpad: Dict[str, Any]                 # Temporary calculations
+    active_hypotheses: List[str]               # Predictions to test
+    context_window: List[str]                  # Relevant memory chunks
+    attention_focus: Optional[str]             # Current DOM element/area of focus
+```
+**Use Cases:**
+- Chain-of-thought reasoning before action selection
+- Evaluate multiple action candidates
+- Maintain focus during complex extraction
+- Store temporary parsing results
+**Example:**
+```python
+working_memory = {
+    "current_goal": "Extract product price from listing",
+    "reasoning_steps": [
+        "Step 1: Search HTML for price indicators ($, €, price)",
+        "Step 2: Found 3 candidates: $49.99, $39.99 (strikethrough), $5.99 (shipping)",
+        "Step 3: $49.99 is in <span class='product-price'>, most likely correct",
+        "Step 4: Extract using selector span.product-price"
+    ],
+    "considered_actions": [
+        Action(action_type="EXTRACT_FIELD", selector="span.price"),
+        Action(action_type="EXTRACT_FIELD", selector="span.product-price"),
+        Action(action_type="SEARCH_PAGE", query="price.*\\$\\d+")
+    ],
+    "attention_focus": "div.product-details"
+}
+```
+### 3. 🟡 Long-Term Memory (Persistent)
+**Purpose:** Store learned patterns, strategies, and historical data across all episodes.
+**Lifecycle:** Persists indefinitely via MCP storage and vector database.
+**Data Structure:**
+```python
+class LongTermMemory(BaseModel):
+    # Vector embeddings for semantic search
+    embeddings_index: VectorIndex              # FAISS, Qdrant, or Pinecone
+    # Successful extraction patterns
+    learned_patterns: List[ExtractionPattern]
+    # Historical performance data
+    past_episodes: List[EpisodeSummary]
+    # Failed attempts (to avoid repetition)
+    failed_patterns: List[FailedPattern]
+    # Domain knowledge
+    website_schemas: Dict[str, WebsiteSchema]  # domain → common patterns
+    # Selector library
+    selector_success_rate: Dict[str, float]    # selector → success rate
+```
+**Extraction Pattern:**
+```python
+class ExtractionPattern(BaseModel):
+    pattern_id: str
+    field_name: str                            # e.g., "price"
+    selector: str                              # e.g., "span.product-price"
+    selector_type: str                         # "css" | "xpath" | "label"
+    success_count: int                         # How many times it worked
+    failure_count: int                         # How many times it failed
+    domains: List[str]                         # Which websites it works on
+    confidence: float                          # 0.0 to 1.0
+    examples: List[str]                        # Sample extracted values
+    created_at: datetime
+    last_used: datetime
+```
+**Use Cases:**
+- Retrieve successful selectors for similar tasks
+- Avoid repeating failed extraction attempts
+- Learn website-specific patterns
+- Build a library of proven strategies
+**Example Query:**
+```python
+# Agent needs to extract "price" from a new e-commerce page
+similar_patterns = long_term_memory.search(
+    query="price extraction e-commerce",
+    filters={"field_name": "price", "confidence": ">0.8"},
+    limit=5
+)
+# Returns:
+[
+    ExtractionPattern(
+        selector="span.product-price",
+        success_count=42,
+        confidence=0.95,
+        domains=["shop.example.com", "store.example.org"]
+    ),
+    ExtractionPattern(
+        selector="div.price-box span[itemprop='price']",
+        success_count=38,
+        confidence=0.92,
+        domains=["ecommerce.example.net"]
+    ),
+    ...
+]
+```
+### 4. 🔴 Shared Memory (Multi-Agent)
+**Purpose:** Enable knowledge sharing across multiple agent instances.
+**Lifecycle:** Persistent, synchronized across all agents.
+**Data Structure:**
+```python
+class SharedMemory(BaseModel):
+    global_knowledge_base: Dict[str, Any]      # Shared facts and patterns
+    agent_messages: List[AgentMessage]         # Inter-agent communication
+    task_state: Dict[str, TaskState]           # Collaborative task status
+    distributed_discoveries: List[Discovery]   # Findings from all agents
+    consensus_data: Dict[str, ConsensusValue]  # Voted/validated facts
+```
+**Use Cases:**
+- Multiple agents scraping different sections of a large site
+- Collaborative fact verification
+- Distributed catalog scraping
+- Consensus-based data validation
+**Example:**
+```python
+# Agent A discovers a pattern
+agent_a.shared_memory.broadcast(
+    AgentMessage(
+        sender="agent_a",
+        message_type="PATTERN_DISCOVERED",
+        data={
+            "pattern": "Product SKU always in span.sku-code",
+            "confidence": 0.89,
+            "domain": "shop.example.com"
+        }
+    )
+)
+# Agent B receives and applies the pattern
+agent_b_discovers = agent_b.shared_memory.receive_messages(
+    message_type="PATTERN_DISCOVERED"
+)
+# Agent B can now use this selector without rediscovering it
+```
+---
+## Memory Operations
+### Core Actions
+The memory system exposes the following actions to the agent:
+#### 1. WRITE_MEMORY
+Store information in the appropriate memory layer.
+```python
+class WriteMemoryAction(Action):
+    action_type: Literal["WRITE_MEMORY"]
+    memory_layer: Literal["short_term", "working", "long_term", "shared"]
+    key: str
+    value: Any
+    metadata: Optional[Dict[str, Any]] = None
+    ttl: Optional[int] = None  # Time-to-live in seconds (for working memory)
+```
+**Example:**
+```python
+# Store a successful extraction pattern
+Action(
+    action_type="WRITE_MEMORY",
+    memory_layer="long_term",
+    key="pattern:price:span.product-price",
+    value={
+        "selector": "span.product-price",
+        "field": "price",
+        "success_count": 1,
+        "domain": "shop.example.com"
+    },
+    metadata={"task_id": "task_medium", "episode_id": "ep_123"}
+)
+```
+#### 2. READ_MEMORY
+Retrieve information from memory.
+```python
+class ReadMemoryAction(Action):
+    action_type: Literal["READ_MEMORY"]
+    memory_layer: Literal["short_term", "working", "long_term", "shared"]
+    key: Optional[str] = None          # Specific key (exact match)
+    query: Optional[str] = None        # Semantic search query
+    filters: Optional[Dict] = None     # Metadata filters
+    limit: int = 10                    # Max results
+```
+**Example:**
+```python
+# Semantic search for price extraction patterns
+Action(
+    action_type="READ_MEMORY",
+    memory_layer="long_term",
+    query="how to extract price from e-commerce product page",
+    filters={"field_name": "price", "confidence": ">0.7"},
+    limit=5
+)
+```
+#### 3. SEARCH_MEMORY
+Advanced semantic search across memory layers.
+```python
+class SearchMemoryAction(Action):
+    action_type: Literal["SEARCH_MEMORY"]
+    query: str                         # Natural language query
+    memory_layers: List[str]           # Which layers to search
+    search_mode: Literal["semantic", "keyword", "hybrid"]
+    time_range: Optional[TimeRange]    # Filter by recency
+    min_relevance: float = 0.5         # Minimum similarity score
+```
+**Example:**
+```python
+# Find all successful pagination strategies
+Action(
+    action_type="SEARCH_MEMORY",
+    query="successful pagination next page navigation strategies",
+    memory_layers=["long_term", "shared"],
+    search_mode="semantic",
+    min_relevance=0.7
+)
+```
+#### 4. SUMMARIZE_MEMORY
+Compress and summarize memory to manage context window.
+```python
+class SummarizeMemoryAction(Action):
+    action_type: Literal["SUMMARIZE_MEMORY"]
+    memory_layer: str
+    summarization_strategy: Literal["importance", "recency", "relevance"]
+    target_size: int                   # Target summary size in tokens
+    preserve_keys: List[str]           # Never summarize these
+```
+#### 5. PRUNE_MEMORY
+Remove low-value or outdated memories.
+```python
+class PruneMemoryAction(Action):
+    action_type: Literal["PRUNE_MEMORY"]
+    memory_layer: str
+    pruning_strategy: Literal["lru", "low_confidence", "old_age"]
+    threshold: float                   # Confidence/age threshold
+```
+---
+## Implementation Details
+### Vector Database Integration
+**Supported Backends:**
+- **FAISS** (default, local, no external dependencies)
+- **Qdrant** (distributed, production-ready)
+- **Pinecone** (managed, cloud-based)
+- **Weaviate** (open-source, GraphQL API)
+**Configuration:**
+```python
+class VectorDBConfig(BaseModel):
+    provider: Literal["faiss", "qdrant", "pinecone", "weaviate"]
+    embedding_model: str = "text-embedding-3-small"  # OpenAI
+    dimension: int = 1536
+    similarity_metric: Literal["cosine", "euclidean", "dot_product"] = "cosine"
+    index_type: str = "IVF"            # FAISS-specific
+    connection_params: Dict[str, Any]  # Provider-specific
+```
+**Embedding Pipeline:**
+```python
+class MemoryEmbedder:
+    def embed_pattern(self, pattern: ExtractionPattern) -> np.ndarray:
+        """Convert extraction pattern to embedding."""
+        text = f"""
+        Field: {pattern.field_name}
+        Selector: {pattern.selector}
+        Type: {pattern.selector_type}
+        Context: {' '.join(pattern.examples[:3])}
+        """
+        return self.embedding_model.encode(text)
+    def embed_query(self, query: str) -> np.ndarray:
+        """Convert search query to embedding."""
+        return self.embedding_model.encode(query)
+```
+### MCP Storage Integration
+**Storage Backends:**
+- **File System MCP** (local JSON/SQLite files)
+- **PostgreSQL MCP** (relational storage)
+- **MongoDB MCP** (document storage)
+- **Redis MCP** (fast cache + pub/sub for shared memory)
+**Example MCP Configuration:**
+```json
+{
+  "mcpServers": {
+    "memory-storage": {
+      "command": "npx",
+      "args": ["-y", "@modelcontextprotocol/server-filesystem", "./memory_data"],
+      "enabled": true,
+      "autoDownload": false
+    },
+    "memory-cache": {
+      "command": "redis-mcp-server",
+      "args": ["--host", "localhost", "--port", "6379"],
+      "enabled": true,
+      "autoDownload": true
+    }
+  }
+}
+```
+### Memory Router
+The **Memory Router** intelligently decides which memory layer to query based on the request:
+```python
+class MemoryRouter:
+    def route_query(self, query: str, context: Dict) -> List[str]:
+        """Determine which memory layers to search."""
+        layers = []
+        # Recent action history → short-term
+        if "last few" in query or "current episode" in query:
+            layers.append("short_term")
+        # Active reasoning → working
+        if "consider" in query or "evaluate" in query:
+            layers.append("working")
+        # Historical patterns → long-term
+        if "similar" in query or "previously" in query or "learned" in query:
+            layers.append("long_term")
+        # Other agents' discoveries → shared
+        if "other agents" in query or "consensus" in query:
+            layers.append("shared")
+        return layers if layers else ["long_term"]  # Default
+```
+### Context Window Optimization
+**Problem:** LLMs have limited context windows. Memory must be compressed.
+**Solutions:**
+1. **Hierarchical Summarization:**
+```python
+class MemorySummarizer:
+    def summarize_episode(self, episode_memory: EpisodeMemory) -> str:
+        """Compress episode into key points."""
+        summary = f"Episode {episode_memory.episode_id} ({episode_memory.task_id}):\n"
+        summary += f"- Visited {len(episode_memory.visited_urls)} pages\n"
+        summary += f"- Extracted {len(episode_memory.extracted_data)} fields\n"
+        summary += f"- {len(episode_memory.actions_history)} actions taken\n"
+        # Highlight key discoveries
+        if episode_memory.intermediate_notes:
+            summary += f"\nKey findings:\n"
+            for note in episode_memory.intermediate_notes[-3:]:  # Last 3 notes
+                summary += f"  • {note}\n"
+        return summary
+```
+2. **Importance Scoring:**
+```python
+class MemoryImportanceScorer:
+    def score(self, memory_item: Any) -> float:
+        """Rate importance of memory (0.0 to 1.0)."""
+        score = 0.0
+        # Recency bonus
+        age_days = (datetime.now() - memory_item.created_at).days
+        score += max(0, 1.0 - age_days / 30) * 0.3
+        # Success rate bonus
+        if hasattr(memory_item, 'success_count'):
+            score += memory_item.confidence * 0.4
+        # Usage frequency bonus
+        if hasattr(memory_item, 'last_used'):
+            days_since_use = (datetime.now() - memory_item.last_used).days
+            score += max(0, 1.0 - days_since_use / 7) * 0.3
+        return min(score, 1.0)
+```
+3. **Automatic Pruning:**
+```python
+class MemoryPruner:
+    def prune_low_value(self, memory_store: Dict, threshold: float = 0.3):
+        """Remove memories below importance threshold."""
+        scorer = MemoryImportanceScorer()
+        to_remove = []
+        for key, item in memory_store.items():
+            if scorer.score(item) < threshold:
+                to_remove.append(key)
+        for key in to_remove:
+            del memory_store[key]
+        return len(to_remove)
+```
+---
+## Configuration
+### Settings Panel
+**Memory Settings Tab:**
+```python
+class MemorySettings(BaseModel):
+    # Enable/disable layers
+    enable_short_term: bool = True
+    enable_working: bool = True
+    enable_long_term: bool = True
+    enable_shared: bool = False          # Off by default (multi-agent)
+    # Size limits
+    max_episode_memory_mb: int = 10
+    max_working_memory_items: int = 50
+    max_long_term_patterns: int = 10000
+    # Vector DB settings
+    vector_db_provider: str = "faiss"
+    embedding_model: str = "text-embedding-3-small"
+    # MCP storage settings
+    storage_backend: str = "filesystem"
+    storage_path: str = "./memory_data"
+    # Pruning settings
+    auto_prune: bool = True
+    prune_threshold: float = 0.3
+    prune_interval_hours: int = 24
+    # Context window optimization
+    auto_summarize: bool = True
+    max_context_tokens: int = 4000
+```
+**UI Example:**
+```
+┌─────────────────────────────────────────────────────────────┐
+│ Memory Settings                                              │
+├─────────────────────────────────────────────────────────────┤
+│                                                              │
+│ ☑ Enable Short-Term Memory (Episode)                        │
+│ ☑ Enable Working Memory (Reasoning)                         │
+│ ☑ Enable Long-Term Memory (Persistent)                      │
+│ ☐ Enable Shared Memory (Multi-Agent)                        │
+│                                                              │
+│ Memory Size Limits:                                          │
+│   Short-Term: [10] MB per episode                           │
+│   Working:    [50] items max                                │
+│   Long-Term:  [10000] patterns max                          │
+│                                                              │
+│ Vector Database:                                             │
+│   Provider:   [FAISS ▼]                                     │
+│   Embedding:  [text-embedding-3-small ▼]                    │
+│                                                              │
+│ Storage Backend:                                             │
+│   Type:       [Filesystem ▼]                                │
+│   Path:       [./memory_data          ] [Browse]            │
+│                                                              │
+│ Auto-Pruning:                                                │
+│   ☑ Enabled                                                  │
+│   Threshold:  [0.3] (0.0 = keep all, 1.0 = keep only best) │
+│   Interval:   [24] hours                                    │
+│                                                              │
+│              [Save Settings]  [Reset to Defaults]           │
+└─────────────────────────────────────────────────────────────┘
+```
+---
+## Best Practices
+### 1. Memory Hygiene
+✅ **Do:**
+- Summarize episode memory before storing in long-term
+- Prune low-confidence patterns regularly
+- Validate patterns before adding to long-term memory
+- Tag memories with metadata (task_id, domain, confidence)
+❌ **Don't:**
+- Store raw HTML in long-term memory (use summaries)
+- Keep failed patterns without analysis
+- Allow unbounded memory growth
+- Store sensitive data without encryption
+### 2. Query Optimization
+✅ **Do:**
+- Use semantic search for conceptual queries ("how to extract price")
+- Use exact key lookup for known patterns
+- Apply filters to narrow search space
+- Limit results to top-K most relevant
+❌ **Don't:**
+- Search all layers for every query (route intelligently)
+- Ignore relevance scores (filter low scores)
+- Retrieve full objects when summaries suffice
+### 3. Context Window Management
+✅ **Do:**
+- Prioritize recent and high-confidence memories
+- Summarize old episodes aggressively
+- Use hierarchical memory retrieval (summary → details on demand)
+- Monitor token usage and trigger summarization proactively
+❌ **Don't:**
+- Include entire memory in every agent call
+- Ignore context window limits
+- Retrieve memories without relevance ranking
+### 4. Multi-Agent Coordination
+✅ **Do:**
+- Broadcast significant discoveries to shared memory
+- Implement consensus mechanisms for conflicting data
+- Use message queues for asynchronous updates
+- Version shared knowledge to handle conflicts
+❌ **Don't:**
+- Allow race conditions on shared writes
+- Broadcast every minor action (create noise)
+- Trust shared data without validation
+---
+## Performance Metrics
+Track these metrics to evaluate memory system effectiveness:
+```python
+class MemoryMetrics(BaseModel):
+    # Retrieval performance
+    avg_retrieval_time_ms: float
+    cache_hit_rate: float
+    # Effectiveness
+    pattern_reuse_rate: float          # % of times learned patterns helped
+    memory_assisted_success_rate: float # Success with vs without memory
+    # Efficiency
+    memory_size_mb: float
+    pruned_items_count: int
+    summarization_ratio: float         # Compressed size / original size
+    # Quality
+    avg_pattern_confidence: float
+    false_positive_rate: float         # Patterns that failed when reused
+```
+---
+## Example Usage
+### Full Episode with Memory
+```python
+# Initialize environment with memory
+env = WebScraperEnv(memory_config=MemorySettings())
+# Reset episode
+obs = env.reset(task_id="task_medium", seed=42)
+# Agent checks long-term memory for similar tasks
+memory_query = Action(
+    action_type="SEARCH_MEMORY",
+    query=f"successful extraction patterns for {obs.task_description}",
+    memory_layers=["long_term"],
+    search_mode="semantic",
+    limit=5
+)
+similar_patterns = env.step(memory_query)
+# Agent reasons using working memory
+working_memory = {
+    "current_goal": "Extract product price",
+    "reasoning_steps": [
+        f"Retrieved {len(similar_patterns)} similar patterns",
+        f"Top pattern: {similar_patterns[0].selector} (confidence: {similar_patterns[0].confidence})",
+        "Will try this selector first"
+    ],
+    "considered_actions": [...]
+}
+# Agent extracts using learned pattern
+extract_action = Action(
+    action_type="EXTRACT_FIELD",
+    target_field="price",
+    selector=similar_patterns[0].selector
+)
+obs, reward, done, info = env.step(extract_action)
+# If successful, reinforce the pattern
+if reward.value > 0:
+    env.step(Action(
+        action_type="WRITE_MEMORY",
+        memory_layer="long_term",
+        key=f"pattern:price:{similar_patterns[0].selector}",
+        value={
+            **similar_patterns[0].dict(),
+            "success_count": similar_patterns[0].success_count + 1,
+            "last_used": datetime.now()
+        }
+    ))
+# Store episode summary
+if done:
+    env.step(Action(
+        action_type="WRITE_MEMORY",
+        memory_layer="long_term",
+        key=f"episode:{obs.episode_id}",
+        value=env.summarize_episode()
+    ))
+```
+---
+## Future Enhancements
+- **Active Learning:** Agent can request human labeling for ambiguous patterns
+- **Federated Memory:** Share memory across organizations without revealing raw data
+- **Memory Replay:** Train on stored episodes for offline RL
+- **Causal Memory:** Track cause-effect relationships between actions and outcomes
+- **Memory Debugging:** Visualize which memories influenced each decision
+---
+**Next:** See [api.md](./api.md) for multi-model API integration.

docs/observability.md ADDED Viewed

	@@ -0,0 +1,147 @@

+# Observability and Dashboard
+## Overview
+Observability provides deep insight into runtime behavior, model usage, tool execution, memory quality, and rewards.
+## Dashboard Sections
+### 1. Live Thought Stream
+- chronological reasoning notes
+- model/router choice trace
+- action confidence timeline
+- override events
+### 2. Navigation Map
+Graph of visited pages:
+- nodes = URLs
+- edges = transitions
+- node color = relevance/confidence
+- revisit highlighting
+### 3. MCP Usage Panel
+- tool call count by server
+- avg latency by tool
+- error rate and retries
+- top successful tool chains
+### 4. Memory Viewer
+- inspect short/working/long/shared memory
+- filter by task/domain/confidence
+- edit/delete entries
+- prune previews
+### 5. Reward Analytics
+- per-step reward breakdown
+- component contribution trends
+- penalty heatmap
+- episode comparison
+### 6. Cost and Token Monitor
+- per-provider usage
+- per-model token counts
+- cumulative cost vs budget
+- forecasted burn rate
+## Core Metrics
+### Agent Metrics
+- task completion rate
+- avg steps to completion
+- recovery score
+- generalization score
+- exploration ratio
+### Tool Metrics
+- tool success rate
+- timeout ratio
+- fallback frequency
+- schema validation failures
+### Memory Metrics
+- retrieval hit rate
+- relevance score distribution
+- prune rate
+- memory-assisted success ratio
+### Search Metrics
+- query success rate
+- multi-hop depth distribution
+- credibility score average
+- duplicate result ratio
+## Logging Model
+Structured logs (JSON):
+```json
+{
+  "timestamp": "2026-03-27T00:00:00Z",
+  "episode_id": "ep_123",
+  "step": 7,
+  "event": "tool_call",
+  "tool": "beautifulsoup.find_all",
+  "latency_ms": 54,
+  "success": true,
+  "reward_delta": 0.08
+}
+```
+## Tracing
+Per-episode trace includes:
+- observations
+- actions
+- rewards
+- tool calls
+- memory operations
+- final submission and grader results
+## Alerts
+Configurable alerts:
+- budget threshold crossed
+- error spike
+- tool outage
+- memory bloat
+- anomalous low reward streak
+## APIs
+- `GET /api/metrics/summary`
+- `GET /api/metrics/timeseries`
+- `GET /api/traces/{episode_id}`
+- `GET /api/costs`
+- `GET /api/memory/stats`
+- `GET /api/tools/stats`
+## Recommended Dashboard Layout
+1. Top row: completion, cost, latency, error rate
+2. Mid row: thought stream + navigation graph
+3. Lower row: reward breakdown + MCP usage + memory viewer
+4. Bottom row: raw trace and export controls
+## Export and Audit
+Exports:
+- JSON trace
+- CSV metrics
+- reward analysis report
+- model usage report
+All exports include episode and configuration fingerprints for reproducibility.

docs/openenv.md ADDED Viewed

	@@ -0,0 +1,220 @@

+# OpenEnv Specification (Enhanced)
+## Overview
+This document defines the OpenEnv contract for WebScraper-OpenEnv with advanced memory, MCP tooling, multi-model routing, and long-page batch handling.
+## Core Interfaces
+### Observation
+```python
+class Observation(BaseModel):
+    episode_id: str
+    task_id: str
+    step_number: int
+    current_url: str
+    page_html: str
+    page_title: str
+    available_actions: list[str]
+    extracted_so_far: dict
+    pages_visited: list[str]
+    budget_remaining: int
+    task_description: str
+    target_fields: list[str]
+    hints: list[str]
+    # Enhanced
+    memory_context: dict | None
+    tool_registry_snapshot: list[dict] | None
+    search_results: list[dict] | None
+    page_chunks: list[dict] | None
+```
+### Action
+```python
+class Action(BaseModel):
+    action_type: str
+    # Existing
+    target_field: str | None = None
+    selector: str | None = None
+    navigate_to: str | None = None
+    submit_extraction: dict | None = None
+    notes: str | None = None
+    # Search
+    query: str | None = None
+    search_engine: str | None = None
+    result_limit: int = 5
+    # Verification
+    field_name: str | None = None
+    claimed_value: str | None = None
+    verification_source: str | None = None
+    # Conflict resolution
+    conflicting_sources: list[str] | None = None
+    chosen_source: str | None = None
+    rationale: str | None = None
+    # MCP + Memory
+    tool_name: str | None = None
+    tool_params: dict | None = None
+    memory_layer: str | None = None
+    memory_key: str | None = None
+    memory_query: str | None = None
+```
+### Action Types
+- `EXTRACT_FIELD`
+- `NAVIGATE`
+- `SEARCH_PAGE`
+- `INSPECT_ELEMENT`
+- `SUBMIT`
+- `SKIP_PAGE`
+- `SEARCH_ENGINE`
+- `VERIFY_FACT`
+- `RESOLVE_CONFLICT`
+- `FETCH_URL`
+- `MCP_TOOL_CALL`
+- `WRITE_MEMORY`
+- `READ_MEMORY`
+- `SEARCH_MEMORY`
+- `SUMMARIZE_MEMORY`
+- `PRUNE_MEMORY`
+### Reward
+```python
+class Reward(BaseModel):
+    value: float
+    cumulative: float
+    breakdown: dict
+    message: str
+```
+## Episode Lifecycle
+```text
+reset(task_id, seed?)
+  -> observation(step=0)
+step(action)
+  -> observation, reward, done, info
+state(episode_id)
+  -> current snapshot
+```
+Terminal conditions:
+- `SUBMIT` called
+- budget exhausted
+- max page limit reached
+- fatal policy error
+## State Machine
+```text
+RESET -> RUNNING -> TERMINAL
+            |
+            +-- NAVIGATE / EXTRACT / SEARCH / VERIFY / MCP / MEMORY
+```
+## Task Profiles
+### Easy
+- single-page extraction
+- low noise
+- hints enabled
+### Medium
+- pagination
+- moderate noise
+- partial hints
+### Hard
+- multi-hop search
+- conflicting sources
+- verification required
+- no hints
+## Long Page Handling
+When HTML exceeds token/size thresholds:
+1. Semantic segmentation
+2. Adaptive chunking
+3. Batch extraction
+4. Merge + dedupe + confidence rank
+5. Optional diff-based incremental update
+## MCP Integration Contract
+On each step, environment may expose:
+- tool registry snapshot
+- per-tool input/output schema
+- timeout and retry policy
+Tool calls are evaluated for:
+- correctness
+- efficiency
+- safety constraints
+## Search Engine Contract
+Search action supports provider routing:
+- Google
+- Bing
+- Brave
+- DuckDuckGo
+- Perplexity
+- custom providers
+Environment stores query + result metadata for observability.
+## Memory Contract
+Layers:
+- short-term (episode)
+- working (reasoning)
+- long-term (persistent)
+- shared (multi-agent)
+Mandatory metadata for write operations:
+- `episode_id`
+- `task_id`
+- `confidence`
+- `source`
+## API Surface
+- `POST /api/reset`
+- `POST /api/step`
+- `GET /api/state/{episode_id}`
+- `GET /api/tasks`
+- `GET /api/reward/{episode_id}`
+- `GET /api/tool-registry`
+- `POST /api/tool-test`
+## Determinism
+Given `task_id + seed + config`, environment should be reproducible for grading and benchmarking.
+## Safety and Guardrails
+- enforce max steps and request budgets
+- enforce MCP tool allowlist/denylist
+- prevent secret leakage from tool outputs
+- sanitize logs and traces

docs/rewards.md ADDED Viewed

	@@ -0,0 +1,637 @@

+# 🎯 Advanced Reward Function
+## Table of Contents
+1. [Overview](#overview)
+2. [Reward Components](#reward-components)
+3. [Planning Quality](#planning-quality)
+4. [Recovery Ability](#recovery-ability)
+5. [Exploration Bonus](#exploration-bonus)
+6. [Redundancy Penalty](#redundancy-penalty)
+7. [Generalization Score](#generalization-score)
+8. [Tool Usage Efficiency](#tool-usage-efficiency)
+9. [Memory Utilization](#memory-utilization)
+10. [Final Reward Formula](#final-reward-formula)
+11. [Configuration](#configuration)
+---
+## Overview
+The **Advanced Reward Function** provides dense, interpretable signals that guide the agent toward intelligent, efficient, and generalizable web scraping strategies.
+### Design Principles
+1. **Dense Rewards:** Provide feedback at every step, not just terminal states
+2. **Interpretable:** Each component has a clear purpose agents (and humans) can understand
+3. **Balanced:** Prevent reward hacking by balancing conflicting objectives
+4. **Adaptive:** Adjust weights based on task difficulty and agent progress
+### Basic vs Advanced
+**Basic Reward (existing):**
+```python
+reward = task_completion_score  # 0.0 to 1.0
+```
+**Advanced Reward:**
+```python
+reward = (
+    w1 * task_completion +
+    w2 * efficiency +
+    w3 * planning_quality +
+    w4 * recovery_ability +
+    w5 * exploration_bonus +
+    w6 * tool_usage +
+    w7 * memory_usage +
+    w8 * generalization
+) - penalties
+```
+---
+## Reward Components
+### 1. Task Completion (w1 = 0.40)
+**Purpose:** Measure how much of the task is complete.
+**Calculation:**
+```python
+def task_completion_score(extracted: Dict, ground_truth: Dict) -> float:
+    """Score based on field completeness and accuracy."""
+    if not ground_truth:
+        return 0.0
+    total_fields = len(ground_truth)
+    correct_fields = 0
+    partial_fields = 0
+    for field, true_value in ground_truth.items():
+        extracted_value = extracted.get(field)
+        if extracted_value is None:
+            continue  # Missing field, 0 points
+        # Exact match
+        if normalize(extracted_value) == normalize(true_value):
+            correct_fields += 1
+        # Partial match (fuzzy)
+        elif similarity(extracted_value, true_value) > 0.7:
+            partial_fields += 1
+    score = (correct_fields + 0.5 * partial_fields) / total_fields
+    return score
+```
+**Example:**
+```python
+# Task: Extract name, price, rating
+ground_truth = {"name": "Widget Pro", "price": "$49.99", "rating": "4.5"}
+# Agent extracted 2/3 correctly
+extracted = {"name": "Widget Pro", "price": "$49.99", "rating": None}
+task_completion = 2/3 = 0.67
+```
+---
+### 2. Efficiency (w2 = 0.15)
+**Purpose:** Reward completing tasks quickly with fewer actions.
+**Calculation:**
+```python
+def efficiency_score(steps_taken: int, max_steps: int, pages_visited: int) -> float:
+    """Lower steps and pages = higher efficiency."""
+    # Step efficiency
+    step_efficiency = 1.0 - (steps_taken / max_steps)
+    # Page efficiency (prefer fewer page visits)
+    ideal_pages = estimate_ideal_page_count(task)
+    page_efficiency = 1.0 - abs(pages_visited - ideal_pages) / ideal_pages
+    page_efficiency = max(0.0, page_efficiency)
+    return 0.7 * step_efficiency + 0.3 * page_efficiency
+```
+**Example:**
+```python
+# Task with max 20 steps
+steps_taken = 8
+efficiency = 1.0 - (8/20) = 0.60  # Good!
+steps_taken = 18
+efficiency = 1.0 - (18/20) = 0.10  # Inefficient
+```
+---
+## Planning Quality
+### 3. Planning Quality Score (w3 = 0.10)
+**Purpose:** Reward agents that plan before acting.
+**Signals:**
+- Used WRITE_MEMORY with reasoning notes
+- Actions follow a coherent strategy
+- Fewer backtracking actions
+**Calculation:**
+```python
+def planning_quality_score(episode_history: List[Action]) -> float:
+    """Measure planning behavior."""
+    score = 0.0
+    # 1. Did agent write reasoning notes?
+    reasoning_actions = [a for a in episode_history if a.notes]
+    if reasoning_actions:
+        score += 0.3
+    # 2. Action coherence: Do actions follow a logical sequence?
+    coherence = measure_action_coherence(episode_history)
+    score += 0.4 * coherence
+    # 3. Backtracking penalty: Visiting same page multiple times
+    unique_pages = len(set(a.navigate_to for a in episode_history if a.navigate_to))
+    total_navigations = len([a for a in episode_history if a.action_type == "NAVIGATE"])
+    if total_navigations > 0:
+        backtrack_ratio = 1.0 - (unique_pages / total_navigations)
+        score += 0.3 * (1.0 - backtrack_ratio)  # Lower backtracking = higher score
+    return min(score, 1.0)
+def measure_action_coherence(actions: List[Action]) -> float:
+    """Are actions logically connected?"""
+    coherence_patterns = [
+        # Good patterns
+        ("SEARCH_PAGE", "EXTRACT_FIELD"),      # Search then extract
+        ("NAVIGATE", "EXTRACT_FIELD"),          # Navigate then extract
+        ("EXTRACT_FIELD", "VERIFY_FACT"),       # Extract then verify
+        ("SEARCH_ENGINE", "NAVIGATE"),          # Search then visit
+    ]
+    coherent_pairs = 0
+    total_pairs = len(actions) - 1
+    for i in range(total_pairs):
+        pair = (actions[i].action_type, actions[i+1].action_type)
+        if pair in coherence_patterns:
+            coherent_pairs += 1
+    return coherent_pairs / total_pairs if total_pairs > 0 else 0.0
+```
+**Example:**
+```python
+# Good planning:
+actions = [
+    Action(type="SEARCH_PAGE", notes="Looking for price pattern"),
+    Action(type="EXTRACT_FIELD", target="price"),
+    Action(type="VERIFY_FACT", field="price")
+]
+planning_score = 0.3 (notes) + 0.4*0.67 (coherence) + 0.3 (no backtrack) = 0.87
+# Poor planning:
+actions = [
+    Action(type="NAVIGATE", navigate_to="/page1"),
+    Action(type="NAVIGATE", navigate_to="/page2"),
+    Action(type="NAVIGATE", navigate_to="/page1"),  # Backtrack!
+    Action(type="EXTRACT_FIELD")
+]
+planning_score = 0.0 (no notes) + 0.4*0.0 (incoherent) + 0.3*0.33 (backtracking) = 0.10
+```
+---
+## Recovery Ability
+### 4. Recovery Ability Score (w4 = 0.08)
+**Purpose:** Reward agents that recover from failures.
+**Signals:**
+- Action failed → Agent tried alternative approach
+- Extraction returned empty → Agent searched with different selector
+- Page blocked → Agent switched proxy/VPN
+**Calculation:**
+```python
+def recovery_ability_score(episode_history: List[Tuple[Action, Reward]]) -> float:
+    """Measure ability to recover from failures."""
+    recoveries = 0
+    failures = 0
+    for i in range(len(episode_history) - 1):
+        action, reward = episode_history[i]
+        next_action, next_reward = episode_history[i + 1]
+        # Detect failure (negative reward or empty result)
+        if reward.value < 0 or "failed" in reward.message.lower():
+            failures += 1
+            # Check if next action was a recovery attempt
+            if is_recovery_action(action, next_action):
+                if next_reward.value > reward.value:  # Recovery succeeded
+                    recoveries += 1
+    return recoveries / failures if failures > 0 else 0.0
+def is_recovery_action(failed_action: Action, next_action: Action) -> bool:
+    """Is next_action a recovery attempt for failed_action?"""
+    # Same action type with different parameters
+    if failed_action.action_type == next_action.action_type:
+        if failed_action.selector != next_action.selector:
+            return True  # Tried different selector
+    # Switched to alternative action type
+    recovery_alternatives = {
+        "EXTRACT_FIELD": ["SEARCH_PAGE", "INSPECT_ELEMENT"],
+        "NAVIGATE": ["FETCH_URL"],  # Try direct fetch if navigate blocked
+        "SEARCH_ENGINE": ["NAVIGATE"],  # Try direct URL if search fails
+    }
+    if next_action.action_type in recovery_alternatives.get(failed_action.action_type, []):
+        return True
+    return False
+```
+**Example:**
+```python
+# Good recovery:
+history = [
+    (Action(type="EXTRACT_FIELD", selector=".price"), Reward(value=-0.1, message="Not found")),
+    (Action(type="SEARCH_PAGE", query="price"), Reward(value=0.2, message="Found price pattern")),
+    (Action(type="EXTRACT_FIELD", selector="span.product-price"), Reward(value=0.5, message="Extracted"))
+]
+recovery_score = 1/1 = 1.0  # 1 failure, 1 successful recovery
+# No recovery:
+history = [
+    (Action(type="EXTRACT_FIELD", selector=".price"), Reward(value=-0.1)),
+    (Action(type="EXTRACT_FIELD", selector=".price"), Reward(value=-0.1)),  # Repeated same failed action!
+    (Action(type="SUBMIT"), Reward(value=0.0))
+]
+recovery_score = 0/2 = 0.0  # 2 failures, 0 recoveries
+```
+---
+## Exploration Bonus
+### 5. Exploration Bonus (w5 = 0.05)
+**Purpose:** Encourage discovering new pages and patterns early in training.
+**Calculation:**
+```python
+def exploration_bonus(
+    pages_visited: List[str],
+    known_pages: Set[str],  # From long-term memory
+    episode_number: int
+) -> float:
+    """Bonus for discovering new pages/patterns."""
+    new_pages = set(pages_visited) - known_pages
+    # Bonus decreases over time (we want agent to eventually exploit)
+    decay_factor = math.exp(-0.01 * episode_number)
+    # Bonus per new page discovered
+    bonus_per_page = 0.1
+    return min(len(new_pages) * bonus_per_page * decay_factor, 1.0)
+```
+**Example:**
+```python
+# Episode 10: Agent discovers 3 new pages
+exploration_bonus = 3 * 0.1 * exp(-0.01*10) = 0.3 * 0.90 = 0.27
+# Episode 500: Same discovery
+exploration_bonus = 3 * 0.1 * exp(-0.01*500) = 0.3 * 0.007 = 0.002  # Minimal bonus now
+```
+---
+## Redundancy Penalty
+### 6. Redundancy Penalty (penalty, not bonus)
+**Purpose:** Penalize visiting the same page repeatedly without progress.
+**Calculation:**
+```python
+def redundancy_penalty(pages_visited: List[str]) -> float:
+    """Penalty for revisiting pages."""
+    from collections import Counter
+    visit_counts = Counter(pages_visited)
+    penalty = 0.0
+    for page, count in visit_counts.items():
+        if count > 1:
+            # Exponential penalty for repeat visits
+            penalty += 0.05 * (count - 1) ** 1.5
+    return min(penalty, 1.0)
+```
+**Example:**
+```python
+pages = ["/page1", "/page2", "/page1", "/page1", "/page3"]
+# page1 visited 3 times
+redundancy_penalty = 0.05 * (3-1)**1.5 = 0.05 * 2.83 = 0.14
+```
+---
+## Generalization Score
+### 7. Generalization Score (w8 = 0.07)
+**Purpose:** Reward strategies that work across different page layouts.
+**Measurement:** After training, evaluate agent on unseen task variations.
+**Calculation:**
+```python
+def generalization_score(
+    agent: Agent,
+    test_tasks: List[Task],
+    training_tasks: List[Task]
+) -> float:
+    """Test agent on unseen variations of trained tasks."""
+    test_results = []
+    for task in test_tasks:
+        # Ensure task is not in training set
+        if task.id in [t.id for t in training_tasks]:
+            continue
+        result = agent.run(task)
+        test_results.append(result.completion_score)
+    # Average performance on unseen tasks
+    return np.mean(test_results) if test_results else 0.0
+```
+---
+## Tool Usage Efficiency
+### 8. Tool Usage (w6 = 0.05)
+**Purpose:** Reward using the right tools at the right time.
+**Calculation:**
+```python
+def tool_usage_score(actions: List[Action]) -> float:
+    """Reward appropriate tool usage."""
+    score = 0.0
+    # 1. Used memory appropriately
+    memory_actions = [a for a in actions if a.action_type in ["READ_MEMORY", "WRITE_MEMORY"]]
+    if memory_actions:
+        score += 0.3
+    # 2. Used MCP tools when appropriate
+    mcp_actions = [a for a in actions if a.action_type == "MCP_TOOL_CALL"]
+    if mcp_actions:
+        score += 0.3
+    # 3. Verified important extractions
+    verify_actions = [a for a in actions if a.action_type == "VERIFY_FACT"]
+    extract_actions = [a for a in actions if a.action_type == "EXTRACT_FIELD"]
+    if verify_actions and extract_actions:
+        verification_ratio = len(verify_actions) / len(extract_actions)
+        score += 0.4 * min(verification_ratio, 1.0)
+    return min(score, 1.0)
+```
+---
+## Memory Utilization
+### 9. Memory Usage (w7 = 0.05)
+**Purpose:** Reward effective use of memory system.
+**Calculation:**
+```python
+def memory_usage_score(episode: Episode) -> float:
+    """Reward effective memory usage."""
+    score = 0.0
+    # 1. Did agent query long-term memory for similar patterns?
+    if episode.memory_queries > 0:
+        score += 0.4
+    # 2. Did agent write successful patterns to long-term memory?
+    if episode.memory_writes > 0:
+        score += 0.3
+    # 3. Did memory queries lead to successful actions?
+    memory_assisted_success = episode.memory_assisted_actions / episode.total_actions
+    score += 0.3 * memory_assisted_success
+    return min(score, 1.0)
+```
+---
+## Final Reward Formula
+### Complete Formula
+```python
+def calculate_reward(episode: Episode, config: RewardConfig) -> Reward:
+    """Calculate comprehensive reward."""
+    # Positive components
+    R_completion = task_completion_score(episode.extracted, episode.ground_truth)
+    R_efficiency = efficiency_score(episode.steps, episode.max_steps, len(episode.pages))
+    R_planning = planning_quality_score(episode.actions)
+    R_recovery = recovery_ability_score(episode.history)
+    R_exploration = exploration_bonus(episode.pages, episode.memory.known_pages, episode.number)
+    R_tools = tool_usage_score(episode.actions)
+    R_memory = memory_usage_score(episode)
+    R_generalization = generalization_score(episode.agent, episode.test_tasks, episode.training_tasks)
+    # Penalties
+    P_redundancy = redundancy_penalty(episode.pages)
+    P_timeout = 1.0 if episode.timed_out else 0.0
+    P_invalid = sum(1 for a in episode.actions if not a.valid) * 0.1
+    # Weighted sum
+    w = config.weights
+    reward_value = (
+        w.completion * R_completion +
+        w.efficiency * R_efficiency +
+        w.planning * R_planning +
+        w.recovery * R_recovery +
+        w.exploration * R_exploration +
+        w.tools * R_tools +
+        w.memory * R_memory +
+        w.generalization * R_generalization
+    ) - (P_redundancy + P_timeout + P_invalid)
+    # Clamp to [-1, 1]
+    reward_value = max(-1.0, min(1.0, reward_value))
+    # Build breakdown for interpretability
+    breakdown = {
+        "task_completion": R_completion,
+        "efficiency": R_efficiency,
+        "planning_quality": R_planning,
+        "recovery_ability": R_recovery,
+        "exploration_bonus": R_exploration,
+        "tool_usage": R_tools,
+        "memory_usage": R_memory,
+        "generalization": R_generalization,
+        "redundancy_penalty": -P_redundancy,
+        "timeout_penalty": -P_timeout,
+        "invalid_action_penalty": -P_invalid
+    }
+    # Generate explanation
+    message = generate_reward_explanation(breakdown, reward_value)
+    return Reward(
+        value=reward_value,
+        cumulative=episode.cumulative_reward + reward_value,
+        breakdown=breakdown,
+        message=message
+    )
+```
+### Default Weights
+```python
+class RewardWeights(BaseModel):
+    completion: float = 0.40      # Most important
+    efficiency: float = 0.15       # Moderate importance
+    planning: float = 0.10         # Encourages good habits
+    recovery: float = 0.08         # Resilience
+    exploration: float = 0.05      # Early training
+    tools: float = 0.05            # Appropriate tool use
+    memory: float = 0.05           # Effective memory
+    generalization: float = 0.07   # Transfer learning
+    # Total: 0.95, leaves room for penalties
+```
+---
+## Configuration
+### Settings
+```typescript
+interface RewardConfig {
+  weights: RewardWeights;
+  // Component toggles
+  enablePlanningReward: boolean;
+  enableRecoveryReward: boolean;
+  enableExplorationBonus: boolean;
+  enableGeneralizationTest: boolean;
+  // Penalty settings
+  redundancyThreshold: number;       // Penalize after N visits to same page
+  timeoutPenalty: number;            // Penalty for exceeding time limit
+  invalidActionPenalty: number;      // Penalty per invalid action
+  // Exploration decay
+  explorationDecayRate: number;      // Default: 0.01
+  // Generalization
+  testTaskCount: number;             // Number of unseen tasks to test on
+}
+```
+### UI Component
+```jsx
+<RewardSettings>
+  <Section title="Component Weights">
+    <Slider label="Task Completion" value={weights.completion} min={0} max={1} step={0.05} />
+    <Slider label="Efficiency" value={weights.efficiency} min={0} max={1} step={0.05} />
+    <Slider label="Planning Quality" value={weights.planning} min={0} max={1} step={0.05} />
+    <Slider label="Recovery Ability" value={weights.recovery} min={0} max={1} step={0.05} />
+    <Slider label="Exploration Bonus" value={weights.exploration} min={0} max={1} step={0.05} />
+    <Slider label="Tool Usage" value={weights.tools} min={0} max={1} step={0.05} />
+    <Slider label="Memory Usage" value={weights.memory} min={0} max={1} step={0.05} />
+    <Slider label="Generalization" value={weights.generalization} min={0} max={1} step={0.05} />
+    <TotalWeight value={Object.values(weights).reduce((a,b) => a+b, 0)} max={1.0} />
+  </Section>
+  <Section title="Penalties">
+    <NumberInput label="Redundancy Threshold (page visits)" value={redundancyThreshold} />
+    <NumberInput label="Timeout Penalty" value={timeoutPenalty} min={0} max={1} step={0.1} />
+    <NumberInput label="Invalid Action Penalty" value={invalidActionPenalty} min={0} max={1} step={0.1} />
+  </Section>
+  <Section title="Exploration">
+    <NumberInput label="Decay Rate" value={explorationDecayRate} min={0} max={0.1} step={0.001} />
+    <HelpText>How quickly exploration bonus decreases over episodes</HelpText>
+  </Section>
+  <Section title="Presets">
+    <Button onClick={() => loadPreset('balanced')}>Balanced (Default)</Button>
+    <Button onClick={() => loadPreset('efficiency_focused')}>Efficiency Focused</Button>
+    <Button onClick={() => loadPreset('quality_focused')}>Quality Focused</Button>
+    <Button onClick={() => loadPreset('exploration')}>Exploration Mode</Button>
+  </Section>
+</RewardSettings>
+```
+---
+## Reward Visualization
+```jsx
+<RewardBreakdown>
+  <BarChart>
+    {Object.entries(breakdown).map(([component, value]) => (
+      <Bar
+        key={component}
+        label={component}
+        value={value}
+        color={value >= 0 ? 'green' : 'red'}
+      />
+    ))}
+  </BarChart>
+  <TotalReward value={reward.value} />
+  <Explanation>{reward.message}</Explanation>
+</RewardBreakdown>
+```
+**Example Output:**
+```
+Reward Breakdown (Total: 0.72)
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+Task Completion:    ████████████████████ 0.85
+Efficiency:         ████████████░░░░░░░░ 0.65
+Planning Quality:   ███████████████░░░░░ 0.78
+Recovery Ability:   ██████████████████░░ 0.90
+Exploration:        ████░░░░░░░░░░░░░░░░ 0.20
+Tool Usage:         ███████████████████░ 0.95
+Memory Usage:       ████████░░░░░░░░░░░░ 0.40
+Generalization:     ██████████████░░░░░░ 0.72
+Redundancy Penalty: ░░░░░░░░░░░░░░░░░░░░ -0.15
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+Explanation:
+✓ Excellent task completion (85% of fields extracted correctly)
+✓ Good efficiency (completed in 8/20 steps)
+✓ Strong recovery ability (recovered from 2/2 failures)
+⚠ Moderate redundancy (visited homepage 3 times)
+→ Overall: Strong performance!
+```
+---
+**Next:** See [html-processing.md](./html-processing.md) for advanced HTML handling.

docs/search-engine.md ADDED Viewed

	@@ -0,0 +1,782 @@

+# 🔍 Search Engine Layer
+## Table of Contents
+1. [Overview](#overview)
+2. [Supported Search Engines](#supported-search-engines)
+3. [Query Optimization](#query-optimization)
+4. [Multi-Hop Search](#multi-hop-search)
+5. [Source Credibility Scoring](#source-credibility-scoring)
+6. [Result Ranking](#result-ranking)
+7. [Caching & Deduplication](#caching--deduplication)
+8. [Configuration](#configuration)
+---
+## Overview
+The **Search Engine Layer** enables agents to search the web intelligently, optimize queries, perform multi-hop searches, and evaluate source credibility.
+### Capabilities
+- ✅ Multiple search engine APIs (Google, Bing, Brave, DuckDuckGo, Perplexity)
+- ✅ Query optimization and rewriting
+- ✅ Multi-hop search (search → refine → search again)
+- ✅ Source credibility scoring
+- ✅ Result ranking and filtering
+- ✅ Caching and deduplication
+- ✅ Cost tracking
+---
+## Supported Search Engines
+### 1. Google Search API
+**Pros:**
+- Most comprehensive results
+- High quality
+- Advanced operators support
+**Cons:**
+- Requires API key + Custom Search Engine ID
+- Costs $5 per 1000 queries after free tier
+**Configuration:**
+```python
+{
+    "google": {
+        "api_key": "YOUR_GOOGLE_API_KEY",
+        "search_engine_id": "YOUR_CSE_ID",
+        "region": "us",
+        "safe_search": True,
+        "num_results": 10
+    }
+}
+```
+**Usage:**
+```python
+results = search_engine.search(
+    query="product reviews for Widget Pro",
+    engine="google",
+    num_results=10
+)
+```
+### 2. Bing Search API
+**Pros:**
+- Good quality results
+- Competitive pricing ($7 per 1000 queries)
+- News search included
+**Cons:**
+- Smaller index than Google
+- Less advanced operators
+**Configuration:**
+```python
+{
+    "bing": {
+        "api_key": "YOUR_BING_API_KEY",
+        "market": "en-US",
+        "safe_search": "Moderate",
+        "freshness": None  # "Day", "Week", "Month"
+    }
+}
+```
+### 3. Brave Search API
+**Pros:**
+- Privacy-focused
+- Independent index
+- Good pricing ($5 per 1000 queries)
+- No tracking
+**Cons:**
+- Smaller index
+- Newer service
+**Configuration:**
+```python
+{
+    "brave": {
+        "api_key": "YOUR_BRAVE_API_KEY",
+        "country": "US",
+        "safe_search": "moderate",
+        "freshness": None
+    }
+}
+```
+### 4. DuckDuckGo (Free, No API Key)
+**Pros:**
+- Completely free
+- No API key required
+- Privacy-focused
+- Good for testing
+**Cons:**
+- Rate limited
+- Less control over results
+- Smaller result set
+**Usage:**
+```python
+from duckduckgo_search import DDGS
+results = DDGS().text(
+    keywords="web scraping tools",
+    max_results=10
+)
+```
+### 5. Perplexity AI (AI-Powered Search)
+**Pros:**
+- Returns AI-summarized answers with citations
+- Real-time web access
+- Conversational queries
+**Cons:**
+- More expensive
+- Designed for Q&A, not traditional search
+**Configuration:**
+```python
+{
+    "perplexity": {
+        "api_key": "YOUR_PERPLEXITY_API_KEY",
+        "model": "pplx-70b-online",
+        "include_citations": True
+    }
+}
+```
+---
+## Query Optimization
+### Query Rewriter
+```python
+class QueryOptimizer:
+    """Optimize search queries for better results."""
+    def optimize(self, query: str, context: Dict = None) -> str:
+        """Optimize a search query."""
+        optimized = query
+        # 1. Expand abbreviations
+        optimized = self.expand_abbreviations(optimized)
+        # 2. Add context keywords
+        if context:
+            optimized = self.add_context(optimized, context)
+        # 3. Remove stop words (optional)
+        # optimized = self.remove_stop_words(optimized)
+        # 4. Add search operators
+        optimized = self.add_operators(optimized)
+        return optimized
+    def expand_abbreviations(self, query: str) -> str:
+        """Expand common abbreviations."""
+        expansions = {
+            "AI": "artificial intelligence",
+            "ML": "machine learning",
+            "API": "application programming interface",
+            "UI": "user interface",
+            "UX": "user experience",
+        }
+        for abbr, full in expansions.items():
+            # Only expand if abbreviation stands alone
+            query = re.sub(rf'\b{abbr}\b', full, query)
+        return query
+    def add_context(self, query: str, context: Dict) -> str:
+        """Add contextual keywords."""
+        if context.get('domain'):
+            query = f"{query} site:{context['domain']}"
+        if context.get('year'):
+            query = f"{query} {context['year']}"
+        if context.get('location'):
+            query = f"{query} {context['location']}"
+        return query
+    def add_operators(self, query: str) -> str:
+        """Add search operators for precision."""
+        # If query has multiple important terms, wrap in quotes
+        important_terms = self.extract_important_terms(query)
+        if len(important_terms) > 1:
+            # Exact phrase search for key terms
+            for term in important_terms:
+                if len(term.split()) > 1:
+                    query = query.replace(term, f'"{term}"')
+        return query
+```
+### Query Expansion
+```python
+class QueryExpander:
+    """Expand queries with synonyms and related terms."""
+    def expand(self, query: str) -> List[str]:
+        """Generate query variations."""
+        variations = [query]
+        # 1. Synonym replacement
+        synonyms = self.get_synonyms(query)
+        for synonym_set in synonyms:
+            for term, synonym in synonym_set:
+                varied = query.replace(term, synonym)
+                variations.append(varied)
+        # 2. Add modifiers
+        modifiers = ["best", "top", "review", "comparison", "guide"]
+        for modifier in modifiers:
+            variations.append(f"{modifier} {query}")
+        # 3. Question forms
+        variations.extend([
+            f"what is {query}",
+            f"how to {query}",
+            f"why {query}"
+        ])
+        return variations[:5]  # Limit to top 5
+```
+### Bad Query Detection
+```python
+def is_bad_query(query: str) -> bool:
+    """Detect poorly formed queries."""
+    # Too short
+    if len(query.split()) < 2:
+        return True
+    # All stop words
+    stop_words = {'the', 'a', 'an', 'is', 'are', 'was', 'were', 'be'}
+    words = set(query.lower().split())
+    if words.issubset(stop_words):
+        return True
+    # No meaningful content
+    if not re.search(r'[a-zA-Z]{3,}', query):
+        return True
+    return False
+```
+---
+## Multi-Hop Search
+### Multi-Hop Strategy
+```python
+class MultiHopSearch:
+    """Perform multi-hop search with refinement."""
+    async def search_multi_hop(
+        self,
+        initial_query: str,
+        max_hops: int = 3
+    ) -> MultiHopResult:
+        """Perform multi-hop search."""
+        results_by_hop = []
+        current_query = initial_query
+        for hop in range(max_hops):
+            # Execute search
+            results = await self.search(current_query)
+            results_by_hop.append(results)
+            # Analyze results
+            analysis = self.analyze_results(results)
+            # Check if we found what we need
+            if analysis.is_satisfactory:
+                break
+            # Refine query for next hop
+            current_query = self.refine_query(
+                current_query,
+                results,
+                analysis
+            )
+        return MultiHopResult(
+            hops=results_by_hop,
+            final_query=current_query,
+            best_results=self.rank_all_results(results_by_hop)
+        )
+    def refine_query(
+        self,
+        original_query: str,
+        results: List[SearchResult],
+        analysis: ResultAnalysis
+    ) -> str:
+        """Refine query based on previous results."""
+        # Extract new keywords from top results
+        new_keywords = self.extract_keywords_from_results(results[:3])
+        # If results were too broad, add specificity
+        if analysis.too_broad:
+            specific_terms = [kw for kw in new_keywords if len(kw.split()) > 1]
+            if specific_terms:
+                return f"{original_query} {specific_terms[0]}"
+        # If results were off-topic, add negative keywords
+        if analysis.off_topic_terms:
+            negative = ' '.join(f"-{term}" for term in analysis.off_topic_terms)
+            return f"{original_query} {negative}"
+        # If no results, try synonyms
+        if analysis.no_results:
+            return self.query_expander.expand(original_query)[0]
+        return original_query
+```
+### Example Multi-Hop Flow
+```python
+# Hop 1: Initial broad search
+query_1 = "best web scraping tools"
+results_1 = search(query_1)
+# Results: General articles about scraping tools
+# Hop 2: Refine to specific use case
+query_2 = "best web scraping tools for e-commerce Python"
+results_2 = search(query_2)
+# Results: More specific, Python-focused
+# Hop 3: Add recent constraint
+query_3 = "best web scraping tools for e-commerce Python 2026"
+results_3 = search(query_3)
+# Results: Latest tools with recent reviews
+```
+---
+## Source Credibility Scoring
+### Credibility Scorer
+```python
+class SourceCredibilityScorer:
+    """Score the credibility of search result sources."""
+    def score(self, url: str, domain: str, result: SearchResult) -> float:
+        """Calculate credibility score (0.0 to 1.0)."""
+        score = 0.5  # Base score
+        # 1. Domain reputation
+        score += self.domain_reputation_score(domain) * 0.3
+        # 2. Domain age
+        score += self.domain_age_score(domain) * 0.1
+        # 3. HTTPS
+        if url.startswith('https://'):
+            score += 0.05
+        # 4. TLD credibility
+        score += self.tld_score(domain) * 0.1
+        # 5. Presence in result snippet
+        score += self.snippet_quality_score(result.snippet) * 0.15
+        # 6. Backlinks (if available)
+        score += self.backlink_score(domain) * 0.2
+        # 7. Freshness
+        score += self.freshness_score(result.date_published) * 0.1
+        return min(max(score, 0.0), 1.0)
+    def domain_reputation_score(self, domain: str) -> float:
+        """Score based on known domain reputation."""
+        # Trusted domains
+        trusted = {
+            'wikipedia.org': 1.0,
+            'github.com': 0.95,
+            'stackoverflow.com': 0.95,
+            'nytimes.com': 0.9,
+            'bbc.com': 0.9,
+            'reuters.com': 0.9,
+            'arxiv.org': 0.95,
+            'nature.com': 0.95,
+            'sciencedirect.com': 0.9,
+        }
+        # Known spammy/low-quality domains
+        untrusted = {
+            'contentvilla.com': 0.1,
+            'ehow.com': 0.3,
+        }
+        if domain in trusted:
+            return trusted[domain]
+        if domain in untrusted:
+            return untrusted[domain]
+        # Medium trust for unknown domains
+        return 0.5
+    def tld_score(self, domain: str) -> float:
+        """Score based on top-level domain."""
+        tld = domain.split('.')[-1]
+        tld_scores = {
+            'edu': 0.9,   # Educational institutions
+            'gov': 0.95,  # Government
+            'org': 0.8,   # Organizations
+            'com': 0.6,   # Commercial (neutral)
+            'net': 0.6,
+            'io': 0.6,
+            'info': 0.4,  # Often spammy
+            'xyz': 0.3,   # Cheap, often spam
+        }
+        return tld_scores.get(tld, 0.5)
+    def snippet_quality_score(self, snippet: str) -> float:
+        """Score snippet quality."""
+        score = 0.5
+        # Penalize clickbait patterns
+        clickbait_patterns = [
+            r'you won\'t believe',
+            r'shocking',
+            r'one weird trick',
+            r'\d+ reasons why',
+        ]
+        for pattern in clickbait_patterns:
+            if re.search(pattern, snippet, re.I):
+                score -= 0.2
+        # Reward factual language
+        if re.search(r'according to|research|study|data|analysis', snippet, re.I):
+            score += 0.2
+        return max(0.0, score)
+    def freshness_score(self, date_published: Optional[datetime]) -> float:
+        """Score based on content freshness."""
+        if not date_published:
+            return 0.3  # Unknown date
+        age_days = (datetime.now() - date_published).days
+        # Decay function: Fresh content scores higher
+        if age_days < 30:
+            return 1.0
+        elif age_days < 90:
+            return 0.8
+        elif age_days < 365:
+            return 0.6
+        elif age_days < 730:
+            return 0.4
+        else:
+            return 0.2
+```
+### Domain Blacklist
+```python
+DOMAIN_BLACKLIST = [
+    'contentvilla.com',
+    'pastebin.com',  # Often scraped/duplicated content
+    'scam-detector.com',
+    'pinterest.com',  # Image aggregator, not original content
+    # Add more as needed
+]
+def is_blacklisted(url: str) -> bool:
+    """Check if URL is blacklisted."""
+    domain = urlparse(url).netloc
+    return any(blocked in domain for blocked in DOMAIN_BLACKLIST)
+```
+---
+## Result Ranking
+### Ranking Algorithm
+```python
+class ResultRanker:
+    """Rank search results by relevance and quality."""
+    def rank(
+        self,
+        results: List[SearchResult],
+        query: str,
+        context: Dict = None
+    ) -> List[RankedResult]:
+        """Rank results by multiple factors."""
+        ranked = []
+        for result in results:
+            score = self.calculate_score(result, query, context)
+            ranked.append(RankedResult(
+                result=result,
+                score=score
+            ))
+        # Sort by score (highest first)
+        ranked.sort(key=lambda x: x.score, reverse=True)
+        return ranked
+    def calculate_score(
+        self,
+        result: SearchResult,
+        query: str,
+        context: Dict
+    ) -> float:
+        """Calculate ranking score."""
+        score = 0.0
+        # 1. Credibility (40%)
+        credibility = self.credibility_scorer.score(
+            result.url,
+            result.domain,
+            result
+        )
+        score += credibility * 0.4
+        # 2. Relevance (35%)
+        relevance = self.calculate_relevance(result, query)
+        score += relevance * 0.35
+        # 3. Freshness (10%)
+        freshness = self.credibility_scorer.freshness_score(result.date_published)
+        score += freshness * 0.1
+        # 4. Engagement signals (10%)
+        # (If available: click-through rate, dwell time, etc.)
+        score += result.engagement_score * 0.1
+        # 5. Diversity bonus (5%)
+        # Prefer results from different domains
+        if context and context.get('seen_domains'):
+            if result.domain not in context['seen_domains']:
+                score += 0.05
+        return score
+    def calculate_relevance(self, result: SearchResult, query: str) -> float:
+        """Calculate query-result relevance."""
+        # Simple keyword matching (can be enhanced with embeddings)
+        query_terms = set(query.lower().split())
+        # Check title
+        title_terms = set(result.title.lower().split())
+        title_overlap = len(query_terms & title_terms) / len(query_terms)
+        # Check snippet
+        snippet_terms = set(result.snippet.lower().split())
+        snippet_overlap = len(query_terms & snippet_terms) / len(query_terms)
+        # Weighted average
+        relevance = 0.6 * title_overlap + 0.4 * snippet_overlap
+        return relevance
+```
+---
+## Caching & Deduplication
+### Search Result Cache
+```python
+class SearchCache:
+    """Cache search results to reduce API calls."""
+    def __init__(self, ttl_seconds: int = 3600):
+        self.cache = {}
+        self.ttl = ttl_seconds
+    def get(self, query: str, engine: str) -> Optional[List[SearchResult]]:
+        """Get cached results."""
+        key = self.make_key(query, engine)
+        if key in self.cache:
+            cached, timestamp = self.cache[key]
+            # Check if still valid
+            age = (datetime.now() - timestamp).total_seconds()
+            if age < self.ttl:
+                return cached
+            else:
+                # Expired, remove
+                del self.cache[key]
+        return None
+    def set(self, query: str, engine: str, results: List[SearchResult]):
+        """Cache results."""
+        key = self.make_key(query, engine)
+        self.cache[key] = (results, datetime.now())
+    def make_key(self, query: str, engine: str) -> str:
+        """Generate cache key."""
+        normalized = query.lower().strip()
+        return f"{engine}:{normalized}"
+```
+### Result Deduplication
+```python
+class ResultDeduplicator:
+    """Remove duplicate results across multiple searches."""
+    def deduplicate(self, results: List[SearchResult]) -> List[SearchResult]:
+        """Remove duplicates."""
+        seen_urls = set()
+        seen_titles = set()
+        unique = []
+        for result in results:
+            # Normalize URL (remove query params, fragments)
+            normalized_url = self.normalize_url(result.url)
+            # Normalize title
+            normalized_title = result.title.lower().strip()
+            # Check if we've seen this result
+            if normalized_url in seen_urls:
+                continue
+            # Check for near-duplicate titles
+            if self.is_near_duplicate_title(normalized_title, seen_titles):
+                continue
+            # Add to unique set
+            unique.append(result)
+            seen_urls.add(normalized_url)
+            seen_titles.add(normalized_title)
+        return unique
+    def normalize_url(self, url: str) -> str:
+        """Normalize URL for comparison."""
+        parsed = urlparse(url)
+        # Remove query params and fragment
+        normalized = f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
+        # Remove trailing slash
+        return normalized.rstrip('/')
+    def is_near_duplicate_title(self, title: str, seen_titles: Set[str]) -> bool:
+        """Check if title is near-duplicate of seen titles."""
+        from difflib import SequenceMatcher
+        for seen in seen_titles:
+            similarity = SequenceMatcher(None, title, seen).ratio()
+            if similarity > 0.85:  # 85% similar
+                return True
+        return False
+```
+---
+## Configuration
+### Search Engine Settings
+```typescript
+interface SearchEngineConfig {
+  default: 'google' | 'bing' | 'brave' | 'duckduckgo' | 'perplexity';
+  providers: {
+    google?: GoogleConfig;
+    bing?: BingConfig;
+    brave?: BraveConfig;
+    duckduckgo?: DuckDuckGoConfig;
+    perplexity?: PerplexityConfig;
+  };
+  // Global settings
+  maxResults: number;              // Default: 10
+  timeout: number;                 // Seconds
+  cacheResults: boolean;           // Default: true
+  cacheTTL: number;                // Seconds
+  // Query optimization
+  optimizeQueries: boolean;        // Default: true
+  expandQueries: boolean;          // Default: false
+  // Multi-hop
+  enableMultiHop: boolean;         // Default: false
+  maxHops: number;                 // Default: 3
+  // Filtering
+  filterByCredibility: boolean;    // Default: true
+  minCredibilityScore: number;     // Default: 0.4
+  blacklistedDomains: string[];
+  // Cost tracking
+  trackCosts: boolean;             // Default: true
+  dailyQueryLimit: number;         // Default: 1000
+}
+```
+### Usage Example
+```python
+# Initialize search engine
+search = SearchEngine(config)
+# Simple search
+results = await search.search(
+    query="best Python web scraping libraries",
+    engine="google",
+    num_results=10
+)
+# Optimized search
+results = await search.search_optimized(
+    query="web scraping",
+    context={"domain": "python.org", "year": 2026},
+    optimize=True,
+    filter_credibility=True
+)
+# Multi-hop search
+multi_hop_results = await search.search_multi_hop(
+    initial_query="web scraping tools",
+    max_hops=3
+)
+# Get ranked results
+ranked = search.rank_results(
+    results,
+    query="web scraping tools",
+    context={"seen_domains": ["github.com"]}
+)
+```
+---
+**Next:** See [agents.md](./agents.md) for agent architecture.

docs/settings.md ADDED Viewed

	@@ -0,0 +1,750 @@

+# ⚙️ Dashboard Settings
+## Table of Contents
+1. [Overview](#overview)
+2. [Memory Settings](#memory-settings)
+3. [API & Model Settings](#api--model-settings)
+4. [MCP Server Management](#mcp-server-management)
+5. [Agent Behavior](#agent-behavior)
+6. [Search Engine Configuration](#search-engine-configuration)
+7. [Network & Proxy](#network--proxy)
+8. [Cost Control](#cost-control)
+9. [Performance Tuning](#performance-tuning)
+10. [Import/Export](#importexport)
+---
+## Overview
+The **Settings Dashboard** provides comprehensive configuration for all aspects of the WebScraper environment, models, MCPs, agents, and observability.
+### Settings Structure
+```
+Settings
+├── Memory
+│   ├── Short-Term Memory
+│   ├── Working Memory
+│   ├── Long-Term Memory
+│   └── Shared Memory
+├── API & Models
+│   ├── OpenAI
+│   ├── Anthropic
+│   ├── Google
+│   ├── Groq
+│   ├── Custom Providers
+│   └── Model Routing
+├── MCP Servers
+│   ├── Installed Servers
+│   ├── Available Servers
+│   └── Custom Servers
+├── Agent Behavior
+│   ├── Exploration vs Exploitation
+│   ├── Retry Strategy
+│   ├── Planning Depth
+│   └── Risk Tolerance
+├── Search Engines
+│   ├── Google Search
+│   ├── Bing Search
+│   ├── Brave Search
+│   └── DuckDuckGo
+├── Network & Proxy
+│   ├── Proxy Pool
+│   ├── VPN Configuration
+│   ├── Rate Limiting
+│   └── User Agent Rotation
+├── Cost Control
+│   ├── Daily Budget
+│   ├── Model Costs
+│   └── Alerts
+└── Performance
+    ├── Batch Processing
+    ├── Parallel Execution
+    ├── Caching
+    └── Context Optimization
+```
+---
+## Memory Settings
+### Configuration
+```typescript
+interface MemorySettings {
+  // Layer toggles
+  enableShortTerm: boolean;          // Episode memory
+  enableWorking: boolean;            // Reasoning buffer
+  enableLongTerm: boolean;           // Persistent patterns
+  enableShared: boolean;             // Multi-agent memory
+  // Size limits
+  maxEpisodeMemoryMB: number;        // Default: 10
+  maxWorkingMemoryItems: number;     // Default: 50
+  maxLongTermPatterns: number;       // Default: 10000
+  // Vector database
+  vectorDB: {
+    provider: 'faiss' | 'qdrant' | 'pinecone' | 'weaviate';
+    embeddingModel: string;          // Default: 'text-embedding-3-small'
+    dimension: number;               // Default: 1536
+    similarityMetric: 'cosine' | 'euclidean' | 'dot_product';
+  };
+  // Storage backend
+  storage: {
+    backend: 'filesystem' | 'postgresql' | 'mongodb' | 'redis';
+    path: string;                    // For filesystem
+    connectionString?: string;        // For databases
+  };
+  // Optimization
+  autoPrune: boolean;                // Default: true
+  pruneThreshold: number;            // Default: 0.3 (keep if score > 0.3)
+  pruneIntervalHours: number;        // Default: 24
+  autoSummarize: boolean;            // Default: true
+  maxContextTokens: number;          // Default: 4000
+}
+```
+### UI Component
+```jsx
+<MemorySettings>
+  <Section title="Memory Layers">
+    <Toggle label="Short-Term Memory (Episode)" value={enableShortTerm} />
+    <Toggle label="Working Memory (Reasoning)" value={enableWorking} />
+    <Toggle label="Long-Term Memory (Persistent)" value={enableLongTerm} />
+    <Toggle label="Shared Memory (Multi-Agent)" value={enableShared} />
+  </Section>
+  <Section title="Size Limits">
+    <NumberInput label="Episode Memory (MB)" value={maxEpisodeMemoryMB} min={1} max={100} />
+    <NumberInput label="Working Memory Items" value={maxWorkingMemoryItems} min={10} max={500} />
+    <NumberInput label="Long-Term Patterns" value={maxLongTermPatterns} min={100} max={100000} />
+  </Section>
+  <Section title="Vector Database">
+    <Select label="Provider" options={['FAISS', 'Qdrant', 'Pinecone', 'Weaviate']} />
+    <Select label="Embedding Model" options={['text-embedding-3-small', 'text-embedding-3-large', 'custom']} />
+    <NumberInput label="Dimension" value={dimension} disabled />
+    <Select label="Similarity Metric" options={['Cosine', 'Euclidean', 'Dot Product']} />
+  </Section>
+  <Section title="Auto-Optimization">
+    <Toggle label="Auto-Prune Low-Value Memories" value={autoPrune} />
+    <Slider label="Prune Threshold" value={pruneThreshold} min={0} max={1} step={0.1} />
+    <NumberInput label="Prune Interval (hours)" value={pruneIntervalHours} />
+    <Toggle label="Auto-Summarize Episodes" value={autoSummarize} />
+    <NumberInput label="Max Context Tokens" value={maxContextTokens} />
+  </Section>
+</MemorySettings>
+```
+---
+## API & Model Settings
+### Multi-Provider Configuration
+```typescript
+interface APISettings {
+  providers: {
+    openai?: {
+      apiKey: string;
+      organization?: string;
+      models: {
+        default: string;
+        reasoning: string;
+        fast: string;
+      };
+      temperature: number;
+      maxTokens: number;
+    };
+    anthropic?: {
+      apiKey: string;
+      models: {
+        default: string;
+        reasoning: string;
+        fast: string;
+      };
+      temperature: number;
+      maxTokens: number;
+    };
+    google?: {
+      apiKey: string;
+      models: {
+        default: string;
+        reasoning: string;
+        fast: string;
+      };
+      temperature: number;
+      maxOutputTokens: number;
+    };
+    groq?: {
+      apiKey: string;
+      models: {
+        default: string;
+        reasoning: string;
+        fast: string;
+      };
+      temperature: number;
+      maxTokens: number;
+    };
+    custom?: {
+      baseURL: string;
+      apiKey: string;
+      models: Record<string, string>;
+    };
+  };
+  // Smart routing
+  router: {
+    enabled: boolean;
+    strategy: 'task_based' | 'cost_optimized' | 'speed_optimized' | 'quality_optimized';
+    fallbackOrder: string[];
+    autoRetry: boolean;
+    maxRetries: number;
+  };
+  // Ensemble
+  ensemble: {
+    enabled: boolean;
+    strategy: 'voting' | 'ranking' | 'fusion' | 'verification';
+    models: string[];
+    minAgreement: number;
+  };
+}
+```
+### UI Component
+```jsx
+<APISettings>
+  <Tabs>
+    <Tab label="OpenAI">
+      <TextInput label="API Key" type="password" value={openaiKey} />
+      <TextInput label="Organization (optional)" value={openaiOrg} />
+      <Select label="Default Model" options={['gpt-4o-mini', 'gpt-4-turbo', 'gpt-4o']} />
+      <Select label="Reasoning Model" options={['gpt-4-turbo', 'gpt-4o']} />
+      <Select label="Fast Model" options={['gpt-4o-mini', 'gpt-3.5-turbo']} />
+      <Button onClick={testConnection}>Test Connection</Button>
+    </Tab>
+    <Tab label="Anthropic">
+      <TextInput label="API Key" type="password" value={anthropicKey} />
+      <Select label="Default Model" options={['claude-3-5-sonnet', 'claude-3-opus']} />
+      <Button onClick={testConnection}>Test Connection</Button>
+    </Tab>
+    <Tab label="Google">
+      <TextInput label="API Key" type="password" value={googleKey} />
+      <Select label="Default Model" options={['gemini-1.5-flash', 'gemini-1.5-pro']} />
+      <Button onClick={testConnection}>Test Connection</Button>
+    </Tab>
+    <Tab label="Groq">
+      <TextInput label="API Key" type="password" value={groqKey} />
+      <Select label="Default Model" options={['llama-3.1-70b-versatile', 'llama-3.1-405b']} />
+      <Button onClick={testConnection}>Test Connection</Button>
+    </Tab>
+    <Tab label="Custom">
+      <TextInput label="Base URL" value={customBaseURL} placeholder="http://localhost:11434/v1" />
+      <TextInput label="API Key" type="password" value={customKey} />
+      <DynamicList label="Models" items={customModels} />
+      <Button onClick={testConnection}>Test Connection</Button>
+    </Tab>
+  </Tabs>
+  <Section title="Smart Model Routing">
+    <Toggle label="Enable Smart Routing" value={routerEnabled} />
+    <Select label="Strategy" options={['Task-Based', 'Cost Optimized', 'Speed Optimized', 'Quality Optimized']} />
+    <SortableList label="Fallback Order" items={fallbackOrder} />
+    <Toggle label="Auto-Retry on Failure" value={autoRetry} />
+    <NumberInput label="Max Retries" value={maxRetries} min={1} max={10} />
+  </Section>
+  <Section title="Model Ensemble">
+    <Toggle label="Enable Ensemble (⚠️ Increases Cost)" value={ensembleEnabled} />
+    <Select label="Strategy" options={['Voting', 'Ranking', 'Fusion', 'Verification']} />
+    <MultiSelect label="Models" options={allModels} selected={ensembleModels} />
+    <Slider label="Min Agreement (%)" value={minAgreement} min={50} max={100} />
+  </Section>
+</APISettings>
+```
+---
+## MCP Server Management
+### Configuration
+```typescript
+interface MCPSettings {
+  servers: Record<string, MCPServerConfig>;
+  autoDiscoverTools: boolean;
+  toolTimeout: number;              // Seconds
+  maxConcurrentCalls: number;
+  retryFailedCalls: boolean;
+  cacheToolResults: boolean;
+  cacheTTL: number;                 // Seconds
+}
+interface MCPServerConfig {
+  command: string;
+  args: string[];
+  enabled: boolean;
+  autoDownload: boolean;
+  config: Record<string, any>;
+  // Metadata
+  name: string;
+  description: string;
+  category: string;
+  installSize: string;
+  status: 'installed' | 'not_installed' | 'downloading' | 'error';
+}
+```
+### UI Component
+```jsx
+<MCPServerManagement>
+  <Tabs>
+    <Tab label="Installed">
+      <ServerList>
+        {installedServers.map(server => (
+          <ServerCard key={server.name}>
+            <ServerIcon category={server.category} />
+            <ServerInfo>
+              <Title>{server.name}</Title>
+              <Description>{server.description}</Description>
+              <Meta>
+                <Badge>{ server.category}</Badge>
+                <Badge>{server.toolCount} tools</Badge>
+              </Meta>
+            </ServerInfo>
+            <Actions>
+              <Toggle value={server.enabled} onChange={toggleServer} />
+              <Button onClick={() => testServer(server)}>Test</Button>
+              <Button onClick={() => uninstallServer(server)}>Uninstall</Button>
+            </Actions>
+          </ServerCard>
+        ))}
+      </ServerList>
+    </Tab>
+    <Tab label="Available">
+      <SearchInput placeholder="Search MCP servers..." />
+      <FilterBar>
+        <Filter label="All" />
+        <Filter label="Parsing" />
+        <Filter label="Browser" />
+        <Filter label="Database" />
+        <Filter label="Search" />
+      </FilterBar>
+      <ServerGallery>
+        {availableServers.map(server => (
+          <ServerCard key={server.name}>
+            <ServerIcon category={server.category} />
+            <Title>{server.name}</Title>
+            <Description>{server.description}</Description>
+            <InstallSize>{server.installSize}</InstallSize>
+            <Button onClick={() => installServer(server)}>
+              Install
+            </Button>
+          </ServerCard>
+        ))}
+      </ServerGallery>
+    </Tab>
+    <Tab label="Custom">
+      <Form onSubmit={addCustomServer}>
+        <TextInput label="Server Name" required />
+        <TextInput label="Command" placeholder="python" required />
+        <TextInput label="Arguments" placeholder="-m mcp_server" required />
+        <JsonEditor label="Config (JSON)" />
+        <Button type="submit">Add Server</Button>
+      </Form>
+    </Tab>
+  </Tabs>
+  <Section title="Global MCP Settings">
+    <Toggle label="Auto-Discover Tools on Startup" value={autoDiscoverTools} />
+    <NumberInput label="Tool Timeout (seconds)" value={toolTimeout} />
+    <NumberInput label="Max Concurrent Calls" value={maxConcurrentCalls} />
+    <Toggle label="Retry Failed Calls" value={retryFailedCalls} />
+    <Toggle label="Cache Tool Results" value={cacheToolResults} />
+    <NumberInput label="Cache TTL (seconds)" value={cacheTTL} />
+  </Section>
+</MCPServerManagement>
+```
+---
+## Agent Behavior
+### Configuration
+```typescript
+interface AgentBehaviorSettings {
+  // Exploration vs Exploitation
+  explorationRate: number;           // 0.0 = exploit only, 1.0 = explore only
+  explorationDecay: number;          // Decay rate per episode
+  // Planning
+  planningDepth: number;             // How many steps ahead to plan
+  replanThreshold: number;           // Replan if reward drops by X%
+  // Retry strategy
+  maxRetries: number;                // Per action
+  retryDelay: number;                // Seconds
+  adaptiveRetry: boolean;            // Increase delay after each failure
+  // Risk tolerance
+  riskTolerance: 'conservative' | 'balanced' | 'aggressive';
+  // Memory usage
+  memoryIntensity: 'low' | 'medium' | 'high';
+  // Learning
+  learningRate: number;
+  enableOnlineLearning: boolean;
+  updateMemoryAfterEpisode: boolean;
+}
+```
+### UI Component
+```jsx
+<AgentBehaviorSettings>
+  <Section title="Exploration Strategy">
+    <Slider label="Exploration Rate" value={explorationRate} min={0} max={1} step={0.1} />
+    <HelpText>
+      0.0 = Always use best known strategy (exploit)
+      1.0 = Always try new approaches (explore)
+    </HelpText>
+    <Slider label="Exploration Decay" value={explorationDecay} min={0} max={0.1} step={0.01} />
+  </Section>
+  <Section title="Planning">
+    <Slider label="Planning Depth" value={planningDepth} min={1} max={10} />
+    <HelpText>How many steps ahead the agent plans (higher = slower but smarter)</HelpText>
+    <NumberInput label="Replan Threshold (%)" value={replanThreshold} min={0} max={100} />
+  </Section>
+  <Section title="Retry Strategy">
+    <NumberInput label="Max Retries" value={maxRetries} min={0} max={10} />
+    <NumberInput label="Retry Delay (seconds)" value={retryDelay} min={0} max={60} />
+    <Toggle label="Adaptive Retry (exponential backoff)" value={adaptiveRetry} />
+  </Section>
+  <Section title="Risk Tolerance">
+    <RadioGroup value={riskTolerance}>
+      <Radio value="conservative">
+        <Label>Conservative</Label>
+        <Description>Prefer proven patterns, avoid risky actions</Description>
+      </Radio>
+      <Radio value="balanced">
+        <Label>Balanced</Label>
+        <Description>Balance exploration and exploitation</Description>
+      </Radio>
+      <Radio value="aggressive">
+        <Label>Aggressive</Label>
+        <Description>Try new approaches quickly, higher failure tolerance</Description>
+      </Radio>
+    </RadioGroup>
+  </Section>
+  <Section title="Memory & Learning">
+    <Select label="Memory Intensity" options={['Low', 'Medium', 'High']} />
+    <Toggle label="Enable Online Learning" value={enableOnlineLearning} />
+    <Toggle label="Update Memory After Each Episode" value={updateMemoryAfterEpisode} />
+  </Section>
+</AgentBehaviorSettings>
+```
+---
+## Search Engine Configuration
+### Configuration
+```typescript
+interface SearchEngineSettings {
+  default: 'google' | 'bing' | 'brave' | 'duckduckgo' | 'perplexity';
+  google?: {
+    apiKey: string;
+    searchEngineId: string;
+    region: string;
+    safeSearch: boolean;
+  };
+  bing?: {
+    apiKey: string;
+    market: string;
+  };
+  brave?: {
+    apiKey: string;
+    country: string;
+  };
+  duckduckgo?: {
+    region: string;
+    safeSearch: boolean;
+  };
+  perplexity?: {
+    apiKey: string;
+    model: string;
+  };
+  // Global settings
+  maxResults: number;
+  timeout: number;
+  cacheResults: boolean;
+  cacheTTL: number;
+}
+```
+### UI Component
+```jsx
+<SearchEngineSettings>
+  <Select label="Default Search Engine" options={['Google', 'Bing', 'Brave', 'DuckDuckGo', 'Perplexity']} />
+  <Tabs>
+    <Tab label="Google">
+      <TextInput label="API Key" type="password" />
+      <TextInput label="Search Engine ID" />
+      <Select label="Region" options={regions} />
+      <Toggle label="Safe Search" />
+      <Button onClick={testGoogle}>Test</Button>
+    </Tab>
+    <Tab label="Bing">
+      <TextInput label="API Key" type="password" />
+      <Select label="Market" options={markets} />
+      <Button onClick={testBing}>Test</Button>
+    </Tab>
+    <Tab label="Brave">
+      <TextInput label="API Key" type="password" />
+      <Select label="Country" options={countries} />
+      <Button onClick={testBrave}>Test</Button>
+    </Tab>
+    <Tab label="DuckDuckGo">
+      <Info>No API key required (free)</Info>
+      <Select label="Region" options={regions} />
+      <Toggle label="Safe Search" />
+    </Tab>
+    <Tab label="Perplexity">
+      <TextInput label="API Key" type="password" />
+      <Select label="Model" options={['pplx-70b-online', 'pplx-7b-online']} />
+      <Button onClick={testPerplexity}>Test</Button>
+    </Tab>
+  </Tabs>
+  <Section title="Global Settings">
+    <NumberInput label="Max Results" value={maxResults} min={1} max={100} />
+    <NumberInput label="Timeout (seconds)" value={timeout} min={5} max={60} />
+    <Toggle label="Cache Results" value={cacheResults} />
+    <NumberInput label="Cache TTL (seconds)" value={cacheTTL} />
+  </Section>
+</SearchEngineSettings>
+```
+---
+## Network & Proxy
+### Configuration
+```typescript
+interface NetworkSettings {
+  proxy: {
+    enabled: boolean;
+    pools: ProxyPool[];
+    rotationStrategy: 'round_robin' | 'random' | 'health_based';
+    maxRetries: number;
+  };
+  vpn: {
+    enabled: boolean;
+    provider: string;
+    server: string;
+    credentials: {
+      username: string;
+      password: string;
+    };
+  };
+  rateLimiting: {
+    enabled: boolean;
+    requestsPerSecond: number;
+    burstSize: number;
+  };
+  userAgent: {
+    rotationEnabled: boolean;
+    customUserAgents: string[];
+  };
+  timeout: {
+    connect: number;
+    read: number;
+  };
+}
+```
+### UI - See [proxy-vpn.md](./WebScraper_OpenEnv_SoftwareDoc.md#9-network-layer--vpn--proxy) for full details
+---
+## Cost Control
+### Configuration
+```typescript
+interface CostControlSettings {
+  dailyBudget: number;               // USD
+  monthlyBudget: number;             // USD
+  alertThresholds: number[];         // [0.5, 0.8, 0.9] = 50%, 80%, 90%
+  modelCosts: Record<string, { input: number; output: number }>;
+  enforcements: {
+    stopOnBudgetExceeded: boolean;
+    downgradeToCheaperModel: boolean;
+    notifyOnHighCost: boolean;
+  };
+}
+```
+### UI Component
+```jsx
+<CostControlSettings>
+  <Section title="Budgets">
+    <NumberInput label="Daily Budget (USD)" value={dailyBudget} prefix="$" />
+    <NumberInput label="Monthly Budget (USD)" value={monthlyBudget} prefix="$" />
+    <CurrentUsage>
+      <Stat>
+        <Label>Today</Label>
+        <Value>${todayCost.toFixed(2)} / ${dailyBudget.toFixed(2)}</Value>
+        <ProgressBar value={todayCost / dailyBudget} />
+      </Stat>
+      <Stat>
+        <Label>This Month</Label>
+        <Value>${monthCost.toFixed(2)} / ${monthlyBudget.toFixed(2)}</Value>
+        <ProgressBar value={monthCost / monthlyBudget} />
+      </Stat>
+    </CurrentUsage>
+  </Section>
+  <Section title="Alerts">
+    <TagInput label="Alert Thresholds (%)" values={alertThresholds} />
+    <HelpText>Get notified at these percentages of budget</HelpText>
+  </Section>
+  <Section title="Enforcement">
+    <Toggle label="Stop Execution on Budget Exceeded" value={stopOnBudgetExceeded} />
+    <Toggle label="Auto-Downgrade to Cheaper Model" value={downgradeToCheaperModel} />
+    <Toggle label="Notify on High-Cost Requests" value={notifyOnHighCost} />
+  </Section>
+  <Section title="Model Costs">
+    <Table>
+      <thead>
+        <tr>
+          <th>Model</th>
+          <th>Input (per 1M tokens)</th>
+          <th>Output (per 1M tokens)</th>
+          <th>Estimated Cost/Episode</th>
+        </tr>
+      </thead>
+      <tbody>
+        {models.map(model => (
+          <tr key={model.name}>
+            <td>{model.name}</td>
+            <td>${model.inputCost.toFixed(2)}</td>
+            <td>${model.outputCost.toFixed(2)}</td>
+            <td>${model.estimatedCostPerEpisode.toFixed(4)}</td>
+          </tr>
+        ))}
+      </tbody>
+    </Table>
+  </Section>
+</CostControlSettings>
+```
+---
+## Performance Tuning
+### Configuration
+```typescript
+interface PerformanceSettings {
+  batchProcessing: {
+    enabled: boolean;
+    batchSize: number;
+    maxConcurrent: number;
+  };
+  parallelExecution: {
+    enabled: boolean;
+    maxWorkers: number;
+  };
+  caching: {
+    enabled: boolean;
+    cacheHTML: boolean;
+    cacheAPIResponses: boolean;
+    cacheDuration: number;            // Seconds
+    maxCacheSize: number;             // MB
+  };
+  contextOptimization: {
+    enabled: boolean;
+    summarizeOldObservations: boolean;
+    pruneThreshold: number;
+    maxContextTokens: number;
+  };
+}
+```
+---
+## Import/Export
+```jsx
+<ImportExportSettings>
+  <Section title="Export Settings">
+    <Button onClick={exportAll}>Export All Settings (JSON)</Button>
+    <Button onClick={exportMemory}>Export Memory Database</Button>
+    <Button onClick={exportLogs}>Export Logs</Button>
+  </Section>
+  <Section title="Import Settings">
+    <FileUpload label="Import Settings (JSON)" accept=".json" onChange={importSettings} />
+    <Button onClick={resetToDefaults}>Reset to Defaults</Button>
+  </Section>
+</ImportExportSettings>
+```
+---
+**Next:** See [rewards.md](./rewards.md) for advanced reward function design.