NeerajCodz commited on
Commit
df47251
·
0 Parent(s):

docs: update

Browse files
docs/README.md ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Documentation Index
2
+
3
+ This documentation set supersedes and expands `WebScraper_OpenEnv_SoftwareDoc.md` into focused modules.
4
+
5
+ ## Core Docs
6
+
7
+ - `openenv.md` — enhanced OpenEnv spec, actions, observations, lifecycle
8
+ - `architecture.md` — system architecture, runtime, scheduling, scaling
9
+ - `agents.md` — multi-agent roles, strategies, HITL, explainability
10
+ - `rewards.md` — advanced reward function and signal breakdown
11
+
12
+ ## Platform Docs
13
+
14
+ - `api.md` — multi-model API system and routing/ensemble/cost tracking
15
+ - `mcp.md` — MCP integration, registry, lazy install, composition
16
+ - `search-engine.md` — search providers, query optimization, credibility scoring
17
+ - `html-processing.md` — semantic parsing, adaptive chunking, batch + diff processing
18
+ - `memory.md` — unified memory system (short/working/long/shared)
19
+
20
+ ## Operations Docs
21
+
22
+ - `settings.md` — dashboard settings and configuration controls
23
+ - `observability.md` — metrics, traces, thought stream, cost telemetry
24
+ - `features.md` — advanced capabilities and feature flags
25
+
26
+ ## Legacy
27
+
28
+ - `WebScraper_OpenEnv_SoftwareDoc.md` remains as original monolithic source.
docs/WebScraper_OpenEnv_SoftwareDoc.md ADDED
@@ -0,0 +1,1654 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # WebScraper-OpenEnv: Software Design Document
2
+
3
+ **Project:** WebScraper-OpenEnv
4
+ **Version:** 1.0.0
5
+ **Hackathon:** OpenEnv — Round 1
6
+ **Author:** [Your Name]
7
+ **Date:** March 2026
8
+
9
+ ---
10
+
11
+ ## Table of Contents
12
+
13
+ 1. [Project Overview](#1-project-overview)
14
+ 2. [Real-World Motivation](#2-real-world-motivation)
15
+ 3. [System Architecture](#3-system-architecture)
16
+ 4. [OpenEnv Specification](#4-openenv-specification)
17
+ - 4.1 Observation Model
18
+ - 4.2 Action Model
19
+ - 4.3 Reward Model
20
+ - 4.4 Episode Lifecycle
21
+ 5. [Environment State Machine](#5-environment-state-machine)
22
+ 6. [Task Definitions](#6-task-definitions)
23
+ - Task 1: Static Page Field Extraction (Easy)
24
+ - Task 2: Paginated Catalog Scraping (Medium)
25
+ - Task 3: Deep Research with Search & Fact Verification (Hard)
26
+ 7. [Grader Design](#7-grader-design)
27
+ 8. [Reward Function Design](#8-reward-function-design)
28
+ 9. [Network Layer — VPN & Proxy](#9-network-layer--vpn--proxy)
29
+ - 9.1 Architecture
30
+ - 9.2 Proxy Configuration
31
+ - 9.3 VPN Configuration
32
+ - 9.4 Public Pool
33
+ - 9.5 Settings Persistence
34
+ 10. [API Endpoint Specification](#10-api-endpoint-specification)
35
+ 11. [Data Models (Pydantic Schemas)](#11-data-models-pydantic-schemas)
36
+ 12. [Simulated Web Environment](#12-simulated-web-environment)
37
+ 13. [Baseline Inference Script](#13-baseline-inference-script)
38
+ 14. [Project Structure](#14-project-structure)
39
+ 15. [Dockerfile & Deployment](#15-dockerfile--deployment)
40
+ 16. [openenv.yaml](#16-openenvyaml)
41
+ 17. [Testing Strategy](#17-testing-strategy)
42
+ 18. [Known Limitations & Future Work](#18-known-limitations--future-work)
43
+
44
+ ---
45
+
46
+ ## 1. Project Overview
47
+
48
+ **WebScraper-OpenEnv** is a reinforcement learning environment that challenges AI agents to perform structured **web data extraction** — a task humans and automated pipelines carry out every day for market research, competitive intelligence, lead generation, price monitoring, and data journalism.
49
+
50
+ The environment wraps a fully **self-contained simulated web server** (no external network calls required) that presents realistic HTML pages with varying structure, noise, pagination, and adversarial anti-scraping patterns. Agents must issue targeted extraction actions to retrieve structured data within budget and quality constraints.
51
+
52
+ This environment is designed to:
53
+ - Evaluate an agent's ability to **parse and reason about semi-structured HTML**
54
+ - Test **multi-step planning** across paginated or linked content
55
+ - Stress-test **robustness** when pages are noisy, misleading, or rate-limited
56
+ - Provide **dense reward signals** that guide learning rather than just measuring final output
57
+
58
+ ---
59
+
60
+ ## 2. Real-World Motivation
61
+
62
+ Web scraping is a core capability required across:
63
+
64
+ | Use Case | Example |
65
+ |---|---|
66
+ | E-commerce monitoring | Track competitor prices across 1,000 SKUs daily |
67
+ | Lead generation | Extract company names, emails, headcount from directories |
68
+ | Research automation | Aggregate paper titles, authors, abstracts from 5 sources |
69
+ | News intelligence | Collect headlines, dates, sources matching a keyword |
70
+ | Real estate | Pull property listings, prices, square footage from portals |
71
+
72
+ Current LLM agents struggle with scraping because it requires:
73
+ 1. Selecting the right CSS/XPath selector or field label from noisy HTML
74
+ 2. Knowing *when to stop* (pagination boundary detection)
75
+ 3. Deduplication and normalization of extracted values
76
+ 4. Graceful recovery from blocked or malformed pages
77
+
78
+ No existing OpenEnv environment covers this domain. **WebScraper-OpenEnv fills this gap.**
79
+
80
+ ---
81
+
82
+ ## 3. System Architecture
83
+
84
+ ```
85
+ ┌─────────────────────────────────────────────────────────────────┐
86
+ │ Single Docker Container (:7860) │
87
+ │ │
88
+ │ ┌───────────────────────────────────────────────────────────┐ │
89
+ │ │ Vite Frontend (React) │ │
90
+ │ │ TaskSelector │ EpisodeViewer │ RewardChart │ Baseline │ │
91
+ │ │ fetch("/api/...") │ │
92
+ │ └────────────────────────┬──────────────────────────────────┘ │
93
+ │ │ same origin │
94
+ │ ┌─────────────────────────▼────────────────────────────────┐ │
95
+ │ │ FastAPI Application │ │
96
+ │ │ │ │
97
+ │ │ /api/reset /api/step /api/state /api/tasks │ │
98
+ │ │ /api/grader /api/baseline │ │
99
+ │ │ /* → serves frontend/dist/index.html (SPA fallback) │ │
100
+ │ │ │ │
101
+ │ │ ┌──────────────────────┐ ┌──────────────────────────┐ │ │
102
+ │ │ │ WebScraperEnv │ │ SimulatedWebServer │ │ │
103
+ │ │ │ - episode state │◄►│ - HTML page generator │ │ │
104
+ │ │ │ - action dispatch │ │ - pagination engine │ │ │
105
+ │ │ │ - reward engine │ │ - noise injector │ │ │
106
+ │ │ │ - grader registry │ │ - anti-scrape simulator │ │ │
107
+ │ │ └──────────────────────┘ └──────────────────────────┘ │ │
108
+ │ └───────────────────────────────────────────────────────────┘ │
109
+ └─────────────────────────────────────────────────────────────────┘
110
+
111
+ │ HTTP JSON (agents / baseline script)
112
+
113
+ AI Agent / Baseline Script
114
+ ```
115
+
116
+ **Key design decisions:**
117
+ - The simulated web server is **seeded and deterministic** — same `task_id` + `seed` always produces the same pages, enabling reproducible evaluation.
118
+ - Pages are generated dynamically from Jinja2 templates with injected noise, not stored as static files, keeping the Docker image small.
119
+ - The environment is **stateless across HTTP requests** but maintains episode state in-memory, keyed by `episode_id`.
120
+ - The **Vite frontend** is compiled at Docker build time (Stage 1) and served as static files by FastAPI — no separate web server (nginx, etc.) needed. Single port, single process.
121
+
122
+ ---
123
+
124
+ ## 4. OpenEnv Specification
125
+
126
+ ### 4.1 Observation Model
127
+
128
+ An `Observation` is returned after every `reset()` and `step()` call.
129
+
130
+ ```python
131
+ class Observation(BaseModel):
132
+ episode_id: str # UUID for the current episode
133
+ task_id: str # Task identifier ("task_easy" | "task_medium" | "task_hard")
134
+ step_number: int # Current step count (0-indexed)
135
+ current_url: str # Simulated URL of the current page
136
+ page_html: str # Raw HTML content of the current page (trimmed to 8000 chars)
137
+ page_title: str # <title> tag value
138
+ available_actions: list[str] # High-level action types available at this step
139
+ extracted_so_far: dict # Fields extracted successfully in this episode so far
140
+ pages_visited: list[str] # Ordered list of URLs visited this episode
141
+ budget_remaining: int # Remaining step budget (starts at task max_steps)
142
+ task_description: str # Human-readable task goal
143
+ target_fields: list[str] # Names of fields the agent must extract
144
+ hints: list[str] # Contextual hints (empty in hard mode)
145
+ ```
146
+
147
+ **Design rationale:**
148
+ - `page_html` is included directly in the observation so agents can act without a separate fetch step. Truncated at 8,000 characters to simulate token budget pressure realistically.
149
+ - `extracted_so_far` gives the agent a running view of what it has already collected — critical for multi-page tasks.
150
+ - `hints` are populated for easy/medium tasks and empty for hard, creating a natural difficulty gradient.
151
+
152
+ ### 4.2 Action Model
153
+
154
+ An `Action` is submitted by the agent in each `step()` call.
155
+
156
+ ```python
157
+ class Action(BaseModel):
158
+ action_type: ActionType # Enum — see below
159
+ target_field: str | None # Field name to extract (for EXTRACT actions)
160
+ selector: str | None # CSS selector or field label hint
161
+ navigate_to: str | None # URL or "next_page" / "prev_page" keyword
162
+ submit_extraction: dict | None # Final field→value map (for SUBMIT action)
163
+ notes: str | None # Agent's internal reasoning note (not scored, logged)
164
+ ```
165
+
166
+ ```python
167
+ class ActionType(str, Enum):
168
+ EXTRACT_FIELD = "extract_field" # Extract one named field from current page
169
+ NAVIGATE = "navigate" # Go to a URL or next/prev page
170
+ SEARCH_PAGE = "search_page" # Regex/keyword search within current page HTML
171
+ INSPECT_ELEMENT = "inspect_element" # Get focused text around a CSS selector
172
+ SUBMIT = "submit" # Final answer — ends the episode
173
+ SKIP_PAGE = "skip_page" # Declare current page irrelevant, move on
174
+ # ── Task 3 / Hard mode only ─────────────────────────────────────────
175
+ SEARCH_ENGINE = "search_engine" # Issue a query to the configured search engine
176
+ VERIFY_FACT = "verify_fact" # Cross-check a field value against a second source
177
+ RESOLVE_CONFLICT = "resolve_conflict" # Declare which of two conflicting values is authoritative
178
+ FETCH_URL = "fetch_url" # Fetch an arbitrary URL (uses active proxy/VPN if set)
179
+ ```
180
+
181
+ **Extended `Action` model for new types:**
182
+
183
+ ```python
184
+ class Action(BaseModel):
185
+ action_type: ActionType
186
+ # --- Existing fields ---
187
+ target_field: str | None = None
188
+ selector: str | None = None
189
+ navigate_to: str | None = None
190
+ submit_extraction: dict | None = None
191
+ notes: str | None = None
192
+ # --- Search engine fields ---
193
+ query: str | None = None # Query string for SEARCH_ENGINE
194
+ search_engine: str | None = None # "google" | "bing" | "brave" | "ddg" (uses settings default if None)
195
+ result_limit: int = 5 # Max search results to return (1–10)
196
+ # --- Fact verification fields ---
197
+ field_name: str | None = None # Field to verify in VERIFY_FACT
198
+ claimed_value: str | None = None # Value to check
199
+ verification_source: str | None = None # URL to verify against
200
+ # --- Conflict resolution fields ---
201
+ conflicting_sources: list[str] | None = None # Two URLs with disagreeing values
202
+ chosen_source: str | None = None # URL the agent judges more authoritative
203
+ rationale: str | None = None # Agent's justification (logged, not scored)
204
+ ```
205
+
206
+ **Design rationale:**
207
+ - Actions are **higher-level than raw HTTP** — the agent doesn't manage cookies or headers, it focuses on extraction logic.
208
+ - `INSPECT_ELEMENT` gives the agent a focused window into the DOM, rewarding agents that learn to select precisely.
209
+ - `SEARCH_ENGINE` issues a query through whichever engine the user has configured in Settings (or the environment's default). Results are returned as a ranked list of `{title, url, snippet}` objects — the agent then navigates to the most promising URL.
210
+ - `VERIFY_FACT` instructs the environment to fetch a second source and check whether the claimed value appears there. Returns a `verified: bool` and a `confidence: float` — not a definitive answer, mirroring real-world uncertainty.
211
+ - `RESOLVE_CONFLICT` is scored by the grader: if the agent picks the more authoritative source it earns a bonus; if it picks the wrong one it earns a penalty.
212
+ - `SUBMIT` is the terminal action that triggers the grader.
213
+
214
+ ### 4.3 Reward Model
215
+
216
+ ```python
217
+ class Reward(BaseModel):
218
+ value: float # Reward for this step (-1.0 to +1.0)
219
+ cumulative: float # Total reward accumulated this episode
220
+ breakdown: dict # Labeled sub-rewards (for interpretability)
221
+ message: str # Human-readable explanation
222
+ ```
223
+
224
+ ### 4.4 Episode Lifecycle
225
+
226
+ ```
227
+ reset(task_id, seed?)
228
+ → Observation (step 0, fresh page, budget = max_steps)
229
+
230
+ step(action: EXTRACT_FIELD | NAVIGATE | ...)
231
+ → Observation (updated state), Reward, done=False, info
232
+
233
+ step(action: SUBMIT)
234
+ → Observation (terminal), Reward (grader score * scale), done=True, info
235
+
236
+ state()
237
+ → Current episode state snapshot (same fields as Observation + internal metadata)
238
+ ```
239
+
240
+ An episode also ends automatically if:
241
+ - `budget_remaining` reaches 0 (budget exhaustion — scores whatever was extracted)
242
+ - The agent navigates to more than `max_pages` unique URLs
243
+
244
+ ---
245
+
246
+ ## 5. Environment State Machine
247
+
248
+ ```
249
+ reset()
250
+
251
+
252
+ ┌──────────────┐
253
+ │ RUNNING │◄──────────────────────────────────────────┐
254
+ │ │ │
255
+ │ step(NAV) │──► fetch_page() ──► update_obs() ──────┤
256
+ │ step(EXT) │──► extract() ──► update_obs() ──────┤
257
+ │ step(SRCH) │──► search_html() ──► update_obs() ──────┤
258
+ │ step(SE) │──► search_engine() ──► ranked_results ────┤
259
+ │ step(VRF) │──► verify_fact() ──► confidence_score ──┤
260
+ │ step(RES) │──► resolve() ──► authoritative val ─┘
261
+ └──────┬───────┘
262
+
263
+ step(SUBMIT) or budget=0
264
+
265
+
266
+ ┌──────────────┐
267
+ │ TERMINAL │──► grader.score() ──► final Reward
268
+ └──────────────┘
269
+ ```
270
+
271
+ **State fields stored per episode:**
272
+
273
+ | Field | Type | Description |
274
+ |---|---|---|
275
+ | `episode_id` | str | UUID |
276
+ | `task_id` | str | Active task |
277
+ | `seed` | int | RNG seed for page generation |
278
+ | `step_number` | int | Steps taken |
279
+ | `current_url` | str | Active page URL |
280
+ | `pages_visited` | list | Navigation history |
281
+ | `extracted_data` | dict | Field→value map built up by agent |
282
+ | `ground_truth` | dict | Hidden correct field→value map |
283
+ | `budget` | int | Steps remaining |
284
+ | `status` | Enum | RUNNING / TERMINAL |
285
+ | `created_at` | datetime | Episode start time |
286
+
287
+ ---
288
+
289
+ ## 6. Task Definitions
290
+
291
+ ### Task 1: Static Page Field Extraction (Easy)
292
+
293
+ **ID:** `task_easy`
294
+ **Max Steps:** 10
295
+ **Max Pages:** 1
296
+ **Hints:** Yes
297
+
298
+ **Scenario:**
299
+ The agent is given a single product listing page for an e-commerce store. The page contains a product name, price, SKU, star rating, and number of reviews. Minimal noise. Fields are labeled clearly.
300
+
301
+ **Target Fields:**
302
+ ```
303
+ product_name, price, sku, star_rating, review_count
304
+ ```
305
+
306
+ **Sample Page URL:** `sim://shop.example.com/product/42`
307
+
308
+ **Ground Truth (example, seeded):**
309
+ ```json
310
+ {
311
+ "product_name": "Wireless Noise-Cancelling Headphones",
312
+ "price": "$89.99",
313
+ "sku": "WNC-4421-BLK",
314
+ "star_rating": "4.3",
315
+ "review_count": "1,247"
316
+ }
317
+ ```
318
+
319
+ **Success Criteria:**
320
+ - Extract all 5 fields correctly → score 1.0
321
+ - Partial credit per field (0.2 per field)
322
+ - Normalized comparison (whitespace-stripped, case-insensitive)
323
+
324
+ **Difficulty Rationale:** A capable LLM can find labeled fields in clean HTML in 1–3 steps with direct CSS selectors or simple keyword search.
325
+
326
+ ---
327
+
328
+ ### Task 2: Paginated Catalog Scraping (Medium)
329
+
330
+ **ID:** `task_medium`
331
+ **Max Steps:** 25
332
+ **Max Pages:** 5
333
+ **Hints:** Partial (structure hint, no selector hint)
334
+
335
+ **Scenario:**
336
+ The agent must scrape a product catalog spread across 3 pages of pagination (20 items per page, 60 total items simulated). The agent must collect the **name and price of the 3 cheapest items** across all pages. Items are listed in random price order. The agent must decide whether to visit all pages or infer from partial data.
337
+
338
+ **Target Fields:**
339
+ ```
340
+ cheapest_item_1_name, cheapest_item_1_price,
341
+ cheapest_item_2_name, cheapest_item_2_price,
342
+ cheapest_item_3_name, cheapest_item_3_price
343
+ ```
344
+
345
+ **Complications introduced:**
346
+ - Prices use mixed formats: `$12.99`, `$12.990`, `12.99 USD` — normalization required
347
+ - One page contains a "Featured" item injected at the top that is actually overpriced
348
+ - Pagination links use non-obvious URL patterns (`?pg=2` vs `?offset=20`)
349
+
350
+ **Grader Logic:**
351
+ 1. Extract agent's top-3 cheapest items
352
+ 2. Compare to ground truth top-3 (computed by environment at episode start)
353
+ 3. Score = (# correctly identified items / 3) × quality bonus (if price values match within ±$0.01)
354
+
355
+ **Difficulty Rationale:** Requires multi-page navigation planning, price normalization, and sorting logic — a significant step up from single-page extraction.
356
+
357
+ ---
358
+
359
+ ### Task 3: Deep Research with Search & Fact Verification (Hard)
360
+
361
+ **ID:** `task_hard`
362
+ **Max Steps:** 60
363
+ **Max Pages:** 20
364
+ **Hints:** None
365
+ **Search Engine:** Required (uses configured engine or environment default)
366
+ **Fact Verification:** Required for minimum 3 fields to achieve full score
367
+
368
+ ---
369
+
370
+ **Scenario:**
371
+ The agent is given a **target entity** (a mid-size private company, randomly selected per seed) and must build a fully sourced, verified intelligence profile. No starting URL is provided — the agent must begin by issuing search engine queries to discover relevant pages. Information is distributed across 6+ simulated domains and some fields only appear on pages that are only discoverable via search (not linked from the entry page). At least two fields will have conflicting values across sources, and the agent must explicitly resolve these conflicts to earn full credit.
372
+
373
+ ---
374
+
375
+ **Target Fields (14 total, grouped by difficulty tier):**
376
+
377
+ ```
378
+ ── Tier 1 — Basic Identity (weight 1.0x each) ──────────────────────────
379
+ company_name Full legal name of the company
380
+ headquarters_city City of primary HQ
381
+ headquarters_country Country of primary HQ
382
+ primary_industry Top-level industry category (e.g. "FinTech", "SaaS")
383
+
384
+ ── Tier 2 — Operational Data (weight 1.5x each) ────────────────────────
385
+ founding_year Year company was founded [CONFLICT present]
386
+ employee_count_range Bucketed range: "1-50" | "51-200" | "201-500" | "501-2000" | "2000+"
387
+ ceo_name Full name of current CEO [requires search to discover page]
388
+ product_count Number of distinct products/services listed [requires enumeration]
389
+
390
+ ── Tier 3 — Financial & Strategic (weight 2.0x each) ───────────────────
391
+ latest_funding_round_type Series A/B/C | Seed | Growth | IPO | Unknown
392
+ latest_funding_amount_usd Numeric USD value (normalize: "$12M" → 12000000)
393
+ total_funding_usd Cumulative raised (may require summing across rounds) [CONFLICT present]
394
+ lead_investor Name of lead investor in latest round [search-only page]
395
+
396
+ ── Tier 4 — Verification Required (weight 2.5x each) ───────────────────
397
+ founding_year_verified Must call VERIFY_FACT; score only awarded if verified
398
+ ceo_name_verified Must call VERIFY_FACT from a second independent source
399
+ ```
400
+
401
+ ---
402
+
403
+ **Complications introduced:**
404
+
405
+ **Search-first discovery**
406
+ No entry URL is provided. The agent must use `SEARCH_ENGINE` to find a homepage, news page, and financial data page. The simulated search engine returns ranked results with varying relevance — the top result is not always the most useful one.
407
+
408
+ **Cross-domain fragmentation**
409
+ Data is spread across 6 simulated domains. No single domain holds more than 4 fields. The agent must plan a visit sequence and track what it has found vs. what is still missing.
410
+
411
+ | Domain | Fields present |
412
+ |---|---|
413
+ | `sim://company.example.com` | company_name, headquarters_city/country, primary_industry |
414
+ | `sim://directory.example.com` | founding_year (version A), employee_count_range, ceo_name |
415
+ | `sim://news.example.com` | latest_funding_round_type, latest_funding_amount_usd, lead_investor |
416
+ | `sim://finance.example.com` | total_funding_usd, founding_year (version B — conflict), product_count |
417
+ | `sim://regulatory.example.com` | founding_year (authoritative — SEC-style filing, only discoverable via search) |
418
+ | `sim://linkedin-sim.example.com` | ceo_name (second independent source for verification) |
419
+
420
+ **Deliberate conflicts**
421
+ - `founding_year`: directory says 2011, finance page says 2013. The regulatory filing (search-only) says 2012 — this is the authoritative answer. Agent must issue `SEARCH_ENGINE` query to find it, then `RESOLVE_CONFLICT` naming it as authoritative.
422
+ - `total_funding_usd`: news page reports latest round only; finance page has cumulative. Agent must distinguish these and report cumulative.
423
+
424
+ **Prose extraction & normalization**
425
+ - `employee_count_range` appears as: "We have grown to over 800 people worldwide" → must map to `"501-2000"`
426
+ - `latest_funding_amount_usd` appears as: "raised $24.5 million in Series B" → must normalize to `24500000`
427
+ - `product_count` requires counting `<li>` items inside a specific section, not reading a single labeled field
428
+
429
+ **Simulated anti-scraping**
430
+ - `finance.example.com` returns a 429-like interstitial on the first visit; agent must either retry (costs a step) or configure a proxy/VPN in settings to bypass it
431
+ - `linkedin-sim.example.com` requires a `SEARCH_PAGE` keyword unlock before full content is accessible
432
+
433
+ **Verification gates**
434
+ Fields `founding_year_verified` and `ceo_name_verified` are only scoreable if the agent has issued a `VERIFY_FACT` action for them referencing a different domain than the one the value was originally extracted from. The grader checks the action log — extraction alone is not sufficient.
435
+
436
+ ---
437
+
438
+ **Search Engine Behavior in Task 3:**
439
+
440
+ When the agent calls `SEARCH_ENGINE`, the simulated engine returns results structured as:
441
+
442
+ ```json
443
+ {
444
+ "query": "Acme Corp company profile",
445
+ "results": [
446
+ {
447
+ "rank": 1,
448
+ "title": "Acme Corp — Official Website",
449
+ "url": "sim://company.example.com/about",
450
+ "snippet": "Acme Corp is a leading SaaS platform headquartered in Austin..."
451
+ },
452
+ {
453
+ "rank": 2,
454
+ "title": "Acme Corp on Business Directory",
455
+ "url": "sim://directory.example.com/acme-corp",
456
+ "snippet": "Founded in 2011. 820 employees. CEO: Jane Doe..."
457
+ }
458
+ ],
459
+ "total_results_simulated": 47,
460
+ "engine_used": "brave"
461
+ }
462
+ ```
463
+
464
+ The agent can call `SEARCH_ENGINE` up to **8 times** per episode without penalty. Beyond 8 calls, each additional search costs `-0.05` reward (diminishing returns signal).
465
+
466
+ ---
467
+
468
+ **Grader Logic:**
469
+
470
+ ```python
471
+ def score_task_hard(submission, ground_truth, episode_state):
472
+ score = 0.0
473
+ max_score = sum(FIELD_WEIGHTS.values()) # 26.0 total weighted points
474
+
475
+ for field, weight in FIELD_WEIGHTS.items():
476
+ agent_val = normalize(submission.get(field))
477
+ truth_val = normalize(ground_truth[field])
478
+
479
+ if field.endswith("_verified"):
480
+ # Only award if agent issued a VERIFY_FACT for this field
481
+ # referencing a different source than the extraction source
482
+ verify_actions = [a for a in episode_state.action_log
483
+ if a.action_type == "verify_fact"
484
+ and a.field_name == field.replace("_verified", "")]
485
+ cross_source = any(
486
+ a.verification_source != episode_state.primary_source_for[field]
487
+ for a in verify_actions
488
+ )
489
+ if agent_val == truth_val and cross_source:
490
+ score += weight
491
+ elif agent_val == truth_val:
492
+ score += weight * 0.5 # Partial: correct but unverified
493
+ elif field in CONFLICT_FIELDS:
494
+ # Check agent issued RESOLVE_CONFLICT with correct authoritative source
495
+ resolve_actions = [a for a in episode_state.action_log
496
+ if a.action_type == "resolve_conflict"
497
+ and field in str(a)]
498
+ resolved_correctly = any(
499
+ a.chosen_source == AUTHORITATIVE_SOURCE[field]
500
+ for a in resolve_actions
501
+ )
502
+ if agent_val == truth_val and resolved_correctly:
503
+ score += weight
504
+ elif agent_val == truth_val:
505
+ score += weight * 0.6 # Correct value but no explicit resolution
506
+ else:
507
+ if agent_val == truth_val:
508
+ score += weight
509
+ elif partial_match(agent_val, truth_val):
510
+ score += weight * 0.4
511
+
512
+ # Coverage bonus: +0.5 if all 14 fields present in submission (even if some wrong)
513
+ coverage_bonus = 0.5 if len(submission) >= 14 else len(submission) / 14 * 0.5
514
+
515
+ raw = (score / max_score) + (coverage_bonus / (max_score + 0.5))
516
+ return min(raw, 1.0)
517
+ ```
518
+
519
+ **Expected baseline scores:**
520
+
521
+ | Agent | Expected Score | Bottleneck |
522
+ |---|---|---|
523
+ | gpt-4o-mini (no tools) | ~0.20 | Cannot discover search-only pages |
524
+ | gpt-4o-mini + search | ~0.45 | Struggles with conflict resolution |
525
+ | gpt-4o (ReAct loop) | ~0.62 | Verification gate compliance |
526
+ | Human (manual) | ~0.90 | Benchmark ceiling |
527
+
528
+ **Difficulty Rationale:** This task is genuinely hard for frontier models because it requires: (1) search-first discovery with no entry URL, (2) multi-domain planning across 6 sources, (3) fact verification as a mandatory action class (not just extracting a value), (4) explicit conflict resolution with source authority reasoning, and (5) normalization of numeric and prose values. No single capability is sufficient — the agent must exercise all of them in one episode.
529
+
530
+ ---
531
+
532
+ ## 7. Grader Design
533
+
534
+ Each task has a dedicated `Grader` class implementing the following interface:
535
+
536
+ ```python
537
+ class BaseGrader(ABC):
538
+ def score(
539
+ self,
540
+ agent_submission: dict, # The agent's SUBMIT payload
541
+ ground_truth: dict, # Hidden correct values
542
+ episode_state: EpisodeState
543
+ ) -> GraderResult:
544
+ ...
545
+
546
+ class GraderResult(BaseModel):
547
+ score: float # 0.0 – 1.0
548
+ field_scores: dict[str, float] # Per-field breakdown
549
+ feedback: str # Human-readable explanation
550
+ penalty_applied: bool # True if penalties were triggered
551
+ penalty_reason: str | None
552
+ ```
553
+
554
+ **Normalization Rules applied before comparison:**
555
+
556
+ | Field Type | Normalization |
557
+ |---|---|
558
+ | Price | Strip currency symbols, commas → float |
559
+ | Text | Strip whitespace, lowercase, remove punctuation |
560
+ | Number with commas | `"1,247"` → `1247` |
561
+ | Range | `"500-999"` bucketed comparison |
562
+ | Year | Integer comparison |
563
+
564
+ **Penalties:**
565
+ - If `step_number > max_steps * 0.8` and fewer than 50% fields extracted → efficiency penalty of -0.1
566
+ - If agent submits more than 3 times (SUBMIT + reset-less re-attempts) → repeat penalty of -0.05 per extra submit
567
+
568
+ **Determinism guarantee:** All graders use only the seeded `ground_truth` dict and the submitted dict. No randomness at score time.
569
+
570
+ ---
571
+
572
+ ## 8. Reward Function Design
573
+
574
+ The reward function provides **dense signal across the full trajectory**, not just a terminal reward.
575
+
576
+ ```
577
+ R_total = R_extraction + R_efficiency + R_navigation + R_terminal - R_penalty
578
+ ```
579
+
580
+ ### Per-Step Rewards
581
+
582
+ | Event | Reward | Rationale |
583
+ |---|---|---|
584
+ | `EXTRACT_FIELD` → correct value | +0.15 | Core task success signal |
585
+ | `EXTRACT_FIELD` → partially correct (wrong format, right content) | +0.05 | Encourages normalization learning |
586
+ | `EXTRACT_FIELD` → wrong value | -0.05 | Penalizes overconfident extraction |
587
+ | `EXTRACT_FIELD` → field already extracted | -0.10 | Penalizes redundant actions |
588
+ | `NAVIGATE` → new relevant page | +0.05 | Rewards exploration |
589
+ | `NAVIGATE` → already-visited page | -0.08 | Penalizes loops |
590
+ | `NAVIGATE` → irrelevant page (no target fields) | -0.03 | Soft penalty for bad routing |
591
+ | `SEARCH_PAGE` → finds target field hint | +0.03 | Rewards intelligent search |
592
+ | `SEARCH_PAGE` → no results | -0.01 | Small cost for wasted action |
593
+ | `INSPECT_ELEMENT` → valid selector hit | +0.02 | Rewards precision |
594
+ | `SKIP_PAGE` → page is actually irrelevant | +0.05 | Rewards correct relevance judgment |
595
+ | `SKIP_PAGE` → page contained target fields | -0.15 | Penalizes incorrect dismissal |
596
+ | `SEARCH_ENGINE` → query within 8-call budget | 0.00 | Neutral — search is a tool, not scored |
597
+ | `SEARCH_ENGINE` → discovers a new relevant domain | +0.08 | Rewards effective query formulation |
598
+ | `SEARCH_ENGINE` → call #9+ (over budget) | -0.05 | Diminishing returns signal |
599
+ | `VERIFY_FACT` → claimed value confirmed | +0.12 | Rewards verification behavior |
600
+ | `VERIFY_FACT` → claimed value contradicted | +0.08 | Still rewards checking (good epistemic practice) |
601
+ | `VERIFY_FACT` → verifying already-verified field | -0.05 | Penalizes redundant verification |
602
+ | `RESOLVE_CONFLICT` → correct authoritative source | +0.20 | High reward for correct reasoning |
603
+ | `RESOLVE_CONFLICT` → wrong authoritative source | -0.10 | Penalizes incorrect conflict resolution |
604
+ | `FETCH_URL` → returns useful content | +0.02 | Small reward for productive fetch |
605
+ | `FETCH_URL` → blocked (anti-scrape, no proxy set) | -0.03 | Mild penalty — should configure proxy |
606
+ | `FETCH_URL` → blocked (proxy active, retry succeeds) | +0.05 | Rewards using proxy correctly |
607
+ | Budget exhaustion (no SUBMIT) | -0.20 | Penalizes running out of budget |
608
+
609
+ ### Terminal Reward (on SUBMIT)
610
+
611
+ ```
612
+ R_terminal = grader_score × 2.0
613
+ ```
614
+
615
+ This scales the terminal reward to dominate the trajectory reward, ensuring the agent optimizes for final output quality.
616
+
617
+ ### Reward Range
618
+
619
+ - Minimum possible (all wrong, loops, budget exhausted): approximately -2.5
620
+ - Maximum possible (all correct, efficient path): approximately +2.5
621
+ - Typical good agent trajectory: +1.0 to +1.8
622
+
623
+ ---
624
+
625
+ ## 9. Network Layer — VPN & Proxy
626
+
627
+ The network layer is an optional but impactful system component. When active, all `NAVIGATE`, `FETCH_URL`, and `SEARCH_ENGINE` actions route outbound requests through the configured proxy or VPN. In simulation mode (default), the layer gates which simulated domains respond with 200 vs. 429 — giving agents a realistic incentive to configure networking.
628
+
629
+ ---
630
+
631
+ ### 9.1 Architecture
632
+
633
+ ```
634
+ Agent Action (FETCH_URL / NAVIGATE / SEARCH_ENGINE)
635
+
636
+
637
+ ┌───────────────────────┐
638
+ │ NetworkRouter │
639
+ │ │
640
+ │ active_proxy? ──────►│──► requests.Session(proxies={...})
641
+ │ active_vpn? ──────►│──► subprocess → wireguard/openvpn tunnel
642
+ │ neither ──────►│──► direct (or blocked by anti-scrape sim)
643
+ └───────────────────────┘
644
+
645
+
646
+ SimulatedWebServer / Real HTTP (if live mode enabled)
647
+ ```
648
+
649
+ **Two operating modes:**
650
+
651
+ | Mode | Description | When used |
652
+ |---|---|---|
653
+ | `simulation` (default) | No real network; proxy/VPN settings control which simulated domains unblock | Always safe, deterministic, no credentials needed |
654
+ | `live` | Real HTTP requests routed through configured proxy/VPN | Optional; requires user-supplied credentials or public pool selection |
655
+
656
+ Mode is set in `Settings → Network → Mode`. `live` mode is off by default and requires explicit opt-in.
657
+
658
+ ---
659
+
660
+ ### 9.2 Proxy Configuration
661
+
662
+ Proxies can be configured three ways: user-supplied credentials, a pre-tested public proxy pool, or disabled.
663
+
664
+ **Settings model:**
665
+
666
+ ```python
667
+ class ProxyConfig(BaseModel):
668
+ enabled: bool = False
669
+ mode: Literal["custom", "public_pool", "rotating"] = "custom"
670
+
671
+ # ── Custom proxy (user-supplied) ──────────────────────────────
672
+ host: str | None = None # e.g. "proxy.mycompany.com"
673
+ port: int | None = None # e.g. 8080
674
+ protocol: Literal["http", "https", "socks4", "socks5"] = "http"
675
+ username: str | None = None # Optional auth
676
+ password: str | None = None # Stored encrypted at rest (Fernet)
677
+ auth_scheme: Literal["basic", "digest", "ntlm"] = "basic"
678
+
679
+ # ── Public pool (no credentials required) ────────────────────
680
+ public_pool_provider: str | None = None # "webshare" | "proxyscrape" | "openproxy"
681
+ public_pool_country_filter: str | None = None # ISO 3166-1 e.g. "US", "DE"
682
+
683
+ # ── Rotating proxy ────────────────────────────────────────────
684
+ rotating_endpoint: str | None = None # e.g. "rotate.proxyservice.io:8080"
685
+ rotate_every_n_requests: int = 10
686
+
687
+ # ── Validation ────────────────────────────────────────────────
688
+ test_url: str = "http://httpbin.org/ip"
689
+ last_test_result: str | None = None # "ok" | "timeout" | "auth_failed"
690
+ last_tested_at: datetime | None = None
691
+ ```
692
+
693
+ **Proxy URL construction (internal):**
694
+
695
+ ```python
696
+ def build_proxy_url(cfg: ProxyConfig) -> str:
697
+ if cfg.username and cfg.password:
698
+ return f"{cfg.protocol}://{cfg.username}:{cfg.password}@{cfg.host}:{cfg.port}"
699
+ return f"{cfg.protocol}://{cfg.host}:{cfg.port}"
700
+ ```
701
+
702
+ **Public pool providers (pre-configured, no credentials):**
703
+
704
+ | Provider key | Type | Notes |
705
+ |---|---|---|
706
+ | `webshare` | HTTP rotating | 10 free proxies on free tier |
707
+ | `proxyscrape` | HTTP/SOCKS5 scraped list | Refreshed every 15 min |
708
+ | `openproxy` | HTTP/HTTPS | Community maintained |
709
+
710
+ The environment ships with a static list of ~50 pre-validated public proxies for simulation mode. In live mode, lists are fetched fresh from provider APIs.
711
+
712
+ ---
713
+
714
+ ### 9.3 VPN Configuration
715
+
716
+ VPN integration supports **WireGuard** and **OpenVPN** protocols. Users paste their config file content or fill individual fields in the Settings UI.
717
+
718
+ ```python
719
+ class VPNConfig(BaseModel):
720
+ enabled: bool = False
721
+ protocol: Literal["wireguard", "openvpn"] = "wireguard"
722
+
723
+ # ── WireGuard ─────────────────────────────────────────────────
724
+ wg_config_content: str | None = None # Full .conf file content (pasted in UI)
725
+ wg_interface_name: str = "wg0"
726
+
727
+ # ── OpenVPN ───────────────────────────────────────────────────
728
+ ovpn_config_content: str | None = None # Full .ovpn file content
729
+ ovpn_username: str | None = None
730
+ ovpn_password: str | None = None # Encrypted at rest
731
+
732
+ # ── Common ────────────────────────────────────────────────────
733
+ server_label: str | None = None # Human label e.g. "US East — NordVPN"
734
+ kill_switch: bool = True # Block requests if tunnel drops
735
+ last_test_result: str | None = None
736
+ last_connected_at: datetime | None = None
737
+ ```
738
+
739
+ **VPN lifecycle (live mode):**
740
+
741
+ ```
742
+ POST /api/settings/vpn/connect
743
+ → writes temp config file
744
+ → subprocess: wg-quick up wg0 OR openvpn --daemon --config temp.ovpn
745
+ → polls interface for IP change
746
+ → stores connected_ip in session
747
+
748
+ POST /api/settings/vpn/disconnect
749
+ → subprocess: wg-quick down wg0 OR killall openvpn
750
+ → clears connected_ip
751
+ ```
752
+
753
+ In **simulation mode**, VPN is purely logical — activating it marks the session as "VPN active" which causes the simulated anti-scrape layer to allow all domain requests.
754
+
755
+ > **Docker note:** WireGuard and OpenVPN require `NET_ADMIN` and `SYS_MODULE` capabilities. The Dockerfile exposes these only if `ENABLE_LIVE_NETWORK=true` is set. HF Spaces deployment runs in simulation mode only (capabilities not available).
756
+
757
+ ---
758
+
759
+ ### 9.4 Public Pool (Quick Start)
760
+
761
+ For users who don't have their own proxy or VPN, the Settings UI offers a **Public Pool** tab that requires zero configuration:
762
+
763
+ | Pool name | Protocol | Speed | Reliability | Notes |
764
+ |---|---|---|---|---|
765
+ | WebShare Free | HTTP rotating | Medium | High | Registration required (free) |
766
+ | ProxyScrape | HTTP/SOCKS5 | Variable | Medium | No registration |
767
+ | OpenProxy Space | HTTP/HTTPS | Slow | Low | Community pool, use as fallback |
768
+ | Simulation Bypass | Simulated | N/A | 100% | Always available; simulation only |
769
+
770
+ Selecting "Simulation Bypass" is the recommended option for evaluation runs — it unlocks all simulated anti-scrape gates without needing real network credentials.
771
+
772
+ ---
773
+
774
+ ### 9.5 Settings Persistence
775
+
776
+ All network settings are stored server-side in a lightweight JSON config file (`config/network_settings.json`). Passwords and VPN configs are encrypted using **Fernet symmetric encryption** with a key derived from a server-side secret (`SETTINGS_SECRET` env var).
777
+
778
+ ```python
779
+ class NetworkSettings(BaseModel):
780
+ proxy: ProxyConfig = ProxyConfig()
781
+ vpn: VPNConfig = VPNConfig()
782
+ default_search_engine: Literal["google", "bing", "brave", "ddg"] = "brave"
783
+ live_mode_enabled: bool = False
784
+ request_timeout_seconds: int = 10
785
+ max_retries: int = 3
786
+ retry_backoff_factor: float = 1.5
787
+ user_agent: str = "WebScraperOpenEnv/1.0"
788
+ ```
789
+
790
+ The Settings UI reads from `GET /api/settings` and writes via `PUT /api/settings`. Passwords are never returned in GET responses — they are write-only from the UI's perspective.
791
+
792
+ ---
793
+
794
+ ## 10. API Endpoint Specification
795
+
796
+ All endpoints accept and return `application/json`.
797
+
798
+ ### `POST /api/reset`
799
+
800
+ Initialize or restart an episode.
801
+
802
+ **Request:**
803
+ ```json
804
+ { "task_id": "task_easy", "seed": 42 }
805
+ ```
806
+ **Response:** `Observation` model
807
+
808
+ ---
809
+
810
+ ### `POST /api/step`
811
+
812
+ Advance the episode by one action.
813
+
814
+ **Request:**
815
+ ```json
816
+ {
817
+ "episode_id": "uuid-...",
818
+ "action": {
819
+ "action_type": "extract_field",
820
+ "target_field": "price",
821
+ "selector": ".product-price"
822
+ }
823
+ }
824
+ ```
825
+ **Response:**
826
+ ```json
827
+ {
828
+ "observation": { "..." : "..." },
829
+ "reward": { "value": 0.15, "cumulative": 0.15, "breakdown": {}, "message": "..." },
830
+ "done": false,
831
+ "info": { "step": 1, "budget_remaining": 9 }
832
+ }
833
+ ```
834
+
835
+ ---
836
+
837
+ ### `GET /api/state`
838
+
839
+ Return current episode state. **Query param:** `episode_id=uuid-...`
840
+
841
+ ---
842
+
843
+ ### `GET /api/tasks`
844
+
845
+ Return all task definitions and their action schemas.
846
+
847
+ ---
848
+
849
+ ### `POST /api/grader`
850
+
851
+ Score a completed episode.
852
+
853
+ **Request:**
854
+ ```json
855
+ {
856
+ "episode_id": "uuid-...",
857
+ "submission": { "product_name": "...", "price": "..." }
858
+ }
859
+ ```
860
+ **Response:** `GraderResult` model
861
+
862
+ ---
863
+
864
+ ### `POST /api/baseline`
865
+
866
+ Trigger the built-in baseline inference script against all 3 tasks and return scores.
867
+
868
+ **Response:**
869
+ ```json
870
+ {
871
+ "baseline_model": "gpt-4o-mini",
872
+ "results": {
873
+ "task_easy": { "score": 0.92, "steps": 4, "fields_correct": 5 },
874
+ "task_medium": { "score": 0.67, "steps": 18, "fields_correct": 4 },
875
+ "task_hard": { "score": 0.38, "steps": 54, "fields_correct": 8 }
876
+ },
877
+ "aggregate_score": 0.66,
878
+ "run_id": "baseline-seed42"
879
+ }
880
+ ```
881
+
882
+ ---
883
+
884
+ ### `GET /api/settings`
885
+
886
+ Return current network settings. **Passwords are never returned** — password fields are always `null` in the response.
887
+
888
+ **Response:** `NetworkSettings` model (with password fields nulled)
889
+
890
+ ---
891
+
892
+ ### `PUT /api/settings`
893
+
894
+ Update network settings (full or partial).
895
+
896
+ **Request:** Partial `NetworkSettings` object — only provided fields are updated.
897
+
898
+ ```json
899
+ {
900
+ "proxy": {
901
+ "enabled": true,
902
+ "mode": "custom",
903
+ "host": "proxy.example.com",
904
+ "port": 8080,
905
+ "protocol": "http",
906
+ "username": "user",
907
+ "password": "secret"
908
+ }
909
+ }
910
+ ```
911
+
912
+ ---
913
+
914
+ ### `POST /api/settings/proxy/test`
915
+
916
+ Test the current proxy configuration by making a request to `test_url`.
917
+
918
+ **Response:**
919
+ ```json
920
+ {
921
+ "success": true,
922
+ "exit_ip": "45.33.32.156",
923
+ "latency_ms": 312,
924
+ "error": null
925
+ }
926
+ ```
927
+
928
+ ---
929
+
930
+ ### `POST /api/settings/vpn/connect`
931
+
932
+ Activate the configured VPN tunnel (live mode only; simulation mode returns immediate success).
933
+
934
+ **Response:**
935
+ ```json
936
+ {
937
+ "connected": true,
938
+ "tunnel_ip": "10.8.0.2",
939
+ "exit_ip": "185.220.101.45",
940
+ "protocol": "wireguard",
941
+ "error": null
942
+ }
943
+ ```
944
+
945
+ ---
946
+
947
+ ### `POST /api/settings/vpn/disconnect`
948
+
949
+ Tear down the active VPN tunnel.
950
+
951
+ ---
952
+
953
+ ### `GET /api/settings/network/status`
954
+
955
+ Returns current active network configuration — what proxy/VPN is live right now.
956
+
957
+ **Response:**
958
+ ```json
959
+ {
960
+ "proxy_active": true,
961
+ "proxy_host": "proxy.example.com:8080",
962
+ "vpn_active": false,
963
+ "vpn_server": null,
964
+ "exit_ip": "45.33.32.156",
965
+ "live_mode": false,
966
+ "default_search_engine": "brave"
967
+ }
968
+ ```
969
+
970
+ ---
971
+
972
+ ### `GET /api/settings/public-pool`
973
+
974
+ Returns the list of available public proxy/VPN pool options with current availability status.
975
+
976
+ **Response:**
977
+ ```json
978
+ {
979
+ "pools": [
980
+ { "key": "simulation_bypass", "name": "Simulation Bypass", "available": true, "requires_auth": false },
981
+ { "key": "webshare", "name": "WebShare Free", "available": true, "requires_auth": true },
982
+ { "key": "proxyscrape", "name": "ProxyScrape", "available": true, "requires_auth": false },
983
+ { "key": "openproxy", "name": "OpenProxy Space", "available": true, "requires_auth": false }
984
+ ]
985
+ }
986
+ ```
987
+
988
+ ---
989
+
990
+ ## 11. Data Models (Pydantic Schemas)
991
+
992
+ ```python
993
+ # env/models.py
994
+
995
+ from pydantic import BaseModel, Field
996
+ from enum import Enum
997
+ from typing import Optional
998
+ import uuid
999
+
1000
+ class ActionType(str, Enum):
1001
+ EXTRACT_FIELD = "extract_field"
1002
+ NAVIGATE = "navigate"
1003
+ SEARCH_PAGE = "search_page"
1004
+ INSPECT_ELEMENT = "inspect_element"
1005
+ SUBMIT = "submit"
1006
+ SKIP_PAGE = "skip_page"
1007
+
1008
+ class Action(BaseModel):
1009
+ action_type: ActionType
1010
+ target_field: Optional[str] = None
1011
+ selector: Optional[str] = None
1012
+ navigate_to: Optional[str] = None
1013
+ submit_extraction: Optional[dict] = None
1014
+ notes: Optional[str] = None
1015
+
1016
+ class Observation(BaseModel):
1017
+ episode_id: str
1018
+ task_id: str
1019
+ step_number: int
1020
+ current_url: str
1021
+ page_html: str
1022
+ page_title: str
1023
+ available_actions: list[str]
1024
+ extracted_so_far: dict
1025
+ pages_visited: list[str]
1026
+ budget_remaining: int
1027
+ task_description: str
1028
+ target_fields: list[str]
1029
+ hints: list[str]
1030
+
1031
+ class Reward(BaseModel):
1032
+ value: float
1033
+ cumulative: float
1034
+ breakdown: dict[str, float]
1035
+ message: str
1036
+
1037
+ class GraderResult(BaseModel):
1038
+ score: float = Field(ge=0.0, le=1.0)
1039
+ field_scores: dict[str, float]
1040
+ feedback: str
1041
+ penalty_applied: bool
1042
+ penalty_reason: Optional[str] = None
1043
+
1044
+ class EpisodeState(BaseModel):
1045
+ episode_id: str
1046
+ task_id: str
1047
+ seed: int
1048
+ step_number: int
1049
+ current_url: str
1050
+ pages_visited: list[str]
1051
+ extracted_data: dict
1052
+ budget_remaining: int
1053
+ status: str # "running" | "terminal"
1054
+ cumulative_reward: float
1055
+ created_at: str
1056
+ # Task 3 extras
1057
+ action_log: list[dict] = [] # Full action history for grader inspection
1058
+ search_calls_used: int = 0 # Track against 8-call free budget
1059
+ verified_fields: list[str] = [] # Fields that have passed VERIFY_FACT
1060
+ resolved_conflicts: list[str] = [] # Fields where RESOLVE_CONFLICT was issued
1061
+
1062
+ class SearchResult(BaseModel):
1063
+ rank: int
1064
+ title: str
1065
+ url: str
1066
+ snippet: str
1067
+
1068
+ class SearchEngineResponse(BaseModel):
1069
+ query: str
1070
+ results: list[SearchResult]
1071
+ total_results_simulated: int
1072
+ engine_used: str
1073
+ calls_remaining: int # Free budget remaining (8 - used)
1074
+
1075
+ class VerifyFactResponse(BaseModel):
1076
+ field_name: str
1077
+ claimed_value: str
1078
+ verification_source: str
1079
+ verified: bool
1080
+ confidence: float # 0.0 – 1.0
1081
+ supporting_text: str | None # Excerpt from verification source
1082
+ contradicting_text: str | None
1083
+
1084
+ class NetworkStatus(BaseModel):
1085
+ proxy_active: bool
1086
+ proxy_host: Optional[str]
1087
+ vpn_active: bool
1088
+ vpn_server: Optional[str]
1089
+ exit_ip: Optional[str]
1090
+ live_mode: bool
1091
+ default_search_engine: str
1092
+ ```
1093
+
1094
+ ---
1095
+
1096
+ ## 12. Simulated Web Environment
1097
+
1098
+ The `SimulatedWebServer` class generates HTML pages on-the-fly using Jinja2 templates seeded by a deterministic RNG.
1099
+
1100
+ ### Page Generator Pipeline
1101
+
1102
+ ```
1103
+ seed + task_id + url
1104
+
1105
+
1106
+ RNG (random.Random)
1107
+
1108
+
1109
+ Template Selector ──► Jinja2 template
1110
+
1111
+
1112
+ Data Populator (products / company profiles / etc.)
1113
+
1114
+
1115
+ Noise Injector ──► adds decoy elements, broken tags, ads
1116
+
1117
+
1118
+ Anti-Scrape Layer ──► conditionally adds interstitials (task_hard)
1119
+
1120
+
1121
+ HTML string (max 8,000 chars)
1122
+ ```
1123
+
1124
+ ### Noise Types by Task
1125
+
1126
+ | Noise Type | Easy | Medium | Hard |
1127
+ |---|---|---|---|
1128
+ | Decoy fields with similar labels | ❌ | ✅ | ✅ |
1129
+ | Inconsistent price formatting | ❌ | ✅ | ✅ |
1130
+ | Broken/unclosed HTML tags | ❌ | ❌ | ✅ |
1131
+ | Interstitial blocking page | ❌ | ❌ | ✅ |
1132
+ | Contradictory values across pages | ❌ | ❌ | ✅ |
1133
+ | JavaScript-only content (noscript fallback) | ❌ | ❌ | ✅ |
1134
+ | Paginated content (multi-page) | ❌ | ✅ | ✅ |
1135
+
1136
+ ### URL Scheme
1137
+
1138
+ Simulated URLs follow the pattern `sim://<domain>/<path>`. The environment maps these to page generators internally — no DNS or network calls occur.
1139
+
1140
+ ```
1141
+ sim://shop.example.com/product/42 → product page (task_easy)
1142
+ sim://catalog.example.com/products?pg=1 → catalog page 1 of 3 (task_medium)
1143
+ sim://company.example.com/about → company homepage (task_hard)
1144
+ sim://directory.example.com/org/acme → directory listing (task_hard)
1145
+ sim://news.example.com/search?q=acme → news aggregator (task_hard)
1146
+ sim://finance.example.com/ticker/ACME → financial data (task_hard) ← 429 gate
1147
+ sim://regulatory.example.com/filings/ACME → SEC-style filing (task_hard, search-only)
1148
+ sim://linkedin-sim.example.com/company/acme → LinkedIn-style profile (task_hard, keyword gate)
1149
+ ```
1150
+
1151
+ **Anti-scrape simulation by domain:**
1152
+
1153
+ | Domain | Block type | Bypass method |
1154
+ |---|---|---|
1155
+ | `finance.example.com` | 429 Rate-limit on first visit | Retry after 1 step, or activate proxy |
1156
+ | `linkedin-sim.example.com` | Keyword gate | `SEARCH_PAGE` with keyword "view_profile" |
1157
+ | `regulatory.example.com` | Not linked — only discoverable via search | `SEARCH_ENGINE` with relevant query |
1158
+
1159
+ ---
1160
+
1161
+ ## 13. Baseline Inference Script
1162
+
1163
+ `scripts/baseline.py` uses the OpenAI API to run a ReAct-style loop against the environment.
1164
+
1165
+ ### Agent Strategy
1166
+
1167
+ ```
1168
+ System Prompt:
1169
+ You are a web scraping agent. You will be given an HTML page and a list
1170
+ of fields to extract. Use the available actions to extract all target
1171
+ fields as efficiently as possible and then submit your findings.
1172
+
1173
+ Loop:
1174
+ 1. Call /reset with task_id and seed=42
1175
+ 2. While not done:
1176
+ a. Format observation as: current URL, page HTML (truncated),
1177
+ fields still needed, steps remaining
1178
+ b. Prompt LLM for next action in JSON format
1179
+ c. Parse action → POST /step
1180
+ d. If done: record score
1181
+ 3. Report all 3 task scores
1182
+ ```
1183
+
1184
+ ### Configuration
1185
+
1186
+ Read from environment variables:
1187
+ ```
1188
+ OPENAI_API_KEY=...
1189
+ BASELINE_MODEL=gpt-4o-mini # default
1190
+ BASELINE_SEED=42
1191
+ BASELINE_MAX_RETRIES=3
1192
+ ```
1193
+
1194
+ ### Reproducibility
1195
+
1196
+ - Fixed seed=42 for all tasks
1197
+ - Deterministic page generation
1198
+ - Temperature=0 for LLM calls
1199
+ - Results logged to `results/baseline_<timestamp>.json`
1200
+
1201
+ ### Expected Baseline Scores (gpt-4o-mini)
1202
+
1203
+ | Task | Expected Score | Notes |
1204
+ |---|---|---|
1205
+ | task_easy | ~0.90 | Near-perfect on clean pages |
1206
+ | task_medium | ~0.60 | Pagination handling is tricky |
1207
+ | task_hard | ~0.35 | Multi-source coordination challenges |
1208
+ | **Aggregate** | **~0.62** | |
1209
+
1210
+ ---
1211
+
1212
+ ## 14. Project Structure
1213
+
1214
+ ```
1215
+ webscraper-openenv/
1216
+ ├── README.md
1217
+ ├── openenv.yaml
1218
+ ├── Dockerfile
1219
+ ├── requirements.txt
1220
+
1221
+ ├── frontend/ # Vite + React app
1222
+ │ ├── package.json
1223
+ │ ├── vite.config.ts
1224
+ │ ├── index.html
1225
+ │ └── src/
1226
+ │ ├── main.tsx
1227
+ │ ├── App.tsx
1228
+ │ ├── components/
1229
+ │ │ ├── TaskSelector.tsx # Pick task_easy / task_medium / task_hard
1230
+ │ │ ├── EpisodeViewer.tsx # Live observation display
1231
+ │ │ ├── ActionPanel.tsx # Manual action builder (for debugging)
1232
+ │ │ ├── RewardChart.tsx # Cumulative reward over steps
1233
+ │ │ ├── BaselineRunner.tsx # Trigger /api/baseline and show scores
1234
+ │ │ └── settings/
1235
+ │ │ ├── SettingsPage.tsx # Top-level settings shell (tabbed layout)
1236
+ │ │ ├── ProxySettings.tsx # Proxy config form (custom / public pool / rotating)
1237
+ │ │ ├── VPNSettings.tsx # VPN config form (WireGuard / OpenVPN file paste)
1238
+ │ │ ├── PublicPoolPicker.tsx # Zero-config public proxy/VPN picker
1239
+ │ │ ├── NetworkStatus.tsx # Live badge: proxy active, VPN active, exit IP
1240
+ │ │ └── SearchEngineSelector.tsx # Default search engine picker
1241
+ │ ├── hooks/
1242
+ │ │ ├── useEpisode.ts # Manages episode state via REST
1243
+ │ │ ├── useNetworkSettings.ts # Read/write /api/settings
1244
+ │ │ └── useNetworkStatus.ts # Polls /api/settings/network/status
1245
+ │ └── api/
1246
+ │ ├── client.ts # Typed fetch wrappers for all endpoints
1247
+ │ └── settingsClient.ts # Settings-specific API calls
1248
+
1249
+ ├── env/
1250
+ │ ├── __init__.py
1251
+ │ ├── environment.py # WebScraperEnv (step/reset/state)
1252
+ │ ├── models.py # All Pydantic models
1253
+ │ ├── reward.py # RewardEngine
1254
+ │ ├── state.py # EpisodeState management
1255
+ │ ├── tasks/
1256
+ │ │ ├── task_easy.py
1257
+ │ │ ├── task_medium.py
1258
+ │ │ └── task_hard.py # Includes search engine + verify + resolve logic
1259
+ │ └── simulator/
1260
+ │ ├── web_server.py
1261
+ │ ├── page_generator.py
1262
+ │ ├── search_engine.py # SimulatedSearchEngine (ranked results by seed)
1263
+ │ ├── fact_verifier.py # FactVerifier (cross-source consistency check)
1264
+ │ ├── noise_injector.py
1265
+ │ └── templates/
1266
+ │ ├── product.html
1267
+ │ ├── catalog.html
1268
+ │ ├── company.html
1269
+ │ ├── directory.html
1270
+ │ ├── news.html
1271
+ │ ├── finance.html
1272
+ │ ├── regulatory.html # New: SEC-style filing page
1273
+ │ └── linkedin_sim.html # New: LinkedIn-style profile page
1274
+
1275
+ ├── network/
1276
+ │ ├── __init__.py
1277
+ │ ├── router.py # NetworkRouter (proxy/VPN dispatch)
1278
+ │ ├── proxy_manager.py # ProxyManager (build URL, test, rotate)
1279
+ │ ├── vpn_manager.py # VPNManager (wg-quick / openvpn subprocess)
1280
+ │ ├── public_pool.py # PublicPoolFetcher (webshare, proxyscrape, openproxy)
1281
+ │ └── settings_store.py # Encrypted read/write of network_settings.json
1282
+
1283
+ ├── config/
1284
+ │ └── network_settings.json # Persisted settings (passwords Fernet-encrypted)
1285
+
1286
+ ├── api/
1287
+ │ ├── __init__.py
1288
+ │ ├── main.py # FastAPI app + static file mount
1289
+ │ ├── routes/
1290
+ │ │ ├── env_routes.py # /api/reset, /api/step, /api/state, etc.
1291
+ │ │ └── settings_routes.py # /api/settings/*, /api/settings/vpn/*, etc.
1292
+ │ └── schemas.py
1293
+
1294
+ ├── scripts/
1295
+ │ ├── baseline.py
1296
+ │ └── validate.py
1297
+
1298
+ ├── tests/
1299
+ │ ├── test_environment.py
1300
+ │ ├── test_graders.py
1301
+ │ ├── test_reward.py
1302
+ │ ├── test_task3_search.py # Search engine + verify + resolve tests
1303
+ │ ├── test_network.py # Proxy/VPN config + routing tests
1304
+ │ └── test_api.py
1305
+
1306
+ └── results/
1307
+ └── baseline_seed42.json
1308
+ ```
1309
+
1310
+ ---
1311
+
1312
+ ## 15. Dockerfile & Deployment
1313
+
1314
+ Everything ships in a **single Docker container**. The build is a two-stage process: Stage 1 compiles the Vite frontend into static files; Stage 2 installs the Python backend and copies the compiled frontend in. FastAPI then serves both the API and the frontend from port 7860.
1315
+
1316
+ ### Request Routing (single port)
1317
+
1318
+ ```
1319
+ Port 7860
1320
+
1321
+ ├── /api/* → FastAPI routes (all OpenEnv endpoints)
1322
+ ├── /assets/* → Vite static assets (JS, CSS, chunks)
1323
+ └── /* → index.html (SPA catch-all, handled by FastAPI StaticFiles)
1324
+ ```
1325
+
1326
+ FastAPI mounts the Vite build output (`frontend/dist/`) as a `StaticFiles` directory and adds a catch-all `GET /{full_path}` route that returns `index.html` so client-side routing works correctly.
1327
+
1328
+ ```python
1329
+ # api/main.py (relevant additions)
1330
+ from fastapi.staticfiles import StaticFiles
1331
+ from fastapi.responses import FileResponse
1332
+
1333
+ app.mount("/assets", StaticFiles(directory="frontend/dist/assets"), name="assets")
1334
+
1335
+ @app.get("/{full_path:path}", include_in_schema=False)
1336
+ async def spa_fallback(full_path: str):
1337
+ return FileResponse("frontend/dist/index.html")
1338
+ ```
1339
+
1340
+ All API routes are prefixed with `/api` to avoid collisions with the SPA router:
1341
+ ```
1342
+ POST /api/reset
1343
+ POST /api/step
1344
+ GET /api/state
1345
+ GET /api/tasks
1346
+ POST /api/grader
1347
+ POST /api/baseline
1348
+ ```
1349
+
1350
+ The Vite frontend calls `fetch("/api/...")` — no base URL configuration needed in production since everything is on the same origin.
1351
+
1352
+ ---
1353
+
1354
+ ### Dockerfile (multi-stage)
1355
+
1356
+ ```dockerfile
1357
+ # ── Stage 1: Build Vite frontend ──────────────────────────────────────
1358
+ FROM node:20-slim AS frontend-builder
1359
+
1360
+ WORKDIR /frontend
1361
+
1362
+ COPY frontend/package.json frontend/package-lock.json ./
1363
+ RUN npm ci
1364
+
1365
+ COPY frontend/ ./
1366
+ RUN npm run build
1367
+ # Output: /frontend/dist/
1368
+
1369
+
1370
+ # ── Stage 2: Python backend + compiled frontend ────────────────────────
1371
+ FROM python:3.11-slim
1372
+
1373
+ WORKDIR /app
1374
+
1375
+ # System packages:
1376
+ # wireguard-tools + iproute2 → wg-quick (live VPN, only used if ENABLE_LIVE_NETWORK=true)
1377
+ # openvpn → OpenVPN tunnel (same gate)
1378
+ # curl → proxy connectivity tests
1379
+ RUN apt-get update && apt-get install -y --no-install-recommends \
1380
+ wireguard-tools \
1381
+ iproute2 \
1382
+ openvpn \
1383
+ curl \
1384
+ && rm -rf /var/lib/apt/lists/*
1385
+
1386
+ # Install Python dependencies
1387
+ COPY requirements.txt .
1388
+ RUN pip install --no-cache-dir -r requirements.txt
1389
+
1390
+ # Copy backend source
1391
+ COPY env/ ./env/
1392
+ COPY network/ ./network/
1393
+ COPY api/ ./api/
1394
+ COPY scripts/ ./scripts/
1395
+ COPY results/ ./results/
1396
+ COPY config/ ./config/
1397
+ COPY openenv.yaml .
1398
+
1399
+ # Copy compiled frontend from stage 1
1400
+ COPY --from=frontend-builder /frontend/dist ./frontend/dist
1401
+
1402
+ ENV PYTHONUNBUFFERED=1
1403
+ ENV PORT=7860
1404
+ # ENABLE_LIVE_NETWORK=false → simulation mode (safe default, no NET_ADMIN needed)
1405
+ # ENABLE_LIVE_NETWORK=true → real proxy/VPN (requires --cap-add NET_ADMIN SYS_MODULE)
1406
+ ENV ENABLE_LIVE_NETWORK=false
1407
+ ENV SETTINGS_SECRET=changeme_generate_a_real_key_in_production
1408
+
1409
+ EXPOSE 7860
1410
+
1411
+ CMD ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "7860"]
1412
+ ```
1413
+
1414
+ **Live network mode (local only, not for HF Spaces):**
1415
+ ```bash
1416
+ docker run -p 7860:7860 \
1417
+ --cap-add NET_ADMIN \
1418
+ --cap-add SYS_MODULE \
1419
+ --sysctl net.ipv4.conf.all.src_valid_mark=1 \
1420
+ -e ENABLE_LIVE_NETWORK=true \
1421
+ -e OPENAI_API_KEY=$OPENAI_API_KEY \
1422
+ -e SETTINGS_SECRET=$(openssl rand -hex 32) \
1423
+ webscraper-openenv
1424
+ ```
1425
+
1426
+ ---
1427
+
1428
+ ### requirements.txt
1429
+
1430
+ ```
1431
+ fastapi>=0.110.0
1432
+ uvicorn[standard]>=0.29.0
1433
+ pydantic>=2.6.0
1434
+ jinja2>=3.1.3
1435
+ openai>=1.20.0
1436
+ pytest>=8.1.0
1437
+ httpx>=0.27.0
1438
+ aiofiles>=23.2.1 # FastAPI StaticFiles
1439
+ cryptography>=42.0.0 # Fernet encryption for stored credentials
1440
+ requests[socks]>=2.31.0 # SOCKS4/5 proxy support
1441
+ ```
1442
+
1443
+ During local development, Vite's dev server runs on `:5173` and the FastAPI backend runs on `:8000`. The proxy forwards all `/api` calls to avoid CORS issues:
1444
+
1445
+ ```typescript
1446
+ import { defineConfig } from 'vite'
1447
+ import react from '@vitejs/plugin-react'
1448
+
1449
+ export default defineConfig({
1450
+ plugins: [react()],
1451
+ server: {
1452
+ proxy: {
1453
+ '/api': {
1454
+ target: 'http://localhost:8000',
1455
+ changeOrigin: true,
1456
+ }
1457
+ }
1458
+ }
1459
+ })
1460
+ ```
1461
+
1462
+ In production (inside Docker), no proxy is needed — both frontend and backend are on port 7860.
1463
+
1464
+ ---
1465
+
1466
+ ### requirements.txt
1467
+
1468
+ ```
1469
+ fastapi>=0.110.0
1470
+ uvicorn[standard]>=0.29.0
1471
+ pydantic>=2.6.0
1472
+ jinja2>=3.1.3
1473
+ openai>=1.20.0
1474
+ pytest>=8.1.0
1475
+ httpx>=0.27.0
1476
+ aiofiles>=23.2.1 # Required for FastAPI StaticFiles
1477
+ ```
1478
+
1479
+ ---
1480
+
1481
+ ### Local Development Workflow
1482
+
1483
+ ```bash
1484
+ # Option A: Full Docker (production-identical)
1485
+ docker build -t webscraper-openenv .
1486
+ docker run -p 7860:7860 -e OPENAI_API_KEY=$OPENAI_API_KEY webscraper-openenv
1487
+ # Visit: http://localhost:7860
1488
+
1489
+ # Option B: Split dev servers (fast iteration)
1490
+ # Terminal 1 — backend
1491
+ uvicorn api.main:app --reload --port 8000
1492
+
1493
+ # Terminal 2 — frontend
1494
+ cd frontend && npm run dev
1495
+ # Visit: http://localhost:5173 (proxies API to :8000)
1496
+ ```
1497
+
1498
+ ### Build & Smoke Test
1499
+
1500
+ ```bash
1501
+ docker build -t webscraper-openenv .
1502
+
1503
+ # Smoke test the API
1504
+ curl http://localhost:7860/api/tasks
1505
+
1506
+ # Smoke test the frontend is served
1507
+ curl -s http://localhost:7860 | grep -q "<div id=\"root\">" && echo "Frontend OK"
1508
+
1509
+ # Full reset/step cycle
1510
+ curl -X POST http://localhost:7860/api/reset \
1511
+ -H "Content-Type: application/json" \
1512
+ -d '{"task_id": "task_easy", "seed": 42}'
1513
+ ```
1514
+
1515
+ ### Hugging Face Spaces Deployment
1516
+
1517
+ The Space will be tagged with `openenv` and configured as:
1518
+ - **SDK:** Docker
1519
+ - **App port:** 7860
1520
+ - **Secrets:** `OPENAI_API_KEY` set via HF Secrets UI
1521
+ - No extra build steps needed — the Dockerfile handles `npm ci && npm run build` internally in Stage 1
1522
+
1523
+ ---
1524
+
1525
+ ## 15. openenv.yaml
1526
+
1527
+ ```yaml
1528
+ name: webscraper-openenv
1529
+ version: "1.0.0"
1530
+ description: >
1531
+ A web scraping environment where AI agents extract structured data
1532
+ from simulated HTML pages with varying complexity, pagination,
1533
+ and adversarial noise patterns.
1534
+
1535
+ author: "[Your Name]"
1536
+ license: MIT
1537
+
1538
+ tags:
1539
+ - openenv
1540
+ - web-scraping
1541
+ - information-extraction
1542
+ - nlp
1543
+ - real-world
1544
+
1545
+ tasks:
1546
+ - id: task_easy
1547
+ name: "Static Page Field Extraction"
1548
+ difficulty: easy
1549
+ max_steps: 10
1550
+ description: "Extract 5 product fields from a single clean product page."
1551
+
1552
+ - id: task_medium
1553
+ name: "Paginated Catalog Scraping"
1554
+ difficulty: medium
1555
+ max_steps: 25
1556
+ description: "Find the 3 cheapest items across 3 pages of a product catalog."
1557
+
1558
+ - id: task_hard
1559
+ name: "Multi-Source Research Aggregation"
1560
+ difficulty: hard
1561
+ max_steps: 40
1562
+ description: "Aggregate a company profile from 4 different simulated web sources."
1563
+
1564
+ api:
1565
+ reset: POST /reset
1566
+ step: POST /step
1567
+ state: GET /state
1568
+ tasks: GET /tasks
1569
+ grader: POST /grader
1570
+ baseline: POST /baseline
1571
+
1572
+ observation_space:
1573
+ type: structured
1574
+ fields:
1575
+ - page_html: string
1576
+ - current_url: string
1577
+ - extracted_so_far: object
1578
+ - budget_remaining: integer
1579
+ - target_fields: array
1580
+
1581
+ action_space:
1582
+ type: structured
1583
+ action_types:
1584
+ - extract_field
1585
+ - navigate
1586
+ - search_page
1587
+ - inspect_element
1588
+ - submit
1589
+ - skip_page
1590
+
1591
+ reward_range: [-2.5, 2.5]
1592
+ episode_termination:
1593
+ - "SUBMIT action called"
1594
+ - "budget_remaining reaches 0"
1595
+ ```
1596
+
1597
+ ---
1598
+
1599
+ ## 16. Testing Strategy
1600
+
1601
+ ### Unit Tests
1602
+
1603
+ **`test_graders.py`**
1604
+ - Test each grader with perfect submission → expect score = 1.0
1605
+ - Test each grader with empty submission → expect score = 0.0
1606
+ - Test partial submissions → expect intermediate scores
1607
+ - Test normalization edge cases (price formats, whitespace, encoding)
1608
+
1609
+ **`test_reward.py`**
1610
+ - Correct extraction event → reward > 0
1611
+ - Redundant extraction → reward < 0
1612
+ - Navigation loop → cumulative negative reward
1613
+ - SUBMIT with perfect answer → large positive reward
1614
+
1615
+ **`test_environment.py`**
1616
+ - `reset()` returns clean state with step_number=0
1617
+ - `state()` after 3 steps returns step_number=3
1618
+ - Budget exhaustion terminates episode
1619
+ - Same seed produces identical HTML
1620
+
1621
+ ### Integration Tests
1622
+
1623
+ **`test_api.py`**
1624
+ - Full episode run via HTTP for each task
1625
+ - `/baseline` endpoint completes without error
1626
+ - `/grader` returns score in [0.0, 1.0]
1627
+ - Invalid episode_id returns 404
1628
+
1629
+ ### Validation
1630
+
1631
+ ```bash
1632
+ openenv validate .
1633
+ ```
1634
+
1635
+ Expected: All checks pass, spec compliance confirmed.
1636
+
1637
+ ---
1638
+
1639
+ ## 17. Known Limitations & Future Work
1640
+
1641
+ | Limitation | Impact | Future Fix |
1642
+ |---|---|---|
1643
+ | HTML truncated to 8,000 chars | Very long pages lose content | Configurable window + scrolling action |
1644
+ | No JavaScript rendering simulation | JS-heavy sites not fully modeled | Add iframe/shadow DOM simulation |
1645
+ | Single in-memory episode store | Not horizontally scalable | Redis-backed episode store |
1646
+ | English-only pages | Non-English scraping not tested | Multilingual page templates |
1647
+ | Fixed set of 3 tasks | Limited evaluation breadth | Procedural task generation with task_level param |
1648
+ | No rate limiting simulation in easy/medium | Less realistic for those tiers | Progressive rate limiting across difficulty |
1649
+
1650
+ ---
1651
+
1652
+ *End of Software Design Document*
1653
+
1654
+ *WebScraper-OpenEnv — OpenEnv Round 1 Submission*
docs/agents.md ADDED
@@ -0,0 +1,204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Agents System Design
2
+
3
+ ## Overview
4
+
5
+ The agent runtime is a multi-agent, memory-aware RL orchestration layer for web extraction tasks. It supports:
6
+
7
+ - Single-agent and multi-agent execution modes
8
+ - Strategy selection (`search-first`, `direct-extraction`, `multi-hop-reasoning`)
9
+ - Human-in-the-loop intervention
10
+ - Explainable decision traces
11
+ - Self-improvement from past episodes
12
+
13
+ ## Agent Roles
14
+
15
+ ### 1. Planner Agent
16
+
17
+ Builds a plan before action:
18
+
19
+ - Goal decomposition
20
+ - Tool selection plan
21
+ - Risk and fallback path
22
+
23
+ ### 2. Navigator Agent
24
+
25
+ Explores pages and search results:
26
+
27
+ - URL prioritization
28
+ - Link traversal policy
29
+ - Page relevance scoring
30
+
31
+ ### 3. Extractor Agent
32
+
33
+ Extracts structured fields:
34
+
35
+ - Selector and schema inference
36
+ - Adaptive chunk extraction
37
+ - Long-page batch processing
38
+
39
+ ### 4. Verifier Agent
40
+
41
+ Checks consistency and trust:
42
+
43
+ - Cross-source verification
44
+ - Conflict resolution
45
+ - Confidence calibration
46
+
47
+ ### 5. Memory Agent
48
+
49
+ Manages memory write/read/search:
50
+
51
+ - Episode summaries
52
+ - Pattern persistence
53
+ - Retrieval ranking and pruning
54
+
55
+ ## Execution Modes
56
+
57
+ ### Single-Agent
58
+
59
+ One policy handles all actions.
60
+
61
+ Pros: low overhead, simple.
62
+ Cons: weaker specialization.
63
+
64
+ ### Multi-Agent
65
+
66
+ Coordinator delegates work:
67
+
68
+ 1. Planner emits execution graph
69
+ 2. Navigator discovers candidate pages
70
+ 3. Extractor parses and emits data
71
+ 4. Verifier validates outputs
72
+ 5. Memory Agent stores reusable patterns
73
+
74
+ Pros: modular, robust, scalable.
75
+ Cons: coordination overhead.
76
+
77
+ ## Agent Communication
78
+
79
+ Shared channels:
80
+
81
+ - `agent_messages`: async inter-agent messages
82
+ - `task_state`: current objective and progress
83
+ - `global_knowledge`: reusable facts and patterns
84
+
85
+ Message schema:
86
+
87
+ ```json
88
+ {
89
+ "message_id": "msg_123",
90
+ "from": "navigator",
91
+ "to": "extractor",
92
+ "type": "page_candidate",
93
+ "payload": {
94
+ "url": "https://site.com/p/123",
95
+ "relevance": 0.91
96
+ },
97
+ "timestamp": "2026-03-27T00:00:00Z"
98
+ }
99
+ ```
100
+
101
+ ## Decision Policy
102
+
103
+ Policy input includes:
104
+
105
+ - Observation
106
+ - Working memory context
107
+ - Retrieved long-term memory hits
108
+ - Tool registry availability
109
+ - Budget and constraints
110
+
111
+ Policy output includes:
112
+
113
+ - Next action
114
+ - Confidence
115
+ - Rationale
116
+ - Fallback action (optional)
117
+
118
+ ## Strategy Library
119
+
120
+ Built-in strategy templates:
121
+
122
+ - `search-first`: broad discovery then narrow extraction
123
+ - `direct-extraction`: immediate field extraction from target page
124
+ - `multi-hop-reasoning`: iterative search and verification
125
+ - `table-centric`: table-first parsing
126
+ - `form-centric`: forms and input structures prioritized
127
+
128
+ Strategy selection can be:
129
+
130
+ - Manual (user setting)
131
+ - Automatic (router based on task signature)
132
+
133
+ ## Self-Improving Agent Loop
134
+
135
+ After each episode:
136
+
137
+ 1. Compute reward breakdown
138
+ 2. Extract failed and successful patterns
139
+ 3. Update strategy performance table
140
+ 4. Store high-confidence selectors in long-term memory
141
+ 5. Penalize redundant navigation patterns
142
+
143
+ ## Explainable AI Mode
144
+
145
+ Each action can emit:
146
+
147
+ - Why this action was chosen
148
+ - Why alternatives were rejected
149
+ - Which memory/tool evidence was used
150
+
151
+ Example trace:
152
+
153
+ ```text
154
+ Action: EXTRACT_FIELD(price)
155
+ Why: Pattern "span.product-price" had 0.93 historical confidence on similar domains.
156
+ Alternatives rejected: ".price-box .value" (lower confidence 0.58), regex-only extraction (unstable on this layout).
157
+ ```
158
+
159
+ ## Human-in-the-Loop
160
+
161
+ Optional checkpoints:
162
+
163
+ - Approve/reject planned action
164
+ - Override selector/tool/model
165
+ - Force verification before submit
166
+
167
+ Intervention modes:
168
+
169
+ - `off`: fully autonomous
170
+ - `review`: pause on low-confidence steps
171
+ - `strict`: require approval on all submit/fetch/verify actions
172
+
173
+ ## Scenario Simulator Hooks
174
+
175
+ Agents can be tested against:
176
+
177
+ - Noisy HTML
178
+ - Missing fields
179
+ - Broken pagination
180
+ - Adversarial layouts
181
+ - Dynamic content with delayed rendering
182
+
183
+ Simulation metrics:
184
+
185
+ - Completion
186
+ - Recovery score
187
+ - Generalization score
188
+ - Cost and latency
189
+
190
+ ## APIs
191
+
192
+ - `POST /api/agents/run`
193
+ - `POST /api/agents/plan`
194
+ - `POST /api/agents/override`
195
+ - `GET /api/agents/state/{episode_id}`
196
+ - `GET /api/agents/trace/{episode_id}`
197
+
198
+ ## Dashboard Widgets
199
+
200
+ - Live thought stream
201
+ - Agent role timeline
202
+ - Inter-agent message feed
203
+ - Strategy performance chart
204
+ - Confidence and override panel
docs/api.md ADDED
@@ -0,0 +1,901 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🤖 Multi-Model API System
2
+
3
+ ## Table of Contents
4
+ 1. [Overview](#overview)
5
+ 2. [Supported Providers](#supported-providers)
6
+ 3. [Smart Model Router](#smart-model-router)
7
+ 4. [Model Ensemble](#model-ensemble)
8
+ 5. [Cost & Token Tracking](#cost--token-tracking)
9
+ 6. [Prompt Management](#prompt-management)
10
+ 7. [Configuration](#configuration)
11
+ 8. [API Reference](#api-reference)
12
+
13
+ ---
14
+
15
+ ## Overview
16
+
17
+ The **Multi-Model API System** provides a unified interface for interacting with multiple LLM providers (OpenAI, Anthropic, Google, Groq, etc.), enabling:
18
+
19
+ - **Flexibility:** Switch between models without code changes
20
+ - **Optimization:** Auto-route requests to the best model for each task
21
+ - **Cost Control:** Track spending and enforce budgets
22
+ - **Reliability:** Fallback to alternative models on failure
23
+ - **Experimentation:** A/B test prompts and models
24
+
25
+ ### Architecture
26
+
27
+ ```
28
+ ┌────────────────────────────────────────────────────────────────┐
29
+ │ Agent Request │
30
+ │ "Extract product price" │
31
+ └────────────────────────┬───────────────────────────────────────┘
32
+
33
+
34
+ ┌────────────────────────────────────────────────────────────────┐
35
+ │ Smart Model Router │
36
+ │ ┌──────────────────────────────────────────────────────────┐ │
37
+ │ │ Task Classifier: │ │
38
+ │ │ • Reasoning → GPT-4 / Claude │ │
39
+ │ │ • Fast extraction → Groq / Gemini Flash │ │
40
+ │ │ • Long context → Claude / GPT-4-32k │ │
41
+ │ │ • Cost-sensitive → Gemini / Groq │ │
42
+ │ └──────────────────────────────────────────────────────────┘ │
43
+ └────────────────────────┬───────────────────────────────────────┘
44
+
45
+ ┌───────────────┼───────────────┬───────────────┐
46
+ │ │ │ │
47
+ ▼ ▼ ▼ ▼
48
+ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
49
+ │ OpenAI │ │ Anthropic │ │ Google │ │ Groq │
50
+ │ Adapter │ │ Adapter │ │ Adapter │ │ Adapter │
51
+ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
52
+ │ │ │ │
53
+ ▼ ▼ ▼ ▼
54
+ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
55
+ │ gpt-4-turbo │ │ claude-3.5 │ │ gemini-pro │ │ llama-3-70b │
56
+ │ gpt-4o-mini │ │ claude-3 │ │ gemini-flash│ │ mixtral-8x7b│
57
+ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
58
+ ```
59
+
60
+ ---
61
+
62
+ ## Supported Providers
63
+
64
+ ### 1. OpenAI
65
+
66
+ **Models:**
67
+ - `gpt-4-turbo` - Best reasoning, multimodal
68
+ - `gpt-4o` - Fast GPT-4 variant
69
+ - `gpt-4o-mini` - Cost-effective, fast
70
+ - `gpt-3.5-turbo` - Legacy, cheap
71
+
72
+ **Capabilities:**
73
+ - Function calling
74
+ - JSON mode
75
+ - Vision (gpt-4-turbo, gpt-4o)
76
+ - 128k context (gpt-4-turbo)
77
+
78
+ **Configuration:**
79
+ ```python
80
+ {
81
+ "provider": "openai",
82
+ "api_key": "sk-...",
83
+ "organization": "org-...", # Optional
84
+ "models": {
85
+ "default": "gpt-4o-mini",
86
+ "reasoning": "gpt-4-turbo",
87
+ "fast": "gpt-4o-mini"
88
+ },
89
+ "parameters": {
90
+ "temperature": 0.7,
91
+ "max_tokens": 4096,
92
+ "timeout": 60
93
+ }
94
+ }
95
+ ```
96
+
97
+ ### 2. Anthropic (Claude)
98
+
99
+ **Models:**
100
+ - `claude-3-opus-20240229` - Most capable
101
+ - `claude-3-sonnet-20240229` - Balanced
102
+ - `claude-3-haiku-20240307` - Fast and cheap
103
+ - `claude-3-5-sonnet-20240620` - Latest, best
104
+
105
+ **Capabilities:**
106
+ - 200k context window
107
+ - Strong reasoning
108
+ - Excellent instruction following
109
+ - Tool use (function calling)
110
+
111
+ **Configuration:**
112
+ ```python
113
+ {
114
+ "provider": "anthropic",
115
+ "api_key": "sk-ant-...",
116
+ "models": {
117
+ "default": "claude-3-5-sonnet-20240620",
118
+ "reasoning": "claude-3-opus-20240229",
119
+ "fast": "claude-3-haiku-20240307"
120
+ },
121
+ "parameters": {
122
+ "temperature": 0.7,
123
+ "max_tokens": 4096,
124
+ "timeout": 90
125
+ }
126
+ }
127
+ ```
128
+
129
+ ### 3. Google (Gemini)
130
+
131
+ **Models:**
132
+ - `gemini-1.5-pro` - Best quality, 2M context
133
+ - `gemini-1.5-flash` - Fast, 1M context
134
+ - `gemini-1.0-pro` - Legacy
135
+
136
+ **Capabilities:**
137
+ - Massive context (1M-2M tokens)
138
+ - Multimodal (text, image, video, audio)
139
+ - Extremely cost-effective
140
+ - Function calling
141
+
142
+ **Configuration:**
143
+ ```python
144
+ {
145
+ "provider": "google",
146
+ "api_key": "AIza...",
147
+ "models": {
148
+ "default": "gemini-1.5-flash",
149
+ "reasoning": "gemini-1.5-pro",
150
+ "fast": "gemini-1.5-flash"
151
+ },
152
+ "parameters": {
153
+ "temperature": 0.7,
154
+ "max_output_tokens": 8192,
155
+ "timeout": 60
156
+ }
157
+ }
158
+ ```
159
+
160
+ ### 4. Groq
161
+
162
+ **Models:**
163
+ - `llama-3.1-405b` - Largest Llama
164
+ - `llama-3.1-70b-versatile` - Balanced
165
+ - `llama-3.1-8b-instant` - Ultra-fast
166
+ - `mixtral-8x7b-32768` - Good reasoning
167
+
168
+ **Capabilities:**
169
+ - **Extremely fast inference** (500+ tokens/sec)
170
+ - Free tier available
171
+ - Open-source models
172
+ - JSON mode
173
+
174
+ **Configuration:**
175
+ ```python
176
+ {
177
+ "provider": "groq",
178
+ "api_key": "gsk_...",
179
+ "models": {
180
+ "default": "llama-3.1-70b-versatile",
181
+ "reasoning": "llama-3.1-405b",
182
+ "fast": "llama-3.1-8b-instant"
183
+ },
184
+ "parameters": {
185
+ "temperature": 0.7,
186
+ "max_tokens": 8192,
187
+ "timeout": 30
188
+ }
189
+ }
190
+ ```
191
+
192
+ ### 5. Mistral AI
193
+
194
+ **Models:**
195
+ - `mistral-large-latest` - Best quality
196
+ - `mistral-medium-latest` - Balanced
197
+ - `mistral-small-latest` - Fast and cheap
198
+ - `mixtral-8x22b` - Open-source, strong
199
+
200
+ **Configuration:**
201
+ ```python
202
+ {
203
+ "provider": "mistral",
204
+ "api_key": "...",
205
+ "models": {
206
+ "default": "mistral-medium-latest",
207
+ "reasoning": "mistral-large-latest",
208
+ "fast": "mistral-small-latest"
209
+ }
210
+ }
211
+ ```
212
+
213
+ ### 6. Cohere
214
+
215
+ **Models:**
216
+ - `command-r-plus` - Best for RAG
217
+ - `command-r` - Balanced
218
+ - `command-light` - Fast
219
+
220
+ **Specialization:** RAG, embeddings, reranking
221
+
222
+ ### 7. Perplexity
223
+
224
+ **Models:**
225
+ - `pplx-70b-online` - Web-connected
226
+ - `pplx-7b-online` - Fast, web-connected
227
+
228
+ **Specialization:** Real-time web search and citations
229
+
230
+ ### 8. Together AI
231
+
232
+ **Models:** 50+ open-source models
233
+ - Llama variants
234
+ - Mistral variants
235
+ - Code models (CodeLlama, StarCoder)
236
+
237
+ **Use Case:** Access to latest open-source models
238
+
239
+ ### 9. Custom / Self-Hosted
240
+
241
+ **Supported:**
242
+ - **Ollama** (local models)
243
+ - **vLLM** (self-hosted inference)
244
+ - **LM Studio** (local GUI)
245
+ - **LocalAI** (OpenAI-compatible local server)
246
+
247
+ **Configuration:**
248
+ ```python
249
+ {
250
+ "provider": "custom",
251
+ "base_url": "http://localhost:11434/v1", # Ollama
252
+ "api_key": "not-needed",
253
+ "models": {
254
+ "default": "llama3:70b",
255
+ "fast": "llama3:8b"
256
+ }
257
+ }
258
+ ```
259
+
260
+ ---
261
+
262
+ ## Smart Model Router
263
+
264
+ The **Smart Model Router** automatically selects the best model for each request based on task characteristics.
265
+
266
+ ### Routing Strategy
267
+
268
+ ```python
269
+ class ModelRouter:
270
+ def route(self, task: Task, context: Dict) -> ModelConfig:
271
+ """Select the best model for this task."""
272
+
273
+ # 1. Explicit user preference
274
+ if context.get("preferred_model"):
275
+ return self.get_model(context["preferred_model"])
276
+
277
+ # 2. Task-based routing
278
+ if task.type == "reasoning":
279
+ return self.route_reasoning(task, context)
280
+ elif task.type == "extraction":
281
+ return self.route_extraction(task, context)
282
+ elif task.type == "classification":
283
+ return self.route_classification(task, context)
284
+
285
+ # 3. Fallback to default
286
+ return self.default_model
287
+
288
+ def route_reasoning(self, task: Task, context: Dict) -> ModelConfig:
289
+ """Route complex reasoning tasks."""
290
+ # Long context? Use Claude or Gemini
291
+ if context.get("input_tokens", 0) > 50000:
292
+ return self.get_model("claude-3-5-sonnet") # 200k context
293
+
294
+ # Need reliability? Use GPT-4 or Claude
295
+ if task.importance == "high":
296
+ return self.get_model("gpt-4-turbo")
297
+
298
+ # Cost-sensitive? Use Gemini or Groq
299
+ if context.get("budget_mode"):
300
+ return self.get_model("gemini-1.5-flash")
301
+
302
+ return self.get_model("claude-3-5-sonnet") # Default for reasoning
303
+
304
+ def route_extraction(self, task: Task, context: Dict) -> ModelConfig:
305
+ """Route simple extraction tasks."""
306
+ # Speed critical? Use Groq
307
+ if context.get("latency_critical"):
308
+ return self.get_model("llama-3.1-70b-versatile", provider="groq")
309
+
310
+ # Cost-sensitive? Use Gemini Flash or Groq
311
+ return self.get_model("gemini-1.5-flash")
312
+ ```
313
+
314
+ ### Routing Rules
315
+
316
+ | Task Type | Input Size | Priority | Recommended Model | Reason |
317
+ |-----------|-----------|----------|-------------------|--------|
318
+ | Reasoning | Any | High | `gpt-4-turbo` | Best quality |
319
+ | Reasoning | >50k tokens | Any | `claude-3-5-sonnet` | 200k context |
320
+ | Reasoning | Any | Budget | `gemini-1.5-flash` | Cheap, good quality |
321
+ | Extraction | <10k tokens | Speed | `groq/llama-3.1-70b` | 500+ tok/sec |
322
+ | Extraction | Any | Budget | `gpt-4o-mini` | $0.15/1M tokens |
323
+ | Classification | <5k tokens | Any | `groq/llama-3.1-8b` | Ultra-fast |
324
+ | Long Context | >100k tokens | Any | `gemini-1.5-pro` | 2M context |
325
+ | Vision | Images | Any | `gpt-4o` | Best multimodal |
326
+ | Web Search | Any | Any | `perplexity` | Web-connected |
327
+
328
+ ### Configuration
329
+
330
+ ```python
331
+ class RouterConfig(BaseModel):
332
+ enabled: bool = True
333
+ strategy: Literal["task_based", "cost_optimized", "speed_optimized", "quality_optimized"]
334
+
335
+ # Task-based routing rules
336
+ routing_rules: Dict[str, str] = {
337
+ "reasoning_high_priority": "gpt-4-turbo",
338
+ "reasoning_budget": "gemini-1.5-flash",
339
+ "extraction_fast": "groq/llama-3.1-70b",
340
+ "extraction_accurate": "claude-3-5-sonnet",
341
+ "long_context": "gemini-1.5-pro",
342
+ "vision": "gpt-4o"
343
+ }
344
+
345
+ # Fallback chain
346
+ fallback_order: List[str] = [
347
+ "claude-3-5-sonnet",
348
+ "gpt-4o-mini",
349
+ "gemini-1.5-flash",
350
+ "groq/llama-3.1-70b"
351
+ ]
352
+
353
+ # Auto-retry on failure
354
+ auto_retry: bool = True
355
+ max_retries: int = 3
356
+ ```
357
+
358
+ ---
359
+
360
+ ## Model Ensemble
361
+
362
+ **Model Ensemble** runs multiple models in parallel and merges their outputs for higher quality or consensus.
363
+
364
+ ### Ensemble Strategies
365
+
366
+ #### 1. Voting (Classification/Extraction)
367
+
368
+ Run 3+ models, take majority vote.
369
+
370
+ ```python
371
+ class VotingEnsemble:
372
+ async def predict(self, prompt: str, models: List[str]) -> Any:
373
+ """Run multiple models and vote on result."""
374
+ tasks = [self.call_model(model, prompt) for model in models]
375
+ results = await asyncio.gather(*tasks)
376
+
377
+ # Count votes
378
+ from collections import Counter
379
+ votes = Counter(results)
380
+ winner, count = votes.most_common(1)[0]
381
+
382
+ confidence = count / len(results)
383
+ return {
384
+ "result": winner,
385
+ "confidence": confidence,
386
+ "votes": dict(votes)
387
+ }
388
+
389
+ # Example: Extract price with 3 models
390
+ ensemble = VotingEnsemble()
391
+ result = await ensemble.predict(
392
+ prompt="Extract the product price: <html>...",
393
+ models=["gpt-4o-mini", "gemini-1.5-flash", "groq/llama-3.1-70b"]
394
+ )
395
+ # Result: {"result": "$49.99", "confidence": 1.0, "votes": {"$49.99": 3}}
396
+ ```
397
+
398
+ #### 2. Ranking (Quality Assessment)
399
+
400
+ Run multiple models, rank outputs by quality.
401
+
402
+ ```python
403
+ class RankingEnsemble:
404
+ async def generate(self, prompt: str, models: List[str]) -> List[Dict]:
405
+ """Generate with multiple models and rank by quality."""
406
+ tasks = [self.call_model(model, prompt) for model in models]
407
+ results = await asyncio.gather(*tasks)
408
+
409
+ # Score each result
410
+ scored_results = []
411
+ for model, output in zip(models, results):
412
+ score = self.quality_scorer.score(output, prompt)
413
+ scored_results.append({
414
+ "model": model,
415
+ "output": output,
416
+ "quality_score": score
417
+ })
418
+
419
+ # Sort by score
420
+ scored_results.sort(key=lambda x: x["quality_score"], reverse=True)
421
+ return scored_results
422
+
423
+ # Example: Generate reasoning with ranking
424
+ ensemble = RankingEnsemble()
425
+ results = await ensemble.generate(
426
+ prompt="Explain how to extract a price from HTML",
427
+ models=["gpt-4-turbo", "claude-3-5-sonnet", "gemini-1.5-pro"]
428
+ )
429
+ best_result = results[0] # Highest quality
430
+ ```
431
+
432
+ #### 3. Fusion (Merging Outputs)
433
+
434
+ Merge complementary outputs from multiple models.
435
+
436
+ ```python
437
+ class FusionEnsemble:
438
+ async def extract_structured(self, prompt: str, models: List[str]) -> Dict:
439
+ """Extract structured data with multiple models and merge."""
440
+ tasks = [self.call_model(model, prompt) for model in models]
441
+ results = await asyncio.gather(*tasks)
442
+
443
+ # Merge fields with confidence weighting
444
+ merged = {}
445
+ for field in self.extract_fields(results):
446
+ values = [r.get(field) for r in results if r.get(field)]
447
+ if not values:
448
+ continue
449
+
450
+ # Use most common value, or highest-confidence model's value
451
+ from collections import Counter
452
+ counts = Counter(values)
453
+ merged[field] = counts.most_common(1)[0][0]
454
+
455
+ return merged
456
+
457
+ # Example: Extract product data with fusion
458
+ ensemble = FusionEnsemble()
459
+ product = await ensemble.extract_structured(
460
+ prompt="Extract product details: <html>...",
461
+ models=["gpt-4o-mini", "gemini-1.5-flash", "claude-3-haiku"]
462
+ )
463
+ # Merges: {name: "...", price: "$X", rating: "Y" } from all models
464
+ ```
465
+
466
+ #### 4. Verification (Primary + Validator)
467
+
468
+ One model generates, another validates.
469
+
470
+ ```python
471
+ class VerificationEnsemble:
472
+ async def generate_and_verify(
473
+ self,
474
+ prompt: str,
475
+ generator_model: str,
476
+ validator_model: str
477
+ ) -> Dict:
478
+ """Generate with one model, verify with another."""
479
+ # Generate
480
+ output = await self.call_model(generator_model, prompt)
481
+
482
+ # Verify
483
+ verification_prompt = f"""
484
+ Original task: {prompt}
485
+ Generated output: {output}
486
+
487
+ Is this output correct and complete? Explain any issues.
488
+ """
489
+ verification = await self.call_model(validator_model, verification_prompt)
490
+
491
+ return {
492
+ "output": output,
493
+ "verification": verification,
494
+ "confidence": self.parse_confidence(verification)
495
+ }
496
+
497
+ # Example: Generate with Groq (fast), verify with Claude (accurate)
498
+ ensemble = VerificationEnsemble()
499
+ result = await ensemble.generate_and_verify(
500
+ prompt="Extract all product prices from this catalog page",
501
+ generator_model="groq/llama-3.1-70b",
502
+ validator_model="claude-3-5-sonnet"
503
+ )
504
+ ```
505
+
506
+ ### Ensemble Configuration
507
+
508
+ ```python
509
+ class EnsembleConfig(BaseModel):
510
+ enabled: bool = False # Off by default (costs more)
511
+ strategy: Literal["voting", "ranking", "fusion", "verification"]
512
+
513
+ # Model selection
514
+ models: List[str] = [] # If empty, router selects
515
+
516
+ # Voting settings
517
+ min_agreement: float = 0.67 # Require 67% agreement
518
+
519
+ # Ranking settings
520
+ quality_metric: Literal["coherence", "accuracy", "completeness"]
521
+
522
+ # Verification settings
523
+ generator_model: Optional[str] = None
524
+ validator_model: Optional[str] = None
525
+ ```
526
+
527
+ ---
528
+
529
+ ## Cost & Token Tracking
530
+
531
+ Track spending and token usage across all models.
532
+
533
+ ### Cost Tracker
534
+
535
+ ```python
536
+ class CostTracker:
537
+ # Pricing (as of March 2026, per 1M tokens)
538
+ PRICING = {
539
+ "gpt-4-turbo": {"input": 10.00, "output": 30.00},
540
+ "gpt-4o": {"input": 5.00, "output": 15.00},
541
+ "gpt-4o-mini": {"input": 0.15, "output": 0.60},
542
+ "claude-3-opus": {"input": 15.00, "output": 75.00},
543
+ "claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
544
+ "claude-3-haiku": {"input": 0.25, "output": 1.25},
545
+ "gemini-1.5-pro": {"input": 3.50, "output": 10.50},
546
+ "gemini-1.5-flash": {"input": 0.35, "output": 1.05},
547
+ "groq/llama-3.1-70b": {"input": 0.59, "output": 0.79},
548
+ "groq/llama-3.1-8b": {"input": 0.05, "output": 0.08},
549
+ }
550
+
551
+ def calculate_cost(
552
+ self,
553
+ model: str,
554
+ input_tokens: int,
555
+ output_tokens: int
556
+ ) -> float:
557
+ """Calculate cost for this request."""
558
+ pricing = self.PRICING.get(model, {"input": 0, "output": 0})
559
+ cost = (
560
+ (input_tokens / 1_000_000) * pricing["input"] +
561
+ (output_tokens / 1_000_000) * pricing["output"]
562
+ )
563
+ return cost
564
+
565
+ def track_request(self, request: ModelRequest, response: ModelResponse):
566
+ """Track a model request."""
567
+ cost = self.calculate_cost(
568
+ model=request.model,
569
+ input_tokens=response.usage.prompt_tokens,
570
+ output_tokens=response.usage.completion_tokens
571
+ )
572
+
573
+ self.db.insert({
574
+ "timestamp": datetime.now(),
575
+ "model": request.model,
576
+ "input_tokens": response.usage.prompt_tokens,
577
+ "output_tokens": response.usage.completion_tokens,
578
+ "total_tokens": response.usage.total_tokens,
579
+ "cost_usd": cost,
580
+ "latency_ms": response.latency_ms,
581
+ "task_type": request.task_type,
582
+ "success": response.success
583
+ })
584
+ ```
585
+
586
+ ### Budget Enforcement
587
+
588
+ ```python
589
+ class BudgetEnforcer:
590
+ def __init__(self, daily_budget_usd: float):
591
+ self.daily_budget = daily_budget_usd
592
+ self.cost_tracker = CostTracker()
593
+
594
+ def check_budget(self) -> bool:
595
+ """Check if budget allows this request."""
596
+ today_cost = self.cost_tracker.get_today_cost()
597
+ return today_cost < self.daily_budget
598
+
599
+ async def call_with_budget(self, request: ModelRequest) -> ModelResponse:
600
+ """Make request only if budget allows."""
601
+ if not self.check_budget():
602
+ # Fallback to cheapest model
603
+ request.model = "groq/llama-3.1-8b-instant"
604
+ logger.warning(f"Budget exceeded, downgrading to {request.model}")
605
+
606
+ response = await self.call_model(request)
607
+ self.cost_tracker.track_request(request, response)
608
+ return response
609
+ ```
610
+
611
+ ### Token Usage Dashboard
612
+
613
+ **UI Display:**
614
+ ```
615
+ ┌──��───────────────────────────────────────────────────────────┐
616
+ │ Token Usage & Cost (Last 24h) │
617
+ ├──────────────────────────────────────────────────────────────┤
618
+ │ │
619
+ │ Total Tokens: 1,234,567 │
620
+ │ Total Cost: $12.34 │
621
+ │ Requests: 456 │
622
+ │ Avg Latency: 1.2s │
623
+ │ │
624
+ │ ┌────────────────────────────────────────────────────────┐ │
625
+ │ │ Cost by Model │ │
626
+ │ │ ████████████████████ gpt-4-turbo $6.50 (53%) │ │
627
+ │ │ ██████████ claude-3-5-sonnet $3.20 (26%) │ │
628
+ │ │ █████ gemini-1.5-flash $1.80 (15%) │ │
629
+ │ │ ██ groq/llama-3.1-70b $0.84 (6%) │ │
630
+ │ └────────────────────────────────────────────────────────┘ │
631
+ │ │
632
+ │ ┌────────────────────────────────────────────────────────┐ │
633
+ │ │ Token Usage by Model │ │
634
+ │ │ Model Input Output Total Cost │ │
635
+ │ │ gpt-4-turbo 123K 45K 168K $6.50 │ │
636
+ │ │ claude-3-5-sonnet 456K 89K 545K $3.20 │ │
637
+ │ │ gemini-1.5-flash 890K 234K 1124K $1.80 │ │
638
+ │ └────────────────────────────────────────────────────────┘ │
639
+ │ │
640
+ │ Budget: $12.34 / $20.00 (62% used) │
641
+ │ [█████████████████░░░░░░░░░░] │
642
+ │ │
643
+ │ ⚠️ Budget 80% threshold: Alert enabled │
644
+ │ │
645
+ └──────────────────────────────────────────────────────────────┘
646
+ ```
647
+
648
+ ---
649
+
650
+ ## Prompt Management
651
+
652
+ Manage, version, and A/B test prompts.
653
+
654
+ ### Prompt Templates
655
+
656
+ ```python
657
+ class PromptTemplate(BaseModel):
658
+ template_id: str
659
+ name: str
660
+ template: str
661
+ variables: List[str]
662
+ version: int
663
+ created_at: datetime
664
+ performance_score: Optional[float] = None
665
+
666
+ class PromptManager:
667
+ def get_template(self, template_id: str, version: Optional[int] = None) -> PromptTemplate:
668
+ """Get prompt template by ID and version."""
669
+ if version is None:
670
+ return self.get_latest_version(template_id)
671
+ return self.db.get(template_id, version)
672
+
673
+ def render(self, template_id: str, variables: Dict) -> str:
674
+ """Render template with variables."""
675
+ template = self.get_template(template_id)
676
+ return template.template.format(**variables)
677
+
678
+ def create_version(self, template_id: str, new_template: str) -> int:
679
+ """Create new version of template."""
680
+ current = self.get_template(template_id)
681
+ new_version = current.version + 1
682
+
683
+ self.db.insert(PromptTemplate(
684
+ template_id=template_id,
685
+ name=current.name,
686
+ template=new_template,
687
+ variables=current.variables,
688
+ version=new_version,
689
+ created_at=datetime.now()
690
+ ))
691
+
692
+ return new_version
693
+ ```
694
+
695
+ ### Example Templates
696
+
697
+ ```python
698
+ # Extraction prompt
699
+ EXTRACTION_PROMPT = """
700
+ You are a web scraping agent. Extract the following fields from the HTML:
701
+
702
+ Target fields: {target_fields}
703
+
704
+ HTML content:
705
+ {html_content}
706
+
707
+ Return a JSON object with the extracted values. If a field is not found, use null.
708
+
709
+ Example output format:
710
+ {{
711
+ "field1": "value1",
712
+ "field2": "value2"
713
+ }}
714
+ """
715
+
716
+ # Reasoning prompt
717
+ REASONING_PROMPT = """
718
+ You are analyzing a web page to plan your next extraction action.
719
+
720
+ Current goal: {goal}
721
+ Page URL: {url}
722
+ Available actions: {actions}
723
+ Previous attempts: {history}
724
+
725
+ Think step by step:
726
+ 1. What information is most important for the goal?
727
+ 2. What patterns do you see in the HTML structure?
728
+ 3. Which action is most likely to succeed?
729
+ 4. What could go wrong?
730
+
731
+ Provide your reasoning and then choose an action.
732
+ """
733
+
734
+ # Register templates
735
+ prompt_manager = PromptManager()
736
+ prompt_manager.register("extraction_v1", EXTRACTION_PROMPT, ["target_fields", "html_content"])
737
+ prompt_manager.register("reasoning_v1", REASONING_PROMPT, ["goal", "url", "actions", "history"])
738
+ ```
739
+
740
+ ### A/B Testing
741
+
742
+ ```python
743
+ class PromptABTest:
744
+ def __init__(self, template_id: str, variants: List[int]):
745
+ self.template_id = template_id
746
+ self.variants = variants # Version numbers
747
+ self.results = {v: [] for v in variants}
748
+
749
+ def get_variant(self) -> int:
750
+ """Select variant (round-robin or random)."""
751
+ return random.choice(self.variants)
752
+
753
+ def track_result(self, variant: int, success: bool, score: float):
754
+ """Track performance of a variant."""
755
+ self.results[variant].append({"success": success, "score": score})
756
+
757
+ def get_winner(self) -> int:
758
+ """Determine which variant performs best."""
759
+ avg_scores = {
760
+ v: np.mean([r["score"] for r in results])
761
+ for v, results in self.results.items()
762
+ if results
763
+ }
764
+ return max(avg_scores, key=avg_scores.get)
765
+
766
+ # Run A/B test
767
+ test = PromptABTest("extraction_v1", variants=[1, 2, 3])
768
+
769
+ for episode in episodes:
770
+ variant = test.get_variant()
771
+ prompt = prompt_manager.render(f"extraction_v1", variables, version=variant)
772
+ result = await model.generate(prompt)
773
+ test.track_result(variant, result.success, result.score)
774
+
775
+ winner = test.get_winner()
776
+ print(f"Best variant: v{winner}")
777
+ ```
778
+
779
+ ---
780
+
781
+ ## Configuration
782
+
783
+ ### Settings Panel
784
+
785
+ ```python
786
+ class APISettings(BaseModel):
787
+ # Provider configurations
788
+ providers: Dict[str, ProviderConfig] = {}
789
+
790
+ # Default model
791
+ default_model: str = "gpt-4o-mini"
792
+
793
+ # Smart routing
794
+ router: RouterConfig = RouterConfig()
795
+
796
+ # Ensemble
797
+ ensemble: EnsembleConfig = EnsembleConfig()
798
+
799
+ # Cost control
800
+ daily_budget_usd: float = 20.00
801
+ alert_threshold: float = 0.8 # Alert at 80% budget
802
+
803
+ # Rate limiting
804
+ max_requests_per_minute: int = 60
805
+
806
+ # Retry policy
807
+ max_retries: int = 3
808
+ retry_delay_seconds: int = 2
809
+
810
+ # Prompt management
811
+ prompt_templates: Dict[str, str] = {}
812
+ ```
813
+
814
+ **UI Example:**
815
+ ```
816
+ ┌────────────────────────────────────────────────────────────┐
817
+ │ API Settings │
818
+ ├────────────────────────────────────────────────────────────┤
819
+ │ │
820
+ │ Model Providers: │
821
+ │ ┌─────────────────────────────────────────────────────┐ │
822
+ │ │ ☑ OpenAI │ │
823
+ │ │ API Key: [sk-proj-••••••••••••••••] [Test] │ │
824
+ │ │ Default: [gpt-4o-mini ▼] │ │
825
+ │ │ │ │
826
+ │ │ ☑ Anthropic │ │
827
+ │ │ API Key: [sk-ant-••••••••••••••••] [Test] │ │
828
+ │ │ Default: [claude-3-5-sonnet ▼] │ │
829
+ │ │ │ │
830
+ │ │ ☑ Google │ │
831
+ │ │ API Key: [AIza••••••••••••••••••••] [Test] │ │
832
+ │ │ Default: [gemini-1.5-flash ▼] │ │
833
+ │ │ │ │
834
+ │ │ ☑ Groq │ │
835
+ │ │ API Key: [gsk_••••••••••••••••••••] [Test] │ │
836
+ │ │ Default: [llama-3.1-70b-versatile ▼] │ │
837
+ │ │ │ │
838
+ │ │ ☐ Mistral [Configure] │ │
839
+ │ │ ☐ Cohere [Configure] │ │
840
+ │ │ ☐ Custom [Configure] │ │
841
+ │ └─────────────────────────────────────────────────────┘ │
842
+ │ │
843
+ │ Smart Routing: │
844
+ │ ☑ Enabled │
845
+ │ Strategy: [Task-Based ▼] │
846
+ │ Fallback: [claude → gpt-4o-mini → gemini → groq] │
847
+ │ │
848
+ │ Model Ensemble: │
849
+ │ ☐ Enabled (increases cost) │
850
+ │ Strategy: [Voting ▼] │
851
+ │ Models: [gpt-4o-mini, gemini-flash, groq/llama ▼] │
852
+ │ │
853
+ │ Cost Control: │
854
+ │ Daily Budget: [$20.00] │
855
+ │ Alert at: [80%] of budget │
856
+ │ Current Usage: $12.34 / $20.00 (62%) │
857
+ │ │
858
+ │ [Save Settings] [Reset to Defaults] │
859
+ └────────────────────────────────────────────────────────────┘
860
+ ```
861
+
862
+ ---
863
+
864
+ ## API Reference
865
+
866
+ ### Python Client
867
+
868
+ ```python
869
+ from webscraper_env import MultiModelAPI
870
+
871
+ # Initialize with config
872
+ api = MultiModelAPI(settings=APISettings())
873
+
874
+ # Simple generation
875
+ response = await api.generate(
876
+ prompt="Extract product price from: <html>...",
877
+ model="gpt-4o-mini" # Optional, uses router if omitted
878
+ )
879
+
880
+ # With routing
881
+ response = await api.generate(
882
+ prompt="Complex reasoning task...",
883
+ task_type="reasoning", # Router selects best model
884
+ priority="high"
885
+ )
886
+
887
+ # With ensemble
888
+ response = await api.generate_ensemble(
889
+ prompt="Extract all prices",
890
+ strategy="voting",
891
+ models=["gpt-4o-mini", "gemini-1.5-flash", "groq/llama-3.1-70b"]
892
+ )
893
+
894
+ # Streaming
895
+ async for chunk in api.generate_stream(prompt="...", model="claude-3-5-sonnet"):
896
+ print(chunk.text, end="", flush=True)
897
+ ```
898
+
899
+ ---
900
+
901
+ **Next:** See [mcp.md](./mcp.md) for MCP server integration.
docs/architecture.md ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # System Architecture
2
+
3
+ ## Overview
4
+
5
+ WebScraper-OpenEnv is designed as a modular, dashboard-first RL environment with extensible APIs, MCP tools, and multi-model routing.
6
+
7
+ ## High-Level Topology
8
+
9
+ ```text
10
+ Frontend Dashboard (React/Vite)
11
+ |
12
+ v
13
+ FastAPI Control Plane
14
+ - episode lifecycle
15
+ - action dispatch
16
+ - reward engine
17
+ - tool registry API
18
+ - settings + policy
19
+ |
20
+ +--> Agent Runtime
21
+ | - planner/navigator/extractor/verifier
22
+ | - memory manager
23
+ | - model router
24
+ |
25
+ +--> MCP Gateway
26
+ | - tool discovery
27
+ | - lazy install/load
28
+ | - schema + timeout + retries
29
+ |
30
+ +--> Search Layer
31
+ | - provider routing
32
+ | - query optimization
33
+ | - credibility scoring
34
+ |
35
+ +--> Memory Layer
36
+ | - short/working/long/shared
37
+ | - vector index + persistent storage
38
+ |
39
+ +--> Observability
40
+ - traces/logs/metrics/cost dashboard
41
+ ```
42
+
43
+ ## Core Subsystems
44
+
45
+ ### 1. Control Plane
46
+
47
+ Responsibilities:
48
+
49
+ - reset/step/state APIs
50
+ - request validation
51
+ - action authorization and policy checks
52
+ - deterministic episode management
53
+
54
+ ### 2. Agent Runtime
55
+
56
+ Responsibilities:
57
+
58
+ - policy inference
59
+ - strategy execution
60
+ - fallback handling
61
+ - action explainability
62
+
63
+ ### 3. Tooling Plane (MCP)
64
+
65
+ Responsibilities:
66
+
67
+ - dynamic tool registry
68
+ - server health checks
69
+ - lazy installation
70
+ - composition workflows
71
+
72
+ ### 4. Data Plane
73
+
74
+ Responsibilities:
75
+
76
+ - HTML ingestion and chunking
77
+ - extraction and normalization
78
+ - verification and reconciliation
79
+ - output persistence
80
+
81
+ ### 5. Analytics Plane
82
+
83
+ Responsibilities:
84
+
85
+ - reward component logging
86
+ - model/token/cost accounting
87
+ - tool usage telemetry
88
+ - memory quality analytics
89
+
90
+ ## Processing Pipeline
91
+
92
+ 1. `reset(task_id, seed)`
93
+ 2. observation emitted
94
+ 3. policy selects action
95
+ 4. action executes (native/MCP/search/memory)
96
+ 5. reward computed and logged
97
+ 6. done check
98
+ 7. repeat until terminal
99
+
100
+ ## Batch and Parallel Design
101
+
102
+ ### Batch
103
+
104
+ - large HTML split into semantic chunks
105
+ - chunk extraction batched with bounded size
106
+ - merge + dedupe + confidence rank
107
+
108
+ ### Parallel
109
+
110
+ - independent chunk tasks run concurrently
111
+ - search and verification can run in parallel branches
112
+ - configurable worker limits and queue priorities
113
+
114
+ ## Queue and Scheduler
115
+
116
+ Task queue supports:
117
+
118
+ - priority classes (`high`, `normal`, `low`)
119
+ - cancellation tokens
120
+ - retry policy with backoff
121
+ - dead-letter queue for repeated failures
122
+
123
+ ## Storage Architecture
124
+
125
+ - Episode state: in-memory + optional persistence
126
+ - Long-term memory: vector DB + metadata store
127
+ - Logs/metrics: append-only time-series-friendly sink
128
+ - Exports: JSON/CSV trace packs
129
+
130
+ ## Reliability
131
+
132
+ - per-tool timeout and retry
133
+ - per-step safety budget
134
+ - circuit breaker for failing providers
135
+ - deterministic fallback chains
136
+
137
+ ## Security
138
+
139
+ - API key vaulting via env/config secrets
140
+ - MCP allowlist
141
+ - output sanitization
142
+ - redaction of sensitive tokens in logs
143
+
144
+ ## Deployment
145
+
146
+ Single-container baseline:
147
+
148
+ - frontend static build served by API backend
149
+ - optional sidecars for DB/vector/MCP infra
150
+
151
+ Scale-out profile:
152
+
153
+ - separate API and worker pools
154
+ - managed vector DB
155
+ - queue-backed distributed execution
156
+ - central observability backend
157
+
158
+ ## Compatibility Goals
159
+
160
+ - local dev mode with minimal dependencies
161
+ - cloud mode with managed infra
162
+ - optional self-hosted LLM endpoints
163
+
164
+ ## Future Architecture Extensions
165
+
166
+ - distributed multi-agent graph execution
167
+ - adaptive autoscaling by queue pressure
168
+ - global memory federation across projects
docs/features.md ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Advanced Features
2
+
3
+ ## Overview
4
+
5
+ This document captures high-end platform capabilities beyond baseline extraction.
6
+
7
+ ## 1) Self-Improving Agent
8
+
9
+ Post-episode learning loop:
10
+
11
+ - classify failures by root cause
12
+ - update selector/tool strategy priors
13
+ - persist successful patterns with confidence
14
+ - penalize repeated failure paths
15
+
16
+ ## 2) Strategy Library
17
+
18
+ Built-in strategies:
19
+
20
+ - Search-first
21
+ - Direct extraction
22
+ - Multi-hop reasoning
23
+ - Verification-first
24
+ - Table-first
25
+
26
+ Each strategy tracks:
27
+
28
+ - win rate
29
+ - cost per success
30
+ - average latency
31
+ - domain affinity
32
+
33
+ ## 3) Explainable AI Mode
34
+
35
+ For every decision, provide:
36
+
37
+ - selected action and confidence
38
+ - top alternatives considered
39
+ - evidence from memory/tools/search
40
+ - expected reward impact
41
+
42
+ ## 4) Human-in-the-Loop
43
+
44
+ Intervention controls:
45
+
46
+ - approve/reject action
47
+ - force tool/model switch
48
+ - enforce verification before submit
49
+ - set hard constraints during runtime
50
+
51
+ ## 5) Scenario Simulator
52
+
53
+ Stress testing scenarios:
54
+
55
+ - noisy HTML
56
+ - broken DOM
57
+ - pagination traps
58
+ - conflicting facts
59
+ - anti-scraping patterns
60
+
61
+ Outputs:
62
+
63
+ - robustness score
64
+ - recovery score
65
+ - strategy suitability map
66
+
67
+ ## 6) Context Compression
68
+
69
+ - rolling summaries
70
+ - salience-based pruning
71
+ - token-aware context packing
72
+ - differential memory refresh
73
+
74
+ ## 7) Batch + Parallel Runtime
75
+
76
+ - task queue with priorities
77
+ - parallel extraction workers
78
+ - bounded concurrency
79
+ - idempotent retry handling
80
+
81
+ ## 8) Prompt Versioning and Evaluation
82
+
83
+ - versioned prompt templates
84
+ - A/B testing by task type
85
+ - reward/cost comparison dashboards
86
+ - rollout and rollback controls
87
+
88
+ ## 9) MCP Toolchain Composition
89
+
90
+ Composable flow examples:
91
+
92
+ - Browser MCP -> Parser MCP -> Validator MCP -> DB MCP
93
+ - Search MCP -> Fetch MCP -> Extract MCP -> Verify MCP
94
+
95
+ ## 10) Governance and Safety
96
+
97
+ - tool allowlist/denylist
98
+ - PII redaction in logs
99
+ - budget and rate guardrails
100
+ - provenance tracking for extracted facts
101
+
102
+ ## Feature Flags
103
+
104
+ All advanced features should be toggleable from Settings and safely disabled by default where cost/latency impact is high.
docs/html-processing.md ADDED
@@ -0,0 +1,739 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🌐 HTML Processing Engine
2
+
3
+ ## Table of Contents
4
+ 1. [Overview](#overview)
5
+ 2. [Semantic Understanding](#semantic-understanding)
6
+ 3. [Content Classification](#content-classification)
7
+ 4. [Smart Extraction](#smart-extraction)
8
+ 5. [Adaptive Chunking](#adaptive-chunking)
9
+ 6. [Batch Processing](#batch-processing)
10
+ 7. [Diff-Based Updates](#diff-based-updates)
11
+ 8. [Schema Detection](#schema-detection)
12
+
13
+ ---
14
+
15
+ ## Overview
16
+
17
+ The **HTML Processing Engine** provides advanced capabilities for understanding, parsing, and extracting data from complex web pages.
18
+
19
+ ### Challenges
20
+
21
+ Modern web pages are challenging:
22
+ - **Size:** 1MB+ of HTML
23
+ - **Complexity:** Nested divs, dynamic IDs, inline styles
24
+ - **Noise:** Ads, tracking scripts, navigation repeated on every page
25
+ - **Inconsistency:** Same site uses different structures across pages
26
+ - **Obfuscation:** Anti-scraping measures (randomized classes, etc.)
27
+
28
+ ### Solution
29
+
30
+ Our engine provides:
31
+ - ✅ **Semantic understanding** of page structure
32
+ - ✅ **Content classification** (main content vs noise)
33
+ - ✅ **Smart extraction** with pattern recognition
34
+ - ✅ **Adaptive chunking** for large pages
35
+ - ✅ **Batch processing** with deduplication
36
+ - ✅ **Diff-based updates** for paginated content
37
+
38
+ ---
39
+
40
+ ## Semantic Understanding
41
+
42
+ ### Architecture
43
+
44
+ ```python
45
+ class SemanticHTMLAnalyzer:
46
+ """Understands page structure at a semantic level."""
47
+
48
+ def analyze(self, html: str) -> SemanticStructure:
49
+ """Analyze HTML and identify semantic regions."""
50
+ soup = BeautifulSoup(html, 'lxml')
51
+
52
+ structure = SemanticStructure()
53
+ structure.header = self.detect_header(soup)
54
+ structure.navigation = self.detect_navigation(soup)
55
+ structure.main_content = self.detect_main_content(soup)
56
+ structure.sidebar = self.detect_sidebar(soup)
57
+ structure.footer = self.detect_footer(soup)
58
+ structure.ads = self.detect_ads(soup)
59
+ structure.forms = self.detect_forms(soup)
60
+ structure.tables = self.detect_tables(soup)
61
+ structure.lists = self.detect_lists(soup)
62
+ structure.product_cards = self.detect_product_cards(soup)
63
+
64
+ return structure
65
+ ```
66
+
67
+ ### Semantic Regions
68
+
69
+ #### 1. Header Detection
70
+
71
+ ```python
72
+ def detect_header(self, soup: BeautifulSoup) -> Optional[Tag]:
73
+ """Detect page header."""
74
+ # Try semantic tags first
75
+ header = soup.find('header')
76
+ if header:
77
+ return header
78
+
79
+ # Try common patterns
80
+ candidates = soup.find_all(['div', 'section'], class_=re.compile(r'header|top|banner', re.I))
81
+ if candidates:
82
+ # Pick the topmost element
83
+ return min(candidates, key=lambda el: self.get_vertical_position(el))
84
+
85
+ # Fallback: First div with logo + navigation
86
+ for div in soup.find_all('div'):
87
+ has_logo = div.find(['img', 'svg'], class_=re.compile(r'logo', re.I))
88
+ has_nav = div.find(['nav', 'ul'], class_=re.compile(r'menu|nav', re.I))
89
+ if has_logo and has_nav:
90
+ return div
91
+
92
+ return None
93
+ ```
94
+
95
+ #### 2. Main Content Detection
96
+
97
+ ```python
98
+ def detect_main_content(self, soup: BeautifulSoup) -> Optional[Tag]:
99
+ """Detect main content area (most important for extraction)."""
100
+ # Try semantic tags
101
+ main = soup.find('main')
102
+ if main:
103
+ return main
104
+
105
+ article = soup.find('article')
106
+ if article:
107
+ return article
108
+
109
+ # Content scoring approach
110
+ candidates = soup.find_all(['div', 'section'])
111
+ scored = []
112
+
113
+ for candidate in candidates:
114
+ score = 0
115
+
116
+ # More text = higher score
117
+ text_length = len(candidate.get_text(strip=True))
118
+ score += text_length * 0.1
119
+
120
+ # Has article/main role
121
+ if candidate.get('role') in ['main', 'article']:
122
+ score += 100
123
+
124
+ # Common content class names
125
+ if candidate.get('class'):
126
+ classes = ' '.join(candidate.get('class'))
127
+ if re.search(r'content|main|article|post|product', classes, re.I):
128
+ score += 50
129
+
130
+ # Penalize if contains nav/aside
131
+ if candidate.find(['nav', 'aside']):
132
+ score -= 30
133
+
134
+ scored.append((candidate, score))
135
+
136
+ if scored:
137
+ scored.sort(key=lambda x: x[1], reverse=True)
138
+ return scored[0][0]
139
+
140
+ return None
141
+ ```
142
+
143
+ #### 3. Product Card Detection
144
+
145
+ ```python
146
+ def detect_product_cards(self, soup: BeautifulSoup) -> List[Tag]:
147
+ """Detect product cards in e-commerce listings."""
148
+ cards = []
149
+
150
+ # Pattern 1: Schema.org markup
151
+ cards.extend(soup.find_all(itemtype=re.compile(r'schema.org/Product')))
152
+
153
+ # Pattern 2: Common class patterns
154
+ class_patterns = [
155
+ r'product[-_]card',
156
+ r'product[-_]item',
157
+ r'product[-_]box',
158
+ r'item[-_]card',
159
+ r'listing[-_]item'
160
+ ]
161
+
162
+ for pattern in class_patterns:
163
+ cards.extend(soup.find_all(class_=re.compile(pattern, re.I)))
164
+
165
+ # Pattern 3: Structural detection
166
+ # Look for repeated elements with image + title + price
167
+ candidates = soup.find_all(['div', 'article', 'li'])
168
+
169
+ for candidate in candidates:
170
+ has_image = candidate.find(['img'])
171
+ has_title = candidate.find(['h1', 'h2', 'h3', 'h4'], class_=re.compile(r'title|name', re.I))
172
+ has_price = candidate.find(class_=re.compile(r'price', re.I))
173
+
174
+ if has_image and has_title and has_price:
175
+ cards.append(candidate)
176
+
177
+ # Deduplicate
178
+ return list(set(cards))
179
+ ```
180
+
181
+ ---
182
+
183
+ ## Content Classification
184
+
185
+ ### Classifier
186
+
187
+ ```python
188
+ class ContentClassifier:
189
+ """Classify HTML elements by type."""
190
+
191
+ CATEGORIES = [
192
+ 'navigation',
193
+ 'header',
194
+ 'footer',
195
+ 'sidebar',
196
+ 'main_content',
197
+ 'advertisement',
198
+ 'product_listing',
199
+ 'product_detail',
200
+ 'form',
201
+ 'table',
202
+ 'pagination',
203
+ 'breadcrumb',
204
+ 'comment_section',
205
+ 'related_items'
206
+ ]
207
+
208
+ def classify_element(self, element: Tag) -> str:
209
+ """Classify a single element."""
210
+ features = self.extract_features(element)
211
+ return self.model.predict(features)
212
+
213
+ def extract_features(self, element: Tag) -> Dict:
214
+ """Extract features for classification."""
215
+ return {
216
+ 'tag_name': element.name,
217
+ 'class_names': element.get('class', []),
218
+ 'id': element.get('id', ''),
219
+ 'role': element.get('role', ''),
220
+ 'text_length': len(element.get_text(strip=True)),
221
+ 'link_density': self.calculate_link_density(element),
222
+ 'has_images': bool(element.find('img')),
223
+ 'has_forms': bool(element.find('form')),
224
+ 'position': self.get_vertical_position(element),
225
+ 'parent_classes': element.parent.get('class', []) if element.parent else [],
226
+ 'children_count': len(element.find_all(recursive=False)),
227
+ 'schema_type': element.get('itemtype', '')
228
+ }
229
+ ```
230
+
231
+ ### Classification Rules
232
+
233
+ ```python
234
+ def classify_by_rules(self, element: Tag) -> Optional[str]:
235
+ """Rule-based classification (fast, deterministic)."""
236
+
237
+ # Navigation
238
+ if element.name == 'nav':
239
+ return 'navigation'
240
+
241
+ if any('nav' in str(c) for c in element.get('class', [])):
242
+ return 'navigation'
243
+
244
+ # Header
245
+ if element.name == 'header':
246
+ return 'header'
247
+
248
+ # Footer
249
+ if element.name == 'footer':
250
+ return 'footer'
251
+
252
+ # Advertisement (common patterns)
253
+ ad_patterns = ['ad', 'advertisement', 'sponsored', 'promo']
254
+ classes = ' '.join(element.get('class', []))
255
+ if any(pattern in classes.lower() for pattern in ad_patterns):
256
+ return 'advertisement'
257
+
258
+ # Product listing
259
+ if element.get('itemtype') == 'http://schema.org/Product':
260
+ return 'product_detail'
261
+
262
+ # Form
263
+ if element.name == 'form' or element.find('form'):
264
+ return 'form'
265
+
266
+ # Table
267
+ if element.name == 'table':
268
+ return 'table'
269
+
270
+ return None
271
+ ```
272
+
273
+ ---
274
+
275
+ ## Smart Extraction
276
+
277
+ ### Pattern-Based Extraction
278
+
279
+ ```python
280
+ class SmartExtractor:
281
+ """Intelligently extract data based on field semantics."""
282
+
283
+ def extract(self, html: str, field_name: str) -> ExtractionResult:
284
+ """Extract a field using multiple strategies."""
285
+ soup = BeautifulSoup(html, 'lxml')
286
+
287
+ # Strategy 1: Schema.org markup
288
+ result = self.extract_from_schema(soup, field_name)
289
+ if result:
290
+ return result
291
+
292
+ # Strategy 2: OpenGraph / meta tags
293
+ result = self.extract_from_meta(soup, field_name)
294
+ if result:
295
+ return result
296
+
297
+ # Strategy 3: Pattern matching
298
+ result = self.extract_by_pattern(soup, field_name)
299
+ if result:
300
+ return result
301
+
302
+ # Strategy 4: ML-based extraction
303
+ result = self.extract_by_ml(soup, field_name)
304
+ if result:
305
+ return result
306
+
307
+ return ExtractionResult(value=None, confidence=0.0)
308
+ ```
309
+
310
+ ### Field-Specific Patterns
311
+
312
+ ```python
313
+ EXTRACTION_PATTERNS = {
314
+ 'price': {
315
+ 'regexes': [
316
+ r'\$\s*\d+[.,]\d{2}', # $49.99
317
+ r'€\s*\d+[.,]\d{2}', # €49,99
318
+ r'£\s*\d+[.,]\d{2}', # £49.99
319
+ r'\d+[.,]\d{2}\s*USD', # 49.99 USD
320
+ ],
321
+ 'css_selectors': [
322
+ '[itemprop="price"]',
323
+ '.price',
324
+ '.product-price',
325
+ 'span.sale-price',
326
+ 'div.price-box span',
327
+ ],
328
+ 'class_keywords': ['price', 'cost', 'sale', 'amount'],
329
+ 'text_indicators': ['$', '€', '£', 'USD', 'EUR', 'GBP']
330
+ },
331
+
332
+ 'product_name': {
333
+ 'css_selectors': [
334
+ '[itemprop="name"]',
335
+ 'h1.product-title',
336
+ 'h1.product-name',
337
+ 'div.product-info h1',
338
+ ],
339
+ 'class_keywords': ['title', 'name', 'product-name'],
340
+ 'heading_tags': ['h1', 'h2']
341
+ },
342
+
343
+ 'rating': {
344
+ 'regexes': [
345
+ r'(\d+\.?\d*)\s*out of\s*5',
346
+ r'(\d+\.?\d*)\s*/\s*5',
347
+ r'(\d+\.?\d*)\s*stars?',
348
+ ],
349
+ 'css_selectors': [
350
+ '[itemprop="ratingValue"]',
351
+ '.rating',
352
+ '.star-rating',
353
+ 'span.rating-value',
354
+ ],
355
+ 'class_keywords': ['rating', 'stars', 'score'],
356
+ },
357
+
358
+ 'email': {
359
+ 'regexes': [
360
+ r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
361
+ ],
362
+ 'css_selectors': [
363
+ '[href^="mailto:"]',
364
+ '[itemprop="email"]',
365
+ ]
366
+ },
367
+
368
+ 'phone': {
369
+ 'regexes': [
370
+ r'\+?1?\s*\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}', # US format
371
+ r'\+\d{1,3}\s?\(?\d{1,4}\)?[\s.-]?\d{3,4}[\s.-]?\d{3,4}', # International
372
+ ],
373
+ 'css_selectors': [
374
+ '[href^="tel:"]',
375
+ '[itemprop="telephone"]',
376
+ ]
377
+ }
378
+ }
379
+ ```
380
+
381
+ ### Confidence Scoring
382
+
383
+ ```python
384
+ def score_extraction(self, value: Any, field_name: str, method: str) -> float:
385
+ """Score extraction confidence."""
386
+ confidence = 0.0
387
+
388
+ # Method confidence
389
+ method_confidence = {
390
+ 'schema.org': 0.95,
391
+ 'meta_tag': 0.90,
392
+ 'pattern_match': 0.70,
393
+ 'ml_model': 0.80,
394
+ 'class_name': 0.60
395
+ }
396
+ confidence += method_confidence.get(method, 0.5)
397
+
398
+ # Value validation
399
+ if field_name == 'price':
400
+ if self.is_valid_price(value):
401
+ confidence += 0.1
402
+ else:
403
+ confidence -= 0.3
404
+
405
+ elif field_name == 'email':
406
+ if self.is_valid_email(value):
407
+ confidence += 0.1
408
+ else:
409
+ confidence = 0.0 # Invalid email
410
+
411
+ # Context validation
412
+ parent_context = self.get_parent_context(value)
413
+ if field_name in parent_context:
414
+ confidence += 0.1
415
+
416
+ return min(confidence, 1.0)
417
+ ```
418
+
419
+ ---
420
+
421
+ ## Adaptive Chunking
422
+
423
+ ### Chunking Strategy
424
+
425
+ ```python
426
+ class AdaptiveChunker:
427
+ """Split large HTML into processable chunks."""
428
+
429
+ def chunk(self, html: str, max_size: int = 50000) -> List[Chunk]:
430
+ """Split HTML intelligently."""
431
+ soup = BeautifulSoup(html, 'lxml')
432
+
433
+ if len(html) <= max_size:
434
+ return [Chunk(html=html, type='full', index=0)]
435
+
436
+ # Strategy 1: Split by semantic sections
437
+ chunks = self.chunk_by_sections(soup, max_size)
438
+ if chunks:
439
+ return chunks
440
+
441
+ # Strategy 2: Split by repeated elements (product cards)
442
+ chunks = self.chunk_by_repeated_elements(soup, max_size)
443
+ if chunks:
444
+ return chunks
445
+
446
+ # Strategy 3: Sliding window with overlap
447
+ chunks = self.chunk_by_sliding_window(html, max_size, overlap=5000)
448
+ return chunks
449
+
450
+ def chunk_by_sections(self, soup: BeautifulSoup, max_size: int) -> List[Chunk]:
451
+ """Split by major sections."""
452
+ sections = soup.find_all(['article', 'section', 'div'], class_=re.compile(r'section|container', re.I))
453
+
454
+ chunks = []
455
+ current_chunk = ""
456
+ current_index = 0
457
+
458
+ for section in sections:
459
+ section_html = str(section)
460
+
461
+ if len(current_chunk) + len(section_html) > max_size:
462
+ # Save current chunk
463
+ if current_chunk:
464
+ chunks.append(Chunk(
465
+ html=current_chunk,
466
+ type='section',
467
+ index=current_index
468
+ ))
469
+ current_index += 1
470
+
471
+ # Start new chunk
472
+ current_chunk = section_html
473
+ else:
474
+ current_chunk += section_html
475
+
476
+ # Add final chunk
477
+ if current_chunk:
478
+ chunks.append(Chunk(html=current_chunk, type='section', index=current_index))
479
+
480
+ return chunks
481
+
482
+ def chunk_by_repeated_elements(self, soup: BeautifulSoup, max_size: int) -> List[Chunk]:
483
+ """Split by repeated elements (e.g., product cards)."""
484
+ # Detect repeated pattern
485
+ repeated = self.detect_repeated_elements(soup)
486
+
487
+ if not repeated:
488
+ return []
489
+
490
+ chunks = []
491
+ current_chunk = ""
492
+ current_items = []
493
+ current_index = 0
494
+
495
+ for element in repeated:
496
+ element_html = str(element)
497
+
498
+ if len(current_chunk) + len(element_html) > max_size:
499
+ # Save current chunk
500
+ if current_chunk:
501
+ chunks.append(Chunk(
502
+ html=current_chunk,
503
+ type='repeated',
504
+ index=current_index,
505
+ item_count=len(current_items)
506
+ ))
507
+ current_index += 1
508
+
509
+ # Start new chunk
510
+ current_chunk = element_html
511
+ current_items = [element]
512
+ else:
513
+ current_chunk += element_html
514
+ current_items.append(element)
515
+
516
+ # Add final chunk
517
+ if current_chunk:
518
+ chunks.append(Chunk(
519
+ html=current_chunk,
520
+ type='repeated',
521
+ index=current_index,
522
+ item_count=len(current_items)
523
+ ))
524
+
525
+ return chunks
526
+ ```
527
+
528
+ ---
529
+
530
+ ## Batch Processing
531
+
532
+ ### Parallel Processing
533
+
534
+ ```python
535
+ class BatchProcessor:
536
+ """Process large pages in parallel batches."""
537
+
538
+ async def process_large_page(
539
+ self,
540
+ html: str,
541
+ extraction_task: ExtractionTask
542
+ ) -> List[Dict]:
543
+ """Process a large page in parallel."""
544
+ # 1. Chunk the HTML
545
+ chunks = self.chunker.chunk(html)
546
+
547
+ # 2. Process chunks in parallel
548
+ tasks = [
549
+ self.process_chunk(chunk, extraction_task)
550
+ for chunk in chunks
551
+ ]
552
+
553
+ chunk_results = await asyncio.gather(*tasks)
554
+
555
+ # 3. Merge and deduplicate results
556
+ merged = self.merge_results(chunk_results)
557
+
558
+ # 4. Cross-chunk validation
559
+ validated = self.validate_across_chunks(merged, chunks)
560
+
561
+ return validated
562
+
563
+ async def process_chunk(
564
+ self,
565
+ chunk: Chunk,
566
+ task: ExtractionTask
567
+ ) -> List[Dict]:
568
+ """Process a single chunk."""
569
+ extractor = SmartExtractor()
570
+ results = []
571
+
572
+ for field in task.fields:
573
+ result = extractor.extract(chunk.html, field)
574
+ if result.value:
575
+ results.append({
576
+ 'field': field,
577
+ 'value': result.value,
578
+ 'confidence': result.confidence,
579
+ 'chunk_index': chunk.index
580
+ })
581
+
582
+ return results
583
+
584
+ def merge_results(self, chunk_results: List[List[Dict]]) -> List[Dict]:
585
+ """Merge and deduplicate results from chunks."""
586
+ merged = {}
587
+
588
+ for chunk_result in chunk_results:
589
+ for item in chunk_result:
590
+ key = (item['field'], item['value'])
591
+
592
+ if key in merged:
593
+ # Increase confidence if found in multiple chunks
594
+ merged[key]['confidence'] = max(
595
+ merged[key]['confidence'],
596
+ item['confidence']
597
+ )
598
+ merged[key]['chunk_count'] += 1
599
+ else:
600
+ merged[key] = {
601
+ **item,
602
+ 'chunk_count': 1
603
+ }
604
+
605
+ return list(merged.values())
606
+ ```
607
+
608
+ ---
609
+
610
+ ## Diff-Based Updates
611
+
612
+ ### Incremental Processing
613
+
614
+ ```python
615
+ class DiffProcessor:
616
+ """Process only changed content between page loads."""
617
+
618
+ def __init__(self):
619
+ self.page_cache = {}
620
+
621
+ def process_with_diff(
622
+ self,
623
+ url: str,
624
+ current_html: str,
625
+ extraction_task: ExtractionTask
626
+ ) -> Dict:
627
+ """Process only the diff from last visit."""
628
+ previous_html = self.page_cache.get(url)
629
+
630
+ if not previous_html:
631
+ # First visit, process full page
632
+ result = self.process_full(current_html, extraction_task)
633
+ self.page_cache[url] = current_html
634
+ return result
635
+
636
+ # Calculate diff
637
+ diff = self.calculate_diff(previous_html, current_html)
638
+
639
+ if diff.similarity > 0.95:
640
+ # Page barely changed, use cached results
641
+ return self.page_cache.get(f"{url}_result")
642
+
643
+ # Process only changed regions
644
+ result = self.process_diff(diff, extraction_task)
645
+
646
+ # Update cache
647
+ self.page_cache[url] = current_html
648
+ self.page_cache[f"{url}_result"] = result
649
+
650
+ return result
651
+
652
+ def calculate_diff(self, html1: str, html2: str) -> Diff:
653
+ """Calculate structural diff between two HTML documents."""
654
+ soup1 = BeautifulSoup(html1, 'lxml')
655
+ soup2 = BeautifulSoup(html2, 'lxml')
656
+
657
+ # Find added, removed, and modified elements
658
+ diff = Diff()
659
+ diff.added = self.find_added_elements(soup1, soup2)
660
+ diff.removed = self.find_removed_elements(soup1, soup2)
661
+ diff.modified = self.find_modified_elements(soup1, soup2)
662
+ diff.similarity = self.calculate_similarity(soup1, soup2)
663
+
664
+ return diff
665
+ ```
666
+
667
+ ---
668
+
669
+ ## Schema Detection
670
+
671
+ ### Auto-Detect Data Schemas
672
+
673
+ ```python
674
+ class SchemaDetector:
675
+ """Automatically detect data schemas in HTML."""
676
+
677
+ def detect_schema(self, html: str) -> Schema:
678
+ """Detect the implicit schema of the page."""
679
+ soup = BeautifulSoup(html, 'lxml')
680
+
681
+ # 1. Check for schema.org markup
682
+ schema_org = self.detect_schema_org(soup)
683
+ if schema_org:
684
+ return schema_org
685
+
686
+ # 2. Detect repeated patterns
687
+ repeated = self.detect_repeated_pattern(soup)
688
+ if repeated:
689
+ return self.infer_schema_from_pattern(repeated)
690
+
691
+ # 3. Detect tables
692
+ tables = soup.find_all('table')
693
+ if tables:
694
+ return self.infer_schema_from_table(tables[0])
695
+
696
+ return Schema()
697
+
698
+ def infer_schema_from_pattern(self, elements: List[Tag]) -> Schema:
699
+ """Infer schema from repeated elements."""
700
+ # Analyze first few elements
701
+ sample = elements[:5]
702
+
703
+ field_candidates = {}
704
+
705
+ for element in sample:
706
+ # Find all text-bearing children
707
+ children = element.find_all(string=True, recursive=True)
708
+
709
+ for child in children:
710
+ # Classify by parent tag/class
711
+ parent = child.parent
712
+ key = (parent.name, ' '.join(parent.get('class', [])))
713
+
714
+ if key not in field_candidates:
715
+ field_candidates[key] = []
716
+
717
+ field_candidates[key].append(child.strip())
718
+
719
+ # Build schema
720
+ schema = Schema()
721
+
722
+ for (tag, class_name), values in field_candidates.items():
723
+ # Infer field type from values
724
+ field_type = self.infer_type(values)
725
+ field_name = self.guess_field_name(class_name, values)
726
+
727
+ schema.add_field(Field(
728
+ name=field_name,
729
+ type=field_type,
730
+ selector=f"{tag}.{class_name}" if class_name else tag,
731
+ sample_values=values
732
+ ))
733
+
734
+ return schema
735
+ ```
736
+
737
+ ---
738
+
739
+ **Next:** See [search-engine.md](./search-engine.md) for search optimization.
docs/mcp.md ADDED
@@ -0,0 +1,977 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🔌 MCP Server Integration
2
+
3
+ ## Table of Contents
4
+ 1. [Overview](#overview)
5
+ 2. [Available MCP Servers](#available-mcp-servers)
6
+ 3. [Tool Registry & Discovery](#tool-registry--discovery)
7
+ 4. [HTML Processing MCPs](#html-processing-mcps)
8
+ 5. [Lazy Loading System](#lazy-loading-system)
9
+ 6. [MCP Composition](#mcp-composition)
10
+ 7. [Testing Panel](#testing-panel)
11
+ 8. [Configuration](#configuration)
12
+
13
+ ---
14
+
15
+ ## Overview
16
+
17
+ The **Model Context Protocol (MCP)** enables the WebScraper agent to interact with external tools, databases, and services through a standardized interface. MCP servers expose **tools** that the agent can discover and use dynamically.
18
+
19
+ ### Why MCP?
20
+
21
+ **Without MCP:**
22
+ - Agent limited to built-in capabilities
23
+ - Cannot access external databases, APIs, or specialized libraries
24
+ - Difficult to extend without code changes
25
+
26
+ **With MCP:**
27
+ - ✅ Dynamically discover and use 100+ community tools
28
+ - ✅ Access databases (PostgreSQL, MongoDB, etc.)
29
+ - ✅ Use specialized libraries (BeautifulSoup, Selenium, Playwright)
30
+ - ✅ Integrate with external APIs (Google, GitHub, etc.)
31
+ - ✅ Extend agent capabilities without code changes
32
+
33
+ ### Architecture
34
+
35
+ ```
36
+ ┌─────────────────────────────────────────────────────────────┐
37
+ │ WebScraper Agent │
38
+ ├─────────────────────────────────────────────────────────────┤
39
+ │ │
40
+ │ ┌────────────────────────────────────────────────────┐ │
41
+ │ │ MCP Tool Registry │ │
42
+ │ │ - Discovers available tools from all MCP servers │ │
43
+ │ │ - Provides tool metadata to agent │ │
44
+ │ │ - Routes tool calls to appropriate server │ │
45
+ │ └────────────────┬───────────────────────────────────┘ │
46
+ │ │ │
47
+ └───────────────────┼──────────────────────────────────────────┘
48
+
49
+ ┌───────────┼───────────┬──────────────┬─────────────┐
50
+ │ │ │ │ │
51
+ ▼ ▼ ▼ ▼ ▼
52
+ ┌──────────────┐ ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
53
+ │ HTML Parser │ │Browser │ │ Database │ │ File │ │ Custom │
54
+ │ MCP │ │ MCP │ │ MCP │ │ System │ │ MCP │
55
+ │ │ │ │ │ │ │ MCP │ │ │
56
+ │• BeautifulSoup││• Puppeteer││• Postgres││• Read ││• Your │
57
+ │• lxml ││• Playwright││• MongoDB │││• Write ││ tools │
58
+ │• html5lib ││• Selenium ││• Redis │││• Search ││ │
59
+ └──────────────┘ └─────────┘ └──────────┘ └──────────┘ └──────────┘
60
+ ```
61
+
62
+ ---
63
+
64
+ ## Available MCP Servers
65
+
66
+ ### 1. HTML Processing & Parsing
67
+
68
+ #### **beautifulsoup-mcp**
69
+ Advanced HTML parsing and extraction.
70
+
71
+ **Tools:**
72
+ - `parse_html(html: str, parser: str = "html.parser")` → Parse HTML into DOM tree
73
+ - `find_all(html: str, selector: str)` → CSS selector search
74
+ - `extract_text(html: str, selector: str)` → Extract text content
75
+ - `extract_attributes(html: str, selector: str, attrs: List[str])` → Get element attributes
76
+ - `clean_html(html: str)` → Remove scripts, styles, comments
77
+ - `extract_tables(html: str)` → Parse all tables into structured data
78
+
79
+ **Configuration:**
80
+ ```json
81
+ {
82
+ "mcpServers": {
83
+ "beautifulsoup": {
84
+ "command": "python",
85
+ "args": ["-m", "mcp_beautifulsoup"],
86
+ "enabled": true,
87
+ "autoDownload": true,
88
+ "config": {
89
+ "default_parser": "lxml",
90
+ "encodings": ["utf-8", "latin-1"]
91
+ }
92
+ }
93
+ }
94
+ }
95
+ ```
96
+
97
+ **Example Usage:**
98
+ ```python
99
+ # Agent action
100
+ action = Action(
101
+ action_type="MCP_TOOL_CALL",
102
+ tool_name="beautifulsoup.find_all",
103
+ tool_params={
104
+ "html": observation.page_html,
105
+ "selector": "div.product-card"
106
+ }
107
+ )
108
+
109
+ # Response
110
+ {
111
+ "products": [
112
+ {"name": "Widget", "price": "$49.99"},
113
+ {"name": "Gadget", "price": "$39.99"}
114
+ ]
115
+ }
116
+ ```
117
+
118
+ #### **lxml-mcp**
119
+ Fast XML/HTML parsing with XPath support.
120
+
121
+ **Tools:**
122
+ - `xpath_query(html: str, xpath: str)` → XPath extraction
123
+ - `css_select(html: str, css: str)` → CSS selector (fast)
124
+ - `validate_html(html: str)` → Check well-formedness
125
+
126
+ #### **html5lib-mcp**
127
+ Standards-compliant HTML5 parsing.
128
+
129
+ **Tools:**
130
+ - `parse_html5(html: str)` → Parse like a browser would
131
+ - `sanitize_html(html: str, allowed_tags: List[str])` → Safe HTML cleaning
132
+
133
+ ### 2. Browser Automation
134
+
135
+ #### **playwright-mcp**
136
+ Full browser automation with JavaScript rendering.
137
+
138
+ **Tools:**
139
+ - `navigate(url: str, wait_for: str = "networkidle")` → Load page with JS
140
+ - `click(selector: str)` → Click element
141
+ - `fill_form(selector: str, value: str)` → Fill input
142
+ - `screenshot(selector: str = None)` → Capture screenshot
143
+ - `wait_for_selector(selector: str, timeout: int = 5000)` → Wait for element
144
+ - `execute_script(script: str)` → Run custom JavaScript
145
+
146
+ **Use Cases:**
147
+ - Pages with client-side rendering (React, Vue, Angular)
148
+ - Infinite scroll / lazy loading
149
+ - Forms and interactions
150
+ - Captcha handling
151
+
152
+ **Configuration:**
153
+ ```json
154
+ {
155
+ "mcpServers": {
156
+ "playwright": {
157
+ "command": "npx",
158
+ "args": ["@playwright/mcp-server"],
159
+ "enabled": false, // Only enable when needed (heavy)
160
+ "autoDownload": true,
161
+ "config": {
162
+ "browser": "chromium",
163
+ "headless": true,
164
+ "viewport": {"width": 1920, "height": 1080}
165
+ }
166
+ }
167
+ }
168
+ }
169
+ ```
170
+
171
+ #### **puppeteer-mcp**
172
+ Lightweight browser automation (Chrome DevTools Protocol).
173
+
174
+ Similar to Playwright but lighter weight.
175
+
176
+ #### **selenium-mcp**
177
+ Legacy browser automation (more compatible, slower).
178
+
179
+ ### 3. Database Access
180
+
181
+ #### **postgresql-mcp**
182
+ Access PostgreSQL databases.
183
+
184
+ **Tools:**
185
+ - `query(sql: str, params: List = [])` → Execute SELECT
186
+ - `execute(sql: str, params: List = [])` → Execute INSERT/UPDATE/DELETE
187
+ - `list_tables()` → Get schema
188
+
189
+ **Use Case:** Store scraped data directly to production database.
190
+
191
+ #### **mongodb-mcp**
192
+ Access MongoDB collections.
193
+
194
+ **Tools:**
195
+ - `find(collection: str, query: dict)` → Query documents
196
+ - `insert(collection: str, document: dict)` → Insert document
197
+ - `aggregate(collection: str, pipeline: List)` → Aggregation pipeline
198
+
199
+ #### **redis-mcp**
200
+ Fast cache and pub/sub.
201
+
202
+ **Tools:**
203
+ - `get(key: str)` → Retrieve cached value
204
+ - `set(key: str, value: str, ttl: int)` → Cache value
205
+ - `publish(channel: str, message: str)` → Pub/sub
206
+
207
+ **Use Case:** Cache parsed HTML, share state between agents.
208
+
209
+ ### 4. File System
210
+
211
+ #### **filesystem-mcp**
212
+ Read/write local files.
213
+
214
+ **Tools:**
215
+ - `read_file(path: str)` → Read text/binary file
216
+ - `write_file(path: str, content: str)` → Write file
217
+ - `list_directory(path: str)` → List files
218
+ - `search_files(pattern: str)` → Glob search
219
+
220
+ **Use Case:** Save scraped data to CSV/JSON, read configuration files.
221
+
222
+ ### 5. Search Engines
223
+
224
+ #### **google-search-mcp**
225
+ Google Search API integration.
226
+
227
+ **Tools:**
228
+ - `search(query: str, num: int = 10)` → Google Search results
229
+ - `search_images(query: str)` → Image search
230
+
231
+ **Configuration:**
232
+ ```json
233
+ {
234
+ "mcpServers": {
235
+ "google-search": {
236
+ "command": "python",
237
+ "args": ["-m", "mcp_google_search"],
238
+ "enabled": true,
239
+ "autoDownload": true,
240
+ "config": {
241
+ "api_key": "YOUR_GOOGLE_API_KEY",
242
+ "search_engine_id": "YOUR_SEARCH_ENGINE_ID"
243
+ }
244
+ }
245
+ }
246
+ }
247
+ ```
248
+
249
+ #### **bing-search-mcp**
250
+ Bing Search API.
251
+
252
+ #### **brave-search-mcp**
253
+ Privacy-focused search (Brave Search API).
254
+
255
+ #### **duckduckgo-mcp**
256
+ Free, no-API search.
257
+
258
+ **Tools:**
259
+ - `search(query: str, max_results: int = 10)` → DDG results
260
+
261
+ ### 6. Data Extraction
262
+
263
+ #### **readability-mcp**
264
+ Extract main article content (removes ads, navigation, etc.).
265
+
266
+ **Tools:**
267
+ - `extract_article(html: str)` → Returns clean article text + metadata
268
+
269
+ **Use Case:** Extract blog posts, news articles, documentation.
270
+
271
+ #### **trafilatura-mcp**
272
+ Advanced web scraping and text extraction.
273
+
274
+ **Tools:**
275
+ - `extract(url: str)` → Extract main content
276
+ - `extract_metadata(html: str)` → Get title, author, date, etc.
277
+
278
+ #### **newspaper-mcp**
279
+ News article extraction and NLP.
280
+
281
+ **Tools:**
282
+ - `parse_article(url: str)` → Full article data
283
+ - `extract_keywords(text: str)` → Keyword extraction
284
+ - `summarize(text: str)` → Auto-summarization
285
+
286
+ ### 7. Data Validation
287
+
288
+ #### **cerberus-mcp**
289
+ Schema validation for extracted data.
290
+
291
+ **Tools:**
292
+ - `validate(data: dict, schema: dict)` → Validate against schema
293
+
294
+ **Example:**
295
+ ```python
296
+ # Define schema
297
+ schema = {
298
+ "product_name": {"type": "string", "required": True, "minlength": 1},
299
+ "price": {"type": "float", "required": True, "min": 0},
300
+ "rating": {"type": "float", "min": 0, "max": 5}
301
+ }
302
+
303
+ # Validate extracted data
304
+ result = mcp.call("cerberus.validate", data=extracted_data, schema=schema)
305
+ if not result["valid"]:
306
+ print("Validation errors:", result["errors"])
307
+ ```
308
+
309
+ #### **pydantic-mcp**
310
+ Pydantic model validation.
311
+
312
+ ### 8. Computer Vision
313
+
314
+ #### **ocr-mcp**
315
+ Extract text from images (Tesseract OCR).
316
+
317
+ **Tools:**
318
+ - `extract_text(image_path: str, lang: str = "eng")` → OCR text
319
+
320
+ **Use Case:** Extract prices from product images, read captchas (if legal).
321
+
322
+ #### **image-analysis-mcp**
323
+ Vision AI (GPT-4 Vision, Claude Vision).
324
+
325
+ **Tools:**
326
+ - `describe_image(image_path: str)` → Natural language description
327
+ - `extract_structured(image_path: str, schema: dict)` → Extract structured data from images
328
+
329
+ ### 9. HTTP & Networking
330
+
331
+ #### **requests-mcp**
332
+ HTTP client with retry, session management.
333
+
334
+ **Tools:**
335
+ - `get(url: str, headers: dict = {})` → HTTP GET
336
+ - `post(url: str, data: dict = {})` → HTTP POST
337
+
338
+ #### **proxy-manager-mcp**
339
+ Manage proxy rotation, IP reputation.
340
+
341
+ **Tools:**
342
+ - `get_proxy()` → Get next proxy from pool
343
+ - `report_dead_proxy(proxy: str)` → Mark proxy as failed
344
+
345
+ ### 10. Utility
346
+
347
+ #### **regex-mcp**
348
+ Advanced regex operations.
349
+
350
+ **Tools:**
351
+ - `find_all(pattern: str, text: str)` → Find all matches
352
+ - `replace(pattern: str, replacement: str, text: str)` → Regex replace
353
+ - `validate(pattern: str)` → Check if regex is valid
354
+
355
+ #### **datetime-mcp**
356
+ Parse and normalize dates.
357
+
358
+ **Tools:**
359
+ - `parse_date(text: str)` → Parse natural language dates
360
+ - `normalize_timezone(date: str, tz: str)` → Convert timezone
361
+
362
+ #### **currency-mcp**
363
+ Currency parsing and conversion.
364
+
365
+ **Tools:**
366
+ - `parse_price(text: str)` → Extract price and currency
367
+ - `convert(amount: float, from_currency: str, to_currency: str)` → Convert
368
+
369
+ ---
370
+
371
+ ## Tool Registry & Discovery
372
+
373
+ The **Tool Registry** automatically discovers all available tools from enabled MCP servers.
374
+
375
+ ### Architecture
376
+
377
+ ```python
378
+ class MCPToolRegistry:
379
+ def __init__(self):
380
+ self.servers: Dict[str, MCPServer] = {}
381
+ self.tools: Dict[str, Tool] = {} # tool_name → Tool
382
+
383
+ def discover_servers(self, config: MCPConfig):
384
+ """Load and connect to all enabled MCP servers."""
385
+ for server_name, server_config in config.mcpServers.items():
386
+ if not server_config.enabled:
387
+ continue
388
+
389
+ # Auto-download if needed
390
+ if server_config.autoDownload and not self.is_installed(server_config):
391
+ self.download_and_install(server_name, server_config)
392
+
393
+ # Connect to server
394
+ server = self.connect_server(server_name, server_config)
395
+ self.servers[server_name] = server
396
+
397
+ # Discover tools
398
+ for tool in server.list_tools():
399
+ full_name = f"{server_name}.{tool.name}"
400
+ self.tools[full_name] = tool
401
+
402
+ def get_tool(self, tool_name: str) -> Tool:
403
+ """Get tool by fully qualified name (server.tool)."""
404
+ return self.tools.get(tool_name)
405
+
406
+ def search_tools(self, query: str, category: str = None) -> List[Tool]:
407
+ """Search tools by natural language query."""
408
+ # Semantic search using tool descriptions
409
+ candidates = list(self.tools.values())
410
+
411
+ if category:
412
+ candidates = [t for t in candidates if t.category == category]
413
+
414
+ # Embed query and tools, rank by similarity
415
+ scored = []
416
+ for tool in candidates:
417
+ score = self.semantic_similarity(query, tool.description)
418
+ scored.append((tool, score))
419
+
420
+ scored.sort(key=lambda x: x[1], reverse=True)
421
+ return [tool for tool, score in scored[:10]]
422
+ ```
423
+
424
+ ### Tool Metadata
425
+
426
+ Each tool exposes rich metadata:
427
+
428
+ ```python
429
+ class Tool(BaseModel):
430
+ name: str # e.g., "find_all"
431
+ full_name: str # e.g., "beautifulsoup.find_all"
432
+ server: str # Server name
433
+ description: str # Human-readable description
434
+ category: str # "parsing" | "browser" | "database" | ...
435
+ input_schema: Dict[str, Any] # JSON Schema for parameters
436
+ output_schema: Dict[str, Any] # JSON Schema for return value
437
+ examples: List[ToolExample] # Usage examples
438
+ cost: ToolCost # Time/resource cost estimate
439
+ requires_auth: bool # Needs API keys?
440
+ rate_limit: Optional[RateLimit] # Rate limiting info
441
+ ```
442
+
443
+ **Example:**
444
+ ```python
445
+ Tool(
446
+ name="find_all",
447
+ full_name="beautifulsoup.find_all",
448
+ server="beautifulsoup",
449
+ description="Find all HTML elements matching a CSS selector",
450
+ category="parsing",
451
+ input_schema={
452
+ "type": "object",
453
+ "properties": {
454
+ "html": {"type": "string", "description": "HTML content to search"},
455
+ "selector": {"type": "string", "description": "CSS selector"}
456
+ },
457
+ "required": ["html", "selector"]
458
+ },
459
+ output_schema={
460
+ "type": "array",
461
+ "items": {"type": "object"}
462
+ },
463
+ examples=[
464
+ ToolExample(
465
+ input={"html": "<div class='item'>A</div>", "selector": ".item"},
466
+ output=[{"tag": "div", "text": "A", "class": "item"}]
467
+ )
468
+ ],
469
+ cost=ToolCost(time_ms=10, cpu_intensive=False),
470
+ requires_auth=False
471
+ )
472
+ ```
473
+
474
+ ### Auto Tool Discovery by Agent
475
+
476
+ The agent can query the registry to find relevant tools:
477
+
478
+ ```python
479
+ # Agent needs to parse HTML
480
+ available_tools = tool_registry.search_tools(
481
+ query="parse HTML and extract elements by CSS selector",
482
+ category="parsing"
483
+ )
484
+
485
+ # Top result: beautifulsoup.find_all
486
+ tool = available_tools[0]
487
+
488
+ # Agent calls the tool
489
+ action = Action(
490
+ action_type="MCP_TOOL_CALL",
491
+ tool_name=tool.full_name,
492
+ tool_params={
493
+ "html": observation.page_html,
494
+ "selector": "div.product-price"
495
+ }
496
+ )
497
+ ```
498
+
499
+ ---
500
+
501
+ ## HTML Processing MCPs
502
+
503
+ ### BeautifulSoup MCP (Detailed)
504
+
505
+ **Installation:**
506
+ ```bash
507
+ pip install mcp-beautifulsoup
508
+ ```
509
+
510
+ **Tools:**
511
+
512
+ #### 1. `find_all(html, selector, limit=None)`
513
+ Find all elements matching CSS selector.
514
+
515
+ ```python
516
+ result = mcp.call("beautifulsoup.find_all", {
517
+ "html": "<div class='price'>$10</div><div class='price'>$20</div>",
518
+ "selector": "div.price"
519
+ })
520
+ # Returns: [{"text": "$10"}, {"text": "$20"}]
521
+ ```
522
+
523
+ #### 2. `find_one(html, selector)`
524
+ Find first matching element.
525
+
526
+ ```python
527
+ result = mcp.call("beautifulsoup.find_one", {
528
+ "html": obs.page_html,
529
+ "selector": "h1.product-title"
530
+ })
531
+ # Returns: {"text": "Widget Pro", "tag": "h1"}
532
+ ```
533
+
534
+ #### 3. `extract_tables(html)`
535
+ Parse all `<table>` elements into structured data.
536
+
537
+ ```python
538
+ result = mcp.call("beautifulsoup.extract_tables", {"html": obs.page_html})
539
+ # Returns:
540
+ [
541
+ {
542
+ "headers": ["Product", "Price", "Stock"],
543
+ "rows": [
544
+ ["Widget", "$49.99", "In Stock"],
545
+ ["Gadget", "$39.99", "Out of Stock"]
546
+ ]
547
+ }
548
+ ]
549
+ ```
550
+
551
+ #### 4. `extract_links(html, base_url=None)`
552
+ Extract all links from page.
553
+
554
+ ```python
555
+ result = mcp.call("beautifulsoup.extract_links", {
556
+ "html": obs.page_html,
557
+ "base_url": "https://example.com"
558
+ })
559
+ # Returns:
560
+ [
561
+ {"url": "https://example.com/product/123", "text": "View Product"},
562
+ {"url": "https://example.com/category/widgets", "text": "Widgets"}
563
+ ]
564
+ ```
565
+
566
+ #### 5. `clean_html(html, remove=['script', 'style', 'noscript'])`
567
+ Remove unwanted elements.
568
+
569
+ ```python
570
+ result = mcp.call("beautifulsoup.clean_html", {
571
+ "html": obs.page_html,
572
+ "remove": ["script", "style", "footer", "nav"]
573
+ })
574
+ # Returns: Clean HTML without ads, scripts, navigation
575
+ ```
576
+
577
+ #### 6. `smart_extract(html, field_name)`
578
+ Intelligent extraction based on field name.
579
+
580
+ ```python
581
+ # Agent wants to extract "price"
582
+ result = mcp.call("beautifulsoup.smart_extract", {
583
+ "html": obs.page_html,
584
+ "field_name": "price"
585
+ })
586
+ # MCP searches for:
587
+ # - Elements with class/id containing "price"
588
+ # - Text matching price patterns ($X.XX, €X,XX)
589
+ # - Schema.org markup (itemprop="price")
590
+ # Returns: {"value": "$49.99", "confidence": 0.92, "selector": "span.product-price"}
591
+ ```
592
+
593
+ ### Batch Processing for Long Content
594
+
595
+ When HTML is too large (> 100KB), process in batches:
596
+
597
+ ```python
598
+ class HTMLBatchProcessor:
599
+ def __init__(self, mcp_client, chunk_size: int = 50000):
600
+ self.mcp = mcp_client
601
+ self.chunk_size = chunk_size
602
+
603
+ def process_large_html(self, html: str, selector: str) -> List[Dict]:
604
+ """Process large HTML in chunks."""
605
+ # Split HTML into meaningful chunks (by sections, not mid-tag)
606
+ chunks = self.split_html_intelligently(html)
607
+
608
+ results = []
609
+ for i, chunk in enumerate(chunks):
610
+ # Process each chunk
611
+ chunk_results = self.mcp.call("beautifulsoup.find_all", {
612
+ "html": chunk,
613
+ "selector": selector
614
+ })
615
+
616
+ # Deduplicate across chunk boundaries
617
+ results.extend(self.deduplicate(chunk_results, results))
618
+
619
+ return results
620
+
621
+ def split_html_intelligently(self, html: str) -> List[str]:
622
+ """Split HTML at section boundaries, not mid-tag."""
623
+ soup = BeautifulSoup(html, 'lxml')
624
+
625
+ # Split by major sections (article, section, div.container, etc.)
626
+ sections = soup.find_all(['article', 'section', 'main'])
627
+
628
+ chunks = []
629
+ current_chunk = ""
630
+
631
+ for section in sections:
632
+ section_html = str(section)
633
+
634
+ if len(current_chunk) + len(section_html) > self.chunk_size:
635
+ chunks.append(current_chunk)
636
+ current_chunk = section_html
637
+ else:
638
+ current_chunk += section_html
639
+
640
+ if current_chunk:
641
+ chunks.append(current_chunk)
642
+
643
+ return chunks
644
+ ```
645
+
646
+ ---
647
+
648
+ ## Lazy Loading System
649
+
650
+ MCP servers are **NOT downloaded by default**. They are installed on-demand when first used.
651
+
652
+ ### Download-on-Demand Flow
653
+
654
+ ```
655
+ Agent wants to use a tool
656
+
657
+
658
+ Is MCP server installed?
659
+
660
+ ┌────┴────┐
661
+ No Yes
662
+ │ │
663
+ ▼ ▼
664
+ Show dialog Execute tool
665
+ "Download
666
+ server X?"
667
+
668
+ ┌───┴───┐
669
+ No Yes
670
+ │ │
671
+ Skip Download & Install
672
+
673
+
674
+ Cache for future use
675
+
676
+
677
+ Execute tool
678
+ ```
679
+
680
+ ### Implementation
681
+
682
+ ```python
683
+ class LazyMCPLoader:
684
+ def __init__(self):
685
+ self.installed_servers: Set[str] = set()
686
+ self.download_queue: Queue[str] = Queue()
687
+
688
+ def ensure_server(self, server_name: str, config: MCPServerConfig) -> bool:
689
+ """Ensure MCP server is installed, download if needed."""
690
+ if server_name in self.installed_servers:
691
+ return True
692
+
693
+ if not config.autoDownload:
694
+ # Prompt user
695
+ if not self.prompt_user_download(server_name):
696
+ return False
697
+
698
+ # Download and install
699
+ return self.download_server(server_name, config)
700
+
701
+ def download_server(self, server_name: str, config: MCPServerConfig) -> bool:
702
+ """Download and install MCP server."""
703
+ try:
704
+ logger.info(f"Downloading MCP server: {server_name}")
705
+
706
+ if config.command == "npx":
707
+ # NPM package
708
+ subprocess.run([
709
+ "npm", "install", "-g", config.args[1]
710
+ ], check=True)
711
+
712
+ elif config.command == "python":
713
+ # Python package
714
+ package_name = config.args[1].replace("-m ", "")
715
+ subprocess.run([
716
+ "pip", "install", package_name
717
+ ], check=True)
718
+
719
+ self.installed_servers.add(server_name)
720
+ logger.info(f"✓ Installed {server_name}")
721
+ return True
722
+
723
+ except Exception as e:
724
+ logger.error(f"Failed to install {server_name}: {e}")
725
+ return False
726
+
727
+ def prompt_user_download(self, server_name: str) -> bool:
728
+ """Ask user if they want to download the server."""
729
+ # In UI, show dialog:
730
+ # "Tool X requires MCP server Y. Download and install? (50MB) [Yes] [No]"
731
+ return self.show_download_dialog(server_name)
732
+ ```
733
+
734
+ ### UI Dialog
735
+
736
+ ```
737
+ ┌──────────────────────────────────────────────────────────┐
738
+ │ MCP Server Required │
739
+ ├──────────────────────────────────────────────────────────┤
740
+ │ │
741
+ │ The tool "beautifulsoup.find_all" requires the MCP │
742
+ │ server "beautifulsoup" which is not installed. │
743
+ │ │
744
+ │ Package: mcp-beautifulsoup │
745
+ │ Size: ~5 MB │
746
+ │ │
747
+ │ Would you like to download and install it now? │
748
+ │ │
749
+ │ [Download & Install] [Skip] │
750
+ │ │
751
+ │ ☑ Remember my choice for this server │
752
+ └──────────────────────────────────────────────────────────┘
753
+ ```
754
+
755
+ ---
756
+
757
+ ## MCP Composition
758
+
759
+ Combine multiple MCP tools to create powerful workflows.
760
+
761
+ ### Example 1: Parse HTML → Extract Tables → Save to Database
762
+
763
+ ```python
764
+ # Step 1: Clean HTML
765
+ cleaned = mcp.call("beautifulsoup.clean_html", {
766
+ "html": observation.page_html
767
+ })
768
+
769
+ # Step 2: Extract tables
770
+ tables = mcp.call("beautifulsoup.extract_tables", {
771
+ "html": cleaned["html"]
772
+ })
773
+
774
+ # Step 3: Save to PostgreSQL
775
+ for table in tables:
776
+ mcp.call("postgresql.execute", {
777
+ "sql": "INSERT INTO scraped_data (data) VALUES (%s)",
778
+ "params": [json.dumps(table)]
779
+ })
780
+ ```
781
+
782
+ ### Example 2: Search Google → Navigate → Parse Article → Summarize
783
+
784
+ ```python
785
+ # Step 1: Search
786
+ results = mcp.call("google-search.search", {
787
+ "query": "best widgets 2026",
788
+ "num": 5
789
+ })
790
+
791
+ # Step 2: Navigate to top result
792
+ mcp.call("playwright.navigate", {
793
+ "url": results[0]["url"]
794
+ })
795
+
796
+ # Step 3: Extract article
797
+ article = mcp.call("readability.extract_article", {
798
+ "html": mcp.call("playwright.get_html", {})
799
+ })
800
+
801
+ # Step 4: Summarize
802
+ summary = mcp.call("llm.summarize", {
803
+ "text": article["text"],
804
+ "max_length": 200
805
+ })
806
+ ```
807
+
808
+ ### Composition DSL
809
+
810
+ Define reusable workflows:
811
+
812
+ ```python
813
+ class MCPWorkflow:
814
+ def __init__(self, name: str, steps: List[WorkflowStep]):
815
+ self.name = name
816
+ self.steps = steps
817
+
818
+ async def execute(self, initial_input: Dict) -> Dict:
819
+ """Execute workflow steps sequentially."""
820
+ context = initial_input
821
+
822
+ for step in self.steps:
823
+ result = await mcp.call(step.tool, step.params(context))
824
+ context[step.output_var] = result
825
+
826
+ return context
827
+
828
+ # Define workflow
829
+ extract_and_save = MCPWorkflow(
830
+ name="extract_and_save",
831
+ steps=[
832
+ WorkflowStep(
833
+ tool="beautifulsoup.find_all",
834
+ params=lambda ctx: {"html": ctx["html"], "selector": ctx["selector"]},
835
+ output_var="extracted"
836
+ ),
837
+ WorkflowStep(
838
+ tool="cerberus.validate",
839
+ params=lambda ctx: {"data": ctx["extracted"], "schema": ctx["schema"]},
840
+ output_var="validated"
841
+ ),
842
+ WorkflowStep(
843
+ tool="postgresql.execute",
844
+ params=lambda ctx: {"sql": "INSERT INTO items ...", "params": ctx["validated"]},
845
+ output_var="saved"
846
+ )
847
+ ]
848
+ )
849
+
850
+ # Execute
851
+ result = await extract_and_save.execute({
852
+ "html": obs.page_html,
853
+ "selector": "div.product",
854
+ "schema": PRODUCT_SCHEMA
855
+ })
856
+ ```
857
+
858
+ ---
859
+
860
+ ## Testing Panel
861
+
862
+ Test MCP tools manually before using them in agent workflows.
863
+
864
+ ### UI
865
+
866
+ ```
867
+ ┌─────────────────────────────────────────────────────────────┐
868
+ │ MCP Testing Panel │
869
+ ├─────────────────────────────────────────────────────────────┤
870
+ │ │
871
+ │ Server: [beautifulsoup ▼] │
872
+ │ Tool: [find_all ▼] │
873
+ │ │
874
+ │ ┌──────────────────────────────────────────────────────┐ │
875
+ │ │ Input Parameters: │ │
876
+ │ │ │ │
877
+ │ │ html: │ │
878
+ │ │ ┌───────────────────────────────────────────────┐ │ │
879
+ │ │ │ <div class="item">Item 1</div> │ │ │
880
+ │ │ │ <div class="item">Item 2</div> │ │ │
881
+ │ │ └───────────────────────────────────────────────┘ │ │
882
+ │ │ │ │
883
+ │ │ selector: [div.item ] │ │
884
+ │ │ │ │
885
+ │ └──────────────────────────────────────────────────────┘ │
886
+ │ │
887
+ │ [Execute Tool] [Clear] │
888
+ │ │
889
+ │ ┌──────────────────────────────────────────────────────┐ │
890
+ │ │ Output: │ │
891
+ │ │ │ │
892
+ │ │ [ │ │
893
+ │ │ {"tag": "div", "class": "item", "text": "Item 1"}, │ │
894
+ │ │ {"tag": "div", "class": "item", "text": "Item 2"} │ │
895
+ │ │ ] │ │
896
+ │ │ │ │
897
+ │ │ Execution time: 12ms │ │
898
+ │ │ Status: ✓ Success │ │
899
+ │ └──────────────────────────────────────────────────────┘ │
900
+ │ │
901
+ │ [Save as Example] │
902
+ └─────────────────────────────────────────────────────────────┘
903
+ ```
904
+
905
+ ---
906
+
907
+ ## Configuration
908
+
909
+ ### Full MCP Configuration Example
910
+
911
+ ```json
912
+ {
913
+ "mcpServers": {
914
+ "beautifulsoup": {
915
+ "command": "python",
916
+ "args": ["-m", "mcp_beautifulsoup"],
917
+ "enabled": true,
918
+ "autoDownload": true,
919
+ "config": {
920
+ "default_parser": "lxml"
921
+ }
922
+ },
923
+ "playwright": {
924
+ "command": "npx",
925
+ "args": ["@playwright/mcp-server"],
926
+ "enabled": false,
927
+ "autoDownload": false,
928
+ "config": {
929
+ "browser": "chromium",
930
+ "headless": true
931
+ }
932
+ },
933
+ "postgresql": {
934
+ "command": "python",
935
+ "args": ["-m", "mcp_postgresql"],
936
+ "enabled": false,
937
+ "autoDownload": false,
938
+ "config": {
939
+ "host": "localhost",
940
+ "port": 5432,
941
+ "database": "scraper_db",
942
+ "user": "postgres",
943
+ "password": "${PG_PASSWORD}"
944
+ }
945
+ },
946
+ "google-search": {
947
+ "command": "python",
948
+ "args": ["-m", "mcp_google_search"],
949
+ "enabled": true,
950
+ "autoDownload": true,
951
+ "config": {
952
+ "api_key": "${GOOGLE_API_KEY}",
953
+ "search_engine_id": "${GOOGLE_SE_ID}"
954
+ }
955
+ },
956
+ "filesystem": {
957
+ "command": "npx",
958
+ "args": ["-y", "@modelcontextprotocol/server-filesystem", "./scraped_data"],
959
+ "enabled": true,
960
+ "autoDownload": true
961
+ }
962
+ },
963
+
964
+ "mcpSettings": {
965
+ "autoDiscoverTools": true,
966
+ "toolTimeout": 30,
967
+ "maxConcurrentCalls": 5,
968
+ "retryFailedCalls": true,
969
+ "cacheToolResults": true,
970
+ "cacheTTL": 3600
971
+ }
972
+ }
973
+ ```
974
+
975
+ ---
976
+
977
+ **Next:** See [settings.md](./settings.md) for complete dashboard settings.
docs/memory.md ADDED
@@ -0,0 +1,786 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🧠 Unified Memory System
2
+
3
+ ## Table of Contents
4
+ 1. [Overview](#overview)
5
+ 2. [Memory Architecture](#memory-architecture)
6
+ 3. [Memory Layers](#memory-layers)
7
+ 4. [Memory Operations](#memory-operations)
8
+ 5. [Implementation Details](#implementation-details)
9
+ 6. [Configuration](#configuration)
10
+ 7. [Best Practices](#best-practices)
11
+
12
+ ---
13
+
14
+ ## Overview
15
+
16
+ The **Unified Memory System** is the most critical upgrade for the WebScraper-OpenEnv agent. It provides persistent, contextual, and hierarchical memory across episodes, enabling the agent to learn from past experiences, maintain reasoning context, and share knowledge across multiple agents.
17
+
18
+ ### Why Memory Matters
19
+
20
+ Without memory:
21
+ - Agents repeat the same mistakes across episodes
22
+ - No learning from successful extraction patterns
23
+ - Cannot maintain context across long scraping sessions
24
+ - Unable to share knowledge between multiple agents
25
+ - Limited by context window size
26
+
27
+ With unified memory:
28
+ - ✅ Learn successful extraction strategies
29
+ - ✅ Remember failed approaches to avoid repetition
30
+ - ✅ Maintain reasoning context across steps
31
+ - ✅ Share discoveries across agent instances
32
+ - ✅ Overcome context window limitations
33
+
34
+ ---
35
+
36
+ ## Memory Architecture
37
+
38
+ ```
39
+ ┌─────────────────────────────────────────────────────────────────┐
40
+ │ Unified Memory System │
41
+ ├─────────────────────────────────────────────────────────────────┤
42
+ │ │
43
+ │ ┌────────────────┐ ┌────────────────┐ ┌──────────────────┐ │
44
+ │ │ Short-Term │ │ Working │ │ Long-Term │ │
45
+ │ │ Memory │ │ Memory │ │ Memory │ │
46
+ │ │ (Episode) │ │ (Reasoning) │ │ (Persistent) │ │
47
+ │ └────────┬───────┘ └───────┬────────┘ └────────┬─────────┘ │
48
+ │ │ │ │ │
49
+ │ └──────────────────┼─────────────────────┘ │
50
+ │ │ │
51
+ │ ┌─────────▼──────────┐ │
52
+ │ │ Memory Router │ │
53
+ │ │ - Query planner │ │
54
+ │ │ - Context builder │ │
55
+ │ │ - Summarizer │ │
56
+ │ └─────────┬──────────┘ │
57
+ │ │ │
58
+ │ ┌──────────────────┼──────────────────┐ │
59
+ │ │ │ │ │
60
+ │ ┌────────▼────────┐ ┌──────▼─────────┐ ┌───▼──────────┐ │
61
+ │ │ Shared Memory │ │ Vector Index │ │ MCP Storage │ │
62
+ │ │ (Multi-Agent) │ │ (FAISS/Qdrant)│ │ (File/DB) │ │
63
+ │ └─────────────────┘ └────────────────┘ └──────────────┘ │
64
+ │ │
65
+ └─────────────────────────────────────────────────────────────────┘
66
+ ```
67
+
68
+ ---
69
+
70
+ ## Memory Layers
71
+
72
+ ### 1. 🟢 Short-Term Memory (Per Episode)
73
+
74
+ **Purpose:** Tracks the current scraping session state.
75
+
76
+ **Lifecycle:** Exists for one episode, cleared on `reset()`.
77
+
78
+ **Data Structure:**
79
+ ```python
80
+ class EpisodeMemory(BaseModel):
81
+ episode_id: str
82
+ task_id: str
83
+ visited_urls: List[str] # Navigation history
84
+ extracted_data: Dict[str, Any] # Field → value mappings
85
+ actions_history: List[Action] # All actions taken
86
+ intermediate_notes: List[str] # Agent's reasoning notes
87
+ observations: List[Observation] # All observations received
88
+ page_summaries: Dict[str, str] # URL �� content summary
89
+ extraction_attempts: Dict[str, List[Any]] # Field → list of attempts
90
+ timestamp_created: datetime
91
+ timestamp_updated: datetime
92
+ ```
93
+
94
+ **Use Cases:**
95
+ - Track which pages have been visited to avoid cycles
96
+ - Remember what data has been extracted
97
+ - Maintain action history for debugging
98
+ - Store intermediate reasoning
99
+
100
+ **Example:**
101
+ ```python
102
+ # Agent navigating a multi-page catalog
103
+ episode_memory = {
104
+ "visited_urls": [
105
+ "/catalog/page/1",
106
+ "/catalog/page/2",
107
+ "/product/12345"
108
+ ],
109
+ "extracted_data": {
110
+ "product_name": "Widget Pro",
111
+ "price": "$49.99"
112
+ },
113
+ "intermediate_notes": [
114
+ "Price found in span.product-price",
115
+ "Next page link present, continuing pagination"
116
+ ]
117
+ }
118
+ ```
119
+
120
+ ### 2. 🔵 Working Memory (Agent Thinking)
121
+
122
+ **Purpose:** Temporary reasoning buffer for active decision-making.
123
+
124
+ **Lifecycle:** Cleared after each action decision, or kept for multi-step reasoning.
125
+
126
+ **Data Structure:**
127
+ ```python
128
+ class WorkingMemory(BaseModel):
129
+ current_goal: str # Active objective
130
+ reasoning_steps: List[str] # Chain of thought
131
+ considered_actions: List[Action] # Actions being evaluated
132
+ scratchpad: Dict[str, Any] # Temporary calculations
133
+ active_hypotheses: List[str] # Predictions to test
134
+ context_window: List[str] # Relevant memory chunks
135
+ attention_focus: Optional[str] # Current DOM element/area of focus
136
+ ```
137
+
138
+ **Use Cases:**
139
+ - Chain-of-thought reasoning before action selection
140
+ - Evaluate multiple action candidates
141
+ - Maintain focus during complex extraction
142
+ - Store temporary parsing results
143
+
144
+ **Example:**
145
+ ```python
146
+ working_memory = {
147
+ "current_goal": "Extract product price from listing",
148
+ "reasoning_steps": [
149
+ "Step 1: Search HTML for price indicators ($, €, price)",
150
+ "Step 2: Found 3 candidates: $49.99, $39.99 (strikethrough), $5.99 (shipping)",
151
+ "Step 3: $49.99 is in <span class='product-price'>, most likely correct",
152
+ "Step 4: Extract using selector span.product-price"
153
+ ],
154
+ "considered_actions": [
155
+ Action(action_type="EXTRACT_FIELD", selector="span.price"),
156
+ Action(action_type="EXTRACT_FIELD", selector="span.product-price"),
157
+ Action(action_type="SEARCH_PAGE", query="price.*\\$\\d+")
158
+ ],
159
+ "attention_focus": "div.product-details"
160
+ }
161
+ ```
162
+
163
+ ### 3. 🟡 Long-Term Memory (Persistent)
164
+
165
+ **Purpose:** Store learned patterns, strategies, and historical data across all episodes.
166
+
167
+ **Lifecycle:** Persists indefinitely via MCP storage and vector database.
168
+
169
+ **Data Structure:**
170
+ ```python
171
+ class LongTermMemory(BaseModel):
172
+ # Vector embeddings for semantic search
173
+ embeddings_index: VectorIndex # FAISS, Qdrant, or Pinecone
174
+
175
+ # Successful extraction patterns
176
+ learned_patterns: List[ExtractionPattern]
177
+
178
+ # Historical performance data
179
+ past_episodes: List[EpisodeSummary]
180
+
181
+ # Failed attempts (to avoid repetition)
182
+ failed_patterns: List[FailedPattern]
183
+
184
+ # Domain knowledge
185
+ website_schemas: Dict[str, WebsiteSchema] # domain → common patterns
186
+
187
+ # Selector library
188
+ selector_success_rate: Dict[str, float] # selector → success rate
189
+ ```
190
+
191
+ **Extraction Pattern:**
192
+ ```python
193
+ class ExtractionPattern(BaseModel):
194
+ pattern_id: str
195
+ field_name: str # e.g., "price"
196
+ selector: str # e.g., "span.product-price"
197
+ selector_type: str # "css" | "xpath" | "label"
198
+ success_count: int # How many times it worked
199
+ failure_count: int # How many times it failed
200
+ domains: List[str] # Which websites it works on
201
+ confidence: float # 0.0 to 1.0
202
+ examples: List[str] # Sample extracted values
203
+ created_at: datetime
204
+ last_used: datetime
205
+ ```
206
+
207
+ **Use Cases:**
208
+ - Retrieve successful selectors for similar tasks
209
+ - Avoid repeating failed extraction attempts
210
+ - Learn website-specific patterns
211
+ - Build a library of proven strategies
212
+
213
+ **Example Query:**
214
+ ```python
215
+ # Agent needs to extract "price" from a new e-commerce page
216
+ similar_patterns = long_term_memory.search(
217
+ query="price extraction e-commerce",
218
+ filters={"field_name": "price", "confidence": ">0.8"},
219
+ limit=5
220
+ )
221
+
222
+ # Returns:
223
+ [
224
+ ExtractionPattern(
225
+ selector="span.product-price",
226
+ success_count=42,
227
+ confidence=0.95,
228
+ domains=["shop.example.com", "store.example.org"]
229
+ ),
230
+ ExtractionPattern(
231
+ selector="div.price-box span[itemprop='price']",
232
+ success_count=38,
233
+ confidence=0.92,
234
+ domains=["ecommerce.example.net"]
235
+ ),
236
+ ...
237
+ ]
238
+ ```
239
+
240
+ ### 4. 🔴 Shared Memory (Multi-Agent)
241
+
242
+ **Purpose:** Enable knowledge sharing across multiple agent instances.
243
+
244
+ **Lifecycle:** Persistent, synchronized across all agents.
245
+
246
+ **Data Structure:**
247
+ ```python
248
+ class SharedMemory(BaseModel):
249
+ global_knowledge_base: Dict[str, Any] # Shared facts and patterns
250
+ agent_messages: List[AgentMessage] # Inter-agent communication
251
+ task_state: Dict[str, TaskState] # Collaborative task status
252
+ distributed_discoveries: List[Discovery] # Findings from all agents
253
+ consensus_data: Dict[str, ConsensusValue] # Voted/validated facts
254
+ ```
255
+
256
+ **Use Cases:**
257
+ - Multiple agents scraping different sections of a large site
258
+ - Collaborative fact verification
259
+ - Distributed catalog scraping
260
+ - Consensus-based data validation
261
+
262
+ **Example:**
263
+ ```python
264
+ # Agent A discovers a pattern
265
+ agent_a.shared_memory.broadcast(
266
+ AgentMessage(
267
+ sender="agent_a",
268
+ message_type="PATTERN_DISCOVERED",
269
+ data={
270
+ "pattern": "Product SKU always in span.sku-code",
271
+ "confidence": 0.89,
272
+ "domain": "shop.example.com"
273
+ }
274
+ )
275
+ )
276
+
277
+ # Agent B receives and applies the pattern
278
+ agent_b_discovers = agent_b.shared_memory.receive_messages(
279
+ message_type="PATTERN_DISCOVERED"
280
+ )
281
+ # Agent B can now use this selector without rediscovering it
282
+ ```
283
+
284
+ ---
285
+
286
+ ## Memory Operations
287
+
288
+ ### Core Actions
289
+
290
+ The memory system exposes the following actions to the agent:
291
+
292
+ #### 1. WRITE_MEMORY
293
+ Store information in the appropriate memory layer.
294
+
295
+ ```python
296
+ class WriteMemoryAction(Action):
297
+ action_type: Literal["WRITE_MEMORY"]
298
+ memory_layer: Literal["short_term", "working", "long_term", "shared"]
299
+ key: str
300
+ value: Any
301
+ metadata: Optional[Dict[str, Any]] = None
302
+ ttl: Optional[int] = None # Time-to-live in seconds (for working memory)
303
+ ```
304
+
305
+ **Example:**
306
+ ```python
307
+ # Store a successful extraction pattern
308
+ Action(
309
+ action_type="WRITE_MEMORY",
310
+ memory_layer="long_term",
311
+ key="pattern:price:span.product-price",
312
+ value={
313
+ "selector": "span.product-price",
314
+ "field": "price",
315
+ "success_count": 1,
316
+ "domain": "shop.example.com"
317
+ },
318
+ metadata={"task_id": "task_medium", "episode_id": "ep_123"}
319
+ )
320
+ ```
321
+
322
+ #### 2. READ_MEMORY
323
+ Retrieve information from memory.
324
+
325
+ ```python
326
+ class ReadMemoryAction(Action):
327
+ action_type: Literal["READ_MEMORY"]
328
+ memory_layer: Literal["short_term", "working", "long_term", "shared"]
329
+ key: Optional[str] = None # Specific key (exact match)
330
+ query: Optional[str] = None # Semantic search query
331
+ filters: Optional[Dict] = None # Metadata filters
332
+ limit: int = 10 # Max results
333
+ ```
334
+
335
+ **Example:**
336
+ ```python
337
+ # Semantic search for price extraction patterns
338
+ Action(
339
+ action_type="READ_MEMORY",
340
+ memory_layer="long_term",
341
+ query="how to extract price from e-commerce product page",
342
+ filters={"field_name": "price", "confidence": ">0.7"},
343
+ limit=5
344
+ )
345
+ ```
346
+
347
+ #### 3. SEARCH_MEMORY
348
+ Advanced semantic search across memory layers.
349
+
350
+ ```python
351
+ class SearchMemoryAction(Action):
352
+ action_type: Literal["SEARCH_MEMORY"]
353
+ query: str # Natural language query
354
+ memory_layers: List[str] # Which layers to search
355
+ search_mode: Literal["semantic", "keyword", "hybrid"]
356
+ time_range: Optional[TimeRange] # Filter by recency
357
+ min_relevance: float = 0.5 # Minimum similarity score
358
+ ```
359
+
360
+ **Example:**
361
+ ```python
362
+ # Find all successful pagination strategies
363
+ Action(
364
+ action_type="SEARCH_MEMORY",
365
+ query="successful pagination next page navigation strategies",
366
+ memory_layers=["long_term", "shared"],
367
+ search_mode="semantic",
368
+ min_relevance=0.7
369
+ )
370
+ ```
371
+
372
+ #### 4. SUMMARIZE_MEMORY
373
+ Compress and summarize memory to manage context window.
374
+
375
+ ```python
376
+ class SummarizeMemoryAction(Action):
377
+ action_type: Literal["SUMMARIZE_MEMORY"]
378
+ memory_layer: str
379
+ summarization_strategy: Literal["importance", "recency", "relevance"]
380
+ target_size: int # Target summary size in tokens
381
+ preserve_keys: List[str] # Never summarize these
382
+ ```
383
+
384
+ #### 5. PRUNE_MEMORY
385
+ Remove low-value or outdated memories.
386
+
387
+ ```python
388
+ class PruneMemoryAction(Action):
389
+ action_type: Literal["PRUNE_MEMORY"]
390
+ memory_layer: str
391
+ pruning_strategy: Literal["lru", "low_confidence", "old_age"]
392
+ threshold: float # Confidence/age threshold
393
+ ```
394
+
395
+ ---
396
+
397
+ ## Implementation Details
398
+
399
+ ### Vector Database Integration
400
+
401
+ **Supported Backends:**
402
+ - **FAISS** (default, local, no external dependencies)
403
+ - **Qdrant** (distributed, production-ready)
404
+ - **Pinecone** (managed, cloud-based)
405
+ - **Weaviate** (open-source, GraphQL API)
406
+
407
+ **Configuration:**
408
+ ```python
409
+ class VectorDBConfig(BaseModel):
410
+ provider: Literal["faiss", "qdrant", "pinecone", "weaviate"]
411
+ embedding_model: str = "text-embedding-3-small" # OpenAI
412
+ dimension: int = 1536
413
+ similarity_metric: Literal["cosine", "euclidean", "dot_product"] = "cosine"
414
+ index_type: str = "IVF" # FAISS-specific
415
+ connection_params: Dict[str, Any] # Provider-specific
416
+ ```
417
+
418
+ **Embedding Pipeline:**
419
+ ```python
420
+ class MemoryEmbedder:
421
+ def embed_pattern(self, pattern: ExtractionPattern) -> np.ndarray:
422
+ """Convert extraction pattern to embedding."""
423
+ text = f"""
424
+ Field: {pattern.field_name}
425
+ Selector: {pattern.selector}
426
+ Type: {pattern.selector_type}
427
+ Context: {' '.join(pattern.examples[:3])}
428
+ """
429
+ return self.embedding_model.encode(text)
430
+
431
+ def embed_query(self, query: str) -> np.ndarray:
432
+ """Convert search query to embedding."""
433
+ return self.embedding_model.encode(query)
434
+ ```
435
+
436
+ ### MCP Storage Integration
437
+
438
+ **Storage Backends:**
439
+ - **File System MCP** (local JSON/SQLite files)
440
+ - **PostgreSQL MCP** (relational storage)
441
+ - **MongoDB MCP** (document storage)
442
+ - **Redis MCP** (fast cache + pub/sub for shared memory)
443
+
444
+ **Example MCP Configuration:**
445
+ ```json
446
+ {
447
+ "mcpServers": {
448
+ "memory-storage": {
449
+ "command": "npx",
450
+ "args": ["-y", "@modelcontextprotocol/server-filesystem", "./memory_data"],
451
+ "enabled": true,
452
+ "autoDownload": false
453
+ },
454
+ "memory-cache": {
455
+ "command": "redis-mcp-server",
456
+ "args": ["--host", "localhost", "--port", "6379"],
457
+ "enabled": true,
458
+ "autoDownload": true
459
+ }
460
+ }
461
+ }
462
+ ```
463
+
464
+ ### Memory Router
465
+
466
+ The **Memory Router** intelligently decides which memory layer to query based on the request:
467
+
468
+ ```python
469
+ class MemoryRouter:
470
+ def route_query(self, query: str, context: Dict) -> List[str]:
471
+ """Determine which memory layers to search."""
472
+ layers = []
473
+
474
+ # Recent action history → short-term
475
+ if "last few" in query or "current episode" in query:
476
+ layers.append("short_term")
477
+
478
+ # Active reasoning → working
479
+ if "consider" in query or "evaluate" in query:
480
+ layers.append("working")
481
+
482
+ # Historical patterns → long-term
483
+ if "similar" in query or "previously" in query or "learned" in query:
484
+ layers.append("long_term")
485
+
486
+ # Other agents' discoveries → shared
487
+ if "other agents" in query or "consensus" in query:
488
+ layers.append("shared")
489
+
490
+ return layers if layers else ["long_term"] # Default
491
+ ```
492
+
493
+ ### Context Window Optimization
494
+
495
+ **Problem:** LLMs have limited context windows. Memory must be compressed.
496
+
497
+ **Solutions:**
498
+
499
+ 1. **Hierarchical Summarization:**
500
+ ```python
501
+ class MemorySummarizer:
502
+ def summarize_episode(self, episode_memory: EpisodeMemory) -> str:
503
+ """Compress episode into key points."""
504
+ summary = f"Episode {episode_memory.episode_id} ({episode_memory.task_id}):\n"
505
+ summary += f"- Visited {len(episode_memory.visited_urls)} pages\n"
506
+ summary += f"- Extracted {len(episode_memory.extracted_data)} fields\n"
507
+ summary += f"- {len(episode_memory.actions_history)} actions taken\n"
508
+
509
+ # Highlight key discoveries
510
+ if episode_memory.intermediate_notes:
511
+ summary += f"\nKey findings:\n"
512
+ for note in episode_memory.intermediate_notes[-3:]: # Last 3 notes
513
+ summary += f" • {note}\n"
514
+
515
+ return summary
516
+ ```
517
+
518
+ 2. **Importance Scoring:**
519
+ ```python
520
+ class MemoryImportanceScorer:
521
+ def score(self, memory_item: Any) -> float:
522
+ """Rate importance of memory (0.0 to 1.0)."""
523
+ score = 0.0
524
+
525
+ # Recency bonus
526
+ age_days = (datetime.now() - memory_item.created_at).days
527
+ score += max(0, 1.0 - age_days / 30) * 0.3
528
+
529
+ # Success rate bonus
530
+ if hasattr(memory_item, 'success_count'):
531
+ score += memory_item.confidence * 0.4
532
+
533
+ # Usage frequency bonus
534
+ if hasattr(memory_item, 'last_used'):
535
+ days_since_use = (datetime.now() - memory_item.last_used).days
536
+ score += max(0, 1.0 - days_since_use / 7) * 0.3
537
+
538
+ return min(score, 1.0)
539
+ ```
540
+
541
+ 3. **Automatic Pruning:**
542
+ ```python
543
+ class MemoryPruner:
544
+ def prune_low_value(self, memory_store: Dict, threshold: float = 0.3):
545
+ """Remove memories below importance threshold."""
546
+ scorer = MemoryImportanceScorer()
547
+ to_remove = []
548
+
549
+ for key, item in memory_store.items():
550
+ if scorer.score(item) < threshold:
551
+ to_remove.append(key)
552
+
553
+ for key in to_remove:
554
+ del memory_store[key]
555
+
556
+ return len(to_remove)
557
+ ```
558
+
559
+ ---
560
+
561
+ ## Configuration
562
+
563
+ ### Settings Panel
564
+
565
+ **Memory Settings Tab:**
566
+ ```python
567
+ class MemorySettings(BaseModel):
568
+ # Enable/disable layers
569
+ enable_short_term: bool = True
570
+ enable_working: bool = True
571
+ enable_long_term: bool = True
572
+ enable_shared: bool = False # Off by default (multi-agent)
573
+
574
+ # Size limits
575
+ max_episode_memory_mb: int = 10
576
+ max_working_memory_items: int = 50
577
+ max_long_term_patterns: int = 10000
578
+
579
+ # Vector DB settings
580
+ vector_db_provider: str = "faiss"
581
+ embedding_model: str = "text-embedding-3-small"
582
+
583
+ # MCP storage settings
584
+ storage_backend: str = "filesystem"
585
+ storage_path: str = "./memory_data"
586
+
587
+ # Pruning settings
588
+ auto_prune: bool = True
589
+ prune_threshold: float = 0.3
590
+ prune_interval_hours: int = 24
591
+
592
+ # Context window optimization
593
+ auto_summarize: bool = True
594
+ max_context_tokens: int = 4000
595
+ ```
596
+
597
+ **UI Example:**
598
+ ```
599
+ ┌─────────────────────────────────────────────────────────────┐
600
+ │ Memory Settings │
601
+ ├─────────────────────────────────────────────────────────────┤
602
+ │ │
603
+ │ ☑ Enable Short-Term Memory (Episode) │
604
+ │ ☑ Enable Working Memory (Reasoning) │
605
+ │ ☑ Enable Long-Term Memory (Persistent) │
606
+ │ ☐ Enable Shared Memory (Multi-Agent) │
607
+ │ │
608
+ │ Memory Size Limits: │
609
+ │ Short-Term: [10] MB per episode │
610
+ │ Working: [50] items max │
611
+ │ Long-Term: [10000] patterns max │
612
+ │ │
613
+ │ Vector Database: │
614
+ │ Provider: [FAISS ▼] │
615
+ │ Embedding: [text-embedding-3-small ▼] │
616
+ │ │
617
+ │ Storage Backend: │
618
+ │ Type: [Filesystem ▼] │
619
+ │ Path: [./memory_data ] [Browse] │
620
+ │ │
621
+ │ Auto-Pruning: │
622
+ │ ☑ Enabled │
623
+ │ Threshold: [0.3] (0.0 = keep all, 1.0 = keep only best) │
624
+ │ Interval: [24] hours │
625
+ │ │
626
+ │ [Save Settings] [Reset to Defaults] │
627
+ └─────────────────────────────────────────────────────────────┘
628
+ ```
629
+
630
+ ---
631
+
632
+ ## Best Practices
633
+
634
+ ### 1. Memory Hygiene
635
+ ✅ **Do:**
636
+ - Summarize episode memory before storing in long-term
637
+ - Prune low-confidence patterns regularly
638
+ - Validate patterns before adding to long-term memory
639
+ - Tag memories with metadata (task_id, domain, confidence)
640
+
641
+ ❌ **Don't:**
642
+ - Store raw HTML in long-term memory (use summaries)
643
+ - Keep failed patterns without analysis
644
+ - Allow unbounded memory growth
645
+ - Store sensitive data without encryption
646
+
647
+ ### 2. Query Optimization
648
+ ✅ **Do:**
649
+ - Use semantic search for conceptual queries ("how to extract price")
650
+ - Use exact key lookup for known patterns
651
+ - Apply filters to narrow search space
652
+ - Limit results to top-K most relevant
653
+
654
+ ❌ **Don't:**
655
+ - Search all layers for every query (route intelligently)
656
+ - Ignore relevance scores (filter low scores)
657
+ - Retrieve full objects when summaries suffice
658
+
659
+ ### 3. Context Window Management
660
+ ✅ **Do:**
661
+ - Prioritize recent and high-confidence memories
662
+ - Summarize old episodes aggressively
663
+ - Use hierarchical memory retrieval (summary → details on demand)
664
+ - Monitor token usage and trigger summarization proactively
665
+
666
+ ❌ **Don't:**
667
+ - Include entire memory in every agent call
668
+ - Ignore context window limits
669
+ - Retrieve memories without relevance ranking
670
+
671
+ ### 4. Multi-Agent Coordination
672
+ ✅ **Do:**
673
+ - Broadcast significant discoveries to shared memory
674
+ - Implement consensus mechanisms for conflicting data
675
+ - Use message queues for asynchronous updates
676
+ - Version shared knowledge to handle conflicts
677
+
678
+ ❌ **Don't:**
679
+ - Allow race conditions on shared writes
680
+ - Broadcast every minor action (create noise)
681
+ - Trust shared data without validation
682
+
683
+ ---
684
+
685
+ ## Performance Metrics
686
+
687
+ Track these metrics to evaluate memory system effectiveness:
688
+
689
+ ```python
690
+ class MemoryMetrics(BaseModel):
691
+ # Retrieval performance
692
+ avg_retrieval_time_ms: float
693
+ cache_hit_rate: float
694
+
695
+ # Effectiveness
696
+ pattern_reuse_rate: float # % of times learned patterns helped
697
+ memory_assisted_success_rate: float # Success with vs without memory
698
+
699
+ # Efficiency
700
+ memory_size_mb: float
701
+ pruned_items_count: int
702
+ summarization_ratio: float # Compressed size / original size
703
+
704
+ # Quality
705
+ avg_pattern_confidence: float
706
+ false_positive_rate: float # Patterns that failed when reused
707
+ ```
708
+
709
+ ---
710
+
711
+ ## Example Usage
712
+
713
+ ### Full Episode with Memory
714
+
715
+ ```python
716
+ # Initialize environment with memory
717
+ env = WebScraperEnv(memory_config=MemorySettings())
718
+
719
+ # Reset episode
720
+ obs = env.reset(task_id="task_medium", seed=42)
721
+
722
+ # Agent checks long-term memory for similar tasks
723
+ memory_query = Action(
724
+ action_type="SEARCH_MEMORY",
725
+ query=f"successful extraction patterns for {obs.task_description}",
726
+ memory_layers=["long_term"],
727
+ search_mode="semantic",
728
+ limit=5
729
+ )
730
+ similar_patterns = env.step(memory_query)
731
+
732
+ # Agent reasons using working memory
733
+ working_memory = {
734
+ "current_goal": "Extract product price",
735
+ "reasoning_steps": [
736
+ f"Retrieved {len(similar_patterns)} similar patterns",
737
+ f"Top pattern: {similar_patterns[0].selector} (confidence: {similar_patterns[0].confidence})",
738
+ "Will try this selector first"
739
+ ],
740
+ "considered_actions": [...]
741
+ }
742
+
743
+ # Agent extracts using learned pattern
744
+ extract_action = Action(
745
+ action_type="EXTRACT_FIELD",
746
+ target_field="price",
747
+ selector=similar_patterns[0].selector
748
+ )
749
+ obs, reward, done, info = env.step(extract_action)
750
+
751
+ # If successful, reinforce the pattern
752
+ if reward.value > 0:
753
+ env.step(Action(
754
+ action_type="WRITE_MEMORY",
755
+ memory_layer="long_term",
756
+ key=f"pattern:price:{similar_patterns[0].selector}",
757
+ value={
758
+ **similar_patterns[0].dict(),
759
+ "success_count": similar_patterns[0].success_count + 1,
760
+ "last_used": datetime.now()
761
+ }
762
+ ))
763
+
764
+ # Store episode summary
765
+ if done:
766
+ env.step(Action(
767
+ action_type="WRITE_MEMORY",
768
+ memory_layer="long_term",
769
+ key=f"episode:{obs.episode_id}",
770
+ value=env.summarize_episode()
771
+ ))
772
+ ```
773
+
774
+ ---
775
+
776
+ ## Future Enhancements
777
+
778
+ - **Active Learning:** Agent can request human labeling for ambiguous patterns
779
+ - **Federated Memory:** Share memory across organizations without revealing raw data
780
+ - **Memory Replay:** Train on stored episodes for offline RL
781
+ - **Causal Memory:** Track cause-effect relationships between actions and outcomes
782
+ - **Memory Debugging:** Visualize which memories influenced each decision
783
+
784
+ ---
785
+
786
+ **Next:** See [api.md](./api.md) for multi-model API integration.
docs/observability.md ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Observability and Dashboard
2
+
3
+ ## Overview
4
+
5
+ Observability provides deep insight into runtime behavior, model usage, tool execution, memory quality, and rewards.
6
+
7
+ ## Dashboard Sections
8
+
9
+ ### 1. Live Thought Stream
10
+
11
+ - chronological reasoning notes
12
+ - model/router choice trace
13
+ - action confidence timeline
14
+ - override events
15
+
16
+ ### 2. Navigation Map
17
+
18
+ Graph of visited pages:
19
+
20
+ - nodes = URLs
21
+ - edges = transitions
22
+ - node color = relevance/confidence
23
+ - revisit highlighting
24
+
25
+ ### 3. MCP Usage Panel
26
+
27
+ - tool call count by server
28
+ - avg latency by tool
29
+ - error rate and retries
30
+ - top successful tool chains
31
+
32
+ ### 4. Memory Viewer
33
+
34
+ - inspect short/working/long/shared memory
35
+ - filter by task/domain/confidence
36
+ - edit/delete entries
37
+ - prune previews
38
+
39
+ ### 5. Reward Analytics
40
+
41
+ - per-step reward breakdown
42
+ - component contribution trends
43
+ - penalty heatmap
44
+ - episode comparison
45
+
46
+ ### 6. Cost and Token Monitor
47
+
48
+ - per-provider usage
49
+ - per-model token counts
50
+ - cumulative cost vs budget
51
+ - forecasted burn rate
52
+
53
+ ## Core Metrics
54
+
55
+ ### Agent Metrics
56
+
57
+ - task completion rate
58
+ - avg steps to completion
59
+ - recovery score
60
+ - generalization score
61
+ - exploration ratio
62
+
63
+ ### Tool Metrics
64
+
65
+ - tool success rate
66
+ - timeout ratio
67
+ - fallback frequency
68
+ - schema validation failures
69
+
70
+ ### Memory Metrics
71
+
72
+ - retrieval hit rate
73
+ - relevance score distribution
74
+ - prune rate
75
+ - memory-assisted success ratio
76
+
77
+ ### Search Metrics
78
+
79
+ - query success rate
80
+ - multi-hop depth distribution
81
+ - credibility score average
82
+ - duplicate result ratio
83
+
84
+ ## Logging Model
85
+
86
+ Structured logs (JSON):
87
+
88
+ ```json
89
+ {
90
+ "timestamp": "2026-03-27T00:00:00Z",
91
+ "episode_id": "ep_123",
92
+ "step": 7,
93
+ "event": "tool_call",
94
+ "tool": "beautifulsoup.find_all",
95
+ "latency_ms": 54,
96
+ "success": true,
97
+ "reward_delta": 0.08
98
+ }
99
+ ```
100
+
101
+ ## Tracing
102
+
103
+ Per-episode trace includes:
104
+
105
+ - observations
106
+ - actions
107
+ - rewards
108
+ - tool calls
109
+ - memory operations
110
+ - final submission and grader results
111
+
112
+ ## Alerts
113
+
114
+ Configurable alerts:
115
+
116
+ - budget threshold crossed
117
+ - error spike
118
+ - tool outage
119
+ - memory bloat
120
+ - anomalous low reward streak
121
+
122
+ ## APIs
123
+
124
+ - `GET /api/metrics/summary`
125
+ - `GET /api/metrics/timeseries`
126
+ - `GET /api/traces/{episode_id}`
127
+ - `GET /api/costs`
128
+ - `GET /api/memory/stats`
129
+ - `GET /api/tools/stats`
130
+
131
+ ## Recommended Dashboard Layout
132
+
133
+ 1. Top row: completion, cost, latency, error rate
134
+ 2. Mid row: thought stream + navigation graph
135
+ 3. Lower row: reward breakdown + MCP usage + memory viewer
136
+ 4. Bottom row: raw trace and export controls
137
+
138
+ ## Export and Audit
139
+
140
+ Exports:
141
+
142
+ - JSON trace
143
+ - CSV metrics
144
+ - reward analysis report
145
+ - model usage report
146
+
147
+ All exports include episode and configuration fingerprints for reproducibility.
docs/openenv.md ADDED
@@ -0,0 +1,220 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenEnv Specification (Enhanced)
2
+
3
+ ## Overview
4
+
5
+ This document defines the OpenEnv contract for WebScraper-OpenEnv with advanced memory, MCP tooling, multi-model routing, and long-page batch handling.
6
+
7
+ ## Core Interfaces
8
+
9
+ ### Observation
10
+
11
+ ```python
12
+ class Observation(BaseModel):
13
+ episode_id: str
14
+ task_id: str
15
+ step_number: int
16
+ current_url: str
17
+ page_html: str
18
+ page_title: str
19
+ available_actions: list[str]
20
+ extracted_so_far: dict
21
+ pages_visited: list[str]
22
+ budget_remaining: int
23
+ task_description: str
24
+ target_fields: list[str]
25
+ hints: list[str]
26
+
27
+ # Enhanced
28
+ memory_context: dict | None
29
+ tool_registry_snapshot: list[dict] | None
30
+ search_results: list[dict] | None
31
+ page_chunks: list[dict] | None
32
+ ```
33
+
34
+ ### Action
35
+
36
+ ```python
37
+ class Action(BaseModel):
38
+ action_type: str
39
+
40
+ # Existing
41
+ target_field: str | None = None
42
+ selector: str | None = None
43
+ navigate_to: str | None = None
44
+ submit_extraction: dict | None = None
45
+ notes: str | None = None
46
+
47
+ # Search
48
+ query: str | None = None
49
+ search_engine: str | None = None
50
+ result_limit: int = 5
51
+
52
+ # Verification
53
+ field_name: str | None = None
54
+ claimed_value: str | None = None
55
+ verification_source: str | None = None
56
+
57
+ # Conflict resolution
58
+ conflicting_sources: list[str] | None = None
59
+ chosen_source: str | None = None
60
+ rationale: str | None = None
61
+
62
+ # MCP + Memory
63
+ tool_name: str | None = None
64
+ tool_params: dict | None = None
65
+ memory_layer: str | None = None
66
+ memory_key: str | None = None
67
+ memory_query: str | None = None
68
+ ```
69
+
70
+ ### Action Types
71
+
72
+ - `EXTRACT_FIELD`
73
+ - `NAVIGATE`
74
+ - `SEARCH_PAGE`
75
+ - `INSPECT_ELEMENT`
76
+ - `SUBMIT`
77
+ - `SKIP_PAGE`
78
+ - `SEARCH_ENGINE`
79
+ - `VERIFY_FACT`
80
+ - `RESOLVE_CONFLICT`
81
+ - `FETCH_URL`
82
+ - `MCP_TOOL_CALL`
83
+ - `WRITE_MEMORY`
84
+ - `READ_MEMORY`
85
+ - `SEARCH_MEMORY`
86
+ - `SUMMARIZE_MEMORY`
87
+ - `PRUNE_MEMORY`
88
+
89
+ ### Reward
90
+
91
+ ```python
92
+ class Reward(BaseModel):
93
+ value: float
94
+ cumulative: float
95
+ breakdown: dict
96
+ message: str
97
+ ```
98
+
99
+ ## Episode Lifecycle
100
+
101
+ ```text
102
+ reset(task_id, seed?)
103
+ -> observation(step=0)
104
+
105
+ step(action)
106
+ -> observation, reward, done, info
107
+
108
+ state(episode_id)
109
+ -> current snapshot
110
+ ```
111
+
112
+ Terminal conditions:
113
+
114
+ - `SUBMIT` called
115
+ - budget exhausted
116
+ - max page limit reached
117
+ - fatal policy error
118
+
119
+ ## State Machine
120
+
121
+ ```text
122
+ RESET -> RUNNING -> TERMINAL
123
+ |
124
+ +-- NAVIGATE / EXTRACT / SEARCH / VERIFY / MCP / MEMORY
125
+ ```
126
+
127
+ ## Task Profiles
128
+
129
+ ### Easy
130
+
131
+ - single-page extraction
132
+ - low noise
133
+ - hints enabled
134
+
135
+ ### Medium
136
+
137
+ - pagination
138
+ - moderate noise
139
+ - partial hints
140
+
141
+ ### Hard
142
+
143
+ - multi-hop search
144
+ - conflicting sources
145
+ - verification required
146
+ - no hints
147
+
148
+ ## Long Page Handling
149
+
150
+ When HTML exceeds token/size thresholds:
151
+
152
+ 1. Semantic segmentation
153
+ 2. Adaptive chunking
154
+ 3. Batch extraction
155
+ 4. Merge + dedupe + confidence rank
156
+ 5. Optional diff-based incremental update
157
+
158
+ ## MCP Integration Contract
159
+
160
+ On each step, environment may expose:
161
+
162
+ - tool registry snapshot
163
+ - per-tool input/output schema
164
+ - timeout and retry policy
165
+
166
+ Tool calls are evaluated for:
167
+
168
+ - correctness
169
+ - efficiency
170
+ - safety constraints
171
+
172
+ ## Search Engine Contract
173
+
174
+ Search action supports provider routing:
175
+
176
+ - Google
177
+ - Bing
178
+ - Brave
179
+ - DuckDuckGo
180
+ - Perplexity
181
+ - custom providers
182
+
183
+ Environment stores query + result metadata for observability.
184
+
185
+ ## Memory Contract
186
+
187
+ Layers:
188
+
189
+ - short-term (episode)
190
+ - working (reasoning)
191
+ - long-term (persistent)
192
+ - shared (multi-agent)
193
+
194
+ Mandatory metadata for write operations:
195
+
196
+ - `episode_id`
197
+ - `task_id`
198
+ - `confidence`
199
+ - `source`
200
+
201
+ ## API Surface
202
+
203
+ - `POST /api/reset`
204
+ - `POST /api/step`
205
+ - `GET /api/state/{episode_id}`
206
+ - `GET /api/tasks`
207
+ - `GET /api/reward/{episode_id}`
208
+ - `GET /api/tool-registry`
209
+ - `POST /api/tool-test`
210
+
211
+ ## Determinism
212
+
213
+ Given `task_id + seed + config`, environment should be reproducible for grading and benchmarking.
214
+
215
+ ## Safety and Guardrails
216
+
217
+ - enforce max steps and request budgets
218
+ - enforce MCP tool allowlist/denylist
219
+ - prevent secret leakage from tool outputs
220
+ - sanitize logs and traces
docs/rewards.md ADDED
@@ -0,0 +1,637 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🎯 Advanced Reward Function
2
+
3
+ ## Table of Contents
4
+ 1. [Overview](#overview)
5
+ 2. [Reward Components](#reward-components)
6
+ 3. [Planning Quality](#planning-quality)
7
+ 4. [Recovery Ability](#recovery-ability)
8
+ 5. [Exploration Bonus](#exploration-bonus)
9
+ 6. [Redundancy Penalty](#redundancy-penalty)
10
+ 7. [Generalization Score](#generalization-score)
11
+ 8. [Tool Usage Efficiency](#tool-usage-efficiency)
12
+ 9. [Memory Utilization](#memory-utilization)
13
+ 10. [Final Reward Formula](#final-reward-formula)
14
+ 11. [Configuration](#configuration)
15
+
16
+ ---
17
+
18
+ ## Overview
19
+
20
+ The **Advanced Reward Function** provides dense, interpretable signals that guide the agent toward intelligent, efficient, and generalizable web scraping strategies.
21
+
22
+ ### Design Principles
23
+
24
+ 1. **Dense Rewards:** Provide feedback at every step, not just terminal states
25
+ 2. **Interpretable:** Each component has a clear purpose agents (and humans) can understand
26
+ 3. **Balanced:** Prevent reward hacking by balancing conflicting objectives
27
+ 4. **Adaptive:** Adjust weights based on task difficulty and agent progress
28
+
29
+ ### Basic vs Advanced
30
+
31
+ **Basic Reward (existing):**
32
+ ```python
33
+ reward = task_completion_score # 0.0 to 1.0
34
+ ```
35
+
36
+ **Advanced Reward:**
37
+ ```python
38
+ reward = (
39
+ w1 * task_completion +
40
+ w2 * efficiency +
41
+ w3 * planning_quality +
42
+ w4 * recovery_ability +
43
+ w5 * exploration_bonus +
44
+ w6 * tool_usage +
45
+ w7 * memory_usage +
46
+ w8 * generalization
47
+ ) - penalties
48
+ ```
49
+
50
+ ---
51
+
52
+ ## Reward Components
53
+
54
+ ### 1. Task Completion (w1 = 0.40)
55
+
56
+ **Purpose:** Measure how much of the task is complete.
57
+
58
+ **Calculation:**
59
+ ```python
60
+ def task_completion_score(extracted: Dict, ground_truth: Dict) -> float:
61
+ """Score based on field completeness and accuracy."""
62
+ if not ground_truth:
63
+ return 0.0
64
+
65
+ total_fields = len(ground_truth)
66
+ correct_fields = 0
67
+ partial_fields = 0
68
+
69
+ for field, true_value in ground_truth.items():
70
+ extracted_value = extracted.get(field)
71
+
72
+ if extracted_value is None:
73
+ continue # Missing field, 0 points
74
+
75
+ # Exact match
76
+ if normalize(extracted_value) == normalize(true_value):
77
+ correct_fields += 1
78
+ # Partial match (fuzzy)
79
+ elif similarity(extracted_value, true_value) > 0.7:
80
+ partial_fields += 1
81
+
82
+ score = (correct_fields + 0.5 * partial_fields) / total_fields
83
+ return score
84
+ ```
85
+
86
+ **Example:**
87
+ ```python
88
+ # Task: Extract name, price, rating
89
+ ground_truth = {"name": "Widget Pro", "price": "$49.99", "rating": "4.5"}
90
+
91
+ # Agent extracted 2/3 correctly
92
+ extracted = {"name": "Widget Pro", "price": "$49.99", "rating": None}
93
+ task_completion = 2/3 = 0.67
94
+ ```
95
+
96
+ ---
97
+
98
+ ### 2. Efficiency (w2 = 0.15)
99
+
100
+ **Purpose:** Reward completing tasks quickly with fewer actions.
101
+
102
+ **Calculation:**
103
+ ```python
104
+ def efficiency_score(steps_taken: int, max_steps: int, pages_visited: int) -> float:
105
+ """Lower steps and pages = higher efficiency."""
106
+ # Step efficiency
107
+ step_efficiency = 1.0 - (steps_taken / max_steps)
108
+
109
+ # Page efficiency (prefer fewer page visits)
110
+ ideal_pages = estimate_ideal_page_count(task)
111
+ page_efficiency = 1.0 - abs(pages_visited - ideal_pages) / ideal_pages
112
+ page_efficiency = max(0.0, page_efficiency)
113
+
114
+ return 0.7 * step_efficiency + 0.3 * page_efficiency
115
+ ```
116
+
117
+ **Example:**
118
+ ```python
119
+ # Task with max 20 steps
120
+ steps_taken = 8
121
+ efficiency = 1.0 - (8/20) = 0.60 # Good!
122
+
123
+ steps_taken = 18
124
+ efficiency = 1.0 - (18/20) = 0.10 # Inefficient
125
+ ```
126
+
127
+ ---
128
+
129
+ ## Planning Quality
130
+
131
+ ### 3. Planning Quality Score (w3 = 0.10)
132
+
133
+ **Purpose:** Reward agents that plan before acting.
134
+
135
+ **Signals:**
136
+ - Used WRITE_MEMORY with reasoning notes
137
+ - Actions follow a coherent strategy
138
+ - Fewer backtracking actions
139
+
140
+ **Calculation:**
141
+ ```python
142
+ def planning_quality_score(episode_history: List[Action]) -> float:
143
+ """Measure planning behavior."""
144
+ score = 0.0
145
+
146
+ # 1. Did agent write reasoning notes?
147
+ reasoning_actions = [a for a in episode_history if a.notes]
148
+ if reasoning_actions:
149
+ score += 0.3
150
+
151
+ # 2. Action coherence: Do actions follow a logical sequence?
152
+ coherence = measure_action_coherence(episode_history)
153
+ score += 0.4 * coherence
154
+
155
+ # 3. Backtracking penalty: Visiting same page multiple times
156
+ unique_pages = len(set(a.navigate_to for a in episode_history if a.navigate_to))
157
+ total_navigations = len([a for a in episode_history if a.action_type == "NAVIGATE"])
158
+ if total_navigations > 0:
159
+ backtrack_ratio = 1.0 - (unique_pages / total_navigations)
160
+ score += 0.3 * (1.0 - backtrack_ratio) # Lower backtracking = higher score
161
+
162
+ return min(score, 1.0)
163
+
164
+ def measure_action_coherence(actions: List[Action]) -> float:
165
+ """Are actions logically connected?"""
166
+ coherence_patterns = [
167
+ # Good patterns
168
+ ("SEARCH_PAGE", "EXTRACT_FIELD"), # Search then extract
169
+ ("NAVIGATE", "EXTRACT_FIELD"), # Navigate then extract
170
+ ("EXTRACT_FIELD", "VERIFY_FACT"), # Extract then verify
171
+ ("SEARCH_ENGINE", "NAVIGATE"), # Search then visit
172
+ ]
173
+
174
+ coherent_pairs = 0
175
+ total_pairs = len(actions) - 1
176
+
177
+ for i in range(total_pairs):
178
+ pair = (actions[i].action_type, actions[i+1].action_type)
179
+ if pair in coherence_patterns:
180
+ coherent_pairs += 1
181
+
182
+ return coherent_pairs / total_pairs if total_pairs > 0 else 0.0
183
+ ```
184
+
185
+ **Example:**
186
+ ```python
187
+ # Good planning:
188
+ actions = [
189
+ Action(type="SEARCH_PAGE", notes="Looking for price pattern"),
190
+ Action(type="EXTRACT_FIELD", target="price"),
191
+ Action(type="VERIFY_FACT", field="price")
192
+ ]
193
+ planning_score = 0.3 (notes) + 0.4*0.67 (coherence) + 0.3 (no backtrack) = 0.87
194
+
195
+ # Poor planning:
196
+ actions = [
197
+ Action(type="NAVIGATE", navigate_to="/page1"),
198
+ Action(type="NAVIGATE", navigate_to="/page2"),
199
+ Action(type="NAVIGATE", navigate_to="/page1"), # Backtrack!
200
+ Action(type="EXTRACT_FIELD")
201
+ ]
202
+ planning_score = 0.0 (no notes) + 0.4*0.0 (incoherent) + 0.3*0.33 (backtracking) = 0.10
203
+ ```
204
+
205
+ ---
206
+
207
+ ## Recovery Ability
208
+
209
+ ### 4. Recovery Ability Score (w4 = 0.08)
210
+
211
+ **Purpose:** Reward agents that recover from failures.
212
+
213
+ **Signals:**
214
+ - Action failed → Agent tried alternative approach
215
+ - Extraction returned empty → Agent searched with different selector
216
+ - Page blocked → Agent switched proxy/VPN
217
+
218
+ **Calculation:**
219
+ ```python
220
+ def recovery_ability_score(episode_history: List[Tuple[Action, Reward]]) -> float:
221
+ """Measure ability to recover from failures."""
222
+ recoveries = 0
223
+ failures = 0
224
+
225
+ for i in range(len(episode_history) - 1):
226
+ action, reward = episode_history[i]
227
+ next_action, next_reward = episode_history[i + 1]
228
+
229
+ # Detect failure (negative reward or empty result)
230
+ if reward.value < 0 or "failed" in reward.message.lower():
231
+ failures += 1
232
+
233
+ # Check if next action was a recovery attempt
234
+ if is_recovery_action(action, next_action):
235
+ if next_reward.value > reward.value: # Recovery succeeded
236
+ recoveries += 1
237
+
238
+ return recoveries / failures if failures > 0 else 0.0
239
+
240
+ def is_recovery_action(failed_action: Action, next_action: Action) -> bool:
241
+ """Is next_action a recovery attempt for failed_action?"""
242
+ # Same action type with different parameters
243
+ if failed_action.action_type == next_action.action_type:
244
+ if failed_action.selector != next_action.selector:
245
+ return True # Tried different selector
246
+
247
+ # Switched to alternative action type
248
+ recovery_alternatives = {
249
+ "EXTRACT_FIELD": ["SEARCH_PAGE", "INSPECT_ELEMENT"],
250
+ "NAVIGATE": ["FETCH_URL"], # Try direct fetch if navigate blocked
251
+ "SEARCH_ENGINE": ["NAVIGATE"], # Try direct URL if search fails
252
+ }
253
+
254
+ if next_action.action_type in recovery_alternatives.get(failed_action.action_type, []):
255
+ return True
256
+
257
+ return False
258
+ ```
259
+
260
+ **Example:**
261
+ ```python
262
+ # Good recovery:
263
+ history = [
264
+ (Action(type="EXTRACT_FIELD", selector=".price"), Reward(value=-0.1, message="Not found")),
265
+ (Action(type="SEARCH_PAGE", query="price"), Reward(value=0.2, message="Found price pattern")),
266
+ (Action(type="EXTRACT_FIELD", selector="span.product-price"), Reward(value=0.5, message="Extracted"))
267
+ ]
268
+ recovery_score = 1/1 = 1.0 # 1 failure, 1 successful recovery
269
+
270
+ # No recovery:
271
+ history = [
272
+ (Action(type="EXTRACT_FIELD", selector=".price"), Reward(value=-0.1)),
273
+ (Action(type="EXTRACT_FIELD", selector=".price"), Reward(value=-0.1)), # Repeated same failed action!
274
+ (Action(type="SUBMIT"), Reward(value=0.0))
275
+ ]
276
+ recovery_score = 0/2 = 0.0 # 2 failures, 0 recoveries
277
+ ```
278
+
279
+ ---
280
+
281
+ ## Exploration Bonus
282
+
283
+ ### 5. Exploration Bonus (w5 = 0.05)
284
+
285
+ **Purpose:** Encourage discovering new pages and patterns early in training.
286
+
287
+ **Calculation:**
288
+ ```python
289
+ def exploration_bonus(
290
+ pages_visited: List[str],
291
+ known_pages: Set[str], # From long-term memory
292
+ episode_number: int
293
+ ) -> float:
294
+ """Bonus for discovering new pages/patterns."""
295
+ new_pages = set(pages_visited) - known_pages
296
+
297
+ # Bonus decreases over time (we want agent to eventually exploit)
298
+ decay_factor = math.exp(-0.01 * episode_number)
299
+
300
+ # Bonus per new page discovered
301
+ bonus_per_page = 0.1
302
+
303
+ return min(len(new_pages) * bonus_per_page * decay_factor, 1.0)
304
+ ```
305
+
306
+ **Example:**
307
+ ```python
308
+ # Episode 10: Agent discovers 3 new pages
309
+ exploration_bonus = 3 * 0.1 * exp(-0.01*10) = 0.3 * 0.90 = 0.27
310
+
311
+ # Episode 500: Same discovery
312
+ exploration_bonus = 3 * 0.1 * exp(-0.01*500) = 0.3 * 0.007 = 0.002 # Minimal bonus now
313
+ ```
314
+
315
+ ---
316
+
317
+ ## Redundancy Penalty
318
+
319
+ ### 6. Redundancy Penalty (penalty, not bonus)
320
+
321
+ **Purpose:** Penalize visiting the same page repeatedly without progress.
322
+
323
+ **Calculation:**
324
+ ```python
325
+ def redundancy_penalty(pages_visited: List[str]) -> float:
326
+ """Penalty for revisiting pages."""
327
+ from collections import Counter
328
+ visit_counts = Counter(pages_visited)
329
+
330
+ penalty = 0.0
331
+ for page, count in visit_counts.items():
332
+ if count > 1:
333
+ # Exponential penalty for repeat visits
334
+ penalty += 0.05 * (count - 1) ** 1.5
335
+
336
+ return min(penalty, 1.0)
337
+ ```
338
+
339
+ **Example:**
340
+ ```python
341
+ pages = ["/page1", "/page2", "/page1", "/page1", "/page3"]
342
+ # page1 visited 3 times
343
+ redundancy_penalty = 0.05 * (3-1)**1.5 = 0.05 * 2.83 = 0.14
344
+ ```
345
+
346
+ ---
347
+
348
+ ## Generalization Score
349
+
350
+ ### 7. Generalization Score (w8 = 0.07)
351
+
352
+ **Purpose:** Reward strategies that work across different page layouts.
353
+
354
+ **Measurement:** After training, evaluate agent on unseen task variations.
355
+
356
+ **Calculation:**
357
+ ```python
358
+ def generalization_score(
359
+ agent: Agent,
360
+ test_tasks: List[Task],
361
+ training_tasks: List[Task]
362
+ ) -> float:
363
+ """Test agent on unseen variations of trained tasks."""
364
+ test_results = []
365
+
366
+ for task in test_tasks:
367
+ # Ensure task is not in training set
368
+ if task.id in [t.id for t in training_tasks]:
369
+ continue
370
+
371
+ result = agent.run(task)
372
+ test_results.append(result.completion_score)
373
+
374
+ # Average performance on unseen tasks
375
+ return np.mean(test_results) if test_results else 0.0
376
+ ```
377
+
378
+ ---
379
+
380
+ ## Tool Usage Efficiency
381
+
382
+ ### 8. Tool Usage (w6 = 0.05)
383
+
384
+ **Purpose:** Reward using the right tools at the right time.
385
+
386
+ **Calculation:**
387
+ ```python
388
+ def tool_usage_score(actions: List[Action]) -> float:
389
+ """Reward appropriate tool usage."""
390
+ score = 0.0
391
+
392
+ # 1. Used memory appropriately
393
+ memory_actions = [a for a in actions if a.action_type in ["READ_MEMORY", "WRITE_MEMORY"]]
394
+ if memory_actions:
395
+ score += 0.3
396
+
397
+ # 2. Used MCP tools when appropriate
398
+ mcp_actions = [a for a in actions if a.action_type == "MCP_TOOL_CALL"]
399
+ if mcp_actions:
400
+ score += 0.3
401
+
402
+ # 3. Verified important extractions
403
+ verify_actions = [a for a in actions if a.action_type == "VERIFY_FACT"]
404
+ extract_actions = [a for a in actions if a.action_type == "EXTRACT_FIELD"]
405
+ if verify_actions and extract_actions:
406
+ verification_ratio = len(verify_actions) / len(extract_actions)
407
+ score += 0.4 * min(verification_ratio, 1.0)
408
+
409
+ return min(score, 1.0)
410
+ ```
411
+
412
+ ---
413
+
414
+ ## Memory Utilization
415
+
416
+ ### 9. Memory Usage (w7 = 0.05)
417
+
418
+ **Purpose:** Reward effective use of memory system.
419
+
420
+ **Calculation:**
421
+ ```python
422
+ def memory_usage_score(episode: Episode) -> float:
423
+ """Reward effective memory usage."""
424
+ score = 0.0
425
+
426
+ # 1. Did agent query long-term memory for similar patterns?
427
+ if episode.memory_queries > 0:
428
+ score += 0.4
429
+
430
+ # 2. Did agent write successful patterns to long-term memory?
431
+ if episode.memory_writes > 0:
432
+ score += 0.3
433
+
434
+ # 3. Did memory queries lead to successful actions?
435
+ memory_assisted_success = episode.memory_assisted_actions / episode.total_actions
436
+ score += 0.3 * memory_assisted_success
437
+
438
+ return min(score, 1.0)
439
+ ```
440
+
441
+ ---
442
+
443
+ ## Final Reward Formula
444
+
445
+ ### Complete Formula
446
+
447
+ ```python
448
+ def calculate_reward(episode: Episode, config: RewardConfig) -> Reward:
449
+ """Calculate comprehensive reward."""
450
+
451
+ # Positive components
452
+ R_completion = task_completion_score(episode.extracted, episode.ground_truth)
453
+ R_efficiency = efficiency_score(episode.steps, episode.max_steps, len(episode.pages))
454
+ R_planning = planning_quality_score(episode.actions)
455
+ R_recovery = recovery_ability_score(episode.history)
456
+ R_exploration = exploration_bonus(episode.pages, episode.memory.known_pages, episode.number)
457
+ R_tools = tool_usage_score(episode.actions)
458
+ R_memory = memory_usage_score(episode)
459
+ R_generalization = generalization_score(episode.agent, episode.test_tasks, episode.training_tasks)
460
+
461
+ # Penalties
462
+ P_redundancy = redundancy_penalty(episode.pages)
463
+ P_timeout = 1.0 if episode.timed_out else 0.0
464
+ P_invalid = sum(1 for a in episode.actions if not a.valid) * 0.1
465
+
466
+ # Weighted sum
467
+ w = config.weights
468
+ reward_value = (
469
+ w.completion * R_completion +
470
+ w.efficiency * R_efficiency +
471
+ w.planning * R_planning +
472
+ w.recovery * R_recovery +
473
+ w.exploration * R_exploration +
474
+ w.tools * R_tools +
475
+ w.memory * R_memory +
476
+ w.generalization * R_generalization
477
+ ) - (P_redundancy + P_timeout + P_invalid)
478
+
479
+ # Clamp to [-1, 1]
480
+ reward_value = max(-1.0, min(1.0, reward_value))
481
+
482
+ # Build breakdown for interpretability
483
+ breakdown = {
484
+ "task_completion": R_completion,
485
+ "efficiency": R_efficiency,
486
+ "planning_quality": R_planning,
487
+ "recovery_ability": R_recovery,
488
+ "exploration_bonus": R_exploration,
489
+ "tool_usage": R_tools,
490
+ "memory_usage": R_memory,
491
+ "generalization": R_generalization,
492
+ "redundancy_penalty": -P_redundancy,
493
+ "timeout_penalty": -P_timeout,
494
+ "invalid_action_penalty": -P_invalid
495
+ }
496
+
497
+ # Generate explanation
498
+ message = generate_reward_explanation(breakdown, reward_value)
499
+
500
+ return Reward(
501
+ value=reward_value,
502
+ cumulative=episode.cumulative_reward + reward_value,
503
+ breakdown=breakdown,
504
+ message=message
505
+ )
506
+ ```
507
+
508
+ ### Default Weights
509
+
510
+ ```python
511
+ class RewardWeights(BaseModel):
512
+ completion: float = 0.40 # Most important
513
+ efficiency: float = 0.15 # Moderate importance
514
+ planning: float = 0.10 # Encourages good habits
515
+ recovery: float = 0.08 # Resilience
516
+ exploration: float = 0.05 # Early training
517
+ tools: float = 0.05 # Appropriate tool use
518
+ memory: float = 0.05 # Effective memory
519
+ generalization: float = 0.07 # Transfer learning
520
+ # Total: 0.95, leaves room for penalties
521
+ ```
522
+
523
+ ---
524
+
525
+ ## Configuration
526
+
527
+ ### Settings
528
+
529
+ ```typescript
530
+ interface RewardConfig {
531
+ weights: RewardWeights;
532
+
533
+ // Component toggles
534
+ enablePlanningReward: boolean;
535
+ enableRecoveryReward: boolean;
536
+ enableExplorationBonus: boolean;
537
+ enableGeneralizationTest: boolean;
538
+
539
+ // Penalty settings
540
+ redundancyThreshold: number; // Penalize after N visits to same page
541
+ timeoutPenalty: number; // Penalty for exceeding time limit
542
+ invalidActionPenalty: number; // Penalty per invalid action
543
+
544
+ // Exploration decay
545
+ explorationDecayRate: number; // Default: 0.01
546
+
547
+ // Generalization
548
+ testTaskCount: number; // Number of unseen tasks to test on
549
+ }
550
+ ```
551
+
552
+ ### UI Component
553
+
554
+ ```jsx
555
+ <RewardSettings>
556
+ <Section title="Component Weights">
557
+ <Slider label="Task Completion" value={weights.completion} min={0} max={1} step={0.05} />
558
+ <Slider label="Efficiency" value={weights.efficiency} min={0} max={1} step={0.05} />
559
+ <Slider label="Planning Quality" value={weights.planning} min={0} max={1} step={0.05} />
560
+ <Slider label="Recovery Ability" value={weights.recovery} min={0} max={1} step={0.05} />
561
+ <Slider label="Exploration Bonus" value={weights.exploration} min={0} max={1} step={0.05} />
562
+ <Slider label="Tool Usage" value={weights.tools} min={0} max={1} step={0.05} />
563
+ <Slider label="Memory Usage" value={weights.memory} min={0} max={1} step={0.05} />
564
+ <Slider label="Generalization" value={weights.generalization} min={0} max={1} step={0.05} />
565
+
566
+ <TotalWeight value={Object.values(weights).reduce((a,b) => a+b, 0)} max={1.0} />
567
+ </Section>
568
+
569
+ <Section title="Penalties">
570
+ <NumberInput label="Redundancy Threshold (page visits)" value={redundancyThreshold} />
571
+ <NumberInput label="Timeout Penalty" value={timeoutPenalty} min={0} max={1} step={0.1} />
572
+ <NumberInput label="Invalid Action Penalty" value={invalidActionPenalty} min={0} max={1} step={0.1} />
573
+ </Section>
574
+
575
+ <Section title="Exploration">
576
+ <NumberInput label="Decay Rate" value={explorationDecayRate} min={0} max={0.1} step={0.001} />
577
+ <HelpText>How quickly exploration bonus decreases over episodes</HelpText>
578
+ </Section>
579
+
580
+ <Section title="Presets">
581
+ <Button onClick={() => loadPreset('balanced')}>Balanced (Default)</Button>
582
+ <Button onClick={() => loadPreset('efficiency_focused')}>Efficiency Focused</Button>
583
+ <Button onClick={() => loadPreset('quality_focused')}>Quality Focused</Button>
584
+ <Button onClick={() => loadPreset('exploration')}>Exploration Mode</Button>
585
+ </Section>
586
+ </RewardSettings>
587
+ ```
588
+
589
+ ---
590
+
591
+ ## Reward Visualization
592
+
593
+ ```jsx
594
+ <RewardBreakdown>
595
+ <BarChart>
596
+ {Object.entries(breakdown).map(([component, value]) => (
597
+ <Bar
598
+ key={component}
599
+ label={component}
600
+ value={value}
601
+ color={value >= 0 ? 'green' : 'red'}
602
+ />
603
+ ))}
604
+ </BarChart>
605
+
606
+ <TotalReward value={reward.value} />
607
+
608
+ <Explanation>{reward.message}</Explanation>
609
+ </RewardBreakdown>
610
+ ```
611
+
612
+ **Example Output:**
613
+ ```
614
+ Reward Breakdown (Total: 0.72)
615
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
616
+ Task Completion: ████████████████████ 0.85
617
+ Efficiency: ████████████░░░░░░░░ 0.65
618
+ Planning Quality: ███████████████░░░░░ 0.78
619
+ Recovery Ability: ██████████████████░░ 0.90
620
+ Exploration: ████░░░░░░░░░░░░░░░░ 0.20
621
+ Tool Usage: ███████████████████░ 0.95
622
+ Memory Usage: ████████░░░░░░░░░░░░ 0.40
623
+ Generalization: ██████████████░░░░░░ 0.72
624
+ Redundancy Penalty: ░░░░░░░░░░░░░░░░░░░░ -0.15
625
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
626
+
627
+ Explanation:
628
+ ✓ Excellent task completion (85% of fields extracted correctly)
629
+ ✓ Good efficiency (completed in 8/20 steps)
630
+ ✓ Strong recovery ability (recovered from 2/2 failures)
631
+ ⚠ Moderate redundancy (visited homepage 3 times)
632
+ → Overall: Strong performance!
633
+ ```
634
+
635
+ ---
636
+
637
+ **Next:** See [html-processing.md](./html-processing.md) for advanced HTML handling.
docs/search-engine.md ADDED
@@ -0,0 +1,782 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🔍 Search Engine Layer
2
+
3
+ ## Table of Contents
4
+ 1. [Overview](#overview)
5
+ 2. [Supported Search Engines](#supported-search-engines)
6
+ 3. [Query Optimization](#query-optimization)
7
+ 4. [Multi-Hop Search](#multi-hop-search)
8
+ 5. [Source Credibility Scoring](#source-credibility-scoring)
9
+ 6. [Result Ranking](#result-ranking)
10
+ 7. [Caching & Deduplication](#caching--deduplication)
11
+ 8. [Configuration](#configuration)
12
+
13
+ ---
14
+
15
+ ## Overview
16
+
17
+ The **Search Engine Layer** enables agents to search the web intelligently, optimize queries, perform multi-hop searches, and evaluate source credibility.
18
+
19
+ ### Capabilities
20
+
21
+ - ✅ Multiple search engine APIs (Google, Bing, Brave, DuckDuckGo, Perplexity)
22
+ - ✅ Query optimization and rewriting
23
+ - ✅ Multi-hop search (search → refine → search again)
24
+ - ✅ Source credibility scoring
25
+ - ✅ Result ranking and filtering
26
+ - ✅ Caching and deduplication
27
+ - ✅ Cost tracking
28
+
29
+ ---
30
+
31
+ ## Supported Search Engines
32
+
33
+ ### 1. Google Search API
34
+
35
+ **Pros:**
36
+ - Most comprehensive results
37
+ - High quality
38
+ - Advanced operators support
39
+
40
+ **Cons:**
41
+ - Requires API key + Custom Search Engine ID
42
+ - Costs $5 per 1000 queries after free tier
43
+
44
+ **Configuration:**
45
+ ```python
46
+ {
47
+ "google": {
48
+ "api_key": "YOUR_GOOGLE_API_KEY",
49
+ "search_engine_id": "YOUR_CSE_ID",
50
+ "region": "us",
51
+ "safe_search": True,
52
+ "num_results": 10
53
+ }
54
+ }
55
+ ```
56
+
57
+ **Usage:**
58
+ ```python
59
+ results = search_engine.search(
60
+ query="product reviews for Widget Pro",
61
+ engine="google",
62
+ num_results=10
63
+ )
64
+ ```
65
+
66
+ ### 2. Bing Search API
67
+
68
+ **Pros:**
69
+ - Good quality results
70
+ - Competitive pricing ($7 per 1000 queries)
71
+ - News search included
72
+
73
+ **Cons:**
74
+ - Smaller index than Google
75
+ - Less advanced operators
76
+
77
+ **Configuration:**
78
+ ```python
79
+ {
80
+ "bing": {
81
+ "api_key": "YOUR_BING_API_KEY",
82
+ "market": "en-US",
83
+ "safe_search": "Moderate",
84
+ "freshness": None # "Day", "Week", "Month"
85
+ }
86
+ }
87
+ ```
88
+
89
+ ### 3. Brave Search API
90
+
91
+ **Pros:**
92
+ - Privacy-focused
93
+ - Independent index
94
+ - Good pricing ($5 per 1000 queries)
95
+ - No tracking
96
+
97
+ **Cons:**
98
+ - Smaller index
99
+ - Newer service
100
+
101
+ **Configuration:**
102
+ ```python
103
+ {
104
+ "brave": {
105
+ "api_key": "YOUR_BRAVE_API_KEY",
106
+ "country": "US",
107
+ "safe_search": "moderate",
108
+ "freshness": None
109
+ }
110
+ }
111
+ ```
112
+
113
+ ### 4. DuckDuckGo (Free, No API Key)
114
+
115
+ **Pros:**
116
+ - Completely free
117
+ - No API key required
118
+ - Privacy-focused
119
+ - Good for testing
120
+
121
+ **Cons:**
122
+ - Rate limited
123
+ - Less control over results
124
+ - Smaller result set
125
+
126
+ **Usage:**
127
+ ```python
128
+ from duckduckgo_search import DDGS
129
+
130
+ results = DDGS().text(
131
+ keywords="web scraping tools",
132
+ max_results=10
133
+ )
134
+ ```
135
+
136
+ ### 5. Perplexity AI (AI-Powered Search)
137
+
138
+ **Pros:**
139
+ - Returns AI-summarized answers with citations
140
+ - Real-time web access
141
+ - Conversational queries
142
+
143
+ **Cons:**
144
+ - More expensive
145
+ - Designed for Q&A, not traditional search
146
+
147
+ **Configuration:**
148
+ ```python
149
+ {
150
+ "perplexity": {
151
+ "api_key": "YOUR_PERPLEXITY_API_KEY",
152
+ "model": "pplx-70b-online",
153
+ "include_citations": True
154
+ }
155
+ }
156
+ ```
157
+
158
+ ---
159
+
160
+ ## Query Optimization
161
+
162
+ ### Query Rewriter
163
+
164
+ ```python
165
+ class QueryOptimizer:
166
+ """Optimize search queries for better results."""
167
+
168
+ def optimize(self, query: str, context: Dict = None) -> str:
169
+ """Optimize a search query."""
170
+ optimized = query
171
+
172
+ # 1. Expand abbreviations
173
+ optimized = self.expand_abbreviations(optimized)
174
+
175
+ # 2. Add context keywords
176
+ if context:
177
+ optimized = self.add_context(optimized, context)
178
+
179
+ # 3. Remove stop words (optional)
180
+ # optimized = self.remove_stop_words(optimized)
181
+
182
+ # 4. Add search operators
183
+ optimized = self.add_operators(optimized)
184
+
185
+ return optimized
186
+
187
+ def expand_abbreviations(self, query: str) -> str:
188
+ """Expand common abbreviations."""
189
+ expansions = {
190
+ "AI": "artificial intelligence",
191
+ "ML": "machine learning",
192
+ "API": "application programming interface",
193
+ "UI": "user interface",
194
+ "UX": "user experience",
195
+ }
196
+
197
+ for abbr, full in expansions.items():
198
+ # Only expand if abbreviation stands alone
199
+ query = re.sub(rf'\b{abbr}\b', full, query)
200
+
201
+ return query
202
+
203
+ def add_context(self, query: str, context: Dict) -> str:
204
+ """Add contextual keywords."""
205
+ if context.get('domain'):
206
+ query = f"{query} site:{context['domain']}"
207
+
208
+ if context.get('year'):
209
+ query = f"{query} {context['year']}"
210
+
211
+ if context.get('location'):
212
+ query = f"{query} {context['location']}"
213
+
214
+ return query
215
+
216
+ def add_operators(self, query: str) -> str:
217
+ """Add search operators for precision."""
218
+ # If query has multiple important terms, wrap in quotes
219
+ important_terms = self.extract_important_terms(query)
220
+
221
+ if len(important_terms) > 1:
222
+ # Exact phrase search for key terms
223
+ for term in important_terms:
224
+ if len(term.split()) > 1:
225
+ query = query.replace(term, f'"{term}"')
226
+
227
+ return query
228
+ ```
229
+
230
+ ### Query Expansion
231
+
232
+ ```python
233
+ class QueryExpander:
234
+ """Expand queries with synonyms and related terms."""
235
+
236
+ def expand(self, query: str) -> List[str]:
237
+ """Generate query variations."""
238
+ variations = [query]
239
+
240
+ # 1. Synonym replacement
241
+ synonyms = self.get_synonyms(query)
242
+ for synonym_set in synonyms:
243
+ for term, synonym in synonym_set:
244
+ varied = query.replace(term, synonym)
245
+ variations.append(varied)
246
+
247
+ # 2. Add modifiers
248
+ modifiers = ["best", "top", "review", "comparison", "guide"]
249
+ for modifier in modifiers:
250
+ variations.append(f"{modifier} {query}")
251
+
252
+ # 3. Question forms
253
+ variations.extend([
254
+ f"what is {query}",
255
+ f"how to {query}",
256
+ f"why {query}"
257
+ ])
258
+
259
+ return variations[:5] # Limit to top 5
260
+ ```
261
+
262
+ ### Bad Query Detection
263
+
264
+ ```python
265
+ def is_bad_query(query: str) -> bool:
266
+ """Detect poorly formed queries."""
267
+ # Too short
268
+ if len(query.split()) < 2:
269
+ return True
270
+
271
+ # All stop words
272
+ stop_words = {'the', 'a', 'an', 'is', 'are', 'was', 'were', 'be'}
273
+ words = set(query.lower().split())
274
+ if words.issubset(stop_words):
275
+ return True
276
+
277
+ # No meaningful content
278
+ if not re.search(r'[a-zA-Z]{3,}', query):
279
+ return True
280
+
281
+ return False
282
+ ```
283
+
284
+ ---
285
+
286
+ ## Multi-Hop Search
287
+
288
+ ### Multi-Hop Strategy
289
+
290
+ ```python
291
+ class MultiHopSearch:
292
+ """Perform multi-hop search with refinement."""
293
+
294
+ async def search_multi_hop(
295
+ self,
296
+ initial_query: str,
297
+ max_hops: int = 3
298
+ ) -> MultiHopResult:
299
+ """Perform multi-hop search."""
300
+ results_by_hop = []
301
+ current_query = initial_query
302
+
303
+ for hop in range(max_hops):
304
+ # Execute search
305
+ results = await self.search(current_query)
306
+ results_by_hop.append(results)
307
+
308
+ # Analyze results
309
+ analysis = self.analyze_results(results)
310
+
311
+ # Check if we found what we need
312
+ if analysis.is_satisfactory:
313
+ break
314
+
315
+ # Refine query for next hop
316
+ current_query = self.refine_query(
317
+ current_query,
318
+ results,
319
+ analysis
320
+ )
321
+
322
+ return MultiHopResult(
323
+ hops=results_by_hop,
324
+ final_query=current_query,
325
+ best_results=self.rank_all_results(results_by_hop)
326
+ )
327
+
328
+ def refine_query(
329
+ self,
330
+ original_query: str,
331
+ results: List[SearchResult],
332
+ analysis: ResultAnalysis
333
+ ) -> str:
334
+ """Refine query based on previous results."""
335
+ # Extract new keywords from top results
336
+ new_keywords = self.extract_keywords_from_results(results[:3])
337
+
338
+ # If results were too broad, add specificity
339
+ if analysis.too_broad:
340
+ specific_terms = [kw for kw in new_keywords if len(kw.split()) > 1]
341
+ if specific_terms:
342
+ return f"{original_query} {specific_terms[0]}"
343
+
344
+ # If results were off-topic, add negative keywords
345
+ if analysis.off_topic_terms:
346
+ negative = ' '.join(f"-{term}" for term in analysis.off_topic_terms)
347
+ return f"{original_query} {negative}"
348
+
349
+ # If no results, try synonyms
350
+ if analysis.no_results:
351
+ return self.query_expander.expand(original_query)[0]
352
+
353
+ return original_query
354
+ ```
355
+
356
+ ### Example Multi-Hop Flow
357
+
358
+ ```python
359
+ # Hop 1: Initial broad search
360
+ query_1 = "best web scraping tools"
361
+ results_1 = search(query_1)
362
+ # Results: General articles about scraping tools
363
+
364
+ # Hop 2: Refine to specific use case
365
+ query_2 = "best web scraping tools for e-commerce Python"
366
+ results_2 = search(query_2)
367
+ # Results: More specific, Python-focused
368
+
369
+ # Hop 3: Add recent constraint
370
+ query_3 = "best web scraping tools for e-commerce Python 2026"
371
+ results_3 = search(query_3)
372
+ # Results: Latest tools with recent reviews
373
+ ```
374
+
375
+ ---
376
+
377
+ ## Source Credibility Scoring
378
+
379
+ ### Credibility Scorer
380
+
381
+ ```python
382
+ class SourceCredibilityScorer:
383
+ """Score the credibility of search result sources."""
384
+
385
+ def score(self, url: str, domain: str, result: SearchResult) -> float:
386
+ """Calculate credibility score (0.0 to 1.0)."""
387
+ score = 0.5 # Base score
388
+
389
+ # 1. Domain reputation
390
+ score += self.domain_reputation_score(domain) * 0.3
391
+
392
+ # 2. Domain age
393
+ score += self.domain_age_score(domain) * 0.1
394
+
395
+ # 3. HTTPS
396
+ if url.startswith('https://'):
397
+ score += 0.05
398
+
399
+ # 4. TLD credibility
400
+ score += self.tld_score(domain) * 0.1
401
+
402
+ # 5. Presence in result snippet
403
+ score += self.snippet_quality_score(result.snippet) * 0.15
404
+
405
+ # 6. Backlinks (if available)
406
+ score += self.backlink_score(domain) * 0.2
407
+
408
+ # 7. Freshness
409
+ score += self.freshness_score(result.date_published) * 0.1
410
+
411
+ return min(max(score, 0.0), 1.0)
412
+
413
+ def domain_reputation_score(self, domain: str) -> float:
414
+ """Score based on known domain reputation."""
415
+ # Trusted domains
416
+ trusted = {
417
+ 'wikipedia.org': 1.0,
418
+ 'github.com': 0.95,
419
+ 'stackoverflow.com': 0.95,
420
+ 'nytimes.com': 0.9,
421
+ 'bbc.com': 0.9,
422
+ 'reuters.com': 0.9,
423
+ 'arxiv.org': 0.95,
424
+ 'nature.com': 0.95,
425
+ 'sciencedirect.com': 0.9,
426
+ }
427
+
428
+ # Known spammy/low-quality domains
429
+ untrusted = {
430
+ 'contentvilla.com': 0.1,
431
+ 'ehow.com': 0.3,
432
+ }
433
+
434
+ if domain in trusted:
435
+ return trusted[domain]
436
+
437
+ if domain in untrusted:
438
+ return untrusted[domain]
439
+
440
+ # Medium trust for unknown domains
441
+ return 0.5
442
+
443
+ def tld_score(self, domain: str) -> float:
444
+ """Score based on top-level domain."""
445
+ tld = domain.split('.')[-1]
446
+
447
+ tld_scores = {
448
+ 'edu': 0.9, # Educational institutions
449
+ 'gov': 0.95, # Government
450
+ 'org': 0.8, # Organizations
451
+ 'com': 0.6, # Commercial (neutral)
452
+ 'net': 0.6,
453
+ 'io': 0.6,
454
+ 'info': 0.4, # Often spammy
455
+ 'xyz': 0.3, # Cheap, often spam
456
+ }
457
+
458
+ return tld_scores.get(tld, 0.5)
459
+
460
+ def snippet_quality_score(self, snippet: str) -> float:
461
+ """Score snippet quality."""
462
+ score = 0.5
463
+
464
+ # Penalize clickbait patterns
465
+ clickbait_patterns = [
466
+ r'you won\'t believe',
467
+ r'shocking',
468
+ r'one weird trick',
469
+ r'\d+ reasons why',
470
+ ]
471
+
472
+ for pattern in clickbait_patterns:
473
+ if re.search(pattern, snippet, re.I):
474
+ score -= 0.2
475
+
476
+ # Reward factual language
477
+ if re.search(r'according to|research|study|data|analysis', snippet, re.I):
478
+ score += 0.2
479
+
480
+ return max(0.0, score)
481
+
482
+ def freshness_score(self, date_published: Optional[datetime]) -> float:
483
+ """Score based on content freshness."""
484
+ if not date_published:
485
+ return 0.3 # Unknown date
486
+
487
+ age_days = (datetime.now() - date_published).days
488
+
489
+ # Decay function: Fresh content scores higher
490
+ if age_days < 30:
491
+ return 1.0
492
+ elif age_days < 90:
493
+ return 0.8
494
+ elif age_days < 365:
495
+ return 0.6
496
+ elif age_days < 730:
497
+ return 0.4
498
+ else:
499
+ return 0.2
500
+ ```
501
+
502
+ ### Domain Blacklist
503
+
504
+ ```python
505
+ DOMAIN_BLACKLIST = [
506
+ 'contentvilla.com',
507
+ 'pastebin.com', # Often scraped/duplicated content
508
+ 'scam-detector.com',
509
+ 'pinterest.com', # Image aggregator, not original content
510
+ # Add more as needed
511
+ ]
512
+
513
+ def is_blacklisted(url: str) -> bool:
514
+ """Check if URL is blacklisted."""
515
+ domain = urlparse(url).netloc
516
+ return any(blocked in domain for blocked in DOMAIN_BLACKLIST)
517
+ ```
518
+
519
+ ---
520
+
521
+ ## Result Ranking
522
+
523
+ ### Ranking Algorithm
524
+
525
+ ```python
526
+ class ResultRanker:
527
+ """Rank search results by relevance and quality."""
528
+
529
+ def rank(
530
+ self,
531
+ results: List[SearchResult],
532
+ query: str,
533
+ context: Dict = None
534
+ ) -> List[RankedResult]:
535
+ """Rank results by multiple factors."""
536
+ ranked = []
537
+
538
+ for result in results:
539
+ score = self.calculate_score(result, query, context)
540
+ ranked.append(RankedResult(
541
+ result=result,
542
+ score=score
543
+ ))
544
+
545
+ # Sort by score (highest first)
546
+ ranked.sort(key=lambda x: x.score, reverse=True)
547
+
548
+ return ranked
549
+
550
+ def calculate_score(
551
+ self,
552
+ result: SearchResult,
553
+ query: str,
554
+ context: Dict
555
+ ) -> float:
556
+ """Calculate ranking score."""
557
+ score = 0.0
558
+
559
+ # 1. Credibility (40%)
560
+ credibility = self.credibility_scorer.score(
561
+ result.url,
562
+ result.domain,
563
+ result
564
+ )
565
+ score += credibility * 0.4
566
+
567
+ # 2. Relevance (35%)
568
+ relevance = self.calculate_relevance(result, query)
569
+ score += relevance * 0.35
570
+
571
+ # 3. Freshness (10%)
572
+ freshness = self.credibility_scorer.freshness_score(result.date_published)
573
+ score += freshness * 0.1
574
+
575
+ # 4. Engagement signals (10%)
576
+ # (If available: click-through rate, dwell time, etc.)
577
+ score += result.engagement_score * 0.1
578
+
579
+ # 5. Diversity bonus (5%)
580
+ # Prefer results from different domains
581
+ if context and context.get('seen_domains'):
582
+ if result.domain not in context['seen_domains']:
583
+ score += 0.05
584
+
585
+ return score
586
+
587
+ def calculate_relevance(self, result: SearchResult, query: str) -> float:
588
+ """Calculate query-result relevance."""
589
+ # Simple keyword matching (can be enhanced with embeddings)
590
+ query_terms = set(query.lower().split())
591
+
592
+ # Check title
593
+ title_terms = set(result.title.lower().split())
594
+ title_overlap = len(query_terms & title_terms) / len(query_terms)
595
+
596
+ # Check snippet
597
+ snippet_terms = set(result.snippet.lower().split())
598
+ snippet_overlap = len(query_terms & snippet_terms) / len(query_terms)
599
+
600
+ # Weighted average
601
+ relevance = 0.6 * title_overlap + 0.4 * snippet_overlap
602
+
603
+ return relevance
604
+ ```
605
+
606
+ ---
607
+
608
+ ## Caching & Deduplication
609
+
610
+ ### Search Result Cache
611
+
612
+ ```python
613
+ class SearchCache:
614
+ """Cache search results to reduce API calls."""
615
+
616
+ def __init__(self, ttl_seconds: int = 3600):
617
+ self.cache = {}
618
+ self.ttl = ttl_seconds
619
+
620
+ def get(self, query: str, engine: str) -> Optional[List[SearchResult]]:
621
+ """Get cached results."""
622
+ key = self.make_key(query, engine)
623
+
624
+ if key in self.cache:
625
+ cached, timestamp = self.cache[key]
626
+
627
+ # Check if still valid
628
+ age = (datetime.now() - timestamp).total_seconds()
629
+ if age < self.ttl:
630
+ return cached
631
+ else:
632
+ # Expired, remove
633
+ del self.cache[key]
634
+
635
+ return None
636
+
637
+ def set(self, query: str, engine: str, results: List[SearchResult]):
638
+ """Cache results."""
639
+ key = self.make_key(query, engine)
640
+ self.cache[key] = (results, datetime.now())
641
+
642
+ def make_key(self, query: str, engine: str) -> str:
643
+ """Generate cache key."""
644
+ normalized = query.lower().strip()
645
+ return f"{engine}:{normalized}"
646
+ ```
647
+
648
+ ### Result Deduplication
649
+
650
+ ```python
651
+ class ResultDeduplicator:
652
+ """Remove duplicate results across multiple searches."""
653
+
654
+ def deduplicate(self, results: List[SearchResult]) -> List[SearchResult]:
655
+ """Remove duplicates."""
656
+ seen_urls = set()
657
+ seen_titles = set()
658
+ unique = []
659
+
660
+ for result in results:
661
+ # Normalize URL (remove query params, fragments)
662
+ normalized_url = self.normalize_url(result.url)
663
+
664
+ # Normalize title
665
+ normalized_title = result.title.lower().strip()
666
+
667
+ # Check if we've seen this result
668
+ if normalized_url in seen_urls:
669
+ continue
670
+
671
+ # Check for near-duplicate titles
672
+ if self.is_near_duplicate_title(normalized_title, seen_titles):
673
+ continue
674
+
675
+ # Add to unique set
676
+ unique.append(result)
677
+ seen_urls.add(normalized_url)
678
+ seen_titles.add(normalized_title)
679
+
680
+ return unique
681
+
682
+ def normalize_url(self, url: str) -> str:
683
+ """Normalize URL for comparison."""
684
+ parsed = urlparse(url)
685
+ # Remove query params and fragment
686
+ normalized = f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
687
+ # Remove trailing slash
688
+ return normalized.rstrip('/')
689
+
690
+ def is_near_duplicate_title(self, title: str, seen_titles: Set[str]) -> bool:
691
+ """Check if title is near-duplicate of seen titles."""
692
+ from difflib import SequenceMatcher
693
+
694
+ for seen in seen_titles:
695
+ similarity = SequenceMatcher(None, title, seen).ratio()
696
+ if similarity > 0.85: # 85% similar
697
+ return True
698
+
699
+ return False
700
+ ```
701
+
702
+ ---
703
+
704
+ ## Configuration
705
+
706
+ ### Search Engine Settings
707
+
708
+ ```typescript
709
+ interface SearchEngineConfig {
710
+ default: 'google' | 'bing' | 'brave' | 'duckduckgo' | 'perplexity';
711
+
712
+ providers: {
713
+ google?: GoogleConfig;
714
+ bing?: BingConfig;
715
+ brave?: BraveConfig;
716
+ duckduckgo?: DuckDuckGoConfig;
717
+ perplexity?: PerplexityConfig;
718
+ };
719
+
720
+ // Global settings
721
+ maxResults: number; // Default: 10
722
+ timeout: number; // Seconds
723
+ cacheResults: boolean; // Default: true
724
+ cacheTTL: number; // Seconds
725
+
726
+ // Query optimization
727
+ optimizeQueries: boolean; // Default: true
728
+ expandQueries: boolean; // Default: false
729
+
730
+ // Multi-hop
731
+ enableMultiHop: boolean; // Default: false
732
+ maxHops: number; // Default: 3
733
+
734
+ // Filtering
735
+ filterByCredibility: boolean; // Default: true
736
+ minCredibilityScore: number; // Default: 0.4
737
+ blacklistedDomains: string[];
738
+
739
+ // Cost tracking
740
+ trackCosts: boolean; // Default: true
741
+ dailyQueryLimit: number; // Default: 1000
742
+ }
743
+ ```
744
+
745
+ ### Usage Example
746
+
747
+ ```python
748
+ # Initialize search engine
749
+ search = SearchEngine(config)
750
+
751
+ # Simple search
752
+ results = await search.search(
753
+ query="best Python web scraping libraries",
754
+ engine="google",
755
+ num_results=10
756
+ )
757
+
758
+ # Optimized search
759
+ results = await search.search_optimized(
760
+ query="web scraping",
761
+ context={"domain": "python.org", "year": 2026},
762
+ optimize=True,
763
+ filter_credibility=True
764
+ )
765
+
766
+ # Multi-hop search
767
+ multi_hop_results = await search.search_multi_hop(
768
+ initial_query="web scraping tools",
769
+ max_hops=3
770
+ )
771
+
772
+ # Get ranked results
773
+ ranked = search.rank_results(
774
+ results,
775
+ query="web scraping tools",
776
+ context={"seen_domains": ["github.com"]}
777
+ )
778
+ ```
779
+
780
+ ---
781
+
782
+ **Next:** See [agents.md](./agents.md) for agent architecture.
docs/settings.md ADDED
@@ -0,0 +1,750 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ⚙️ Dashboard Settings
2
+
3
+ ## Table of Contents
4
+ 1. [Overview](#overview)
5
+ 2. [Memory Settings](#memory-settings)
6
+ 3. [API & Model Settings](#api--model-settings)
7
+ 4. [MCP Server Management](#mcp-server-management)
8
+ 5. [Agent Behavior](#agent-behavior)
9
+ 6. [Search Engine Configuration](#search-engine-configuration)
10
+ 7. [Network & Proxy](#network--proxy)
11
+ 8. [Cost Control](#cost-control)
12
+ 9. [Performance Tuning](#performance-tuning)
13
+ 10. [Import/Export](#importexport)
14
+
15
+ ---
16
+
17
+ ## Overview
18
+
19
+ The **Settings Dashboard** provides comprehensive configuration for all aspects of the WebScraper environment, models, MCPs, agents, and observability.
20
+
21
+ ### Settings Structure
22
+
23
+ ```
24
+ Settings
25
+ ├── Memory
26
+ │ ├── Short-Term Memory
27
+ │ ├── Working Memory
28
+ │ ├── Long-Term Memory
29
+ │ └── Shared Memory
30
+ ├── API & Models
31
+ │ ├── OpenAI
32
+ │ ├── Anthropic
33
+ │ ├── Google
34
+ │ ├── Groq
35
+ │ ├── Custom Providers
36
+ │ └── Model Routing
37
+ ├── MCP Servers
38
+ │ ├── Installed Servers
39
+ │ ├── Available Servers
40
+ │ └── Custom Servers
41
+ ├── Agent Behavior
42
+ │ ├── Exploration vs Exploitation
43
+ │ ├── Retry Strategy
44
+ │ ├── Planning Depth
45
+ │ └── Risk Tolerance
46
+ ├── Search Engines
47
+ │ ├── Google Search
48
+ │ ├── Bing Search
49
+ │ ├── Brave Search
50
+ │ └── DuckDuckGo
51
+ ├── Network & Proxy
52
+ │ ├── Proxy Pool
53
+ │ ├── VPN Configuration
54
+ │ ├── Rate Limiting
55
+ │ └── User Agent Rotation
56
+ ├── Cost Control
57
+ │ ├── Daily Budget
58
+ │ ├── Model Costs
59
+ │ └── Alerts
60
+ └── Performance
61
+ ├── Batch Processing
62
+ ├── Parallel Execution
63
+ ├── Caching
64
+ └── Context Optimization
65
+ ```
66
+
67
+ ---
68
+
69
+ ## Memory Settings
70
+
71
+ ### Configuration
72
+
73
+ ```typescript
74
+ interface MemorySettings {
75
+ // Layer toggles
76
+ enableShortTerm: boolean; // Episode memory
77
+ enableWorking: boolean; // Reasoning buffer
78
+ enableLongTerm: boolean; // Persistent patterns
79
+ enableShared: boolean; // Multi-agent memory
80
+
81
+ // Size limits
82
+ maxEpisodeMemoryMB: number; // Default: 10
83
+ maxWorkingMemoryItems: number; // Default: 50
84
+ maxLongTermPatterns: number; // Default: 10000
85
+
86
+ // Vector database
87
+ vectorDB: {
88
+ provider: 'faiss' | 'qdrant' | 'pinecone' | 'weaviate';
89
+ embeddingModel: string; // Default: 'text-embedding-3-small'
90
+ dimension: number; // Default: 1536
91
+ similarityMetric: 'cosine' | 'euclidean' | 'dot_product';
92
+ };
93
+
94
+ // Storage backend
95
+ storage: {
96
+ backend: 'filesystem' | 'postgresql' | 'mongodb' | 'redis';
97
+ path: string; // For filesystem
98
+ connectionString?: string; // For databases
99
+ };
100
+
101
+ // Optimization
102
+ autoPrune: boolean; // Default: true
103
+ pruneThreshold: number; // Default: 0.3 (keep if score > 0.3)
104
+ pruneIntervalHours: number; // Default: 24
105
+ autoSummarize: boolean; // Default: true
106
+ maxContextTokens: number; // Default: 4000
107
+ }
108
+ ```
109
+
110
+ ### UI Component
111
+
112
+ ```jsx
113
+ <MemorySettings>
114
+ <Section title="Memory Layers">
115
+ <Toggle label="Short-Term Memory (Episode)" value={enableShortTerm} />
116
+ <Toggle label="Working Memory (Reasoning)" value={enableWorking} />
117
+ <Toggle label="Long-Term Memory (Persistent)" value={enableLongTerm} />
118
+ <Toggle label="Shared Memory (Multi-Agent)" value={enableShared} />
119
+ </Section>
120
+
121
+ <Section title="Size Limits">
122
+ <NumberInput label="Episode Memory (MB)" value={maxEpisodeMemoryMB} min={1} max={100} />
123
+ <NumberInput label="Working Memory Items" value={maxWorkingMemoryItems} min={10} max={500} />
124
+ <NumberInput label="Long-Term Patterns" value={maxLongTermPatterns} min={100} max={100000} />
125
+ </Section>
126
+
127
+ <Section title="Vector Database">
128
+ <Select label="Provider" options={['FAISS', 'Qdrant', 'Pinecone', 'Weaviate']} />
129
+ <Select label="Embedding Model" options={['text-embedding-3-small', 'text-embedding-3-large', 'custom']} />
130
+ <NumberInput label="Dimension" value={dimension} disabled />
131
+ <Select label="Similarity Metric" options={['Cosine', 'Euclidean', 'Dot Product']} />
132
+ </Section>
133
+
134
+ <Section title="Auto-Optimization">
135
+ <Toggle label="Auto-Prune Low-Value Memories" value={autoPrune} />
136
+ <Slider label="Prune Threshold" value={pruneThreshold} min={0} max={1} step={0.1} />
137
+ <NumberInput label="Prune Interval (hours)" value={pruneIntervalHours} />
138
+ <Toggle label="Auto-Summarize Episodes" value={autoSummarize} />
139
+ <NumberInput label="Max Context Tokens" value={maxContextTokens} />
140
+ </Section>
141
+ </MemorySettings>
142
+ ```
143
+
144
+ ---
145
+
146
+ ## API & Model Settings
147
+
148
+ ### Multi-Provider Configuration
149
+
150
+ ```typescript
151
+ interface APISettings {
152
+ providers: {
153
+ openai?: {
154
+ apiKey: string;
155
+ organization?: string;
156
+ models: {
157
+ default: string;
158
+ reasoning: string;
159
+ fast: string;
160
+ };
161
+ temperature: number;
162
+ maxTokens: number;
163
+ };
164
+
165
+ anthropic?: {
166
+ apiKey: string;
167
+ models: {
168
+ default: string;
169
+ reasoning: string;
170
+ fast: string;
171
+ };
172
+ temperature: number;
173
+ maxTokens: number;
174
+ };
175
+
176
+ google?: {
177
+ apiKey: string;
178
+ models: {
179
+ default: string;
180
+ reasoning: string;
181
+ fast: string;
182
+ };
183
+ temperature: number;
184
+ maxOutputTokens: number;
185
+ };
186
+
187
+ groq?: {
188
+ apiKey: string;
189
+ models: {
190
+ default: string;
191
+ reasoning: string;
192
+ fast: string;
193
+ };
194
+ temperature: number;
195
+ maxTokens: number;
196
+ };
197
+
198
+ custom?: {
199
+ baseURL: string;
200
+ apiKey: string;
201
+ models: Record<string, string>;
202
+ };
203
+ };
204
+
205
+ // Smart routing
206
+ router: {
207
+ enabled: boolean;
208
+ strategy: 'task_based' | 'cost_optimized' | 'speed_optimized' | 'quality_optimized';
209
+ fallbackOrder: string[];
210
+ autoRetry: boolean;
211
+ maxRetries: number;
212
+ };
213
+
214
+ // Ensemble
215
+ ensemble: {
216
+ enabled: boolean;
217
+ strategy: 'voting' | 'ranking' | 'fusion' | 'verification';
218
+ models: string[];
219
+ minAgreement: number;
220
+ };
221
+ }
222
+ ```
223
+
224
+ ### UI Component
225
+
226
+ ```jsx
227
+ <APISettings>
228
+ <Tabs>
229
+ <Tab label="OpenAI">
230
+ <TextInput label="API Key" type="password" value={openaiKey} />
231
+ <TextInput label="Organization (optional)" value={openaiOrg} />
232
+ <Select label="Default Model" options={['gpt-4o-mini', 'gpt-4-turbo', 'gpt-4o']} />
233
+ <Select label="Reasoning Model" options={['gpt-4-turbo', 'gpt-4o']} />
234
+ <Select label="Fast Model" options={['gpt-4o-mini', 'gpt-3.5-turbo']} />
235
+ <Button onClick={testConnection}>Test Connection</Button>
236
+ </Tab>
237
+
238
+ <Tab label="Anthropic">
239
+ <TextInput label="API Key" type="password" value={anthropicKey} />
240
+ <Select label="Default Model" options={['claude-3-5-sonnet', 'claude-3-opus']} />
241
+ <Button onClick={testConnection}>Test Connection</Button>
242
+ </Tab>
243
+
244
+ <Tab label="Google">
245
+ <TextInput label="API Key" type="password" value={googleKey} />
246
+ <Select label="Default Model" options={['gemini-1.5-flash', 'gemini-1.5-pro']} />
247
+ <Button onClick={testConnection}>Test Connection</Button>
248
+ </Tab>
249
+
250
+ <Tab label="Groq">
251
+ <TextInput label="API Key" type="password" value={groqKey} />
252
+ <Select label="Default Model" options={['llama-3.1-70b-versatile', 'llama-3.1-405b']} />
253
+ <Button onClick={testConnection}>Test Connection</Button>
254
+ </Tab>
255
+
256
+ <Tab label="Custom">
257
+ <TextInput label="Base URL" value={customBaseURL} placeholder="http://localhost:11434/v1" />
258
+ <TextInput label="API Key" type="password" value={customKey} />
259
+ <DynamicList label="Models" items={customModels} />
260
+ <Button onClick={testConnection}>Test Connection</Button>
261
+ </Tab>
262
+ </Tabs>
263
+
264
+ <Section title="Smart Model Routing">
265
+ <Toggle label="Enable Smart Routing" value={routerEnabled} />
266
+ <Select label="Strategy" options={['Task-Based', 'Cost Optimized', 'Speed Optimized', 'Quality Optimized']} />
267
+ <SortableList label="Fallback Order" items={fallbackOrder} />
268
+ <Toggle label="Auto-Retry on Failure" value={autoRetry} />
269
+ <NumberInput label="Max Retries" value={maxRetries} min={1} max={10} />
270
+ </Section>
271
+
272
+ <Section title="Model Ensemble">
273
+ <Toggle label="Enable Ensemble (⚠️ Increases Cost)" value={ensembleEnabled} />
274
+ <Select label="Strategy" options={['Voting', 'Ranking', 'Fusion', 'Verification']} />
275
+ <MultiSelect label="Models" options={allModels} selected={ensembleModels} />
276
+ <Slider label="Min Agreement (%)" value={minAgreement} min={50} max={100} />
277
+ </Section>
278
+ </APISettings>
279
+ ```
280
+
281
+ ---
282
+
283
+ ## MCP Server Management
284
+
285
+ ### Configuration
286
+
287
+ ```typescript
288
+ interface MCPSettings {
289
+ servers: Record<string, MCPServerConfig>;
290
+
291
+ autoDiscoverTools: boolean;
292
+ toolTimeout: number; // Seconds
293
+ maxConcurrentCalls: number;
294
+ retryFailedCalls: boolean;
295
+ cacheToolResults: boolean;
296
+ cacheTTL: number; // Seconds
297
+ }
298
+
299
+ interface MCPServerConfig {
300
+ command: string;
301
+ args: string[];
302
+ enabled: boolean;
303
+ autoDownload: boolean;
304
+ config: Record<string, any>;
305
+
306
+ // Metadata
307
+ name: string;
308
+ description: string;
309
+ category: string;
310
+ installSize: string;
311
+ status: 'installed' | 'not_installed' | 'downloading' | 'error';
312
+ }
313
+ ```
314
+
315
+ ### UI Component
316
+
317
+ ```jsx
318
+ <MCPServerManagement>
319
+ <Tabs>
320
+ <Tab label="Installed">
321
+ <ServerList>
322
+ {installedServers.map(server => (
323
+ <ServerCard key={server.name}>
324
+ <ServerIcon category={server.category} />
325
+ <ServerInfo>
326
+ <Title>{server.name}</Title>
327
+ <Description>{server.description}</Description>
328
+ <Meta>
329
+ <Badge>{ server.category}</Badge>
330
+ <Badge>{server.toolCount} tools</Badge>
331
+ </Meta>
332
+ </ServerInfo>
333
+ <Actions>
334
+ <Toggle value={server.enabled} onChange={toggleServer} />
335
+ <Button onClick={() => testServer(server)}>Test</Button>
336
+ <Button onClick={() => uninstallServer(server)}>Uninstall</Button>
337
+ </Actions>
338
+ </ServerCard>
339
+ ))}
340
+ </ServerList>
341
+ </Tab>
342
+
343
+ <Tab label="Available">
344
+ <SearchInput placeholder="Search MCP servers..." />
345
+ <FilterBar>
346
+ <Filter label="All" />
347
+ <Filter label="Parsing" />
348
+ <Filter label="Browser" />
349
+ <Filter label="Database" />
350
+ <Filter label="Search" />
351
+ </FilterBar>
352
+
353
+ <ServerGallery>
354
+ {availableServers.map(server => (
355
+ <ServerCard key={server.name}>
356
+ <ServerIcon category={server.category} />
357
+ <Title>{server.name}</Title>
358
+ <Description>{server.description}</Description>
359
+ <InstallSize>{server.installSize}</InstallSize>
360
+ <Button onClick={() => installServer(server)}>
361
+ Install
362
+ </Button>
363
+ </ServerCard>
364
+ ))}
365
+ </ServerGallery>
366
+ </Tab>
367
+
368
+ <Tab label="Custom">
369
+ <Form onSubmit={addCustomServer}>
370
+ <TextInput label="Server Name" required />
371
+ <TextInput label="Command" placeholder="python" required />
372
+ <TextInput label="Arguments" placeholder="-m mcp_server" required />
373
+ <JsonEditor label="Config (JSON)" />
374
+ <Button type="submit">Add Server</Button>
375
+ </Form>
376
+ </Tab>
377
+ </Tabs>
378
+
379
+ <Section title="Global MCP Settings">
380
+ <Toggle label="Auto-Discover Tools on Startup" value={autoDiscoverTools} />
381
+ <NumberInput label="Tool Timeout (seconds)" value={toolTimeout} />
382
+ <NumberInput label="Max Concurrent Calls" value={maxConcurrentCalls} />
383
+ <Toggle label="Retry Failed Calls" value={retryFailedCalls} />
384
+ <Toggle label="Cache Tool Results" value={cacheToolResults} />
385
+ <NumberInput label="Cache TTL (seconds)" value={cacheTTL} />
386
+ </Section>
387
+ </MCPServerManagement>
388
+ ```
389
+
390
+ ---
391
+
392
+ ## Agent Behavior
393
+
394
+ ### Configuration
395
+
396
+ ```typescript
397
+ interface AgentBehaviorSettings {
398
+ // Exploration vs Exploitation
399
+ explorationRate: number; // 0.0 = exploit only, 1.0 = explore only
400
+ explorationDecay: number; // Decay rate per episode
401
+
402
+ // Planning
403
+ planningDepth: number; // How many steps ahead to plan
404
+ replanThreshold: number; // Replan if reward drops by X%
405
+
406
+ // Retry strategy
407
+ maxRetries: number; // Per action
408
+ retryDelay: number; // Seconds
409
+ adaptiveRetry: boolean; // Increase delay after each failure
410
+
411
+ // Risk tolerance
412
+ riskTolerance: 'conservative' | 'balanced' | 'aggressive';
413
+
414
+ // Memory usage
415
+ memoryIntensity: 'low' | 'medium' | 'high';
416
+
417
+ // Learning
418
+ learningRate: number;
419
+ enableOnlineLearning: boolean;
420
+ updateMemoryAfterEpisode: boolean;
421
+ }
422
+ ```
423
+
424
+ ### UI Component
425
+
426
+ ```jsx
427
+ <AgentBehaviorSettings>
428
+ <Section title="Exploration Strategy">
429
+ <Slider label="Exploration Rate" value={explorationRate} min={0} max={1} step={0.1} />
430
+ <HelpText>
431
+ 0.0 = Always use best known strategy (exploit)
432
+ 1.0 = Always try new approaches (explore)
433
+ </HelpText>
434
+ <Slider label="Exploration Decay" value={explorationDecay} min={0} max={0.1} step={0.01} />
435
+ </Section>
436
+
437
+ <Section title="Planning">
438
+ <Slider label="Planning Depth" value={planningDepth} min={1} max={10} />
439
+ <HelpText>How many steps ahead the agent plans (higher = slower but smarter)</HelpText>
440
+ <NumberInput label="Replan Threshold (%)" value={replanThreshold} min={0} max={100} />
441
+ </Section>
442
+
443
+ <Section title="Retry Strategy">
444
+ <NumberInput label="Max Retries" value={maxRetries} min={0} max={10} />
445
+ <NumberInput label="Retry Delay (seconds)" value={retryDelay} min={0} max={60} />
446
+ <Toggle label="Adaptive Retry (exponential backoff)" value={adaptiveRetry} />
447
+ </Section>
448
+
449
+ <Section title="Risk Tolerance">
450
+ <RadioGroup value={riskTolerance}>
451
+ <Radio value="conservative">
452
+ <Label>Conservative</Label>
453
+ <Description>Prefer proven patterns, avoid risky actions</Description>
454
+ </Radio>
455
+ <Radio value="balanced">
456
+ <Label>Balanced</Label>
457
+ <Description>Balance exploration and exploitation</Description>
458
+ </Radio>
459
+ <Radio value="aggressive">
460
+ <Label>Aggressive</Label>
461
+ <Description>Try new approaches quickly, higher failure tolerance</Description>
462
+ </Radio>
463
+ </RadioGroup>
464
+ </Section>
465
+
466
+ <Section title="Memory & Learning">
467
+ <Select label="Memory Intensity" options={['Low', 'Medium', 'High']} />
468
+ <Toggle label="Enable Online Learning" value={enableOnlineLearning} />
469
+ <Toggle label="Update Memory After Each Episode" value={updateMemoryAfterEpisode} />
470
+ </Section>
471
+ </AgentBehaviorSettings>
472
+ ```
473
+
474
+ ---
475
+
476
+ ## Search Engine Configuration
477
+
478
+ ### Configuration
479
+
480
+ ```typescript
481
+ interface SearchEngineSettings {
482
+ default: 'google' | 'bing' | 'brave' | 'duckduckgo' | 'perplexity';
483
+
484
+ google?: {
485
+ apiKey: string;
486
+ searchEngineId: string;
487
+ region: string;
488
+ safeSearch: boolean;
489
+ };
490
+
491
+ bing?: {
492
+ apiKey: string;
493
+ market: string;
494
+ };
495
+
496
+ brave?: {
497
+ apiKey: string;
498
+ country: string;
499
+ };
500
+
501
+ duckduckgo?: {
502
+ region: string;
503
+ safeSearch: boolean;
504
+ };
505
+
506
+ perplexity?: {
507
+ apiKey: string;
508
+ model: string;
509
+ };
510
+
511
+ // Global settings
512
+ maxResults: number;
513
+ timeout: number;
514
+ cacheResults: boolean;
515
+ cacheTTL: number;
516
+ }
517
+ ```
518
+
519
+ ### UI Component
520
+
521
+ ```jsx
522
+ <SearchEngineSettings>
523
+ <Select label="Default Search Engine" options={['Google', 'Bing', 'Brave', 'DuckDuckGo', 'Perplexity']} />
524
+
525
+ <Tabs>
526
+ <Tab label="Google">
527
+ <TextInput label="API Key" type="password" />
528
+ <TextInput label="Search Engine ID" />
529
+ <Select label="Region" options={regions} />
530
+ <Toggle label="Safe Search" />
531
+ <Button onClick={testGoogle}>Test</Button>
532
+ </Tab>
533
+
534
+ <Tab label="Bing">
535
+ <TextInput label="API Key" type="password" />
536
+ <Select label="Market" options={markets} />
537
+ <Button onClick={testBing}>Test</Button>
538
+ </Tab>
539
+
540
+ <Tab label="Brave">
541
+ <TextInput label="API Key" type="password" />
542
+ <Select label="Country" options={countries} />
543
+ <Button onClick={testBrave}>Test</Button>
544
+ </Tab>
545
+
546
+ <Tab label="DuckDuckGo">
547
+ <Info>No API key required (free)</Info>
548
+ <Select label="Region" options={regions} />
549
+ <Toggle label="Safe Search" />
550
+ </Tab>
551
+
552
+ <Tab label="Perplexity">
553
+ <TextInput label="API Key" type="password" />
554
+ <Select label="Model" options={['pplx-70b-online', 'pplx-7b-online']} />
555
+ <Button onClick={testPerplexity}>Test</Button>
556
+ </Tab>
557
+ </Tabs>
558
+
559
+ <Section title="Global Settings">
560
+ <NumberInput label="Max Results" value={maxResults} min={1} max={100} />
561
+ <NumberInput label="Timeout (seconds)" value={timeout} min={5} max={60} />
562
+ <Toggle label="Cache Results" value={cacheResults} />
563
+ <NumberInput label="Cache TTL (seconds)" value={cacheTTL} />
564
+ </Section>
565
+ </SearchEngineSettings>
566
+ ```
567
+
568
+ ---
569
+
570
+ ## Network & Proxy
571
+
572
+ ### Configuration
573
+
574
+ ```typescript
575
+ interface NetworkSettings {
576
+ proxy: {
577
+ enabled: boolean;
578
+ pools: ProxyPool[];
579
+ rotationStrategy: 'round_robin' | 'random' | 'health_based';
580
+ maxRetries: number;
581
+ };
582
+
583
+ vpn: {
584
+ enabled: boolean;
585
+ provider: string;
586
+ server: string;
587
+ credentials: {
588
+ username: string;
589
+ password: string;
590
+ };
591
+ };
592
+
593
+ rateLimiting: {
594
+ enabled: boolean;
595
+ requestsPerSecond: number;
596
+ burstSize: number;
597
+ };
598
+
599
+ userAgent: {
600
+ rotationEnabled: boolean;
601
+ customUserAgents: string[];
602
+ };
603
+
604
+ timeout: {
605
+ connect: number;
606
+ read: number;
607
+ };
608
+ }
609
+ ```
610
+
611
+ ### UI - See [proxy-vpn.md](./WebScraper_OpenEnv_SoftwareDoc.md#9-network-layer--vpn--proxy) for full details
612
+
613
+ ---
614
+
615
+ ## Cost Control
616
+
617
+ ### Configuration
618
+
619
+ ```typescript
620
+ interface CostControlSettings {
621
+ dailyBudget: number; // USD
622
+ monthlyBudget: number; // USD
623
+ alertThresholds: number[]; // [0.5, 0.8, 0.9] = 50%, 80%, 90%
624
+
625
+ modelCosts: Record<string, { input: number; output: number }>;
626
+
627
+ enforcements: {
628
+ stopOnBudgetExceeded: boolean;
629
+ downgradeToCheaperModel: boolean;
630
+ notifyOnHighCost: boolean;
631
+ };
632
+ }
633
+ ```
634
+
635
+ ### UI Component
636
+
637
+ ```jsx
638
+ <CostControlSettings>
639
+ <Section title="Budgets">
640
+ <NumberInput label="Daily Budget (USD)" value={dailyBudget} prefix="$" />
641
+ <NumberInput label="Monthly Budget (USD)" value={monthlyBudget} prefix="$" />
642
+
643
+ <CurrentUsage>
644
+ <Stat>
645
+ <Label>Today</Label>
646
+ <Value>${todayCost.toFixed(2)} / ${dailyBudget.toFixed(2)}</Value>
647
+ <ProgressBar value={todayCost / dailyBudget} />
648
+ </Stat>
649
+ <Stat>
650
+ <Label>This Month</Label>
651
+ <Value>${monthCost.toFixed(2)} / ${monthlyBudget.toFixed(2)}</Value>
652
+ <ProgressBar value={monthCost / monthlyBudget} />
653
+ </Stat>
654
+ </CurrentUsage>
655
+ </Section>
656
+
657
+ <Section title="Alerts">
658
+ <TagInput label="Alert Thresholds (%)" values={alertThresholds} />
659
+ <HelpText>Get notified at these percentages of budget</HelpText>
660
+ </Section>
661
+
662
+ <Section title="Enforcement">
663
+ <Toggle label="Stop Execution on Budget Exceeded" value={stopOnBudgetExceeded} />
664
+ <Toggle label="Auto-Downgrade to Cheaper Model" value={downgradeToCheaperModel} />
665
+ <Toggle label="Notify on High-Cost Requests" value={notifyOnHighCost} />
666
+ </Section>
667
+
668
+ <Section title="Model Costs">
669
+ <Table>
670
+ <thead>
671
+ <tr>
672
+ <th>Model</th>
673
+ <th>Input (per 1M tokens)</th>
674
+ <th>Output (per 1M tokens)</th>
675
+ <th>Estimated Cost/Episode</th>
676
+ </tr>
677
+ </thead>
678
+ <tbody>
679
+ {models.map(model => (
680
+ <tr key={model.name}>
681
+ <td>{model.name}</td>
682
+ <td>${model.inputCost.toFixed(2)}</td>
683
+ <td>${model.outputCost.toFixed(2)}</td>
684
+ <td>${model.estimatedCostPerEpisode.toFixed(4)}</td>
685
+ </tr>
686
+ ))}
687
+ </tbody>
688
+ </Table>
689
+ </Section>
690
+ </CostControlSettings>
691
+ ```
692
+
693
+ ---
694
+
695
+ ## Performance Tuning
696
+
697
+ ### Configuration
698
+
699
+ ```typescript
700
+ interface PerformanceSettings {
701
+ batchProcessing: {
702
+ enabled: boolean;
703
+ batchSize: number;
704
+ maxConcurrent: number;
705
+ };
706
+
707
+ parallelExecution: {
708
+ enabled: boolean;
709
+ maxWorkers: number;
710
+ };
711
+
712
+ caching: {
713
+ enabled: boolean;
714
+ cacheHTML: boolean;
715
+ cacheAPIResponses: boolean;
716
+ cacheDuration: number; // Seconds
717
+ maxCacheSize: number; // MB
718
+ };
719
+
720
+ contextOptimization: {
721
+ enabled: boolean;
722
+ summarizeOldObservations: boolean;
723
+ pruneThreshold: number;
724
+ maxContextTokens: number;
725
+ };
726
+ }
727
+ ```
728
+
729
+ ---
730
+
731
+ ## Import/Export
732
+
733
+ ```jsx
734
+ <ImportExportSettings>
735
+ <Section title="Export Settings">
736
+ <Button onClick={exportAll}>Export All Settings (JSON)</Button>
737
+ <Button onClick={exportMemory}>Export Memory Database</Button>
738
+ <Button onClick={exportLogs}>Export Logs</Button>
739
+ </Section>
740
+
741
+ <Section title="Import Settings">
742
+ <FileUpload label="Import Settings (JSON)" accept=".json" onChange={importSettings} />
743
+ <Button onClick={resetToDefaults}>Reset to Defaults</Button>
744
+ </Section>
745
+ </ImportExportSettings>
746
+ ```
747
+
748
+ ---
749
+
750
+ **Next:** See [rewards.md](./rewards.md) for advanced reward function design.