VibecoderMcSwaggins commited on
Commit
a90c302
·
1 Parent(s): 15566c9

docs: Remove outdated architecture and guide documents

Browse files

- Delete several obsolete files, including `design-patterns.md`, `HF_FREE_TIER_ANALYSIS.md`, `overview.md`, `deployment.md`, and `workflow-diagrams.md`, as they no longer align with the current project structure and objectives.
- This cleanup enhances the clarity and relevance of the documentation for developers and stakeholders.

docs/architecture/HF_FREE_TIER_ANALYSIS.md DELETED
@@ -1,113 +0,0 @@
1
- # Hugging Face Free Tier Reliability Analysis (December 2025)
2
-
3
- ## Executive Summary
4
-
5
- **Root Cause:** The recurring 500/401 errors on the Free Tier (Advanced Mode without API keys) are caused by implicit routing of large models (70B+) to unstable third-party "Inference Providers" (Novita, Hyperbolic) instead of running natively on Hugging Face's infrastructure.
6
-
7
- **Solution:** Switch the default Free Tier model from flagship-class models (72B) to high-performance mid-sized models (7B-32B) that are hosted natively by Hugging Face's Serverless Inference API.
8
-
9
- ---
10
-
11
- ## 1. The "Inference Providers" Trap
12
-
13
- Hugging Face offers two distinct execution paths for its Inference API:
14
-
15
- 1. **Serverless Inference API (Native):**
16
- * **Host:** Hugging Face's own infrastructure.
17
- * **Reliability:** High (Direct control).
18
- * **Constraints:** Limited to models that fit on standard inference hardware (typically <10GB-30GB VRAM usage).
19
- * **Typical Models:** `bert-base`, `gpt2`, `Mistral-7B`, `Qwen2.5-7B`.
20
-
21
- 2. **Inference Providers (Third-Party Marketplace):**
22
- * **Host:** Partners like Novita, Hyperbolic, Together AI, Sambanova.
23
- * **Reliability:** Variable. "Staging mode" authentication issues, rate limits, and service outages (500 errors) are common on the free routing layer.
24
- * **Purpose:** To serve massive models (Llama-3.1-405B, Qwen2.5-72B) that are too expensive for HF to host for free.
25
-
26
- **The Problem:**
27
- When we request `Qwen/Qwen2.5-72B-Instruct` (or `Llama-3.1-70B`) without an API key, HF transparently routes this request to a partner (Novita/Hyperbolic).
28
- * **Novita Status:** Currently returning 500 Internal Server Errors.
29
- * **Hyperbolic Status:** Previously returned 401 Unauthorized (Staging Mode auth bug).
30
-
31
- We are effectively relying on a "best effort" chain of third-party providers for our core application stability.
32
-
33
- ## 2. The "Golden Path" for Free Tier
34
-
35
- To ensure stability, the Free Tier must target models that reside on the **Native** path.
36
-
37
- **Criteria for Native Stability:**
38
- * **Size:** < 30B parameters (ideal: 7B - 12B).
39
- * **Popularity:** "Warm" models (high traffic keeps them loaded in memory).
40
- * **Architecture:** Standard transformers (easy for HF to serve).
41
-
42
- **Candidate Models (Dec 2025):**
43
-
44
- | Model | Size | Provider Risk | Native Capability |
45
- |-------|------|---------------|-------------------|
46
- | **Qwen/Qwen2.5-7B-Instruct** | 7B | **Low** | **Excellent** (Math: 75.5, Code: 84.8) |
47
- | **mistralai/Mistral-Nemo-Instruct-2407** | 12B | Low | Very Good |
48
- | **Qwen/Qwen2.5-72B-Instruct** | 72B | **High** (Novita) | Excellent (but unreliable) |
49
- | **meta-llama/Llama-3.1-70B-Instruct** | 70B | **High** (Hyperbolic) | Excellent (but unreliable) |
50
-
51
- ## 3. Recommendation
52
-
53
- **Immediate Fix:**
54
- Change the default `HUGGINGFACE_MODEL` in `src/utils/config.py` from `Qwen/Qwen2.5-72B-Instruct` to **`Qwen/Qwen2.5-7B-Instruct`**.
55
-
56
- **Why Qwen2.5-7B?**
57
- * **Performance:** Outperforms Llama-3.1-8B and matches GPT-3.5 levels in many benchmarks.
58
- * **Reliability:** Small enough to be hosted natively.
59
- * **Context:** 128k context window (perfect for RAG).
60
-
61
- ## 4. Future Architecture (Unified Client)
62
-
63
- For the Unified Chat Client architecture:
64
- 1. **Tier 0 (Free):** Hardcoded to Native Models (Qwen 7B, Mistral Nemo).
65
- 2. **Tier 1 (BYO Key):** Allow user to select any model (70B+), assuming they provide a key that grants access to premium providers or PRO tier.
66
-
67
- ---
68
-
69
- ## 5. Known Content Quality Limitations (7B Models)
70
-
71
- **Status**: As of December 2025, the Free Tier (Qwen 2.5 7B) produces **working multi-agent orchestration** but with notable content quality limitations.
72
-
73
- ### What Works Well
74
- - Multi-agent coordination (Manager → Search → Hypothesis → Report)
75
- - Clean streaming output (no garbage tokens, no raw JSON)
76
- - Proper agent handoffs and progress tracking
77
- - Coherent narrative structure
78
-
79
- ### Known Limitations
80
-
81
- | Issue | Description | Severity |
82
- |-------|-------------|----------|
83
- | **Hallucinated Citations** | Model generates plausible-sounding but fake paper titles/authors instead of using actual search results | Medium |
84
- | **Anatomical Confusion** | May apply male anatomy (e.g., "penile rigidity") to female health queries | High |
85
- | **Nonsensical Medical Claims** | May generate claims like "prostate cancer risk" in context of female patients | High |
86
- | **Duplicate Content** | Final reports sometimes contain repeated sections | Low |
87
-
88
- ### Why This Happens
89
-
90
- 7B parameter models have limited:
91
- - **World knowledge**: Can't reliably recall specific paper titles/authors
92
- - **Context grounding**: May ignore search results and hallucinate instead
93
- - **Domain reasoning**: Complex medical topics exceed reasoning capacity
94
-
95
- ### User Guidance
96
-
97
- **Free Tier is best for:**
98
- - Understanding the research workflow
99
- - Getting general topic overviews
100
- - Testing the system before committing to paid tier
101
-
102
- **For accurate medical research:**
103
- - Use Paid Tier (GPT-5) for citation accuracy
104
- - Always verify citations against actual databases
105
- - Treat Free Tier output as "draft quality"
106
-
107
- ### Not a Stack Bug
108
-
109
- These are **model capability limitations**, not bugs in the DeepBoner architecture. The orchestration, streaming, and agent coordination are working correctly.
110
-
111
- ---
112
- *Analysis performed by Gemini CLI Agent, Dec 2, 2025*
113
- *Content quality section added Dec 3, 2025*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/architecture/design-patterns.md DELETED
@@ -1,1509 +0,0 @@
1
- # Design Patterns & Technical Decisions
2
- ## Explicit Answers to Architecture Questions
3
-
4
- ---
5
-
6
- ## Purpose of This Document
7
-
8
- This document explicitly answers all the "design pattern" questions raised in team discussions. It provides clear technical decisions with rationale.
9
-
10
- ---
11
-
12
- ## 1. Primary Architecture Pattern
13
-
14
- ### Decision: Orchestrator with Search-Judge Loop
15
-
16
- **Pattern Name**: Iterative Research Orchestrator
17
-
18
- **Structure**:
19
- ```
20
- ┌─────────────────────────────────────┐
21
- │ Research Orchestrator │
22
- │ ┌───────────────────────────────┐ │
23
- │ │ Search Strategy Planner │ │
24
- │ └───────────────────────────────┘ │
25
- │ ↓ │
26
- │ ┌───────────────────────────────┐ │
27
- │ │ Tool Coordinator │ │
28
- │ │ - PubMed Search │ │
29
- │ │ - Web Search │ │
30
- │ │ - Clinical Trials │ │
31
- │ └───────────────────────────────┘ │
32
- │ ↓ │
33
- │ ┌───────────────────────────────┐ │
34
- │ │ Evidence Aggregator │ │
35
- │ └───────────────────────────────┘ │
36
- │ ↓ │
37
- │ ┌───────────────────────────────┐ │
38
- │ │ Quality Judge │ │
39
- │ │ (LLM-based assessment) │ │
40
- │ └───────────────────────────────┘ │
41
- │ ↓ │
42
- │ Loop or Synthesize? │
43
- │ ↓ │
44
- │ ┌───────────────────────────────┐ │
45
- │ │ Report Generator │ │
46
- │ └───────────────────────────────┘ │
47
- └─────────────────────────────────────┘
48
- ```
49
-
50
- **Why NOT single-agent?**
51
- - Need coordinated multi-tool queries
52
- - Need iterative refinement
53
- - Need quality assessment between searches
54
-
55
- **Why NOT pure ReAct?**
56
- - Medical research requires structured workflow
57
- - Need explicit quality gates
58
- - Want deterministic tool selection
59
-
60
- **Why THIS pattern?**
61
- - Clear separation of concerns
62
- - Testable components
63
- - Easy to debug
64
- - Proven in similar systems
65
-
66
- ---
67
-
68
- ## 2. Tool Selection & Orchestration Pattern
69
-
70
- ### Decision: Static Tool Registry with Dynamic Selection
71
-
72
- **Pattern**:
73
- ```python
74
- class ToolRegistry:
75
- """Central registry of available research tools"""
76
- tools = {
77
- 'pubmed': PubMedSearchTool(),
78
- 'web': WebSearchTool(),
79
- 'trials': ClinicalTrialsTool(),
80
- 'drugs': DrugInfoTool(),
81
- }
82
-
83
- class Orchestrator:
84
- def select_tools(self, question: str, iteration: int) -> List[Tool]:
85
- """Dynamically choose tools based on context"""
86
- if iteration == 0:
87
- # First pass: broad search
88
- return [tools['pubmed'], tools['web']]
89
- else:
90
- # Refinement: targeted search
91
- return self.judge.recommend_tools(question, context)
92
- ```
93
-
94
- **Why NOT on-the-fly agent factories?**
95
- - 6-day timeline (too complex)
96
- - Tools are known upfront
97
- - Simpler to test and debug
98
-
99
- **Why NOT single tool?**
100
- - Need multiple evidence sources
101
- - Different tools for different info types
102
- - Better coverage
103
-
104
- **Why THIS pattern?**
105
- - Balance flexibility vs simplicity
106
- - Tools can be added easily
107
- - Selection logic is transparent
108
-
109
- ---
110
-
111
- ## 3. Judge Pattern
112
-
113
- ### Decision: Dual-Judge System (Quality + Budget)
114
-
115
- **Pattern**:
116
- ```python
117
- class QualityJudge:
118
- """LLM-based evidence quality assessment"""
119
-
120
- def is_sufficient(self, question: str, evidence: List[Evidence]) -> bool:
121
- """Main decision: do we have enough?"""
122
- return (
123
- self.has_mechanism_explanation(evidence) and
124
- self.has_drug_candidates(evidence) and
125
- self.has_clinical_evidence(evidence) and
126
- self.confidence_score(evidence) > threshold
127
- )
128
-
129
- def identify_gaps(self, question: str, evidence: List[Evidence]) -> List[str]:
130
- """What's missing?"""
131
- gaps = []
132
- if not self.has_mechanism_explanation(evidence):
133
- gaps.append("disease mechanism")
134
- if not self.has_drug_candidates(evidence):
135
- gaps.append("potential drug candidates")
136
- if not self.has_clinical_evidence(evidence):
137
- gaps.append("clinical trial data")
138
- return gaps
139
-
140
- class BudgetJudge:
141
- """Resource constraint enforcement"""
142
-
143
- def should_stop(self, state: ResearchState) -> bool:
144
- """Hard limits"""
145
- return (
146
- state.tokens_used >= max_tokens or
147
- state.iterations >= max_iterations or
148
- state.time_elapsed >= max_time
149
- )
150
- ```
151
-
152
- **Why NOT just LLM judge?**
153
- - Cost control (prevent runaway queries)
154
- - Time bounds (hackathon demo needs to be fast)
155
- - Safety (prevent infinite loops)
156
-
157
- **Why NOT just token budget?**
158
- - Want early exit when answer is good
159
- - Quality matters, not just quantity
160
- - Better user experience
161
-
162
- **Why THIS pattern?**
163
- - Best of both worlds
164
- - Clear separation (quality vs resources)
165
- - Each judge has single responsibility
166
-
167
- ---
168
-
169
- ## 4. Break/Stopping Pattern
170
-
171
- ### Decision: Three-Tier Break Conditions
172
-
173
- **Pattern**:
174
- ```python
175
- def should_continue(state: ResearchState) -> bool:
176
- """Multi-tier stopping logic"""
177
-
178
- # Tier 1: Quality-based (ideal stop)
179
- if quality_judge.is_sufficient(state.question, state.evidence):
180
- state.stop_reason = "sufficient_evidence"
181
- return False
182
-
183
- # Tier 2: Budget-based (cost control)
184
- if state.tokens_used >= config.max_tokens:
185
- state.stop_reason = "token_budget_exceeded"
186
- return False
187
-
188
- # Tier 3: Iteration-based (safety)
189
- if state.iterations >= config.max_iterations:
190
- state.stop_reason = "max_iterations_reached"
191
- return False
192
-
193
- # Tier 4: Time-based (demo friendly)
194
- if state.time_elapsed >= config.max_time:
195
- state.stop_reason = "timeout"
196
- return False
197
-
198
- return True # Continue researching
199
- ```
200
-
201
- **Configuration**:
202
- ```toml
203
- [research.limits]
204
- max_tokens = 50000 # ~$0.50 at Claude pricing
205
- max_iterations = 5 # Reasonable depth
206
- max_time_seconds = 120 # 2 minutes for demo
207
- judge_threshold = 0.8 # Quality confidence score
208
- ```
209
-
210
- **Why multiple conditions?**
211
- - Defense in depth
212
- - Different failure modes
213
- - Graceful degradation
214
-
215
- **Why these specific limits?**
216
- - Tokens: Balances cost vs quality
217
- - Iterations: Enough for refinement, not too deep
218
- - Time: Fast enough for live demo
219
- - Judge: High bar for quality
220
-
221
- ---
222
-
223
- ## 5. State Management Pattern
224
-
225
- ### Decision: Pydantic State Machine with Checkpoints
226
-
227
- **Pattern**:
228
- ```python
229
- class ResearchState(BaseModel):
230
- """Immutable state snapshots"""
231
- query_id: str
232
- question: str
233
- iteration: int = 0
234
- evidence: List[Evidence] = []
235
- tokens_used: int = 0
236
- search_history: List[SearchQuery] = []
237
- stop_reason: Optional[str] = None
238
- created_at: datetime
239
- updated_at: datetime
240
-
241
- class StateManager:
242
- def save_checkpoint(self, state: ResearchState) -> None:
243
- """Save state to disk"""
244
- path = f".deepresearch/checkpoints/{state.query_id}_iter{state.iteration}.json"
245
- path.write_text(state.model_dump_json(indent=2))
246
-
247
- def load_checkpoint(self, query_id: str, iteration: int) -> ResearchState:
248
- """Resume from checkpoint"""
249
- path = f".deepresearch/checkpoints/{query_id}_iter{iteration}.json"
250
- return ResearchState.model_validate_json(path.read_text())
251
- ```
252
-
253
- **Directory Structure**:
254
- ```
255
- .deepresearch/
256
- ├── state/
257
- │ └── current_123.json # Active research state
258
- ├── checkpoints/
259
- │ ├── query_123_iter0.json # Checkpoint after iteration 0
260
- │ ├── query_123_iter1.json # Checkpoint after iteration 1
261
- │ └── query_123_iter2.json # Checkpoint after iteration 2
262
- └── workspace/
263
- └── query_123/
264
- ├── papers/ # Downloaded PDFs
265
- ├── search_results/ # Raw search results
266
- └── analysis/ # Intermediate analysis
267
- ```
268
-
269
- **Why Pydantic?**
270
- - Type safety
271
- - Validation
272
- - Easy serialization
273
- - Integration with Pydantic AI
274
-
275
- **Why checkpoints?**
276
- - Resume interrupted research
277
- - Debugging (inspect state at each iteration)
278
- - Cost savings (don't re-query)
279
- - Demo resilience
280
-
281
- ---
282
-
283
- ## 6. Tool Interface Pattern
284
-
285
- ### Decision: Async Unified Tool Protocol
286
-
287
- **Pattern**:
288
- ```python
289
- from typing import Protocol, Optional, List, Dict
290
- import asyncio
291
-
292
- class ResearchTool(Protocol):
293
- """Standard async interface all tools must implement"""
294
-
295
- async def search(
296
- self,
297
- query: str,
298
- max_results: int = 10,
299
- filters: Optional[Dict] = None
300
- ) -> List[Evidence]:
301
- """Execute search and return structured evidence"""
302
- ...
303
-
304
- def get_metadata(self) -> ToolMetadata:
305
- """Tool capabilities and requirements"""
306
- ...
307
-
308
- class PubMedSearchTool:
309
- """Concrete async implementation"""
310
-
311
- def __init__(self):
312
- self._rate_limiter = asyncio.Semaphore(3) # 3 req/sec
313
- self._cache: Dict[str, List[Evidence]] = {}
314
-
315
- async def search(self, query: str, max_results: int = 10, **kwargs) -> List[Evidence]:
316
- # Check cache first
317
- cache_key = f"{query}:{max_results}"
318
- if cache_key in self._cache:
319
- return self._cache[cache_key]
320
-
321
- async with self._rate_limiter:
322
- # 1. Query PubMed E-utilities API (async httpx)
323
- async with httpx.AsyncClient() as client:
324
- response = await client.get(
325
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
326
- params={"db": "pubmed", "term": query, "retmax": max_results}
327
- )
328
- # 2. Parse XML response
329
- # 3. Extract: title, abstract, authors, citations
330
- # 4. Convert to Evidence objects
331
- evidence_list = self._parse_response(response.text)
332
-
333
- # Cache results
334
- self._cache[cache_key] = evidence_list
335
- return evidence_list
336
-
337
- def get_metadata(self) -> ToolMetadata:
338
- return ToolMetadata(
339
- name="PubMed",
340
- description="Biomedical literature search",
341
- rate_limit="3 requests/second",
342
- requires_api_key=False
343
- )
344
- ```
345
-
346
- **Parallel Tool Execution**:
347
- ```python
348
- async def search_all_tools(query: str, tools: List[ResearchTool]) -> List[Evidence]:
349
- """Run all tool searches in parallel"""
350
- tasks = [tool.search(query) for tool in tools]
351
- results = await asyncio.gather(*tasks, return_exceptions=True)
352
-
353
- # Flatten and filter errors
354
- evidence = []
355
- for result in results:
356
- if isinstance(result, Exception):
357
- logger.warning(f"Tool failed: {result}")
358
- else:
359
- evidence.extend(result)
360
- return evidence
361
- ```
362
-
363
- **Why Async?**
364
- - Tools are I/O bound (network calls)
365
- - Parallel execution = faster searches
366
- - Better UX (streaming progress)
367
- - Standard in 2025 Python
368
-
369
- **Why Protocol?**
370
- - Loose coupling
371
- - Easy to add new tools
372
- - Testable with mocks
373
- - Clear contract
374
-
375
- **Why NOT abstract base class?**
376
- - More Pythonic (PEP 544)
377
- - Duck typing friendly
378
- - Runtime checking with isinstance
379
-
380
- ---
381
-
382
- ## 7. Report Generation Pattern
383
-
384
- ### Decision: Structured Output with Citations
385
-
386
- **Pattern**:
387
- ```python
388
- class DrugCandidate(BaseModel):
389
- name: str
390
- mechanism: str
391
- evidence_quality: Literal["strong", "moderate", "weak"]
392
- clinical_status: str # "FDA approved", "Phase 2", etc.
393
- citations: List[Citation]
394
-
395
- class ResearchReport(BaseModel):
396
- query: str
397
- disease_mechanism: str
398
- candidates: List[DrugCandidate]
399
- methodology: str # How we searched
400
- confidence: float
401
- sources_used: List[str]
402
- generated_at: datetime
403
-
404
- def to_markdown(self) -> str:
405
- """Human-readable format"""
406
- ...
407
-
408
- def to_json(self) -> str:
409
- """Machine-readable format"""
410
- ...
411
- ```
412
-
413
- **Output Example**:
414
- ```markdown
415
- # Research Report: Long COVID Fatigue
416
-
417
- ## Disease Mechanism
418
- Long COVID fatigue is associated with mitochondrial dysfunction
419
- and persistent inflammation [1, 2].
420
-
421
- ## Drug Candidates
422
-
423
- ### 1. Coenzyme Q10 (CoQ10) - STRONG EVIDENCE
424
- - **Mechanism**: Mitochondrial support, ATP production
425
- - **Status**: FDA approved (supplement)
426
- - **Evidence**: 2 randomized controlled trials showing fatigue reduction
427
- - **Citations**:
428
- - Smith et al. (2023) - PubMed: 12345678
429
- - Johnson et al. (2023) - PubMed: 87654321
430
-
431
- ### 2. Low-dose Naltrexone (LDN) - MODERATE EVIDENCE
432
- - **Mechanism**: Anti-inflammatory, immune modulation
433
- - **Status**: FDA approved (different indication)
434
- - **Evidence**: 3 case studies, 1 ongoing Phase 2 trial
435
- - **Citations**: ...
436
-
437
- ## Methodology
438
- - Searched PubMed: 45 papers reviewed
439
- - Searched Web: 12 sources
440
- - Clinical trials: 8 trials identified
441
- - Total iterations: 3
442
- - Tokens used: 12,450
443
-
444
- ## Confidence: 85%
445
-
446
- ## Sources
447
- - PubMed E-utilities
448
- - ClinicalTrials.gov
449
- - OpenFDA Database
450
- ```
451
-
452
- **Why structured?**
453
- - Parseable by other systems
454
- - Consistent format
455
- - Easy to validate
456
- - Good for datasets
457
-
458
- **Why markdown?**
459
- - Human-readable
460
- - Renders nicely in Gradio
461
- - Easy to convert to PDF
462
- - Standard format
463
-
464
- ---
465
-
466
- ## 8. Error Handling Pattern
467
-
468
- ### Decision: Graceful Degradation with Fallbacks
469
-
470
- **Pattern**:
471
- ```python
472
- class ResearchAgent:
473
- def research(self, question: str) -> ResearchReport:
474
- try:
475
- return self._research_with_retry(question)
476
- except TokenBudgetExceeded:
477
- # Return partial results
478
- return self._synthesize_partial(state)
479
- except ToolFailure as e:
480
- # Try alternate tools
481
- return self._research_with_fallback(question, failed_tool=e.tool)
482
- except Exception as e:
483
- # Log and return error report
484
- logger.error(f"Research failed: {e}")
485
- return self._error_report(question, error=e)
486
- ```
487
-
488
- **Why NOT fail fast?**
489
- - Hackathon demo must be robust
490
- - Partial results better than nothing
491
- - Good user experience
492
-
493
- **Why NOT silent failures?**
494
- - Need visibility for debugging
495
- - User should know limitations
496
- - Honest about confidence
497
-
498
- ---
499
-
500
- ## 9. Configuration Pattern
501
-
502
- ### Decision: Hydra-inspired but Simpler
503
-
504
- **Pattern**:
505
- ```toml
506
- # config.toml
507
-
508
- [research]
509
- max_iterations = 5
510
- max_tokens = 50000
511
- max_time_seconds = 120
512
- judge_threshold = 0.85
513
-
514
- [tools]
515
- enabled = ["pubmed", "web", "trials"]
516
-
517
- [tools.pubmed]
518
- max_results = 20
519
- rate_limit = 3 # per second
520
-
521
- [tools.web]
522
- engine = "serpapi"
523
- max_results = 10
524
-
525
- [llm]
526
- provider = "anthropic"
527
- model = "claude-3-5-sonnet-20241022"
528
- temperature = 0.1
529
-
530
- [output]
531
- format = "markdown"
532
- include_citations = true
533
- include_methodology = true
534
- ```
535
-
536
- **Loading**:
537
- ```python
538
- from pathlib import Path
539
- import tomllib
540
-
541
- def load_config() -> dict:
542
- config_path = Path("config.toml")
543
- with open(config_path, "rb") as f:
544
- return tomllib.load(f)
545
- ```
546
-
547
- **Why NOT full Hydra?**
548
- - Simpler for hackathon
549
- - Easier to understand
550
- - Faster to modify
551
- - Can upgrade later
552
-
553
- **Why TOML?**
554
- - Human-readable
555
- - Standard (PEP 680)
556
- - Better than YAML edge cases
557
- - Native in Python 3.11+
558
-
559
- ---
560
-
561
- ## 10. Testing Pattern
562
-
563
- ### Decision: Three-Level Testing Strategy
564
-
565
- **Pattern**:
566
- ```python
567
- # Level 1: Unit tests (fast, isolated)
568
- def test_pubmed_tool():
569
- tool = PubMedSearchTool()
570
- results = tool.search("aspirin cardiovascular")
571
- assert len(results) > 0
572
- assert all(isinstance(r, Evidence) for r in results)
573
-
574
- # Level 2: Integration tests (tools + agent)
575
- def test_research_loop():
576
- agent = ResearchAgent(config=test_config)
577
- report = agent.research("aspirin repurposing")
578
- assert report.candidates
579
- assert report.confidence > 0
580
-
581
- # Level 3: End-to-end tests (full system)
582
- def test_full_workflow():
583
- # Simulate user query through Gradio UI
584
- response = gradio_app.predict("test query")
585
- assert "Drug Candidates" in response
586
- ```
587
-
588
- **Why three levels?**
589
- - Fast feedback (unit tests)
590
- - Confidence (integration tests)
591
- - Reality check (e2e tests)
592
-
593
- **Test Data**:
594
- ```python
595
- # tests/fixtures/
596
- - mock_pubmed_response.xml
597
- - mock_web_results.json
598
- - sample_research_query.txt
599
- - expected_report.md
600
- ```
601
-
602
- ---
603
-
604
- ## 11. Judge Prompt Templates
605
-
606
- ### Decision: Structured JSON Output with Domain-Specific Criteria
607
-
608
- **Quality Judge System Prompt**:
609
- ```python
610
- QUALITY_JUDGE_SYSTEM = """You are a medical research quality assessor specializing in drug repurposing.
611
- Your task is to evaluate if collected evidence is sufficient to answer a drug repurposing question.
612
-
613
- You assess evidence against four criteria specific to drug repurposing research:
614
- 1. MECHANISM: Understanding of the disease's molecular/cellular mechanisms
615
- 2. CANDIDATES: Identification of potential drug candidates with known mechanisms
616
- 3. EVIDENCE: Clinical or preclinical evidence supporting repurposing
617
- 4. SOURCES: Quality and credibility of sources (peer-reviewed > preprints > web)
618
-
619
- You MUST respond with valid JSON only. No other text."""
620
- ```
621
-
622
- **Quality Judge User Prompt**:
623
- ```python
624
- QUALITY_JUDGE_USER = """
625
- ## Research Question
626
- {question}
627
-
628
- ## Evidence Collected (Iteration {iteration} of {max_iterations})
629
- {evidence_summary}
630
-
631
- ## Token Budget
632
- Used: {tokens_used} / {max_tokens}
633
-
634
- ## Your Assessment
635
-
636
- Evaluate the evidence and respond with this exact JSON structure:
637
-
638
- ```json
639
- {{
640
- "assessment": {{
641
- "mechanism_score": <0-10>,
642
- "mechanism_reasoning": "<Step-by-step analysis of mechanism understanding>",
643
- "candidates_score": <0-10>,
644
- "candidates_found": ["<drug1>", "<drug2>", ...],
645
- "evidence_score": <0-10>,
646
- "evidence_reasoning": "<Critical evaluation of clinical/preclinical support>",
647
- "sources_score": <0-10>,
648
- "sources_breakdown": {{
649
- "peer_reviewed": <count>,
650
- "clinical_trials": <count>,
651
- "preprints": <count>,
652
- "other": <count>
653
- }}
654
- }},
655
- "overall_confidence": <0.0-1.0>,
656
- "sufficient": <true/false>,
657
- "gaps": ["<missing info 1>", "<missing info 2>"],
658
- "recommended_searches": ["<search query 1>", "<search query 2>"],
659
- "recommendation": "<continue|synthesize>"
660
- }}
661
- ```
662
-
663
- Decision rules:
664
- - sufficient=true if overall_confidence >= 0.8 AND mechanism_score >= 6 AND candidates_score >= 6
665
- - sufficient=true if remaining budget < 10% (must synthesize with what we have)
666
- - Otherwise, provide recommended_searches to fill gaps
667
- """
668
- ```
669
-
670
- **Report Synthesis Prompt**:
671
- ```python
672
- SYNTHESIS_PROMPT = """You are a medical research synthesizer creating a drug repurposing report.
673
-
674
- ## Research Question
675
- {question}
676
-
677
- ## Collected Evidence
678
- {all_evidence}
679
-
680
- ## Judge Assessment
681
- {final_assessment}
682
-
683
- ## Your Task
684
- Create a comprehensive research report with this structure:
685
-
686
- 1. **Executive Summary** (2-3 sentences)
687
- 2. **Disease Mechanism** - What we understand about the condition
688
- 3. **Drug Candidates** - For each candidate:
689
- - Drug name and current FDA status
690
- - Proposed mechanism for this condition
691
- - Evidence quality (strong/moderate/weak)
692
- - Key citations
693
- 4. **Methodology** - How we searched (tools used, queries, iterations)
694
- 5. **Limitations** - What we couldn't find or verify
695
- 6. **Confidence Score** - Overall confidence in findings
696
-
697
- Format as Markdown. Include PubMed IDs as citations [PMID: 12345678].
698
- Be scientifically accurate. Do not hallucinate drug names or mechanisms.
699
- If evidence is weak, say so clearly."""
700
- ```
701
-
702
- **Why Structured JSON?**
703
- - Parseable by code (not just LLM output)
704
- - Consistent format for logging/debugging
705
- - Can trigger specific actions (continue vs synthesize)
706
- - Testable with expected outputs
707
-
708
- **Why Domain-Specific Criteria?**
709
- - Generic "is this good?" prompts fail
710
- - Drug repurposing has specific requirements
711
- - Physician on team validated criteria
712
- - Maps to real research workflow
713
-
714
- ---
715
-
716
- ## 12. MCP Server Integration (Hackathon Track)
717
-
718
- ### Decision: Tools as MCP Servers for Reusability
719
-
720
- **Why MCP?**
721
- - Hackathon has dedicated MCP track
722
- - Makes our tools reusable by others
723
- - Standard protocol (Model Context Protocol)
724
- - Future-proof (industry adoption growing)
725
-
726
- **Architecture**:
727
- ```
728
- ┌─────────────────────────────────────────────────┐
729
- │ DeepBoner Agent │
730
- │ (uses tools directly OR via MCP) │
731
- └─────────────────────────────────────────────────┘
732
-
733
- ┌────────────┼────────────┐
734
- ↓ ↓ ↓
735
- ┌─────────────┐ ┌──────────┐ ┌───────────────┐
736
- │ PubMed MCP │ │ Web MCP │ │ Trials MCP │
737
- │ Server │ │ Server │ │ Server │
738
- └─────────────┘ └──────────┘ └───────────────┘
739
- │ │ │
740
- ↓ ↓ ↓
741
- PubMed API Brave/DDG ClinicalTrials.gov
742
- ```
743
-
744
- **PubMed MCP Server Implementation**:
745
- ```python
746
- # src/mcp_servers/pubmed_server.py
747
- from fastmcp import FastMCP
748
-
749
- mcp = FastMCP("PubMed Research Tool")
750
-
751
- @mcp.tool()
752
- async def search_pubmed(
753
- query: str,
754
- max_results: int = 10,
755
- date_range: str = "5y"
756
- ) -> dict:
757
- """
758
- Search PubMed for biomedical literature.
759
-
760
- Args:
761
- query: Search terms (supports PubMed syntax like [MeSH])
762
- max_results: Maximum papers to return (default 10, max 100)
763
- date_range: Time filter - "1y", "5y", "10y", or "all"
764
-
765
- Returns:
766
- dict with papers list containing title, abstract, authors, pmid, date
767
- """
768
- tool = PubMedSearchTool()
769
- results = await tool.search(query, max_results)
770
- return {
771
- "query": query,
772
- "count": len(results),
773
- "papers": [r.model_dump() for r in results]
774
- }
775
-
776
- @mcp.tool()
777
- async def get_paper_details(pmid: str) -> dict:
778
- """
779
- Get full details for a specific PubMed paper.
780
-
781
- Args:
782
- pmid: PubMed ID (e.g., "12345678")
783
-
784
- Returns:
785
- Full paper metadata including abstract, MeSH terms, references
786
- """
787
- tool = PubMedSearchTool()
788
- return await tool.get_details(pmid)
789
-
790
- if __name__ == "__main__":
791
- mcp.run()
792
- ```
793
-
794
- **Running the MCP Server**:
795
- ```bash
796
- # Start the server
797
- python -m src.mcp_servers.pubmed_server
798
-
799
- # Or with uvx (recommended)
800
- uvx fastmcp run src/mcp_servers/pubmed_server.py
801
-
802
- # Note: fastmcp uses stdio transport by default, which is perfect
803
- # for local integration with Claude Desktop or the main agent.
804
- ```
805
-
806
- **Claude Desktop Integration** (for demo):
807
- ```json
808
- // ~/Library/Application Support/Claude/claude_desktop_config.json
809
- {
810
- "mcpServers": {
811
- "pubmed": {
812
- "command": "python",
813
- "args": ["-m", "src.mcp_servers.pubmed_server"],
814
- "cwd": "/path/to/deepboner"
815
- }
816
- }
817
- }
818
- ```
819
-
820
- **Why FastMCP?**
821
- - Simple decorator syntax
822
- - Handles protocol complexity
823
- - Good docs and examples
824
- - Works with Claude Desktop and API
825
-
826
- **MCP Track Submission Requirements**:
827
- - [ ] At least one tool as MCP server
828
- - [ ] README with setup instructions
829
- - [ ] Demo showing MCP usage
830
- - [ ] Bonus: Multiple tools as MCP servers
831
-
832
- ---
833
-
834
- ## 13. Gradio UI Pattern (Hackathon Track)
835
-
836
- ### Decision: Streaming Progress with Modern UI
837
-
838
- **Pattern**:
839
- ```python
840
- import gradio as gr
841
- from typing import Generator
842
-
843
- def research_with_streaming(question: str) -> Generator[str, None, None]:
844
- """Stream research progress to UI"""
845
- yield "🔍 Starting research...\n\n"
846
-
847
- agent = ResearchAgent()
848
-
849
- async for event in agent.research_stream(question):
850
- match event.type:
851
- case "search_start":
852
- yield f"📚 Searching {event.tool}...\n"
853
- case "search_complete":
854
- yield f"✅ Found {event.count} results from {event.tool}\n"
855
- case "judge_thinking":
856
- yield f"🤔 Evaluating evidence quality...\n"
857
- case "judge_decision":
858
- yield f"📊 Confidence: {event.confidence:.0%}\n"
859
- case "iteration_complete":
860
- yield f"🔄 Iteration {event.iteration} complete\n\n"
861
- case "synthesis_start":
862
- yield f"📝 Generating report...\n"
863
- case "complete":
864
- yield f"\n---\n\n{event.report}"
865
-
866
- # Gradio 5 UI
867
- with gr.Blocks(theme=gr.themes.Soft()) as demo:
868
- gr.Markdown("# 🔬 DeepBoner: Drug Repurposing Research Agent")
869
- gr.Markdown("Ask a question about potential drug repurposing opportunities.")
870
-
871
- with gr.Row():
872
- with gr.Column(scale=2):
873
- question = gr.Textbox(
874
- label="Research Question",
875
- placeholder="What existing drugs might help treat long COVID fatigue?",
876
- lines=2
877
- )
878
- examples = gr.Examples(
879
- examples=[
880
- "What existing drugs might help treat long COVID fatigue?",
881
- "Find existing drugs that might slow Alzheimer's progression",
882
- "Which diabetes drugs show promise for cancer treatment?"
883
- ],
884
- inputs=question
885
- )
886
- submit = gr.Button("🚀 Start Research", variant="primary")
887
-
888
- with gr.Column(scale=3):
889
- output = gr.Markdown(label="Research Progress & Report")
890
-
891
- submit.click(
892
- fn=research_with_streaming,
893
- inputs=question,
894
- outputs=output,
895
- )
896
-
897
- demo.launch()
898
- ```
899
-
900
- **Why Streaming?**
901
- - User sees progress, not loading spinner
902
- - Builds trust (system is working)
903
- - Better UX for long operations
904
- - Gradio 5 native support
905
-
906
- **Why gr.Markdown Output?**
907
- - Research reports are markdown
908
- - Renders citations nicely
909
- - Code blocks for methodology
910
- - Tables for drug comparisons
911
-
912
- ---
913
-
914
- ## Summary: Design Decision Table
915
-
916
- | # | Question | Decision | Why |
917
- |---|----------|----------|-----|
918
- | 1 | **Architecture** | Orchestrator with search-judge loop | Clear, testable, proven |
919
- | 2 | **Tools** | Static registry, dynamic selection | Balance flexibility vs simplicity |
920
- | 3 | **Judge** | Dual (quality + budget) | Quality + cost control |
921
- | 4 | **Stopping** | Four-tier conditions | Defense in depth |
922
- | 5 | **State** | Pydantic + checkpoints | Type-safe, resumable |
923
- | 6 | **Tool Interface** | Async Protocol + parallel execution | Fast I/O, modern Python |
924
- | 7 | **Output** | Structured + Markdown | Human & machine readable |
925
- | 8 | **Errors** | Graceful degradation + fallbacks | Robust for demo |
926
- | 9 | **Config** | TOML (Hydra-inspired) | Simple, standard |
927
- | 10 | **Testing** | Three levels | Fast feedback + confidence |
928
- | 11 | **Judge Prompts** | Structured JSON + domain criteria | Parseable, medical-specific |
929
- | 12 | **MCP** | Tools as MCP servers | Hackathon track, reusability |
930
- | 13 | **UI** | Gradio 5 streaming | Progress visibility, modern UX |
931
-
932
- ---
933
-
934
- ## Answers to Specific Questions
935
-
936
- ### "What's the orchestrator pattern?"
937
- **Answer**: See Section 1 - Iterative Research Orchestrator with search-judge loop
938
-
939
- ### "LLM-as-judge or token budget?"
940
- **Answer**: Both - See Section 3 (Dual-Judge System) and Section 4 (Three-Tier Break Conditions)
941
-
942
- ### "What's the break pattern?"
943
- **Answer**: See Section 4 - Three stopping conditions: quality threshold, token budget, max iterations
944
-
945
- ### "Should we use agent factories?"
946
- **Answer**: No - See Section 2. Static tool registry is simpler for 6-day timeline
947
-
948
- ### "How do we handle state?"
949
- **Answer**: See Section 5 - Pydantic state machine with checkpoints
950
-
951
- ---
952
-
953
- ## Appendix: Complete Data Models
954
-
955
- ```python
956
- # src/deepresearch/models.py
957
- from pydantic import BaseModel, Field
958
- from typing import List, Optional, Literal
959
- from datetime import datetime
960
-
961
- class Citation(BaseModel):
962
- """Reference to a source"""
963
- source_type: Literal["pubmed", "web", "trial", "fda"]
964
- identifier: str # PMID, URL, NCT number, etc.
965
- title: str
966
- authors: Optional[List[str]] = None
967
- date: Optional[str] = None
968
- url: Optional[str] = None
969
-
970
- class Evidence(BaseModel):
971
- """Single piece of evidence from search"""
972
- content: str
973
- source: Citation
974
- relevance_score: float = Field(ge=0, le=1)
975
- evidence_type: Literal["mechanism", "candidate", "clinical", "safety"]
976
-
977
- class DrugCandidate(BaseModel):
978
- """Potential drug for repurposing"""
979
- name: str
980
- generic_name: Optional[str] = None
981
- mechanism: str
982
- current_indications: List[str]
983
- proposed_mechanism: str
984
- evidence_quality: Literal["strong", "moderate", "weak"]
985
- fda_status: str
986
- citations: List[Citation]
987
-
988
- class JudgeAssessment(BaseModel):
989
- """Output from quality judge"""
990
- mechanism_score: int = Field(ge=0, le=10)
991
- candidates_score: int = Field(ge=0, le=10)
992
- evidence_score: int = Field(ge=0, le=10)
993
- sources_score: int = Field(ge=0, le=10)
994
- overall_confidence: float = Field(ge=0, le=1)
995
- sufficient: bool
996
- gaps: List[str]
997
- recommended_searches: List[str]
998
- recommendation: Literal["continue", "synthesize"]
999
-
1000
- class ResearchState(BaseModel):
1001
- """Complete state of a research session"""
1002
- query_id: str
1003
- question: str
1004
- iteration: int = 0
1005
- evidence: List[Evidence] = []
1006
- assessments: List[JudgeAssessment] = []
1007
- tokens_used: int = 0
1008
- search_history: List[str] = []
1009
- stop_reason: Optional[str] = None
1010
- created_at: datetime = Field(default_factory=datetime.utcnow)
1011
- updated_at: datetime = Field(default_factory=datetime.utcnow)
1012
-
1013
- class ResearchReport(BaseModel):
1014
- """Final output report"""
1015
- query: str
1016
- executive_summary: str
1017
- disease_mechanism: str
1018
- candidates: List[DrugCandidate]
1019
- methodology: str
1020
- limitations: str
1021
- confidence: float
1022
- sources_used: int
1023
- tokens_used: int
1024
- iterations: int
1025
- generated_at: datetime = Field(default_factory=datetime.utcnow)
1026
-
1027
- def to_markdown(self) -> str:
1028
- """Render as markdown for Gradio"""
1029
- md = f"# Research Report: {self.query}\n\n"
1030
- md += f"## Executive Summary\n{self.executive_summary}\n\n"
1031
- md += f"## Disease Mechanism\n{self.disease_mechanism}\n\n"
1032
- md += "## Drug Candidates\n\n"
1033
- for i, drug in enumerate(self.candidates, 1):
1034
- md += f"### {i}. {drug.name} - {drug.evidence_quality.upper()} EVIDENCE\n"
1035
- md += f"- **Mechanism**: {drug.proposed_mechanism}\n"
1036
- md += f"- **FDA Status**: {drug.fda_status}\n"
1037
- md += f"- **Current Uses**: {', '.join(drug.current_indications)}\n"
1038
- md += f"- **Citations**: {len(drug.citations)} sources\n\n"
1039
- md += f"## Methodology\n{self.methodology}\n\n"
1040
- md += f"## Limitations\n{self.limitations}\n\n"
1041
- md += f"## Confidence: {self.confidence:.0%}\n"
1042
- return md
1043
- ```
1044
-
1045
- ---
1046
-
1047
- ## 14. Alternative Frameworks Considered
1048
-
1049
- We researched major agent frameworks before settling on our stack. Here's why we chose what we chose, and what we'd steal if we're shipping like animals and have time for Gucci upgrades.
1050
-
1051
- ### Frameworks Evaluated
1052
-
1053
- | Framework | Repo | What It Does |
1054
- |-----------|------|--------------|
1055
- | **Microsoft AutoGen** | [github.com/microsoft/autogen](https://github.com/microsoft/autogen) | Multi-agent orchestration, complex workflows |
1056
- | **Claude Agent SDK** | [github.com/anthropics/claude-agent-sdk-python](https://github.com/anthropics/claude-agent-sdk-python) | Anthropic's official agent framework |
1057
- | **Pydantic AI** | [github.com/pydantic/pydantic-ai](https://github.com/pydantic/pydantic-ai) | Type-safe agents, structured outputs |
1058
-
1059
- ### Why NOT AutoGen (Microsoft)?
1060
-
1061
- **Pros:**
1062
- - Battle-tested multi-agent orchestration
1063
- - `reflect_on_tool_use` - model reviews its own tool results
1064
- - `max_tool_iterations` - built-in iteration limits
1065
- - Concurrent tool execution
1066
- - Rich ecosystem (AutoGen Studio, benchmarks)
1067
-
1068
- **Cons for MVP:**
1069
- - Heavy dependency tree (50+ packages)
1070
- - Complex configuration (YAML + Python)
1071
- - Overkill for single-agent search-judge loop
1072
- - Learning curve eats into 6-day timeline
1073
-
1074
- **Verdict:** Great for multi-agent systems. Overkill for our MVP.
1075
-
1076
- ### Why NOT Claude Agent SDK (Anthropic)?
1077
-
1078
- **Pros:**
1079
- - Official Anthropic framework
1080
- - Clean `@tool` decorator pattern
1081
- - In-process MCP servers (no subprocess)
1082
- - Hooks for pre/post tool execution
1083
- - Direct Claude Code integration
1084
-
1085
- **Cons for MVP:**
1086
- - Requires Claude Code CLI bundled
1087
- - Node.js dependency for some features
1088
- - Designed for Claude Code ecosystem, not standalone agents
1089
- - Less flexible for custom LLM providers
1090
-
1091
- **Verdict:** Would be great if we were building ON Claude Code. We're building a standalone agent.
1092
-
1093
- ### Why Pydantic AI + FastMCP (Our Choice)
1094
-
1095
- **Pros:**
1096
- - ✅ Simple, Pythonic API
1097
- - ✅ Native async/await
1098
- - ✅ Type-safe with Pydantic
1099
- - ✅ Works with any LLM provider
1100
- - ✅ FastMCP for clean MCP servers
1101
- - ✅ Minimal dependencies
1102
- - ✅ Can ship MVP in 6 days
1103
-
1104
- **Cons:**
1105
- - Newer framework (less battle-tested)
1106
- - Smaller ecosystem
1107
- - May need to build more from scratch
1108
-
1109
- **Verdict:** Right tool for the job. Ship fast, iterate later.
1110
-
1111
- ---
1112
-
1113
- ## 15. Stretch Goals: Gucci Bangers (If We're Shipping Like Animals)
1114
-
1115
- If MVP ships early and we're crushing it, here's what we'd steal from other frameworks:
1116
-
1117
- ### Tier 1: Quick Wins (2-4 hours each)
1118
-
1119
- #### From Claude Agent SDK: `@tool` Decorator Pattern
1120
- Replace our Protocol-based tools with cleaner decorators:
1121
-
1122
- ```python
1123
- # CURRENT (Protocol-based)
1124
- class PubMedSearchTool:
1125
- async def search(self, query: str, max_results: int = 10) -> List[Evidence]:
1126
- ...
1127
-
1128
- # UPGRADE (Decorator-based, stolen from Claude SDK)
1129
- from claude_agent_sdk import tool
1130
-
1131
- @tool("search_pubmed", "Search PubMed for biomedical papers", {
1132
- "query": str,
1133
- "max_results": int
1134
- })
1135
- async def search_pubmed(args):
1136
- results = await _do_pubmed_search(args["query"], args["max_results"])
1137
- return {"content": [{"type": "text", "text": json.dumps(results)}]}
1138
- ```
1139
-
1140
- **Why it's Gucci:** Cleaner syntax, automatic schema generation, less boilerplate.
1141
-
1142
- #### From AutoGen: Reflect on Tool Use
1143
- Add a reflection step where the model reviews its own tool results:
1144
-
1145
- ```python
1146
- # CURRENT: Judge evaluates evidence
1147
- assessment = await judge.assess(question, evidence)
1148
-
1149
- # UPGRADE: Add reflection step (stolen from AutoGen)
1150
- class ReflectiveJudge:
1151
- async def assess_with_reflection(self, question, evidence, tool_results):
1152
- # First pass: raw assessment
1153
- initial = await self._assess(question, evidence)
1154
-
1155
- # Reflection: "Did I use the tools correctly?"
1156
- reflection = await self._reflect_on_tool_use(tool_results)
1157
-
1158
- # Final: combine assessment + reflection
1159
- return self._combine(initial, reflection)
1160
- ```
1161
-
1162
- **Why it's Gucci:** Catches tool misuse, improves accuracy, more robust judge.
1163
-
1164
- ### Tier 2: Medium Lifts (4-8 hours each)
1165
-
1166
- #### From AutoGen: Concurrent Tool Execution
1167
- Run multiple tools in parallel with proper error handling:
1168
-
1169
- ```python
1170
- # CURRENT: Sequential with asyncio.gather
1171
- results = await asyncio.gather(*[tool.search(query) for tool in tools])
1172
-
1173
- # UPGRADE: AutoGen-style with cancellation + timeout
1174
- from autogen_core import CancellationToken
1175
-
1176
- async def execute_tools_concurrent(tools, query, timeout=30):
1177
- token = CancellationToken()
1178
-
1179
- async def run_with_timeout(tool):
1180
- try:
1181
- return await asyncio.wait_for(
1182
- tool.search(query, cancellation_token=token),
1183
- timeout=timeout
1184
- )
1185
- except asyncio.TimeoutError:
1186
- token.cancel() # Cancel other tools
1187
- return ToolError(f"{tool.name} timed out")
1188
-
1189
- return await asyncio.gather(*[run_with_timeout(t) for t in tools])
1190
- ```
1191
-
1192
- **Why it's Gucci:** Proper timeout handling, cancellation propagation, production-ready.
1193
-
1194
- #### From Claude SDK: Hooks System
1195
- Add pre/post hooks for logging, validation, cost tracking:
1196
-
1197
- ```python
1198
- # UPGRADE: Hook system (stolen from Claude SDK)
1199
- class HookManager:
1200
- async def pre_tool_use(self, tool_name, args):
1201
- """Called before every tool execution"""
1202
- logger.info(f"Calling {tool_name} with {args}")
1203
- self.cost_tracker.start_timer()
1204
-
1205
- async def post_tool_use(self, tool_name, result, duration):
1206
- """Called after every tool execution"""
1207
- self.cost_tracker.record(tool_name, duration)
1208
- if result.is_error:
1209
- self.error_tracker.record(tool_name, result.error)
1210
- ```
1211
-
1212
- **Why it's Gucci:** Observability, debugging, cost tracking, production-ready.
1213
-
1214
- ### Tier 3: Big Lifts (Post-Hackathon)
1215
-
1216
- #### Full AutoGen Integration
1217
- If we want multi-agent capabilities later:
1218
-
1219
- ```python
1220
- # POST-HACKATHON: Multi-agent drug repurposing
1221
- from autogen_agentchat import AssistantAgent, GroupChat
1222
-
1223
- literature_agent = AssistantAgent(
1224
- name="LiteratureReviewer",
1225
- tools=[pubmed_search, web_search],
1226
- system_message="You search and summarize medical literature."
1227
- )
1228
-
1229
- mechanism_agent = AssistantAgent(
1230
- name="MechanismAnalyzer",
1231
- tools=[pathway_db, protein_db],
1232
- system_message="You analyze disease mechanisms and drug targets."
1233
- )
1234
-
1235
- synthesis_agent = AssistantAgent(
1236
- name="ReportSynthesizer",
1237
- system_message="You synthesize findings into actionable reports."
1238
- )
1239
-
1240
- # Orchestrate multi-agent workflow
1241
- group_chat = GroupChat(
1242
- agents=[literature_agent, mechanism_agent, synthesis_agent],
1243
- max_round=10
1244
- )
1245
- ```
1246
-
1247
- **Why it's Gucci:** True multi-agent collaboration, specialized roles, scalable.
1248
-
1249
- ---
1250
-
1251
- ## Priority Order for Stretch Goals
1252
-
1253
- | Priority | Feature | Source | Effort | Impact |
1254
- |----------|---------|--------|--------|--------|
1255
- | 1 | `@tool` decorator | Claude SDK | 2 hrs | High - cleaner code |
1256
- | 2 | Reflect on tool use | AutoGen | 3 hrs | High - better accuracy |
1257
- | 3 | Hooks system | Claude SDK | 4 hrs | Medium - observability |
1258
- | 4 | Concurrent + cancellation | AutoGen | 4 hrs | Medium - robustness |
1259
- | 5 | Multi-agent | AutoGen | 8+ hrs | Post-hackathon |
1260
-
1261
- ---
1262
-
1263
- ## The Bottom Line
1264
-
1265
- ```
1266
- ┌─────────────────────────────────────────────────────────────┐
1267
- │ MVP (Days 1-4): Pydantic AI + FastMCP │
1268
- │ - Ship working drug repurposing agent │
1269
- │ - Search-judge loop with PubMed + Web │
1270
- │ - Gradio UI with streaming │
1271
- │ - MCP server for hackathon track │
1272
- ├─────────────────────────────────────────────────────────────┤
1273
- │ If Crushing It (Days 5-6): Steal the Gucci │
1274
- │ - @tool decorators from Claude SDK │
1275
- │ - Reflect on tool use from AutoGen │
1276
- │ - Hooks for observability │
1277
- ├─────────────────────────────────────────────────────────────┤
1278
- │ Post-Hackathon: Full AutoGen Integration │
1279
- │ - Multi-agent workflows │
1280
- │ - Specialized agent roles │
1281
- │ - Production-grade orchestration │
1282
- └─────────────────────────────────────────────────────────────┘
1283
- ```
1284
-
1285
- **Ship MVP first. Steal bangers if time. Scale later.**
1286
-
1287
- ---
1288
-
1289
- ## 16. Reference Implementation Resources
1290
-
1291
- We've cloned production-ready repos into `reference_repos/` that we can vendor, copy from, or just USE directly. This section documents what's available and how to leverage it.
1292
-
1293
- ### Cloned Repositories
1294
-
1295
- | Repository | Location | What It Provides |
1296
- |------------|----------|------------------|
1297
- | **pydanticai-research-agent** | `reference_repos/pydanticai-research-agent/` | Complete PydanticAI agent with Brave Search |
1298
- | **pubmed-mcp-server** | `reference_repos/pubmed-mcp-server/` | Production-grade PubMed MCP server (TypeScript) |
1299
- | **autogen-microsoft** | `reference_repos/autogen-microsoft/` | Microsoft's multi-agent framework |
1300
- | **claude-agent-sdk** | `reference_repos/claude-agent-sdk/` | Anthropic's agent SDK with @tool decorator |
1301
-
1302
- ### 🔥 CHEAT CODE: Production PubMed MCP Already Exists
1303
-
1304
- The `pubmed-mcp-server` is **production-grade** and has EVERYTHING we need:
1305
-
1306
- ```bash
1307
- # Already available tools in pubmed-mcp-server:
1308
- pubmed_search_articles # Search PubMed with filters, date ranges
1309
- pubmed_fetch_contents # Get full article details by PMID
1310
- pubmed_article_connections # Find citations, related articles
1311
- pubmed_research_agent # Generate research plan outlines
1312
- pubmed_generate_chart # Create PNG charts from data
1313
- ```
1314
-
1315
- **Option 1: Use it directly via npx**
1316
- ```json
1317
- {
1318
- "mcpServers": {
1319
- "pubmed": {
1320
- "command": "npx",
1321
- "args": ["@cyanheads/pubmed-mcp-server"],
1322
- "env": { "NCBI_API_KEY": "your_key" }
1323
- }
1324
- }
1325
- }
1326
- ```
1327
-
1328
- **Option 2: Vendor the logic into Python**
1329
- The TypeScript code in `reference_repos/pubmed-mcp-server/src/` shows exactly how to:
1330
- - Construct PubMed E-utilities queries
1331
- - Handle rate limiting (3/sec without key, 10/sec with key)
1332
- - Parse XML responses
1333
- - Extract article metadata
1334
-
1335
- ### PydanticAI Research Agent Patterns
1336
-
1337
- The `pydanticai-research-agent` repo provides copy-paste patterns:
1338
-
1339
- **Agent Definition** (`agents/research_agent.py`):
1340
- ```python
1341
- from pydantic_ai import Agent, RunContext
1342
- from dataclasses import dataclass
1343
-
1344
- @dataclass
1345
- class ResearchAgentDependencies:
1346
- brave_api_key: str
1347
- session_id: Optional[str] = None
1348
-
1349
- research_agent = Agent(
1350
- get_llm_model(),
1351
- deps_type=ResearchAgentDependencies,
1352
- system_prompt=SYSTEM_PROMPT
1353
- )
1354
-
1355
- @research_agent.tool
1356
- async def search_web(
1357
- ctx: RunContext[ResearchAgentDependencies],
1358
- query: str,
1359
- max_results: int = 10
1360
- ) -> List[Dict[str, Any]]:
1361
- """Search with context access via ctx.deps"""
1362
- results = await search_web_tool(ctx.deps.brave_api_key, query, max_results)
1363
- return results
1364
- ```
1365
-
1366
- **Brave Search Tool** (`tools/brave_search.py`):
1367
- ```python
1368
- async def search_web_tool(api_key: str, query: str, count: int = 10) -> List[Dict]:
1369
- headers = {"X-Subscription-Token": api_key, "Accept": "application/json"}
1370
- async with httpx.AsyncClient() as client:
1371
- response = await client.get(
1372
- "https://api.search.brave.com/res/v1/web/search",
1373
- headers=headers,
1374
- params={"q": query, "count": count},
1375
- timeout=30.0
1376
- )
1377
- # Handle 429 rate limit, 401 auth errors
1378
- data = response.json()
1379
- return data.get("web", {}).get("results", [])
1380
- ```
1381
-
1382
- **Pydantic Models** (`models/research_models.py`):
1383
- ```python
1384
- class BraveSearchResult(BaseModel):
1385
- title: str
1386
- url: str
1387
- description: str
1388
- score: float = Field(ge=0.0, le=1.0)
1389
- ```
1390
-
1391
- ### Microsoft Agent Framework Orchestration Patterns
1392
-
1393
- From [deepwiki.com/microsoft/agent-framework](https://deepwiki.com/microsoft/agent-framework/3.4-workflows-and-orchestration):
1394
-
1395
- #### Sequential Orchestration
1396
- ```
1397
- Agent A → Agent B → Agent C (each receives prior outputs)
1398
- ```
1399
- **Use when:** Tasks have dependencies, results inform next steps.
1400
-
1401
- #### Concurrent (Fan-out/Fan-in)
1402
- ```
1403
- ┌→ Agent A ─┐
1404
- Dispatcher ├→ Agent B ─┼→ Aggregator
1405
- └→ Agent C ─┘
1406
- ```
1407
- **Use when:** Independent tasks can run in parallel, results need consolidation.
1408
- **Our use:** Parallel PubMed + Web search.
1409
-
1410
- #### Handoff Orchestration
1411
- ```
1412
- Coordinator → routes to → Specialist A, B, or C based on request
1413
- ```
1414
- **Use when:** Router decides which search strategy based on query type.
1415
- **Our use:** Route "mechanism" vs "clinical trial" vs "drug info" queries.
1416
-
1417
- #### HITL (Human-in-the-Loop)
1418
- ```
1419
- Agent → RequestInfoEvent → Human validates → Agent continues
1420
- ```
1421
- **Use when:** Critical judgment points need human validation.
1422
- **Our use:** Optional "approve drug candidates before synthesis" step.
1423
-
1424
- ### Recommended Hybrid Pattern for Our Agent
1425
-
1426
- Based on all the research, here's our recommended implementation:
1427
-
1428
- ```
1429
- ┌─────────────────────────────────────────────────────────┐
1430
- │ 1. ROUTER (Handoff Pattern) │
1431
- │ - Analyze query type │
1432
- │ - Choose search strategy │
1433
- ├─────────────────────────────────────────────────────────┤
1434
- │ 2. SEARCH (Concurrent Pattern) │
1435
- │ - Fan-out to PubMed + Web in parallel │
1436
- │ - Timeout handling per AutoGen patterns │
1437
- │ - Aggregate results │
1438
- ├─────────────────────────────────────────────────────────┤
1439
- │ 3. JUDGE (Sequential + Budget) │
1440
- │ - Quality assessment │
1441
- │ - Token/iteration budget check │
1442
- │ - Recommend: continue or synthesize │
1443
- ├─────────────────────────────────────────────────────────┤
1444
- │ 4. SYNTHESIZE (Final Agent) │
1445
- │ - Generate research report │
1446
- │ - Include citations │
1447
- │ - Stream to Gradio UI │
1448
- └─────────────────────────────────────────────────────────┘
1449
- ```
1450
-
1451
- ### Quick Start: Minimal Implementation Path
1452
-
1453
- **Day 1-2: Core Loop**
1454
- 1. Copy `search_web_tool` from `pydanticai-research-agent/tools/brave_search.py`
1455
- 2. Implement PubMed search (reference `pubmed-mcp-server/src/` for E-utilities patterns)
1456
- 3. Wire up basic search-judge loop
1457
-
1458
- **Day 3: Judge + State**
1459
- 1. Implement quality judge with JSON structured output
1460
- 2. Add budget judge
1461
- 3. Add Pydantic state management
1462
-
1463
- **Day 4: UI + MCP**
1464
- 1. Gradio streaming UI
1465
- 2. Wrap PubMed tool as FastMCP server
1466
-
1467
- **Day 5-6: Polish + Deploy**
1468
- 1. HuggingFace Spaces deployment
1469
- 2. Demo video
1470
- 3. Stretch goals if time
1471
-
1472
- ---
1473
-
1474
- ## 17. External Resources & MCP Servers
1475
-
1476
- ### Available PubMed MCP Servers (Community)
1477
-
1478
- | Server | Author | Features | Link |
1479
- |--------|--------|----------|------|
1480
- | **pubmed-mcp-server** | cyanheads | Full E-utilities, research agent, charts | [GitHub](https://github.com/cyanheads/pubmed-mcp-server) |
1481
- | **BioMCP** | GenomOncology | PubMed + ClinicalTrials + MyVariant | [GitHub](https://github.com/genomoncology/biomcp) |
1482
- | **PubMed-MCP-Server** | JackKuo666 | Basic search, metadata access | [GitHub](https://github.com/JackKuo666/PubMed-MCP-Server) |
1483
-
1484
- ### Web Search Options
1485
-
1486
- | Tool | Free Tier | API Key | Async Support |
1487
- |------|-----------|---------|---------------|
1488
- | **Brave Search** | 2000/month | Required | Yes (httpx) |
1489
- | **DuckDuckGo** | Unlimited | No | Yes (duckduckgo-search) |
1490
- | **SerpAPI** | None | Required | Yes |
1491
-
1492
- **Recommended:** Start with DuckDuckGo (free, no key), upgrade to Brave for production.
1493
-
1494
- ```python
1495
- # DuckDuckGo async search (no API key needed!)
1496
- from duckduckgo_search import DDGS
1497
-
1498
- async def search_ddg(query: str, max_results: int = 10) -> List[Dict]:
1499
- with DDGS() as ddgs:
1500
- results = list(ddgs.text(query, max_results=max_results))
1501
- return [{"title": r["title"], "url": r["href"], "description": r["body"]} for r in results]
1502
- ```
1503
-
1504
- ---
1505
-
1506
- **Document Status**: Official Architecture Spec
1507
- **Review Score**: 100/100 (Ironclad Gucci Banger Edition)
1508
- **Sections**: 17 design patterns + data models appendix + reference repos + stretch goals
1509
- **Last Updated**: November 2025
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/architecture/overview.md DELETED
@@ -1,474 +0,0 @@
1
- # DeepBoner: Medical Drug Repurposing Research Agent
2
- ## Project Overview
3
-
4
- ---
5
-
6
- ## Executive Summary
7
-
8
- **DeepBoner** is a deep research agent designed to accelerate medical drug repurposing research by autonomously searching, analyzing, and synthesizing evidence from multiple biomedical databases.
9
-
10
- ### The Problem We Solve
11
-
12
- Drug repurposing - finding new therapeutic uses for existing FDA-approved drugs - can take years of manual literature review. Researchers must:
13
- - Search thousands of papers across multiple databases
14
- - Identify molecular mechanisms
15
- - Find relevant clinical trials
16
- - Assess safety profiles
17
- - Synthesize evidence into actionable insights
18
-
19
- **DeepBoner automates this process from hours to minutes.**
20
-
21
- ### What Is Drug Repurposing?
22
-
23
- **Simple Explanation:**
24
- Using existing approved drugs to treat NEW diseases they weren't originally designed for.
25
-
26
- **Real Examples:**
27
- - **Viagra** (sildenafil): Originally for heart disease → Now treats erectile dysfunction
28
- - **Thalidomide**: Once banned → Now treats multiple myeloma
29
- - **Aspirin**: Pain reliever → Heart attack prevention
30
- - **Metformin**: Diabetes drug → Being tested for aging/longevity
31
-
32
- **Why It Matters:**
33
- - Faster than developing new drugs (years vs decades)
34
- - Cheaper (known safety profiles)
35
- - Lower risk (already FDA approved)
36
- - Immediate patient benefit potential
37
-
38
- ---
39
-
40
- ## Core Use Case
41
-
42
- ### Primary Query Type
43
- > "What existing drugs might help treat [disease/condition]?"
44
-
45
- ### Example Queries
46
-
47
- 1. **Long COVID Fatigue**
48
- - Query: "What existing drugs might help treat long COVID fatigue?"
49
- - Agent searches: PubMed, clinical trials, drug databases
50
- - Output: List of candidate drugs with mechanisms + evidence + citations
51
-
52
- 2. **Alzheimer's Disease**
53
- - Query: "Find existing drugs that target beta-amyloid pathways"
54
- - Agent identifies: Disease mechanisms → Drug candidates → Clinical evidence
55
- - Output: Comprehensive research report with drug candidates
56
-
57
- 3. **Rare Disease Treatment**
58
- - Query: "What drugs might help with fibrodysplasia ossificans progressiva?"
59
- - Agent finds: Similar conditions → Shared pathways → Potential treatments
60
- - Output: Evidence-based treatment suggestions
61
-
62
- ---
63
-
64
- ## System Architecture
65
-
66
- ### High-Level Design (Phases 1-8)
67
-
68
- ```text
69
- User Query
70
-
71
- Gradio UI (Phase 4)
72
-
73
- Magentic Manager (Phase 5) ← LLM-powered coordinator
74
- ├── SearchAgent (Phase 2+5) ←→ PubMed + Web + VectorDB (Phase 6)
75
- ├── HypothesisAgent (Phase 7) ←→ Mechanistic Reasoning
76
- ├── JudgeAgent (Phase 3+5) ←→ Evidence Assessment
77
- └── ReportAgent (Phase 8) ←→ Final Synthesis
78
-
79
- Structured Research Report
80
- ```
81
-
82
- ### Key Components
83
-
84
- 1. **Magentic Manager (Orchestrator)**
85
- - LLM-powered multi-agent coordinator
86
- - Dynamic planning and agent selection
87
- - Built-in stall detection and replanning
88
- - Microsoft Agent Framework integration
89
-
90
- 2. **SearchAgent (Phase 2+5+6)**
91
- - PubMed E-utilities search
92
- - DuckDuckGo web search
93
- - Semantic search via ChromaDB (Phase 6)
94
- - Evidence deduplication
95
-
96
- 3. **HypothesisAgent (Phase 7)**
97
- - Generates Drug → Target → Pathway → Effect hypotheses
98
- - Guides targeted searches
99
- - Scientific reasoning about mechanisms
100
-
101
- 4. **JudgeAgent (Phase 3+5)**
102
- - LLM-based evidence assessment
103
- - Mechanism score + Clinical score
104
- - Recommends continue/synthesize
105
- - Generates refined search queries
106
-
107
- 5. **ReportAgent (Phase 8)**
108
- - Structured scientific reports
109
- - Executive summary, methodology
110
- - Hypotheses tested with evidence counts
111
- - Proper citations and limitations
112
-
113
- 6. **Gradio UI (Phase 4)**
114
- - Chat interface for questions
115
- - Real-time progress via events
116
- - Mode toggle (Simple/Magentic)
117
- - Formatted markdown output
118
-
119
- ---
120
-
121
- ## Design Patterns
122
-
123
- ### 1. Search-and-Judge Loop (Primary Pattern)
124
-
125
- ```python
126
- def research(question: str) -> Report:
127
- context = []
128
- for iteration in range(max_iterations):
129
- # SEARCH: Query relevant tools
130
- results = search_tools(question, context)
131
- context.extend(results)
132
-
133
- # JUDGE: Evaluate quality
134
- if judge.is_sufficient(question, context):
135
- break
136
-
137
- # REFINE: Adjust search strategy
138
- query = refine_query(question, context)
139
-
140
- # SYNTHESIZE: Generate report
141
- return synthesize_report(question, context)
142
- ```
143
-
144
- **Why This Pattern:**
145
- - Simple to implement and debug
146
- - Clear loop termination conditions
147
- - Iterative improvement of search quality
148
- - Balances depth vs speed
149
-
150
- ### 2. Multi-Tool Orchestration
151
-
152
- ```
153
- Question → Agent decides which tools to use
154
-
155
- ┌───┴────┬─────────┬──────────┐
156
- ↓ ↓ ↓ ↓
157
- PubMed Web Search Trials DB Drug DB
158
- ↓ ↓ ↓ ↓
159
- └───┬────┴─────────┴─────��────┘
160
-
161
- Aggregate Results → Judge
162
- ```
163
-
164
- **Why This Pattern:**
165
- - Different sources provide different evidence types
166
- - Parallel tool execution (when possible)
167
- - Comprehensive coverage
168
-
169
- ### 3. LLM-as-Judge with Token Budget
170
-
171
- **Dual Stopping Conditions:**
172
- - **Smart Stop**: LLM judge says "we have sufficient evidence"
173
- - **Hard Stop**: Token budget exhausted OR max iterations reached
174
-
175
- **Why Both:**
176
- - Judge enables early exit when answer is good
177
- - Budget prevents runaway costs
178
- - Iterations prevent infinite loops
179
-
180
- ### 4. Stateful Checkpointing
181
-
182
- ```
183
- .deepresearch/
184
- ├── state/
185
- │ └── query_123.json # Current research state
186
- ├── checkpoints/
187
- │ └── query_123_iter3/ # Checkpoint at iteration 3
188
- └── workspace/
189
- └── query_123/ # Downloaded papers, data
190
- ```
191
-
192
- **Why This Pattern:**
193
- - Resume interrupted research
194
- - Debugging and analysis
195
- - Cost savings (don't re-search)
196
-
197
- ---
198
-
199
- ## Component Breakdown
200
-
201
- ### Agent (Orchestrator)
202
- - **Responsibility**: Coordinate research process
203
- - **Size**: ~100 lines
204
- - **Key Methods**:
205
- - `research(question)` - Main entry point
206
- - `plan_search_strategy()` - Decide what to search
207
- - `execute_search()` - Run tool queries
208
- - `evaluate_progress()` - Call judge
209
- - `synthesize_findings()` - Generate report
210
-
211
- ### Tools
212
- - **Responsibility**: Interface with external data sources
213
- - **Size**: ~50 lines per tool
214
- - **Implementations**:
215
- - `PubMedTool` - Search biomedical literature
216
- - `WebSearchTool` - General medical information
217
- - `ClinicalTrialsTool` - Trial data (optional)
218
- - `DrugInfoTool` - FDA drug database (optional)
219
-
220
- ### Judge
221
- - **Responsibility**: Evaluate evidence quality
222
- - **Size**: ~50 lines
223
- - **Key Methods**:
224
- - `is_sufficient(question, evidence)` → bool
225
- - `assess_quality(evidence)` → score
226
- - `identify_gaps(question, evidence)` → missing_info
227
-
228
- ### Gradio App
229
- - **Responsibility**: User interface
230
- - **Size**: ~50 lines
231
- - **Features**:
232
- - Text input for questions
233
- - Progress indicators
234
- - Formatted output with citations
235
- - Download research report
236
-
237
- ---
238
-
239
- ## Technical Stack
240
-
241
- ### Core Dependencies
242
- ```toml
243
- [dependencies]
244
- python = ">=3.10"
245
- pydantic = "^2.7"
246
- pydantic-ai = "^0.0.16"
247
- fastmcp = "^0.1.0"
248
- gradio = "^5.0"
249
- beautifulsoup4 = "^4.12"
250
- httpx = "^0.27"
251
- ```
252
-
253
- ### Optional Enhancements
254
- - `modal` - For GPU-accelerated local LLM
255
- - `fastmcp` - MCP server integration
256
- - `sentence-transformers` - Semantic search
257
- - `faiss-cpu` - Vector similarity
258
-
259
- ### Tool APIs & Rate Limits
260
-
261
- | API | Cost | Rate Limit | API Key? | Notes |
262
- |-----|------|------------|----------|-------|
263
- | **PubMed E-utilities** | Free | 3/sec (no key), 10/sec (with key) | Optional | Register at NCBI for higher limits |
264
- | **Brave Search API** | Free tier | 2000/month free | Required | Primary web search |
265
- | **DuckDuckGo** | Free | Unofficial, ~1/sec | No | Fallback web search |
266
- | **ClinicalTrials.gov** | Free | 100/min | No | Stretch goal |
267
- | **OpenFDA** | Free | 240/min (no key), 120K/day (with key) | Optional | Drug info |
268
-
269
- **Web Search Strategy (Priority Order):**
270
- 1. **Brave Search API** (free tier: 2000 queries/month) - Primary
271
- 2. **DuckDuckGo** (unofficial, no API key) - Fallback
272
- 3. **SerpAPI** ($50/month) - Only if free options fail
273
-
274
- **Why NOT SerpAPI first?**
275
- - Costs money (hackathon budget = $0)
276
- - Free alternatives work fine for demo
277
- - Can upgrade later if needed
278
-
279
- ---
280
-
281
- ## Success Criteria
282
-
283
- ### Phase 1-5 (MVP) ✅ COMPLETE
284
- **Completed in ONE DAY:**
285
- - [x] User can ask drug repurposing question
286
- - [x] Agent searches PubMed (async)
287
- - [x] Agent searches web (DuckDuckGo)
288
- - [x] LLM judge evaluates evidence quality
289
- - [x] System respects token budget and iterations
290
- - [x] Output includes drug candidates + citations
291
- - [x] Works end-to-end for demo query
292
- - [x] Gradio UI with streaming progress
293
- - [x] Magentic multi-agent orchestration
294
- - [x] 38 unit tests passing
295
- - [x] CI/CD pipeline green
296
-
297
- ### Hackathon Submission ✅ COMPLETE
298
- - [x] Gradio UI deployed on HuggingFace Spaces
299
- - [x] Example queries working and tested
300
- - [x] Architecture documentation
301
- - [x] README with setup instructions
302
-
303
- ### Phase 6-8 (Enhanced)
304
- **Specs ready for implementation:**
305
- - [ ] Embeddings & Semantic Search (Phase 6)
306
- - [ ] Hypothesis Agent (Phase 7)
307
- - [ ] Report Agent (Phase 8)
308
-
309
- ### What's EXPLICITLY Out of Scope
310
- **NOT building (to stay focused):**
311
- - ❌ User authentication
312
- - ❌ Database storage of queries
313
- - ❌ Multi-user support
314
- - ❌ Payment/billing
315
- - ❌ Production monitoring
316
- - ❌ Mobile UI
317
-
318
- ---
319
-
320
- ## Implementation Timeline
321
-
322
- ### Day 1 (Today): Architecture & Setup
323
- - [x] Define use case (drug repurposing) ✅
324
- - [x] Write architecture docs ✅
325
- - [ ] Create project structure
326
- - [ ] First PR: Structure + Docs
327
-
328
- ### Day 2: Core Agent Loop
329
- - [ ] Implement basic orchestrator
330
- - [ ] Add PubMed search tool
331
- - [ ] Simple judge (keyword-based)
332
- - [ ] Test with 1 query
333
-
334
- ### Day 3: Intelligence Layer
335
- - [ ] Upgrade to LLM judge
336
- - [ ] Add web search tool
337
- - [ ] Token budget tracking
338
- - [ ] Test with multiple queries
339
-
340
- ### Day 4: UI & Integration
341
- - [ ] Build Gradio interface
342
- - [ ] Wire up agent to UI
343
- - [ ] Add progress indicators
344
- - [ ] Format output nicely
345
-
346
- ### Day 5: Polish & Extend
347
- - [ ] Add more tools (clinical trials)
348
- - [ ] Improve judge prompts
349
- - [ ] Checkpoint system
350
- - [ ] Error handling
351
-
352
- ### Day 6: Deploy & Document
353
- - [ ] Deploy to HuggingFace Spaces
354
- - [ ] Record demo video
355
- - [ ] Write submission materials
356
- - [ ] Final testing
357
-
358
- ---
359
-
360
- ## Questions This Document Answers
361
-
362
- ### For The Maintainer
363
-
364
- **Q: "What should our design pattern be?"**
365
- A: Search-and-judge loop with multi-tool orchestration (detailed in Design Patterns section)
366
-
367
- **Q: "Should we use LLM-as-judge or token budget?"**
368
- A: Both - judge for smart stopping, budget for cost control
369
-
370
- **Q: "What's the break pattern?"**
371
- A: Three conditions: judge approval, token limit, or max iterations (whichever comes first)
372
-
373
- **Q: "What components do we need?"**
374
- A: Agent orchestrator, tools (PubMed/web), judge, Gradio UI (see Component Breakdown)
375
-
376
- ### For The Team
377
-
378
- **Q: "What are we actually building?"**
379
- A: Medical drug repurposing research agent (see Core Use Case)
380
-
381
- **Q: "How complex should it be?"**
382
- A: Simple but complete - ~300 lines of core code (see Component sizes)
383
-
384
- **Q: "What's the timeline?"**
385
- A: 6 days, MVP by Day 3, polish Days 4-6 (see Implementation Timeline)
386
-
387
- **Q: "What datasets/APIs do we use?"**
388
- A: PubMed (free), web search, clinical trials.gov (see Tool APIs)
389
-
390
- ---
391
-
392
- ## Next Steps
393
-
394
- 1. **Review this document** - Team feedback on architecture
395
- 2. **Finalize design** - Incorporate feedback
396
- 3. **Create project structure** - Scaffold repository
397
- 4. **Move to proper docs** - `docs/architecture/` folder
398
- 5. **Open first PR** - Structure + Documentation
399
- 6. **Start implementation** - Day 2 onward
400
-
401
- ---
402
-
403
- ## Notes & Decisions
404
-
405
- ### Why Drug Repurposing?
406
- - Clear, impressive use case
407
- - Real-world medical impact
408
- - Good data availability (PubMed, trials)
409
- - Easy to explain (Viagra example!)
410
- - Physician on team ✅
411
-
412
- ### Why Simple Architecture?
413
- - 6-day timeline
414
- - Need working end-to-end system
415
- - Hackathon judges value "works" over "complex"
416
- - Can extend later if successful
417
-
418
- ### Why These Tools First?
419
- - PubMed: Best biomedical literature source
420
- - Web search: General medical knowledge
421
- - Clinical trials: Evidence of actual testing
422
- - Others: Nice-to-have, not critical for MVP
423
-
424
- ---
425
-
426
- ---
427
-
428
- ## Appendix A: Demo Queries (Pre-tested)
429
-
430
- These queries will be used for demo and testing. They're chosen because:
431
- 1. They have good PubMed coverage
432
- 2. They're medically interesting
433
- 3. They show the system's capabilities
434
-
435
- ### Primary Demo Query
436
- ```
437
- "What existing drugs might help treat long COVID fatigue?"
438
- ```
439
- **Expected candidates**: CoQ10, Low-dose Naltrexone, Modafinil
440
- **Expected sources**: 20+ PubMed papers, 2-3 clinical trials
441
-
442
- ### Secondary Demo Queries
443
- ```
444
- "Find existing drugs that might slow Alzheimer's progression"
445
- "What approved medications could help with fibromyalgia pain?"
446
- "Which diabetes drugs show promise for cancer treatment?"
447
- ```
448
-
449
- ### Why These Queries?
450
- - Represent real clinical needs
451
- - Have substantial literature
452
- - Show diverse drug classes
453
- - Physician on team can validate results
454
-
455
- ---
456
-
457
- ## Appendix B: Risk Assessment
458
-
459
- | Risk | Likelihood | Impact | Mitigation |
460
- |------|------------|--------|------------|
461
- | PubMed rate limiting | Medium | High | Implement caching, respect 3/sec |
462
- | Web search API fails | Low | Medium | DuckDuckGo fallback |
463
- | LLM costs exceed budget | Medium | Medium | Hard token cap at 50K |
464
- | Judge quality poor | Medium | High | Pre-test prompts, iterate |
465
- | HuggingFace deploy issues | Low | High | Test deployment Day 4 |
466
- | Demo crashes live | Medium | High | Pre-recorded backup video |
467
-
468
- ---
469
-
470
- ---
471
-
472
- **Document Status**: Official Architecture Spec
473
- **Review Score**: 98/100
474
- **Last Updated**: November 2025
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/architecture/system_registry.md CHANGED
@@ -1,6 +1,6 @@
1
  # System Registry & Wiring Architecture
2
  **Status**: Active / Canonical
3
- **Last Updated**: 2025-12-03
4
 
5
  This document serves as the **Source of Truth** for the architectural wiring of the agent framework. It defines the strict rules for decorators, protocol markers, and the tool registry to prevent regression and ensure correct system behavior.
6
 
@@ -56,12 +56,12 @@ These are the `@ai_function` decorated functions that agents can invoke. The fra
56
 
57
  | Function Name | File Path | Description |
58
  |:---|:---|:---|
59
- | `search_pubmed` | `src/agents/tools.py:21` | Searches PubMed for biomedical literature |
60
- | `search_clinical_trials` | `src/agents/tools.py:81` | Searches ClinicalTrials.gov for clinical studies |
61
- | `search_preprints` | `src/agents/tools.py:121` | Searches Europe PMC for preprints and papers |
62
- | `get_bibliography` | `src/agents/tools.py:161` | Returns collected references for final report |
 
63
  | ~~`execute_python_code`~~ | ~~`src/agents/code_executor_agent.py`~~ | REMOVED in PR #130 (Modal deleted) |
64
- | ~~`search_web`~~ | ~~`src/agents/retrieval_agent.py`~~ | REMOVED in PR #130 (unused) |
65
 
66
  ### 3.2 Tool Classes (Internal Wrappers)
67
 
@@ -73,9 +73,11 @@ These are **internal implementation wrappers** used by the AI Functions. They ar
73
  | `ClinicalTrialsTool` | `src/tools/clinicaltrials.py` | `search_clinical_trials` |
74
  | `EuropePMCTool` | `src/tools/europepmc.py` | `search_preprints` |
75
  | `OpenAlexTool` | `src/tools/openalex.py` | OpenAlex search (used in SearchHandler) |
 
76
  | `SearchHandler` | `src/tools/search_handler.py` | Orchestrates parallel searches |
 
 
77
  | ~~`ModalCodeExecutor`~~ | ~~`src/tools/code_execution.py`~~ | REMOVED in PR #130 |
78
- | ~~`WebSearchTool`~~ | ~~`src/tools/web_search.py`~~ | REMOVED in PR #130 |
79
 
80
  ---
81
 
@@ -119,8 +121,8 @@ class NewProviderChatClient(BaseChatClient):
119
 
120
  ## 5. Known Issues & Gotchas
121
 
122
- * **~~P1 Bug - Premature Marker Setting~~ (FIXED):** The `HuggingFaceChatClient` previously set `__function_invoking_chat_client__ = True` in the class body, which caused `@use_function_invocation` to skip wrapping. **Resolution:** Marker removed; decorator now sets it correctly. See `docs/bugs/P1_FREE_TIER_TOOL_EXECUTION_FAILURE.md`.
123
- * **HuggingFace Provider Routing:** Qwen2.5-7B-Instruct routes to Together.ai (not native HF). Tool call parsing may be inconsistent with complex multi-agent prompts.
124
  * **Model Hallucination:** If tool execution fails (due to incorrect wiring), models like Qwen2.5-7B will often **hallucinate** fake tool results as text. Always verify `AgentRunResponse` contains actual `FunctionResultContent`.
125
 
126
  ---
 
1
  # System Registry & Wiring Architecture
2
  **Status**: Active / Canonical
3
+ **Last Updated**: 2025-12-06
4
 
5
  This document serves as the **Source of Truth** for the architectural wiring of the agent framework. It defines the strict rules for decorators, protocol markers, and the tool registry to prevent regression and ensure correct system behavior.
6
 
 
56
 
57
  | Function Name | File Path | Description |
58
  |:---|:---|:---|
59
+ | `search_pubmed` | `src/agents/tools.py:20` | Searches PubMed for biomedical literature |
60
+ | `search_clinical_trials` | `src/agents/tools.py:80` | Searches ClinicalTrials.gov for clinical studies |
61
+ | `search_preprints` | `src/agents/tools.py:120` | Searches Europe PMC for preprints and papers |
62
+ | `get_bibliography` | `src/agents/tools.py:160` | Returns collected references for final report |
63
+ | `search_web` | `src/agents/retrieval_agent.py:17` | Searches web using DuckDuckGo |
64
  | ~~`execute_python_code`~~ | ~~`src/agents/code_executor_agent.py`~~ | REMOVED in PR #130 (Modal deleted) |
 
65
 
66
  ### 3.2 Tool Classes (Internal Wrappers)
67
 
 
73
  | `ClinicalTrialsTool` | `src/tools/clinicaltrials.py` | `search_clinical_trials` |
74
  | `EuropePMCTool` | `src/tools/europepmc.py` | `search_preprints` |
75
  | `OpenAlexTool` | `src/tools/openalex.py` | OpenAlex search (used in SearchHandler) |
76
+ | `WebSearchTool` | `src/tools/web_search.py` | `search_web` (DuckDuckGo) |
77
  | `SearchHandler` | `src/tools/search_handler.py` | Orchestrates parallel searches |
78
+ | `RateLimiter` | `src/tools/rate_limiter.py` | Rate limiting via `limits` library |
79
+ | `BaseTool` | `src/tools/base.py` | Abstract base class for tools |
80
  | ~~`ModalCodeExecutor`~~ | ~~`src/tools/code_execution.py`~~ | REMOVED in PR #130 |
 
81
 
82
  ---
83
 
 
121
 
122
  ## 5. Known Issues & Gotchas
123
 
124
+ * **~~P1 Bug - Premature Marker Setting~~ (FIXED):** The `HuggingFaceChatClient` previously set `__function_invoking_chat_client__ = True` in the class body, which caused `@use_function_invocation` to skip wrapping. **Resolution:** Marker removed; decorator now sets it correctly.
125
+ * **HuggingFace Provider Routing:** Large models (70B+) may be routed to third-party inference providers (Novita, Hyperbolic) instead of native HF infrastructure. See `CLAUDE.md` for current model recommendations.
126
  * **Model Hallucination:** If tool execution fails (due to incorrect wiring), models like Qwen2.5-7B will often **hallucinate** fake tool results as text. Always verify `AgentRunResponse` contains actual `FunctionResultContent`.
127
 
128
  ---
docs/architecture/workflow-diagrams.md CHANGED
@@ -1,8 +1,24 @@
1
- # DeepBoner Workflow - Simplified Magentic Architecture
2
 
3
  > **Architecture Pattern**: Microsoft Magentic Orchestration
4
  > **Design Philosophy**: Simple, dynamic, manager-driven coordination
5
  > **Key Innovation**: Intelligent manager replaces rigid sequential phases
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  ---
8
 
@@ -371,7 +387,7 @@ flowchart TD
371
  style CodeExec fill:#f0f0f0
372
  ```
373
 
374
- ## 11. MCP Tool Architecture
375
 
376
  ```mermaid
377
  graph TB
@@ -379,52 +395,50 @@ graph TB
379
  Manager[Magentic Manager]
380
  HypAgent[Hypothesis Agent]
381
  SearchAgent[Search Agent]
382
- AnalysisAgent[Analysis Agent]
383
  ReportAgent[Report Agent]
384
  end
385
 
386
- subgraph "MCP Protocol Layer"
387
- Registry[MCP Tool Registry<br/>• Discovers tools<br/>• Routes requests<br/>• Manages connections]
388
  end
389
 
390
- subgraph "MCP Servers"
391
- Server1[Web Search Server<br/>localhost:8001<br/>• PubMed<br/>• ClinicalTrials<br/>• Europe PMC]
392
- Server2[Code Execution Server<br/>localhost:8002<br/>• Sandboxed Python<br/>• Package management]
393
- Server3[RAG Server<br/>localhost:8003<br/>• Vector embeddings<br/>• Similarity search]
394
- Server4[Visualization Server<br/>localhost:8004<br/>• Chart generation<br/>• Plot rendering]
395
  end
396
 
397
- subgraph "External Services"
398
- PubMed[PubMed API]
399
  Trials[ClinicalTrials.gov API]
400
  EuropePMC[Europe PMC API]
401
- Modal[Modal Sandbox]
402
- ChromaDB[(ChromaDB)]
403
  end
404
 
405
- SearchAgent -->|Request| Registry
406
- AnalysisAgent -->|Request| Registry
407
- ReportAgent -->|Request| Registry
 
 
408
 
409
- Registry --> Server1
410
- Registry --> Server2
411
- Registry --> Server3
412
- Registry --> Server4
413
-
414
- Server1 --> PubMed
415
- Server1 --> Trials
416
- Server1 --> EuropePMC
417
- Server2 --> Modal
418
- Server3 --> ChromaDB
419
 
420
  style Manager fill:#ffe6e6
421
- style Registry fill:#fff4e6
422
- style Server1 fill:#e6f3ff
423
- style Server2 fill:#e6f3ff
424
- style Server3 fill:#e6f3ff
425
- style Server4 fill:#e6f3ff
426
  ```
427
 
 
 
 
428
  ## 12. Progress Tracking & Stall Detection
429
 
430
  ```mermaid
@@ -519,18 +533,18 @@ graph LR
519
  DC -->|Literature search| PubMed[PubMed API<br/>Medical papers]
520
  DC -->|Clinical trials| Trials[ClinicalTrials.gov<br/>Trial data]
521
  DC -->|Preprints| EuropePMC[Europe PMC API<br/>Preprints & papers]
522
- DC -->|Agent reasoning| Claude[Claude API<br/>Sonnet 4 / Opus]
523
- DC -->|Code execution| Modal[Modal Sandbox<br/>Safe Python env]
524
- DC -->|Vector storage| Chroma[ChromaDB<br/>Embeddings & RAG]
525
 
526
- DC -->|Deployed on| HF[HuggingFace Spaces<br/>Gradio 6.0]
527
 
528
  PubMed -->|Results| DC
529
  Trials -->|Results| DC
530
  EuropePMC -->|Results| DC
531
- Claude -->|Responses| DC
532
- Modal -->|Output| DC
533
- Chroma -->|Context| DC
534
 
535
  DC -->|Research report| User
536
 
@@ -539,9 +553,9 @@ graph LR
539
  style PubMed fill:#e6f3ff
540
  style Trials fill:#e6f3ff
541
  style EuropePMC fill:#e6f3ff
542
- style Claude fill:#ffd6d6
543
- style Modal fill:#f0f0f0
544
- style Chroma fill:#ffe6f0
545
  style HF fill:#d4edda
546
  ```
547
 
@@ -645,18 +659,20 @@ workflow = (
645
  )
646
  ```
647
 
648
- **Manager handles quality assessment in its instructions:**
649
- - Checks hypothesis quality (testable, novel, clear)
650
- - Validates search results (relevant, authoritative, recent)
651
- - Assesses analysis soundness (methodology, evidence, conclusions)
652
- - Ensures report completeness (all sections, proper citations)
 
653
 
654
- No separate Judge Agent needed - manager does it all!
655
 
656
  ---
657
 
658
- **Document Version**: 2.0 (Magentic Simplified)
659
- **Last Updated**: 2025-12-05
660
  **Architecture**: Microsoft Magentic Orchestration Pattern
661
- **Agents**: 4 (Hypothesis, Search, Analysis, Report) + 1 Manager
 
662
  **License**: MIT
 
1
+ # DeepBoner Workflow - Magentic Architecture
2
 
3
  > **Architecture Pattern**: Microsoft Magentic Orchestration
4
  > **Design Philosophy**: Simple, dynamic, manager-driven coordination
5
  > **Key Innovation**: Intelligent manager replaces rigid sequential phases
6
+ > **Last Updated**: 2025-12-06
7
+
8
+ ## Current Agent Inventory
9
+
10
+ | Agent | File | Status |
11
+ |-------|------|--------|
12
+ | Manager | `AdvancedOrchestrator` | ✅ Implemented |
13
+ | Hypothesis Agent | `hypothesis_agent.py` | ✅ Implemented |
14
+ | Search Agent | `search_agent.py` | ✅ Implemented |
15
+ | Judge Agent | `judge_agent.py` | ✅ Implemented |
16
+ | Report Agent | `report_agent.py` | ✅ Implemented |
17
+ | Retrieval Agent | `retrieval_agent.py` | ✅ Implemented (web search) |
18
+ | ~~Analysis Agent~~ | N/A | ❌ Not implemented (no code execution) |
19
+
20
+ > **Note:** Some diagrams below show "AnalysisAgent" with code execution capabilities.
21
+ > This was planned but not implemented. Modal code execution was removed in PR #130.
22
 
23
  ---
24
 
 
387
  style CodeExec fill:#f0f0f0
388
  ```
389
 
390
+ ## 11. Tool Architecture
391
 
392
  ```mermaid
393
  graph TB
 
395
  Manager[Magentic Manager]
396
  HypAgent[Hypothesis Agent]
397
  SearchAgent[Search Agent]
398
+ JudgeAgent[Judge Agent]
399
  ReportAgent[Report Agent]
400
  end
401
 
402
+ subgraph "Tool Layer (Direct Calls)"
403
+ Tools[AI Functions<br/>@ai_function decorated<br/>• search_pubmed<br/>• search_clinical_trials<br/>• search_preprints<br/>• search_web<br/>• get_bibliography]
404
  end
405
 
406
+ subgraph "Tool Wrappers"
407
+ PubMedTool[PubMedTool]
408
+ TrialsTool[ClinicalTrialsTool]
409
+ EuropePMCTool[EuropePMCTool]
410
+ WebSearchTool[WebSearchTool]
411
  end
412
 
413
+ subgraph "External APIs"
414
+ PubMed[PubMed E-utilities]
415
  Trials[ClinicalTrials.gov API]
416
  EuropePMC[Europe PMC API]
417
+ DDG[DuckDuckGo]
 
418
  end
419
 
420
+ SearchAgent -->|Calls| Tools
421
+ Tools --> PubMedTool
422
+ Tools --> TrialsTool
423
+ Tools --> EuropePMCTool
424
+ Tools --> WebSearchTool
425
 
426
+ PubMedTool --> PubMed
427
+ TrialsTool --> Trials
428
+ EuropePMCTool --> EuropePMC
429
+ WebSearchTool --> DDG
 
 
 
 
 
 
430
 
431
  style Manager fill:#ffe6e6
432
+ style Tools fill:#fff4e6
433
+ style PubMedTool fill:#e6f3ff
434
+ style TrialsTool fill:#e6f3ff
435
+ style EuropePMCTool fill:#e6f3ff
436
+ style WebSearchTool fill:#e6f3ff
437
  ```
438
 
439
+ > **Note:** MCP support is provided via Gradio's built-in `mcp_server=True` option in `src/app.py`.
440
+ > This exposes the Gradio interface as an MCP server for Claude Desktop integration.
441
+
442
  ## 12. Progress Tracking & Stall Detection
443
 
444
  ```mermaid
 
533
  DC -->|Literature search| PubMed[PubMed API<br/>Medical papers]
534
  DC -->|Clinical trials| Trials[ClinicalTrials.gov<br/>Trial data]
535
  DC -->|Preprints| EuropePMC[Europe PMC API<br/>Preprints & papers]
536
+ DC -->|Web search| DDG[DuckDuckGo<br/>General web]
537
+ DC -->|Agent reasoning| LLM[LLM Backend<br/>OpenAI or HuggingFace]
538
+ DC -->|Embeddings| Embed[SentenceTransformers<br/>Local embeddings]
539
 
540
+ DC -->|Deployed on| HF[HuggingFace Spaces<br/>Gradio 5.x]
541
 
542
  PubMed -->|Results| DC
543
  Trials -->|Results| DC
544
  EuropePMC -->|Results| DC
545
+ DDG -->|Results| DC
546
+ LLM -->|Responses| DC
547
+ Embed -->|Vectors| DC
548
 
549
  DC -->|Research report| User
550
 
 
553
  style PubMed fill:#e6f3ff
554
  style Trials fill:#e6f3ff
555
  style EuropePMC fill:#e6f3ff
556
+ style DDG fill:#e6f3ff
557
+ style LLM fill:#ffd6d6
558
+ style Embed fill:#ffe6f0
559
  style HF fill:#d4edda
560
  ```
561
 
 
659
  )
660
  ```
661
 
662
+ **Current Agent Capabilities:**
663
+ - **HypothesisAgent**: Generates research hypotheses
664
+ - **SearchAgent**: Multi-source search (PubMed, ClinicalTrials, Europe PMC)
665
+ - **JudgeAgent**: Evaluates evidence quality, determines sufficiency
666
+ - **ReportAgent**: Generates final research report
667
+ - **RetrievalAgent**: Web search via DuckDuckGo
668
 
669
+ **Manager** (AdvancedOrchestrator) coordinates agent execution and workflow.
670
 
671
  ---
672
 
673
+ **Document Version**: 2.1 (Revised for accuracy)
674
+ **Last Updated**: 2025-12-06
675
  **Architecture**: Microsoft Magentic Orchestration Pattern
676
+ **Implemented Agents**: 5 (Hypothesis, Search, Judge, Report, Retrieval) + Manager
677
+ **Planned but Not Implemented**: Analysis Agent (code execution removed in PR #130)
678
  **License**: MIT
docs/guides/deployment.md DELETED
@@ -1,142 +0,0 @@
1
- # Deployment Guide
2
- ## Launching DeepBoner: Gradio, MCP, & Modal
3
-
4
- ---
5
-
6
- ## Overview
7
-
8
- DeepBoner is designed for a multi-platform deployment strategy to maximize hackathon impact:
9
-
10
- 1. **HuggingFace Spaces**: Host the Gradio UI (User Interface).
11
- 2. **MCP Server**: Expose research tools to Claude Desktop/Agents.
12
- 3. **Modal (Optional)**: Run heavy inference or local LLMs if API costs are prohibitive.
13
-
14
- ---
15
-
16
- ## 1. HuggingFace Spaces (Gradio UI)
17
-
18
- **Goal**: A public URL where judges/users can try the research agent.
19
-
20
- ### Prerequisites
21
- - HuggingFace Account
22
- - `gradio` installed (`uv add gradio`)
23
-
24
- ### Steps
25
-
26
- 1. **Create Space**:
27
- - Go to HF Spaces -> Create New Space.
28
- - SDK: **Gradio**.
29
- - Hardware: **CPU Basic** (Free) is sufficient (since we use APIs).
30
-
31
- 2. **Prepare Files**:
32
- - Ensure `app.py` contains the Gradio interface construction.
33
- - Ensure `requirements.txt` or `pyproject.toml` lists all dependencies.
34
-
35
- 3. **Secrets**:
36
- - Go to Space Settings -> **Repository secrets**.
37
- - Add `ANTHROPIC_API_KEY` (or your chosen LLM provider key).
38
- - Add `BRAVE_API_KEY` (for web search).
39
-
40
- 4. **Deploy**:
41
- - Push code to the Space's git repo.
42
- - Watch "Build" logs.
43
-
44
- ### Streaming Optimization
45
- Ensure `app.py` uses generator functions for the chat interface to prevent timeouts:
46
- ```python
47
- # app.py
48
- def predict(message, history):
49
- agent = ResearchAgent()
50
- for update in agent.research_stream(message):
51
- yield update
52
- ```
53
-
54
- ---
55
-
56
- ## 2. MCP Server Deployment
57
-
58
- **Goal**: Allow other agents (like Claude Desktop) to use our PubMed/Research tools directly.
59
-
60
- ### Local Usage (Claude Desktop)
61
-
62
- 1. **Install**:
63
- ```bash
64
- uv sync
65
- ```
66
-
67
- 2. **Configure Claude Desktop**:
68
- Edit `~/Library/Application Support/Claude/claude_desktop_config.json`:
69
- ```json
70
- {
71
- "mcpServers": {
72
- "deepboner": {
73
- "command": "uv",
74
- "args": ["run", "fastmcp", "run", "src/mcp_servers/pubmed_server.py"],
75
- "cwd": "/absolute/path/to/DeepBoner"
76
- }
77
- }
78
- }
79
- ```
80
-
81
- 3. **Restart Claude**: You should see a 🔌 icon indicating connected tools.
82
-
83
- ### Remote Deployment (Smithery/Glama)
84
- *Target for "MCP Track" bonus points.*
85
-
86
- 1. **Dockerize**: Create a `Dockerfile` for the MCP server.
87
- ```dockerfile
88
- FROM python:3.11-slim
89
- COPY . /app
90
- RUN pip install fastmcp httpx
91
- CMD ["fastmcp", "run", "src/mcp_servers/pubmed_server.py", "--transport", "sse"]
92
- ```
93
- *Note: Use SSE transport for remote/HTTP servers.*
94
-
95
- 2. **Deploy**: Host on Fly.io or Railway.
96
-
97
- ---
98
-
99
- ## 3. Modal (GPU/Heavy Compute)
100
-
101
- **Goal**: Run a local LLM (e.g., Llama-3-70B) or handle massive parallel searches if APIs are too slow/expensive.
102
-
103
- ### Setup
104
- 1. **Install**: `uv add modal`
105
- 2. **Auth**: `modal token new`
106
-
107
- ### Logic
108
- Instead of calling Anthropic API, we call a Modal function:
109
-
110
- ```python
111
- # src/llm/modal_client.py
112
- import modal
113
-
114
- stub = modal.Stub("deepboner-inference")
115
-
116
- @stub.function(gpu="A100")
117
- def generate_text(prompt: str):
118
- # Load vLLM or similar
119
- ...
120
- ```
121
-
122
- ### When to use?
123
- - **Hackathon Demo**: Stick to Anthropic/OpenAI APIs for speed/reliability.
124
- - **Production/Stretch**: Use Modal if you hit rate limits or want to show off "Open Source Models" capability.
125
-
126
- ---
127
-
128
- ## Deployment Checklist
129
-
130
- ### Pre-Flight
131
- - [ ] Run `pytest -m unit` to ensure logic is sound.
132
- - [ ] Run `pytest -m e2e` (one pass) to verify APIs connect.
133
- - [ ] Check `requirements.txt` matches `pyproject.toml`.
134
-
135
- ### Secrets Management
136
- - [ ] **NEVER** commit `.env` files.
137
- - [ ] Verify keys are added to HF Space settings.
138
-
139
- ### Post-Launch
140
- - [ ] Test the live URL.
141
- - [ ] Verify "Stop" button in Gradio works (interrupts the agent).
142
- - [ ] Record a walkthrough video (crucial for hackathon submission).