VibecoderMcSwaggins commited on
Commit
d7e5abb
·
1 Parent(s): 32e3b61

feat(phase3): implement judge slice (LLM Judge, Prompts, Models)

Browse files

Implemented LLM-based Judge Agent with PydanticAI for structured output. Includes 100% test coverage and fallback mechanisms.

docs/implementation/05_phase_magentic.md ADDED
@@ -0,0 +1,582 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 5 Implementation Spec: Magentic Integration (Optional)
2
+
3
+ **Goal**: Upgrade orchestrator to use Microsoft Agent Framework's Magentic-One pattern.
4
+ **Philosophy**: "Same API, Better Engine."
5
+ **Prerequisite**: Phase 4 complete (MVP working end-to-end)
6
+
7
+ ---
8
+
9
+ ## 1. Why Magentic?
10
+
11
+ Magentic-One provides:
12
+ - **LLM-powered manager** that dynamically plans, selects agents, tracks progress
13
+ - **Built-in stall detection** and automatic replanning
14
+ - **Checkpointing** for pause/resume workflows
15
+ - **Event streaming** for real-time UI updates
16
+ - **Multi-agent coordination** with round limits and reset logic
17
+
18
+ This is **NOT required for MVP**. Only implement if time permits after Phase 4.
19
+
20
+ ---
21
+
22
+ ## 2. Architecture Alignment
23
+
24
+ ### Current Phase 4 Architecture
25
+ ```
26
+ User Query
27
+
28
+ Orchestrator (while loop)
29
+ ├── SearchHandler.execute() → Evidence
30
+ ├── JudgeHandler.assess() → JudgeAssessment
31
+ └── Loop/Synthesize decision
32
+
33
+ Research Report
34
+ ```
35
+
36
+ ### Phase 5 Magentic Architecture
37
+ ```
38
+ User Query
39
+
40
+ MagenticBuilder
41
+ ├── SearchAgent (wraps SearchHandler)
42
+ ├── JudgeAgent (wraps JudgeHandler)
43
+ └── StandardMagenticManager (LLM coordinator)
44
+
45
+ Research Report (same output format)
46
+ ```
47
+
48
+ **Key Insight**: We wrap existing handlers as `AgentProtocol` implementations. The domain logic stays the same.
49
+
50
+ ---
51
+
52
+ ## 3. Design for Seamless Integration
53
+
54
+ ### 3.1 Protocol-Based Design (Phase 4 prep)
55
+
56
+ In Phase 4, define handlers using Protocols so they can be wrapped later:
57
+
58
+ ```python
59
+ # src/orchestrator.py (Phase 4)
60
+ from typing import Protocol, List
61
+ from src.utils.models import Evidence, SearchResult, JudgeAssessment
62
+
63
+
64
+ class SearchHandlerProtocol(Protocol):
65
+ """Protocol for search handler - can be wrapped as Agent later."""
66
+ async def execute(self, query: str, max_results_per_tool: int = 10) -> SearchResult:
67
+ ...
68
+
69
+
70
+ class JudgeHandlerProtocol(Protocol):
71
+ """Protocol for judge handler - can be wrapped as Agent later."""
72
+ async def assess(self, question: str, evidence: List[Evidence]) -> JudgeAssessment:
73
+ ...
74
+
75
+
76
+ class OrchestratorProtocol(Protocol):
77
+ """Protocol for orchestrator - allows swapping implementations."""
78
+ async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
79
+ ...
80
+ ```
81
+
82
+ ### 3.2 Facade Pattern
83
+
84
+ The `Orchestrator` class is a facade. In Phase 5, we create `MagenticOrchestrator` with the same interface:
85
+
86
+ ```python
87
+ # Phase 4: Simple orchestrator
88
+ orchestrator = Orchestrator(search_handler, judge_handler)
89
+
90
+ # Phase 5: Magentic orchestrator (SAME API)
91
+ orchestrator = MagenticOrchestrator(search_handler, judge_handler)
92
+
93
+ # Usage is identical
94
+ async for event in orchestrator.run("metformin alzheimer"):
95
+ print(event.to_markdown())
96
+ ```
97
+
98
+ ---
99
+
100
+ ## 4. Phase 5 Implementation
101
+
102
+ ### 4.1 Install Agent Framework
103
+
104
+ Add to `pyproject.toml`:
105
+
106
+ ```toml
107
+ [project.optional-dependencies]
108
+ magentic = [
109
+ "agent-framework-core>=0.1.0",
110
+ ]
111
+ ```
112
+
113
+ ### 4.2 Agent Wrappers (`src/agents/search_agent.py`)
114
+
115
+ Wrap `SearchHandler` as an `AgentProtocol`:
116
+
117
+ ```python
118
+ """Search agent wrapper for Magentic integration."""
119
+ from typing import Any
120
+ from agent_framework import AgentProtocol, AgentRunResponse, ChatMessage, Role
121
+
122
+ from src.tools.search_handler import SearchHandler
123
+ from src.utils.models import SearchResult
124
+
125
+
126
+ class SearchAgent:
127
+ """Wraps SearchHandler as an AgentProtocol for Magentic."""
128
+
129
+ def __init__(self, search_handler: SearchHandler):
130
+ self._handler = search_handler
131
+ self._id = "search-agent"
132
+ self._name = "SearchAgent"
133
+
134
+ @property
135
+ def id(self) -> str:
136
+ return self._id
137
+
138
+ @property
139
+ def name(self) -> str | None:
140
+ return self._name
141
+
142
+ @property
143
+ def display_name(self) -> str:
144
+ return self._name
145
+
146
+ @property
147
+ def description(self) -> str | None:
148
+ return "Searches PubMed and web for drug repurposing evidence"
149
+
150
+ async def run(
151
+ self,
152
+ messages: list[ChatMessage] | None = None,
153
+ *,
154
+ thread: Any = None,
155
+ **kwargs: Any,
156
+ ) -> AgentRunResponse:
157
+ """Execute search based on the last user message."""
158
+ # Extract query from messages
159
+ query = ""
160
+ if messages:
161
+ for msg in reversed(messages):
162
+ if msg.role == Role.USER and msg.text:
163
+ query = msg.text
164
+ break
165
+
166
+ if not query:
167
+ return AgentRunResponse(
168
+ messages=[ChatMessage(role=Role.ASSISTANT, text="No query provided")],
169
+ response_id="search-no-query",
170
+ )
171
+
172
+ # Execute search
173
+ result: SearchResult = await self._handler.execute(query, max_results_per_tool=10)
174
+
175
+ # Format response
176
+ evidence_text = "\n".join([
177
+ f"- [{e.citation.title}]({e.citation.url}): {e.content[:200]}..."
178
+ for e in result.evidence[:5]
179
+ ])
180
+
181
+ response_text = f"Found {result.total_found} sources:\n\n{evidence_text}"
182
+
183
+ return AgentRunResponse(
184
+ messages=[ChatMessage(role=Role.ASSISTANT, text=response_text)],
185
+ response_id=f"search-{result.total_found}",
186
+ metadata={"evidence": [e.model_dump() for e in result.evidence]},
187
+ )
188
+
189
+ def run_stream(self, messages=None, *, thread=None, **kwargs):
190
+ """Streaming not implemented for search."""
191
+ async def _stream():
192
+ result = await self.run(messages, thread=thread, **kwargs)
193
+ from agent_framework import AgentRunResponseUpdate
194
+ yield AgentRunResponseUpdate(messages=result.messages)
195
+ return _stream()
196
+ ```
197
+
198
+ ### 4.3 Judge Agent Wrapper (`src/agents/judge_agent.py`)
199
+
200
+ ```python
201
+ """Judge agent wrapper for Magentic integration."""
202
+ from typing import Any, List
203
+ from agent_framework import AgentProtocol, AgentRunResponse, ChatMessage, Role
204
+
205
+ from src.agent_factory.judges import JudgeHandler
206
+ from src.utils.models import Evidence, JudgeAssessment
207
+
208
+
209
+ class JudgeAgent:
210
+ """Wraps JudgeHandler as an AgentProtocol for Magentic."""
211
+
212
+ def __init__(self, judge_handler: JudgeHandler, evidence_store: dict[str, List[Evidence]]):
213
+ self._handler = judge_handler
214
+ self._evidence_store = evidence_store # Shared state for evidence
215
+ self._id = "judge-agent"
216
+ self._name = "JudgeAgent"
217
+
218
+ @property
219
+ def id(self) -> str:
220
+ return self._id
221
+
222
+ @property
223
+ def name(self) -> str | None:
224
+ return self._name
225
+
226
+ @property
227
+ def display_name(self) -> str:
228
+ return self._name
229
+
230
+ @property
231
+ def description(self) -> str | None:
232
+ return "Evaluates evidence quality and determines if sufficient for synthesis"
233
+
234
+ async def run(
235
+ self,
236
+ messages: list[ChatMessage] | None = None,
237
+ *,
238
+ thread: Any = None,
239
+ **kwargs: Any,
240
+ ) -> AgentRunResponse:
241
+ """Assess evidence quality."""
242
+ # Extract original question from messages
243
+ question = ""
244
+ if messages:
245
+ for msg in messages:
246
+ if msg.role == Role.USER and msg.text:
247
+ question = msg.text
248
+ break
249
+
250
+ # Get evidence from shared store
251
+ evidence = self._evidence_store.get("current", [])
252
+
253
+ # Assess
254
+ assessment: JudgeAssessment = await self._handler.assess(question, evidence)
255
+
256
+ # Format response
257
+ response_text = f"""## Assessment
258
+
259
+ **Sufficient**: {assessment.sufficient}
260
+ **Confidence**: {assessment.confidence:.0%}
261
+ **Recommendation**: {assessment.recommendation}
262
+
263
+ ### Scores
264
+ - Mechanism: {assessment.details.mechanism_score}/10
265
+ - Clinical: {assessment.details.clinical_evidence_score}/10
266
+
267
+ ### Reasoning
268
+ {assessment.reasoning}
269
+ """
270
+
271
+ if assessment.next_search_queries:
272
+ response_text += f"\n### Next Queries\n" + "\n".join(
273
+ f"- {q}" for q in assessment.next_search_queries
274
+ )
275
+
276
+ return AgentRunResponse(
277
+ messages=[ChatMessage(role=Role.ASSISTANT, text=response_text)],
278
+ response_id=f"judge-{assessment.recommendation}",
279
+ metadata={"assessment": assessment.model_dump()},
280
+ )
281
+
282
+ def run_stream(self, messages=None, *, thread=None, **kwargs):
283
+ """Streaming not implemented for judge."""
284
+ async def _stream():
285
+ result = await self.run(messages, thread=thread, **kwargs)
286
+ from agent_framework import AgentRunResponseUpdate
287
+ yield AgentRunResponseUpdate(messages=result.messages)
288
+ return _stream()
289
+ ```
290
+
291
+ ### 4.4 Magentic Orchestrator (`src/orchestrator_magentic.py`)
292
+
293
+ ```python
294
+ """Magentic-based orchestrator for DeepCritical."""
295
+ from typing import AsyncGenerator, List
296
+ import structlog
297
+
298
+ from agent_framework import (
299
+ MagenticBuilder,
300
+ MagenticFinalResultEvent,
301
+ MagenticAgentMessageEvent,
302
+ MagenticOrchestratorMessageEvent,
303
+ WorkflowOutputEvent,
304
+ )
305
+ from agent_framework.openai import OpenAIChatClient
306
+
307
+ from src.agents.search_agent import SearchAgent
308
+ from src.agents.judge_agent import JudgeAgent
309
+ from src.tools.search_handler import SearchHandler
310
+ from src.agent_factory.judges import JudgeHandler
311
+ from src.utils.models import AgentEvent, Evidence
312
+
313
+ logger = structlog.get_logger()
314
+
315
+
316
+ class MagenticOrchestrator:
317
+ """
318
+ Magentic-based orchestrator - same API as Orchestrator.
319
+
320
+ Uses Microsoft Agent Framework's MagenticBuilder for multi-agent coordination.
321
+ """
322
+
323
+ def __init__(
324
+ self,
325
+ search_handler: SearchHandler,
326
+ judge_handler: JudgeHandler,
327
+ max_rounds: int = 10,
328
+ ):
329
+ self._search_handler = search_handler
330
+ self._judge_handler = judge_handler
331
+ self._max_rounds = max_rounds
332
+ self._evidence_store: dict[str, List[Evidence]] = {"current": []}
333
+
334
+ async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
335
+ """
336
+ Run the Magentic workflow - same API as simple Orchestrator.
337
+
338
+ Yields AgentEvent objects for real-time UI updates.
339
+ """
340
+ logger.info("Starting Magentic orchestrator", query=query)
341
+
342
+ yield AgentEvent(
343
+ type="started",
344
+ message=f"Starting research (Magentic mode): {query}",
345
+ iteration=0,
346
+ )
347
+
348
+ # Create agent wrappers
349
+ search_agent = SearchAgent(self._search_handler)
350
+ judge_agent = JudgeAgent(self._judge_handler, self._evidence_store)
351
+
352
+ # Build Magentic workflow
353
+ workflow = (
354
+ MagenticBuilder()
355
+ .participants(
356
+ searcher=search_agent,
357
+ judge=judge_agent,
358
+ )
359
+ .with_standard_manager(
360
+ chat_client=OpenAIChatClient(),
361
+ max_round_count=self._max_rounds,
362
+ max_stall_count=3,
363
+ max_reset_count=2,
364
+ )
365
+ .build()
366
+ )
367
+
368
+ # Task instruction for the manager
369
+ task = f"""Research drug repurposing opportunities for: {query}
370
+
371
+ Instructions:
372
+ 1. Use SearcherAgent to find evidence from PubMed and web
373
+ 2. Use JudgeAgent to evaluate if evidence is sufficient
374
+ 3. If JudgeAgent says "continue", search with refined queries
375
+ 4. If JudgeAgent says "synthesize", provide final synthesis
376
+ 5. Stop when synthesis is ready or max rounds reached
377
+
378
+ Focus on finding:
379
+ - Mechanism of action evidence
380
+ - Clinical/preclinical studies
381
+ - Specific drug candidates
382
+ """
383
+
384
+ iteration = 0
385
+ try:
386
+ async for event in workflow.run_stream(task):
387
+ if isinstance(event, MagenticOrchestratorMessageEvent):
388
+ yield AgentEvent(
389
+ type="judging",
390
+ message=f"Manager: {event.kind}",
391
+ iteration=iteration,
392
+ )
393
+
394
+ elif isinstance(event, MagenticAgentMessageEvent):
395
+ iteration += 1
396
+ agent_name = event.agent_id or "unknown"
397
+
398
+ if "search" in agent_name.lower():
399
+ yield AgentEvent(
400
+ type="search_complete",
401
+ message=f"Search agent responded",
402
+ iteration=iteration,
403
+ )
404
+ elif "judge" in agent_name.lower():
405
+ yield AgentEvent(
406
+ type="judge_complete",
407
+ message=f"Judge agent evaluated evidence",
408
+ iteration=iteration,
409
+ )
410
+
411
+ elif isinstance(event, MagenticFinalResultEvent):
412
+ final_text = event.message.text if event.message else "No result"
413
+ yield AgentEvent(
414
+ type="complete",
415
+ message=final_text,
416
+ data={"iterations": iteration},
417
+ iteration=iteration,
418
+ )
419
+
420
+ elif isinstance(event, WorkflowOutputEvent):
421
+ if event.data:
422
+ yield AgentEvent(
423
+ type="complete",
424
+ message=str(event.data),
425
+ iteration=iteration,
426
+ )
427
+
428
+ except Exception as e:
429
+ logger.error("Magentic workflow failed", error=str(e))
430
+ yield AgentEvent(
431
+ type="error",
432
+ message=f"Workflow error: {str(e)}",
433
+ iteration=iteration,
434
+ )
435
+ ```
436
+
437
+ ### 4.5 Factory Pattern (`src/orchestrator_factory.py`)
438
+
439
+ Allow switching between implementations:
440
+
441
+ ```python
442
+ """Factory for creating orchestrators."""
443
+ from typing import Literal
444
+
445
+ from src.orchestrator import Orchestrator
446
+ from src.tools.search_handler import SearchHandler
447
+ from src.agent_factory.judges import JudgeHandler
448
+ from src.utils.models import OrchestratorConfig
449
+
450
+
451
+ def create_orchestrator(
452
+ search_handler: SearchHandler,
453
+ judge_handler: JudgeHandler,
454
+ config: OrchestratorConfig | None = None,
455
+ mode: Literal["simple", "magentic"] = "simple",
456
+ ):
457
+ """
458
+ Create an orchestrator instance.
459
+
460
+ Args:
461
+ search_handler: The search handler
462
+ judge_handler: The judge handler
463
+ config: Optional configuration
464
+ mode: "simple" for Phase 4 loop, "magentic" for Phase 5 multi-agent
465
+
466
+ Returns:
467
+ Orchestrator instance (same interface regardless of mode)
468
+ """
469
+ if mode == "magentic":
470
+ try:
471
+ from src.orchestrator_magentic import MagenticOrchestrator
472
+ return MagenticOrchestrator(
473
+ search_handler=search_handler,
474
+ judge_handler=judge_handler,
475
+ max_rounds=config.max_iterations if config else 10,
476
+ )
477
+ except ImportError:
478
+ # Fallback to simple if agent-framework not installed
479
+ pass
480
+
481
+ return Orchestrator(
482
+ search_handler=search_handler,
483
+ judge_handler=judge_handler,
484
+ config=config,
485
+ )
486
+ ```
487
+
488
+ ---
489
+
490
+ ## 5. Directory Structure After Phase 5
491
+
492
+ ```
493
+ src/
494
+ ├── app.py # Gradio UI (unchanged)
495
+ ├── orchestrator.py # Phase 4 simple orchestrator
496
+ ├── orchestrator_magentic.py # Phase 5 Magentic orchestrator
497
+ ├── orchestrator_factory.py # Factory to switch implementations
498
+ ├── agents/ # NEW: Agent wrappers
499
+ │ ├── __init__.py
500
+ │ ├── search_agent.py # SearchHandler as AgentProtocol
501
+ │ └── judge_agent.py # JudgeHandler as AgentProtocol
502
+ ├── agent_factory/
503
+ │ └── judges.py # JudgeHandler (unchanged)
504
+ ├── tools/
505
+ │ ├── pubmed.py # PubMed tool (unchanged)
506
+ │ ├── websearch.py # Web tool (unchanged)
507
+ │ └── search_handler.py # SearchHandler (unchanged)
508
+ └── utils/
509
+ └── models.py # Models (unchanged)
510
+ ```
511
+
512
+ ---
513
+
514
+ ## 6. Implementation Checklist
515
+
516
+ - [ ] Ensure Phase 4 uses Protocol-based handler interfaces
517
+ - [ ] Add `agent-framework-core` to optional dependencies
518
+ - [ ] Create `src/agents/` directory
519
+ - [ ] Implement `SearchAgent` wrapper
520
+ - [ ] Implement `JudgeAgent` wrapper
521
+ - [ ] Implement `MagenticOrchestrator`
522
+ - [ ] Implement `orchestrator_factory.py`
523
+ - [ ] Add tests for agent wrappers
524
+ - [ ] Test Magentic flow end-to-end
525
+ - [ ] Update `src/app.py` to use factory with mode toggle
526
+
527
+ ---
528
+
529
+ ## 7. Definition of Done
530
+
531
+ Phase 5 is **COMPLETE** when:
532
+
533
+ 1. All Phase 4 tests still pass (no regression)
534
+ 2. `MagenticOrchestrator` has same API as `Orchestrator`
535
+ 3. Can switch between modes via factory:
536
+
537
+ ```python
538
+ # Simple mode (Phase 4)
539
+ orchestrator = create_orchestrator(search, judge, mode="simple")
540
+
541
+ # Magentic mode (Phase 5)
542
+ orchestrator = create_orchestrator(search, judge, mode="magentic")
543
+
544
+ # Same usage!
545
+ async for event in orchestrator.run("metformin alzheimer"):
546
+ print(event.to_markdown())
547
+ ```
548
+
549
+ 4. UI works with both modes
550
+ 5. Graceful fallback if agent-framework not installed
551
+
552
+ ---
553
+
554
+ ## 8. When to Implement
555
+
556
+ **Priority**: LOW (optional enhancement)
557
+
558
+ Implement ONLY after:
559
+ 1. ✅ Phase 1: Foundation
560
+ 2. ✅ Phase 2: Search
561
+ 3. ✅ Phase 3: Judge
562
+ 4. ✅ Phase 4: Orchestrator + UI (MVP SHIPPED)
563
+
564
+ If hackathon deadline is approaching, **SKIP Phase 5**. Ship the MVP.
565
+
566
+ ---
567
+
568
+ ## 9. Benefits of This Design
569
+
570
+ 1. **No breaking changes** - Phase 4 code works unchanged
571
+ 2. **Same API** - `run()` returns `AsyncGenerator[AgentEvent, None]`
572
+ 3. **Gradual adoption** - Optional dependency, factory fallback
573
+ 4. **Testable** - Each component can be tested independently
574
+ 5. **Aligns with Tonic's vision** - Uses Microsoft Agent Framework patterns
575
+
576
+ ---
577
+
578
+ ## 10. Reference
579
+
580
+ - Microsoft Agent Framework: `reference_repos/agent-framework/`
581
+ - Magentic samples: `reference_repos/agent-framework/python/samples/getting_started/workflows/orchestration/magentic.py`
582
+ - AgentProtocol: `reference_repos/agent-framework/python/packages/core/agent_framework/_agents.py`
docs/implementation/roadmap.md CHANGED
@@ -115,11 +115,26 @@ tests/
115
 
116
  ---
117
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
  ## Spec Documents
119
 
120
  1. **[Phase 1 Spec: Foundation](01_phase_foundation.md)**
121
  2. **[Phase 2 Spec: Search Slice](02_phase_search.md)**
122
  3. **[Phase 3 Spec: Judge Slice](03_phase_judge.md)**
123
  4. **[Phase 4 Spec: UI & Loop](04_phase_ui.md)**
 
124
 
125
  *Start by reading Phase 1 Spec to initialize the repo.*
 
115
 
116
  ---
117
 
118
+ ### **Phase 5: Magentic Integration (OPTIONAL - Post-MVP)**
119
+
120
+ *Goal: Upgrade orchestrator to use Microsoft Agent Framework patterns.*
121
+
122
+ - [ ] Wrap SearchHandler as `AgentProtocol` (SearchAgent)
123
+ - [ ] Wrap JudgeHandler as `AgentProtocol` (JudgeAgent)
124
+ - [ ] Implement `MagenticOrchestrator` using `MagenticBuilder`
125
+ - [ ] Create factory pattern for switching implementations
126
+ - **Deliverable**: Same API, better multi-agent orchestration engine.
127
+
128
+ **NOTE**: Only implement Phase 5 if time permits after MVP is shipped.
129
+
130
+ ---
131
+
132
  ## Spec Documents
133
 
134
  1. **[Phase 1 Spec: Foundation](01_phase_foundation.md)**
135
  2. **[Phase 2 Spec: Search Slice](02_phase_search.md)**
136
  3. **[Phase 3 Spec: Judge Slice](03_phase_judge.md)**
137
  4. **[Phase 4 Spec: UI & Loop](04_phase_ui.md)**
138
+ 5. **[Phase 5 Spec: Magentic Integration](05_phase_magentic.md)** *(Optional)*
139
 
140
  *Start by reading Phase 1 Spec to initialize the repo.*
pyproject.toml CHANGED
@@ -10,6 +10,10 @@ dependencies = [
10
  "pydantic-settings>=2.2", # For BaseSettings (config)
11
  "pydantic-ai>=0.0.16", # Agent framework
12
 
 
 
 
 
13
  # HTTP & Parsing
14
  "httpx>=0.27", # Async HTTP client
15
  "beautifulsoup4>=4.12", # HTML parsing
 
10
  "pydantic-settings>=2.2", # For BaseSettings (config)
11
  "pydantic-ai>=0.0.16", # Agent framework
12
 
13
+ # AI Providers
14
+ "openai>=1.0.0",
15
+ "anthropic>=0.18.0",
16
+
17
  # HTTP & Parsing
18
  "httpx>=0.27", # Async HTTP client
19
  "beautifulsoup4>=4.12", # HTML parsing
src/agent_factory/judges.py CHANGED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Judge handler for evidence assessment using PydanticAI."""
2
+
3
+ from typing import Any, cast
4
+
5
+ import structlog
6
+ from pydantic_ai import Agent
7
+ from pydantic_ai.models.anthropic import AnthropicModel
8
+ from pydantic_ai.models.openai import OpenAIModel
9
+
10
+ from src.prompts.judge import (
11
+ SYSTEM_PROMPT,
12
+ format_empty_evidence_prompt,
13
+ format_user_prompt,
14
+ )
15
+ from src.utils.config import settings
16
+ from src.utils.models import AssessmentDetails, Evidence, JudgeAssessment
17
+
18
+ logger = structlog.get_logger()
19
+
20
+
21
+ def get_model() -> Any:
22
+ """Get the LLM model based on configuration."""
23
+ provider = settings.llm_provider
24
+
25
+ if provider == "anthropic":
26
+ return AnthropicModel(settings.anthropic_model)
27
+ return OpenAIModel(settings.openai_model)
28
+
29
+
30
+ class JudgeHandler:
31
+ """
32
+ Handles evidence assessment using an LLM with structured output.
33
+
34
+ Uses PydanticAI to ensure responses match the JudgeAssessment schema.
35
+ """
36
+
37
+ def __init__(self, model: Any = None) -> None:
38
+ """
39
+ Initialize the JudgeHandler.
40
+
41
+ Args:
42
+ model: Optional PydanticAI model. If None, uses config default.
43
+ """
44
+ self.model = model or get_model()
45
+ self.agent = Agent(
46
+ model=self.model,
47
+ result_type=JudgeAssessment,
48
+ system_prompt=SYSTEM_PROMPT,
49
+ retries=3,
50
+ )
51
+
52
+ async def assess(
53
+ self,
54
+ question: str,
55
+ evidence: list[Evidence],
56
+ ) -> JudgeAssessment:
57
+ """
58
+ Assess evidence and determine if it's sufficient.
59
+
60
+ Args:
61
+ question: The user's research question
62
+ evidence: List of Evidence objects from search
63
+
64
+ Returns:
65
+ JudgeAssessment with evaluation results
66
+
67
+ Raises:
68
+ JudgeError: If assessment fails after retries
69
+ """
70
+ logger.info(
71
+ "Starting evidence assessment",
72
+ question=question[:100],
73
+ evidence_count=len(evidence),
74
+ )
75
+
76
+ # Format the prompt based on whether we have evidence
77
+ if evidence:
78
+ user_prompt = format_user_prompt(question, evidence)
79
+ else:
80
+ user_prompt = format_empty_evidence_prompt(question)
81
+
82
+ try:
83
+ # Run the agent with structured output
84
+ result = await self.agent.run(user_prompt)
85
+ assessment = cast(JudgeAssessment, result.data)
86
+
87
+ logger.info(
88
+ "Assessment complete",
89
+ sufficient=assessment.sufficient,
90
+ recommendation=assessment.recommendation,
91
+ confidence=assessment.confidence,
92
+ )
93
+
94
+ return assessment
95
+
96
+ except Exception as e:
97
+ logger.error("Assessment failed", error=str(e))
98
+ # Return a safe default assessment on failure
99
+ return self._create_fallback_assessment(question, str(e))
100
+
101
+ def _create_fallback_assessment(
102
+ self,
103
+ question: str,
104
+ error: str,
105
+ ) -> JudgeAssessment:
106
+ """
107
+ Create a fallback assessment when LLM fails.
108
+
109
+ Args:
110
+ question: The original question
111
+ error: The error message
112
+
113
+ Returns:
114
+ Safe fallback JudgeAssessment
115
+ """
116
+ return JudgeAssessment(
117
+ details=AssessmentDetails(
118
+ mechanism_score=0,
119
+ mechanism_reasoning="Assessment failed due to LLM error",
120
+ clinical_evidence_score=0,
121
+ clinical_reasoning="Assessment failed due to LLM error",
122
+ drug_candidates=[],
123
+ key_findings=[],
124
+ ),
125
+ sufficient=False,
126
+ confidence=0.0,
127
+ recommendation="continue",
128
+ next_search_queries=[
129
+ f"{question} mechanism",
130
+ f"{question} clinical trials",
131
+ f"{question} drug candidates",
132
+ ],
133
+ reasoning=f"Assessment failed: {error}. Recommend retrying with refined queries.",
134
+ )
135
+
136
+
137
+ class MockJudgeHandler:
138
+ """
139
+ Mock JudgeHandler for testing without LLM calls.
140
+
141
+ Use this in unit tests to avoid API calls.
142
+ """
143
+
144
+ def __init__(self, mock_response: JudgeAssessment | None = None) -> None:
145
+ """
146
+ Initialize with optional mock response.
147
+
148
+ Args:
149
+ mock_response: The assessment to return. If None, uses default.
150
+ """
151
+ self.mock_response = mock_response
152
+ self.call_count = 0
153
+ self.last_question: str | None = None
154
+ self.last_evidence: list[Evidence] | None = None
155
+
156
+ async def assess(
157
+ self,
158
+ question: str,
159
+ evidence: list[Evidence],
160
+ ) -> JudgeAssessment:
161
+ """Return the mock response."""
162
+ self.call_count += 1
163
+ self.last_question = question
164
+ self.last_evidence = evidence
165
+
166
+ if self.mock_response:
167
+ return self.mock_response
168
+
169
+ min_evidence = 3
170
+ # Default mock response
171
+ return JudgeAssessment(
172
+ details=AssessmentDetails(
173
+ mechanism_score=7,
174
+ mechanism_reasoning="Mock assessment - good mechanism evidence",
175
+ clinical_evidence_score=6,
176
+ clinical_reasoning="Mock assessment - moderate clinical evidence",
177
+ drug_candidates=["Drug A", "Drug B"],
178
+ key_findings=["Finding 1", "Finding 2"],
179
+ ),
180
+ sufficient=len(evidence) >= min_evidence,
181
+ confidence=0.75,
182
+ recommendation="synthesize" if len(evidence) >= min_evidence else "continue",
183
+ next_search_queries=["query 1", "query 2"] if len(evidence) < min_evidence else [],
184
+ reasoning="Mock assessment for testing purposes",
185
+ )
src/prompts/judge.py ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Judge prompts for evidence assessment."""
2
+
3
+ from src.utils.models import Evidence
4
+
5
+ SYSTEM_PROMPT = """You are an expert drug repurposing research judge.
6
+
7
+ Your task is to evaluate evidence from biomedical literature and determine if it's sufficient to
8
+ recommend drug candidates for a given condition.
9
+
10
+ ## Evaluation Criteria
11
+
12
+ 1. **Mechanism Score (0-10)**: How well does the evidence explain the biological mechanism?
13
+ - 0-3: No clear mechanism, speculative
14
+ - 4-6: Some mechanistic insight, but gaps exist
15
+ - 7-10: Clear, well-supported mechanism of action
16
+
17
+ 2. **Clinical Evidence Score (0-10)**: Strength of clinical/preclinical support?
18
+ - 0-3: No clinical data, only theoretical
19
+ - 4-6: Preclinical or early clinical data
20
+ - 7-10: Strong clinical evidence (trials, meta-analyses)
21
+
22
+ 3. **Sufficiency**: Evidence is sufficient when:
23
+ - Combined scores >= 12 AND
24
+ - At least one specific drug candidate identified AND
25
+ - Clear mechanistic rationale exists
26
+
27
+ ## Output Rules
28
+
29
+ - Always output valid JSON matching the schema
30
+ - Be conservative: only recommend "synthesize" when truly confident
31
+ - If continuing, suggest specific, actionable search queries
32
+ - Never hallucinate drug names or findings not in the evidence
33
+ """
34
+
35
+
36
+ def format_user_prompt(question: str, evidence: list[Evidence]) -> str:
37
+ """
38
+ Format the user prompt with question and evidence.
39
+
40
+ Args:
41
+ question: The user's research question
42
+ evidence: List of Evidence objects from search
43
+
44
+ Returns:
45
+ Formatted prompt string
46
+ """
47
+ max_content_len = 1500
48
+ evidence_text = "\n\n".join(
49
+ [
50
+ f"### Evidence {i + 1}\n"
51
+ f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
52
+ f"**URL**: {e.citation.url}\n"
53
+ f"**Date**: {e.citation.date}\n"
54
+ f"**Content**:\n{e.content[:max_content_len]}..."
55
+ if len(e.content) > max_content_len
56
+ else f"### Evidence {i + 1}\n"
57
+ f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
58
+ f"**URL**: {e.citation.url}\n"
59
+ f"**Date**: {e.citation.date}\n"
60
+ f"**Content**:\n{e.content}"
61
+ for i, e in enumerate(evidence)
62
+ ]
63
+ )
64
+
65
+ return f"""## Research Question
66
+ {question}
67
+
68
+ ## Available Evidence ({len(evidence)} sources)
69
+
70
+ {evidence_text}
71
+
72
+ ## Your Task
73
+
74
+ Evaluate this evidence and determine if it's sufficient to recommend drug repurposing candidates.
75
+ Respond with a JSON object matching the JudgeAssessment schema.
76
+ """
77
+
78
+
79
+ def format_empty_evidence_prompt(question: str) -> str:
80
+ """
81
+ Format prompt when no evidence was found.
82
+
83
+ Args:
84
+ question: The user's research question
85
+
86
+ Returns:
87
+ Formatted prompt string
88
+ """
89
+ return f"""## Research Question
90
+ {question}
91
+
92
+ ## Available Evidence
93
+
94
+ No evidence was found from the search.
95
+
96
+ ## Your Task
97
+
98
+ Since no evidence was found, recommend search queries that might yield better results.
99
+ Set sufficient=False and recommendation=\"continue\".
100
+ Suggest 3-5 specific search queries.
101
+ """
src/utils/models.py CHANGED
@@ -43,3 +43,50 @@ class SearchResult(BaseModel):
43
  sources_searched: list[Literal["pubmed", "web"]]
44
  total_found: int
45
  errors: list[str] = Field(default_factory=list)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  sources_searched: list[Literal["pubmed", "web"]]
44
  total_found: int
45
  errors: list[str] = Field(default_factory=list)
46
+
47
+
48
+ class AssessmentDetails(BaseModel):
49
+ """Detailed assessment of evidence quality."""
50
+
51
+ mechanism_score: int = Field(
52
+ ...,
53
+ ge=0,
54
+ le=10,
55
+ description="How well does the evidence explain the mechanism? 0-10",
56
+ )
57
+ mechanism_reasoning: str = Field(
58
+ ..., min_length=10, description="Explanation of mechanism score"
59
+ )
60
+ clinical_evidence_score: int = Field(
61
+ ...,
62
+ ge=0,
63
+ le=10,
64
+ description="Strength of clinical/preclinical evidence. 0-10",
65
+ )
66
+ clinical_reasoning: str = Field(
67
+ ..., min_length=10, description="Explanation of clinical evidence score"
68
+ )
69
+ drug_candidates: list[str] = Field(
70
+ default_factory=list, description="List of specific drug candidates mentioned"
71
+ )
72
+ key_findings: list[str] = Field(
73
+ default_factory=list, description="Key findings from the evidence"
74
+ )
75
+
76
+
77
+ class JudgeAssessment(BaseModel):
78
+ """Complete assessment from the Judge."""
79
+
80
+ details: AssessmentDetails
81
+ sufficient: bool = Field(..., description="Is evidence sufficient to provide a recommendation?")
82
+ confidence: float = Field(..., ge=0.0, le=1.0, description="Confidence in the assessment (0-1)")
83
+ recommendation: Literal["continue", "synthesize"] = Field(
84
+ ...,
85
+ description="continue = need more evidence, synthesize = ready to answer",
86
+ )
87
+ next_search_queries: list[str] = Field(
88
+ default_factory=list, description="If continue, what queries to search next"
89
+ )
90
+ reasoning: str = Field(
91
+ ..., min_length=20, description="Overall reasoning for the recommendation"
92
+ )
tests/unit/agent_factory/test_judges.py ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Unit tests for JudgeHandler."""
2
+
3
+ from unittest.mock import AsyncMock, MagicMock, patch
4
+
5
+ import pytest
6
+
7
+ from src.agent_factory.judges import JudgeHandler, MockJudgeHandler
8
+ from src.utils.models import AssessmentDetails, Citation, Evidence, JudgeAssessment
9
+
10
+
11
+ class TestJudgeHandler:
12
+ """Tests for JudgeHandler."""
13
+
14
+ @pytest.mark.asyncio
15
+ async def test_assess_returns_assessment(self):
16
+ """JudgeHandler should return JudgeAssessment from LLM."""
17
+ # Create mock assessment
18
+ expected_confidence = 0.85
19
+ mock_assessment = JudgeAssessment(
20
+ details=AssessmentDetails(
21
+ mechanism_score=8,
22
+ mechanism_reasoning="Strong mechanistic evidence",
23
+ clinical_evidence_score=7,
24
+ clinical_reasoning="Good clinical support",
25
+ drug_candidates=["Metformin"],
26
+ key_findings=["Neuroprotective effects"],
27
+ ),
28
+ sufficient=True,
29
+ confidence=expected_confidence,
30
+ recommendation="synthesize",
31
+ next_search_queries=[],
32
+ reasoning="Evidence is sufficient for synthesis",
33
+ )
34
+
35
+ # Mock the PydanticAI agent
36
+ mock_result = MagicMock()
37
+ mock_result.data = mock_assessment
38
+
39
+ with patch("src.agent_factory.judges.Agent") as mock_agent_class:
40
+ mock_agent = AsyncMock()
41
+ mock_agent.run = AsyncMock(return_value=mock_result)
42
+ mock_agent_class.return_value = mock_agent
43
+
44
+ handler = JudgeHandler()
45
+ # Replace the agent with our mock
46
+ handler.agent = mock_agent
47
+
48
+ evidence = [
49
+ Evidence(
50
+ content="Metformin shows neuroprotective properties...",
51
+ citation=Citation(
52
+ source="pubmed",
53
+ title="Metformin in AD",
54
+ url="https://pubmed.ncbi.nlm.nih.gov/12345/",
55
+ date="2024-01-01",
56
+ ),
57
+ )
58
+ ]
59
+
60
+ result = await handler.assess("metformin alzheimer", evidence)
61
+
62
+ assert result.sufficient is True
63
+ assert result.recommendation == "synthesize"
64
+ assert result.confidence == expected_confidence
65
+ assert "Metformin" in result.details.drug_candidates
66
+
67
+ @pytest.mark.asyncio
68
+ async def test_assess_empty_evidence(self):
69
+ """JudgeHandler should handle empty evidence gracefully."""
70
+ mock_assessment = JudgeAssessment(
71
+ details=AssessmentDetails(
72
+ mechanism_score=0,
73
+ mechanism_reasoning="No evidence to assess",
74
+ clinical_evidence_score=0,
75
+ clinical_reasoning="No evidence to assess",
76
+ drug_candidates=[],
77
+ key_findings=[],
78
+ ),
79
+ sufficient=False,
80
+ confidence=0.0,
81
+ recommendation="continue",
82
+ next_search_queries=["metformin alzheimer mechanism"],
83
+ reasoning="No evidence found, need to search more",
84
+ )
85
+
86
+ mock_result = MagicMock()
87
+ mock_result.data = mock_assessment
88
+
89
+ with patch("src.agent_factory.judges.Agent") as mock_agent_class:
90
+ mock_agent = AsyncMock()
91
+ mock_agent.run = AsyncMock(return_value=mock_result)
92
+ mock_agent_class.return_value = mock_agent
93
+
94
+ handler = JudgeHandler()
95
+ handler.agent = mock_agent
96
+
97
+ result = await handler.assess("metformin alzheimer", [])
98
+
99
+ assert result.sufficient is False
100
+ assert result.recommendation == "continue"
101
+ assert len(result.next_search_queries) > 0
102
+
103
+ @pytest.mark.asyncio
104
+ async def test_assess_handles_llm_failure(self):
105
+ """JudgeHandler should return fallback on LLM failure."""
106
+ with patch("src.agent_factory.judges.Agent") as mock_agent_class:
107
+ mock_agent = AsyncMock()
108
+ mock_agent.run = AsyncMock(side_effect=Exception("API Error"))
109
+ mock_agent_class.return_value = mock_agent
110
+
111
+ handler = JudgeHandler()
112
+ handler.agent = mock_agent
113
+
114
+ evidence = [
115
+ Evidence(
116
+ content="Some content",
117
+ citation=Citation(
118
+ source="pubmed",
119
+ title="Title",
120
+ url="url",
121
+ date="2024",
122
+ ),
123
+ )
124
+ ]
125
+
126
+ result = await handler.assess("test question", evidence)
127
+
128
+ # Should return fallback, not raise
129
+ assert result.sufficient is False
130
+ assert result.recommendation == "continue"
131
+ assert "failed" in result.reasoning.lower()
132
+
133
+
134
+ class TestMockJudgeHandler:
135
+ """Tests for MockJudgeHandler."""
136
+
137
+ @pytest.mark.asyncio
138
+ async def test_mock_handler_returns_default(self):
139
+ """MockJudgeHandler should return default assessment."""
140
+ handler = MockJudgeHandler()
141
+
142
+ evidence = [
143
+ Evidence(
144
+ content="Content 1",
145
+ citation=Citation(source="pubmed", title="T1", url="u1", date="2024"),
146
+ ),
147
+ Evidence(
148
+ content="Content 2",
149
+ citation=Citation(source="web", title="T2", url="u2", date="2024"),
150
+ ),
151
+ ]
152
+
153
+ result = await handler.assess("test", evidence)
154
+
155
+ expected_mech_score = 7
156
+ expected_evidence_len = 2
157
+
158
+ assert handler.call_count == 1
159
+ assert handler.last_question == "test"
160
+ assert handler.last_evidence is not None
161
+ assert len(handler.last_evidence) == expected_evidence_len
162
+ assert result.details.mechanism_score == expected_mech_score
163
+
164
+ @pytest.mark.asyncio
165
+ async def test_mock_handler_custom_response(self):
166
+ """MockJudgeHandler should return custom response when provided."""
167
+ expected_score = 10
168
+ custom_assessment = JudgeAssessment(
169
+ details=AssessmentDetails(
170
+ mechanism_score=expected_score,
171
+ mechanism_reasoning="Custom reasoning",
172
+ clinical_evidence_score=expected_score,
173
+ clinical_reasoning="Custom clinical",
174
+ drug_candidates=["CustomDrug"],
175
+ key_findings=["Custom finding"],
176
+ ),
177
+ sufficient=True,
178
+ confidence=1.0,
179
+ recommendation="synthesize",
180
+ next_search_queries=[],
181
+ reasoning="Custom assessment logic for testing purposes must be at least 20 chars long",
182
+ )
183
+
184
+ handler = MockJudgeHandler(mock_response=custom_assessment)
185
+ result = await handler.assess("test", [])
186
+
187
+ assert result.details.mechanism_score == expected_score
188
+ assert result.details.drug_candidates == ["CustomDrug"]
189
+
190
+ @pytest.mark.asyncio
191
+ async def test_mock_handler_insufficient_with_few_evidence(self):
192
+ """MockJudgeHandler should recommend continue with < 3 evidence."""
193
+ handler = MockJudgeHandler()
194
+
195
+ # Only 2 pieces of evidence
196
+ evidence = [
197
+ Evidence(
198
+ content="Content",
199
+ citation=Citation(source="pubmed", title="T", url="u", date="2024"),
200
+ ),
201
+ Evidence(
202
+ content="Content 2",
203
+ citation=Citation(source="web", title="T2", url="u2", date="2024"),
204
+ ),
205
+ ]
206
+
207
+ result = await handler.assess("test", evidence)
208
+
209
+ assert result.sufficient is False
210
+ assert result.recommendation == "continue"
211
+ assert len(result.next_search_queries) > 0
uv.lock CHANGED
@@ -657,10 +657,12 @@ name = "deepcritical"
657
  version = "0.1.0"
658
  source = { editable = "." }
659
  dependencies = [
 
660
  { name = "beautifulsoup4" },
661
  { name = "duckduckgo-search" },
662
  { name = "gradio" },
663
  { name = "httpx" },
 
664
  { name = "pydantic" },
665
  { name = "pydantic-ai" },
666
  { name = "pydantic-settings" },
@@ -685,11 +687,13 @@ dev = [
685
 
686
  [package.metadata]
687
  requires-dist = [
 
688
  { name = "beautifulsoup4", specifier = ">=4.12" },
689
  { name = "duckduckgo-search", specifier = ">=6.0" },
690
  { name = "gradio", specifier = ">=5.0" },
691
  { name = "httpx", specifier = ">=0.27" },
692
  { name = "mypy", marker = "extra == 'dev'", specifier = ">=1.10" },
 
693
  { name = "pre-commit", marker = "extra == 'dev'", specifier = ">=3.7" },
694
  { name = "pydantic", specifier = ">=2.7" },
695
  { name = "pydantic-ai", specifier = ">=0.0.16" },
 
657
  version = "0.1.0"
658
  source = { editable = "." }
659
  dependencies = [
660
+ { name = "anthropic" },
661
  { name = "beautifulsoup4" },
662
  { name = "duckduckgo-search" },
663
  { name = "gradio" },
664
  { name = "httpx" },
665
+ { name = "openai" },
666
  { name = "pydantic" },
667
  { name = "pydantic-ai" },
668
  { name = "pydantic-settings" },
 
687
 
688
  [package.metadata]
689
  requires-dist = [
690
+ { name = "anthropic", specifier = ">=0.18.0" },
691
  { name = "beautifulsoup4", specifier = ">=4.12" },
692
  { name = "duckduckgo-search", specifier = ">=6.0" },
693
  { name = "gradio", specifier = ">=5.0" },
694
  { name = "httpx", specifier = ">=0.27" },
695
  { name = "mypy", marker = "extra == 'dev'", specifier = ">=1.10" },
696
+ { name = "openai", specifier = ">=1.0.0" },
697
  { name = "pre-commit", marker = "extra == 'dev'", specifier = ">=3.7" },
698
  { name = "pydantic", specifier = ">=2.7" },
699
  { name = "pydantic-ai", specifier = ">=0.0.16" },