Spaces:

DataQuests
/

DeepCritical

Running

VibecoderMcSwaggins commited on 13 days ago

Commit

d7e5abb

1 Parent(s): 32e3b61

feat(phase3): implement judge slice (LLM Judge, Prompts, Models)

Implemented LLM-based Judge Agent with PydanticAI for structured output. Includes 100% test coverage and fallback mechanisms.

Files changed (8) hide show

docs/implementation/05_phase_magentic.md +582 -0
docs/implementation/roadmap.md +15 -0
pyproject.toml +4 -0
src/agent_factory/judges.py +185 -0
src/prompts/judge.py +101 -0
src/utils/models.py +47 -0
tests/unit/agent_factory/test_judges.py +211 -0
uv.lock +4 -0

docs/implementation/05_phase_magentic.md ADDED Viewed

	@@ -0,0 +1,582 @@

+# Phase 5 Implementation Spec: Magentic Integration (Optional)
+**Goal**: Upgrade orchestrator to use Microsoft Agent Framework's Magentic-One pattern.
+**Philosophy**: "Same API, Better Engine."
+**Prerequisite**: Phase 4 complete (MVP working end-to-end)
+---
+## 1. Why Magentic?
+Magentic-One provides:
+- **LLM-powered manager** that dynamically plans, selects agents, tracks progress
+- **Built-in stall detection** and automatic replanning
+- **Checkpointing** for pause/resume workflows
+- **Event streaming** for real-time UI updates
+- **Multi-agent coordination** with round limits and reset logic
+This is **NOT required for MVP**. Only implement if time permits after Phase 4.
+---
+## 2. Architecture Alignment
+### Current Phase 4 Architecture
+```
+User Query
+    ↓
+Orchestrator (while loop)
+    ├── SearchHandler.execute() → Evidence
+    ├── JudgeHandler.assess() → JudgeAssessment
+    └── Loop/Synthesize decision
+    ↓
+Research Report
+```
+### Phase 5 Magentic Architecture
+```
+User Query
+    ↓
+MagenticBuilder
+    ├── SearchAgent (wraps SearchHandler)
+    ├── JudgeAgent (wraps JudgeHandler)
+    └── StandardMagenticManager (LLM coordinator)
+    ↓
+Research Report (same output format)
+```
+**Key Insight**: We wrap existing handlers as `AgentProtocol` implementations. The domain logic stays the same.
+---
+## 3. Design for Seamless Integration
+### 3.1 Protocol-Based Design (Phase 4 prep)
+In Phase 4, define handlers using Protocols so they can be wrapped later:
+```python
+# src/orchestrator.py (Phase 4)
+from typing import Protocol, List
+from src.utils.models import Evidence, SearchResult, JudgeAssessment
+class SearchHandlerProtocol(Protocol):
+    """Protocol for search handler - can be wrapped as Agent later."""
+    async def execute(self, query: str, max_results_per_tool: int = 10) -> SearchResult:
+        ...
+class JudgeHandlerProtocol(Protocol):
+    """Protocol for judge handler - can be wrapped as Agent later."""
+    async def assess(self, question: str, evidence: List[Evidence]) -> JudgeAssessment:
+        ...
+class OrchestratorProtocol(Protocol):
+    """Protocol for orchestrator - allows swapping implementations."""
+    async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
+        ...
+```
+### 3.2 Facade Pattern
+The `Orchestrator` class is a facade. In Phase 5, we create `MagenticOrchestrator` with the same interface:
+```python
+# Phase 4: Simple orchestrator
+orchestrator = Orchestrator(search_handler, judge_handler)
+# Phase 5: Magentic orchestrator (SAME API)
+orchestrator = MagenticOrchestrator(search_handler, judge_handler)
+# Usage is identical
+async for event in orchestrator.run("metformin alzheimer"):
+    print(event.to_markdown())
+```
+---
+## 4. Phase 5 Implementation
+### 4.1 Install Agent Framework
+Add to `pyproject.toml`:
+```toml
+[project.optional-dependencies]
+magentic = [
+    "agent-framework-core>=0.1.0",
+]
+```
+### 4.2 Agent Wrappers (`src/agents/search_agent.py`)
+Wrap `SearchHandler` as an `AgentProtocol`:
+```python
+"""Search agent wrapper for Magentic integration."""
+from typing import Any
+from agent_framework import AgentProtocol, AgentRunResponse, ChatMessage, Role
+from src.tools.search_handler import SearchHandler
+from src.utils.models import SearchResult
+class SearchAgent:
+    """Wraps SearchHandler as an AgentProtocol for Magentic."""
+    def __init__(self, search_handler: SearchHandler):
+        self._handler = search_handler
+        self._id = "search-agent"
+        self._name = "SearchAgent"
+    @property
+    def id(self) -> str:
+        return self._id
+    @property
+    def name(self) -> str | None:
+        return self._name
+    @property
+    def display_name(self) -> str:
+        return self._name
+    @property
+    def description(self) -> str | None:
+        return "Searches PubMed and web for drug repurposing evidence"
+    async def run(
+        self,
+        messages: list[ChatMessage] | None = None,
+        *,
+        thread: Any = None,
+        **kwargs: Any,
+    ) -> AgentRunResponse:
+        """Execute search based on the last user message."""
+        # Extract query from messages
+        query = ""
+        if messages:
+            for msg in reversed(messages):
+                if msg.role == Role.USER and msg.text:
+                    query = msg.text
+                    break
+        if not query:
+            return AgentRunResponse(
+                messages=[ChatMessage(role=Role.ASSISTANT, text="No query provided")],
+                response_id="search-no-query",
+            )
+        # Execute search
+        result: SearchResult = await self._handler.execute(query, max_results_per_tool=10)
+        # Format response
+        evidence_text = "\n".join([
+            f"- [{e.citation.title}]({e.citation.url}): {e.content[:200]}..."
+            for e in result.evidence[:5]
+        ])
+        response_text = f"Found {result.total_found} sources:\n\n{evidence_text}"
+        return AgentRunResponse(
+            messages=[ChatMessage(role=Role.ASSISTANT, text=response_text)],
+            response_id=f"search-{result.total_found}",
+            metadata={"evidence": [e.model_dump() for e in result.evidence]},
+        )
+    def run_stream(self, messages=None, *, thread=None, **kwargs):
+        """Streaming not implemented for search."""
+        async def _stream():
+            result = await self.run(messages, thread=thread, **kwargs)
+            from agent_framework import AgentRunResponseUpdate
+            yield AgentRunResponseUpdate(messages=result.messages)
+        return _stream()
+```
+### 4.3 Judge Agent Wrapper (`src/agents/judge_agent.py`)
+```python
+"""Judge agent wrapper for Magentic integration."""
+from typing import Any, List
+from agent_framework import AgentProtocol, AgentRunResponse, ChatMessage, Role
+from src.agent_factory.judges import JudgeHandler
+from src.utils.models import Evidence, JudgeAssessment
+class JudgeAgent:
+    """Wraps JudgeHandler as an AgentProtocol for Magentic."""
+    def __init__(self, judge_handler: JudgeHandler, evidence_store: dict[str, List[Evidence]]):
+        self._handler = judge_handler
+        self._evidence_store = evidence_store  # Shared state for evidence
+        self._id = "judge-agent"
+        self._name = "JudgeAgent"
+    @property
+    def id(self) -> str:
+        return self._id
+    @property
+    def name(self) -> str | None:
+        return self._name
+    @property
+    def display_name(self) -> str:
+        return self._name
+    @property
+    def description(self) -> str | None:
+        return "Evaluates evidence quality and determines if sufficient for synthesis"
+    async def run(
+        self,
+        messages: list[ChatMessage] | None = None,
+        *,
+        thread: Any = None,
+        **kwargs: Any,
+    ) -> AgentRunResponse:
+        """Assess evidence quality."""
+        # Extract original question from messages
+        question = ""
+        if messages:
+            for msg in messages:
+                if msg.role == Role.USER and msg.text:
+                    question = msg.text
+                    break
+        # Get evidence from shared store
+        evidence = self._evidence_store.get("current", [])
+        # Assess
+        assessment: JudgeAssessment = await self._handler.assess(question, evidence)
+        # Format response
+        response_text = f"""## Assessment
+**Sufficient**: {assessment.sufficient}
+**Confidence**: {assessment.confidence:.0%}
+**Recommendation**: {assessment.recommendation}
+### Scores
+- Mechanism: {assessment.details.mechanism_score}/10
+- Clinical: {assessment.details.clinical_evidence_score}/10
+### Reasoning
+{assessment.reasoning}
+"""
+        if assessment.next_search_queries:
+            response_text += f"\n### Next Queries\n" + "\n".join(
+                f"- {q}" for q in assessment.next_search_queries
+            )
+        return AgentRunResponse(
+            messages=[ChatMessage(role=Role.ASSISTANT, text=response_text)],
+            response_id=f"judge-{assessment.recommendation}",
+            metadata={"assessment": assessment.model_dump()},
+        )
+    def run_stream(self, messages=None, *, thread=None, **kwargs):
+        """Streaming not implemented for judge."""
+        async def _stream():
+            result = await self.run(messages, thread=thread, **kwargs)
+            from agent_framework import AgentRunResponseUpdate
+            yield AgentRunResponseUpdate(messages=result.messages)
+        return _stream()
+```
+### 4.4 Magentic Orchestrator (`src/orchestrator_magentic.py`)
+```python
+"""Magentic-based orchestrator for DeepCritical."""
+from typing import AsyncGenerator, List
+import structlog
+from agent_framework import (
+    MagenticBuilder,
+    MagenticFinalResultEvent,
+    MagenticAgentMessageEvent,
+    MagenticOrchestratorMessageEvent,
+    WorkflowOutputEvent,
+)
+from agent_framework.openai import OpenAIChatClient
+from src.agents.search_agent import SearchAgent
+from src.agents.judge_agent import JudgeAgent
+from src.tools.search_handler import SearchHandler
+from src.agent_factory.judges import JudgeHandler
+from src.utils.models import AgentEvent, Evidence
+logger = structlog.get_logger()
+class MagenticOrchestrator:
+    """
+    Magentic-based orchestrator - same API as Orchestrator.
+    Uses Microsoft Agent Framework's MagenticBuilder for multi-agent coordination.
+    """
+    def __init__(
+        self,
+        search_handler: SearchHandler,
+        judge_handler: JudgeHandler,
+        max_rounds: int = 10,
+    ):
+        self._search_handler = search_handler
+        self._judge_handler = judge_handler
+        self._max_rounds = max_rounds
+        self._evidence_store: dict[str, List[Evidence]] = {"current": []}
+    async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
+        """
+        Run the Magentic workflow - same API as simple Orchestrator.
+        Yields AgentEvent objects for real-time UI updates.
+        """
+        logger.info("Starting Magentic orchestrator", query=query)
+        yield AgentEvent(
+            type="started",
+            message=f"Starting research (Magentic mode): {query}",
+            iteration=0,
+        )
+        # Create agent wrappers
+        search_agent = SearchAgent(self._search_handler)
+        judge_agent = JudgeAgent(self._judge_handler, self._evidence_store)
+        # Build Magentic workflow
+        workflow = (
+            MagenticBuilder()
+            .participants(
+                searcher=search_agent,
+                judge=judge_agent,
+            )
+            .with_standard_manager(
+                chat_client=OpenAIChatClient(),
+                max_round_count=self._max_rounds,
+                max_stall_count=3,
+                max_reset_count=2,
+            )
+            .build()
+        )
+        # Task instruction for the manager
+        task = f"""Research drug repurposing opportunities for: {query}
+Instructions:
+1. Use SearcherAgent to find evidence from PubMed and web
+2. Use JudgeAgent to evaluate if evidence is sufficient
+3. If JudgeAgent says "continue", search with refined queries
+4. If JudgeAgent says "synthesize", provide final synthesis
+5. Stop when synthesis is ready or max rounds reached
+Focus on finding:
+- Mechanism of action evidence
+- Clinical/preclinical studies
+- Specific drug candidates
+"""
+        iteration = 0
+        try:
+            async for event in workflow.run_stream(task):
+                if isinstance(event, MagenticOrchestratorMessageEvent):
+                    yield AgentEvent(
+                        type="judging",
+                        message=f"Manager: {event.kind}",
+                        iteration=iteration,
+                    )
+                elif isinstance(event, MagenticAgentMessageEvent):
+                    iteration += 1
+                    agent_name = event.agent_id or "unknown"
+                    if "search" in agent_name.lower():
+                        yield AgentEvent(
+                            type="search_complete",
+                            message=f"Search agent responded",
+                            iteration=iteration,
+                        )
+                    elif "judge" in agent_name.lower():
+                        yield AgentEvent(
+                            type="judge_complete",
+                            message=f"Judge agent evaluated evidence",
+                            iteration=iteration,
+                        )
+                elif isinstance(event, MagenticFinalResultEvent):
+                    final_text = event.message.text if event.message else "No result"
+                    yield AgentEvent(
+                        type="complete",
+                        message=final_text,
+                        data={"iterations": iteration},
+                        iteration=iteration,
+                    )
+                elif isinstance(event, WorkflowOutputEvent):
+                    if event.data:
+                        yield AgentEvent(
+                            type="complete",
+                            message=str(event.data),
+                            iteration=iteration,
+                        )
+        except Exception as e:
+            logger.error("Magentic workflow failed", error=str(e))
+            yield AgentEvent(
+                type="error",
+                message=f"Workflow error: {str(e)}",
+                iteration=iteration,
+            )
+```
+### 4.5 Factory Pattern (`src/orchestrator_factory.py`)
+Allow switching between implementations:
+```python
+"""Factory for creating orchestrators."""
+from typing import Literal
+from src.orchestrator import Orchestrator
+from src.tools.search_handler import SearchHandler
+from src.agent_factory.judges import JudgeHandler
+from src.utils.models import OrchestratorConfig
+def create_orchestrator(
+    search_handler: SearchHandler,
+    judge_handler: JudgeHandler,
+    config: OrchestratorConfig | None = None,
+    mode: Literal["simple", "magentic"] = "simple",
+):
+    """
+    Create an orchestrator instance.
+    Args:
+        search_handler: The search handler
+        judge_handler: The judge handler
+        config: Optional configuration
+        mode: "simple" for Phase 4 loop, "magentic" for Phase 5 multi-agent
+    Returns:
+        Orchestrator instance (same interface regardless of mode)
+    """
+    if mode == "magentic":
+        try:
+            from src.orchestrator_magentic import MagenticOrchestrator
+            return MagenticOrchestrator(
+                search_handler=search_handler,
+                judge_handler=judge_handler,
+                max_rounds=config.max_iterations if config else 10,
+            )
+        except ImportError:
+            # Fallback to simple if agent-framework not installed
+            pass
+    return Orchestrator(
+        search_handler=search_handler,
+        judge_handler=judge_handler,
+        config=config,
+    )
+```
+---
+## 5. Directory Structure After Phase 5
+```
+src/
+├── app.py                      # Gradio UI (unchanged)
+├── orchestrator.py             # Phase 4 simple orchestrator
+├── orchestrator_magentic.py    # Phase 5 Magentic orchestrator
+├── orchestrator_factory.py     # Factory to switch implementations
+├── agents/                     # NEW: Agent wrappers
+│   ├── __init__.py
+│   ├── search_agent.py         # SearchHandler as AgentProtocol
+│   └── judge_agent.py          # JudgeHandler as AgentProtocol
+├── agent_factory/
+│   └── judges.py               # JudgeHandler (unchanged)
+├── tools/
+│   ├── pubmed.py               # PubMed tool (unchanged)
+│   ├── websearch.py            # Web tool (unchanged)
+│   └── search_handler.py       # SearchHandler (unchanged)
+└── utils/
+    └── models.py               # Models (unchanged)
+```
+---
+## 6. Implementation Checklist
+- [ ] Ensure Phase 4 uses Protocol-based handler interfaces
+- [ ] Add `agent-framework-core` to optional dependencies
+- [ ] Create `src/agents/` directory
+- [ ] Implement `SearchAgent` wrapper
+- [ ] Implement `JudgeAgent` wrapper
+- [ ] Implement `MagenticOrchestrator`
+- [ ] Implement `orchestrator_factory.py`
+- [ ] Add tests for agent wrappers
+- [ ] Test Magentic flow end-to-end
+- [ ] Update `src/app.py` to use factory with mode toggle
+---
+## 7. Definition of Done
+Phase 5 is **COMPLETE** when:
+1. All Phase 4 tests still pass (no regression)
+2. `MagenticOrchestrator` has same API as `Orchestrator`
+3. Can switch between modes via factory:
+```python
+# Simple mode (Phase 4)
+orchestrator = create_orchestrator(search, judge, mode="simple")
+# Magentic mode (Phase 5)
+orchestrator = create_orchestrator(search, judge, mode="magentic")
+# Same usage!
+async for event in orchestrator.run("metformin alzheimer"):
+    print(event.to_markdown())
+```
+4. UI works with both modes
+5. Graceful fallback if agent-framework not installed
+---
+## 8. When to Implement
+**Priority**: LOW (optional enhancement)
+Implement ONLY after:
+1. ✅ Phase 1: Foundation
+2. ✅ Phase 2: Search
+3. ✅ Phase 3: Judge
+4. ✅ Phase 4: Orchestrator + UI (MVP SHIPPED)
+If hackathon deadline is approaching, **SKIP Phase 5**. Ship the MVP.
+---
+## 9. Benefits of This Design
+1. **No breaking changes** - Phase 4 code works unchanged
+2. **Same API** - `run()` returns `AsyncGenerator[AgentEvent, None]`
+3. **Gradual adoption** - Optional dependency, factory fallback
+4. **Testable** - Each component can be tested independently
+5. **Aligns with Tonic's vision** - Uses Microsoft Agent Framework patterns
+---
+## 10. Reference
+- Microsoft Agent Framework: `reference_repos/agent-framework/`
+- Magentic samples: `reference_repos/agent-framework/python/samples/getting_started/workflows/orchestration/magentic.py`
+- AgentProtocol: `reference_repos/agent-framework/python/packages/core/agent_framework/_agents.py`

docs/implementation/roadmap.md CHANGED Viewed

@@ -115,11 +115,26 @@ tests/
 ---
 ## Spec Documents
 1. **[Phase 1 Spec: Foundation](01_phase_foundation.md)**
 2. **[Phase 2 Spec: Search Slice](02_phase_search.md)**
 3. **[Phase 3 Spec: Judge Slice](03_phase_judge.md)**
 4. **[Phase 4 Spec: UI & Loop](04_phase_ui.md)**
 *Start by reading Phase 1 Spec to initialize the repo.*

 ---
+### **Phase 5: Magentic Integration (OPTIONAL - Post-MVP)**
+*Goal: Upgrade orchestrator to use Microsoft Agent Framework patterns.*
+- [ ] Wrap SearchHandler as `AgentProtocol` (SearchAgent)
+- [ ] Wrap JudgeHandler as `AgentProtocol` (JudgeAgent)
+- [ ] Implement `MagenticOrchestrator` using `MagenticBuilder`
+- [ ] Create factory pattern for switching implementations
+- **Deliverable**: Same API, better multi-agent orchestration engine.
+**NOTE**: Only implement Phase 5 if time permits after MVP is shipped.
+---
 ## Spec Documents
 1. **[Phase 1 Spec: Foundation](01_phase_foundation.md)**
 2. **[Phase 2 Spec: Search Slice](02_phase_search.md)**
 3. **[Phase 3 Spec: Judge Slice](03_phase_judge.md)**
 4. **[Phase 4 Spec: UI & Loop](04_phase_ui.md)**
+5. **[Phase 5 Spec: Magentic Integration](05_phase_magentic.md)** *(Optional)*
 *Start by reading Phase 1 Spec to initialize the repo.*

pyproject.toml CHANGED Viewed

@@ -10,6 +10,10 @@ dependencies = [
     "pydantic-settings>=2.2",      # For BaseSettings (config)
     "pydantic-ai>=0.0.16",          # Agent framework
     # HTTP & Parsing
     "httpx>=0.27",                   # Async HTTP client
     "beautifulsoup4>=4.12",          # HTML parsing

     "pydantic-settings>=2.2",      # For BaseSettings (config)
     "pydantic-ai>=0.0.16",          # Agent framework
+    # AI Providers
+    "openai>=1.0.0",
+    "anthropic>=0.18.0",
     # HTTP & Parsing
     "httpx>=0.27",                   # Async HTTP client
     "beautifulsoup4>=4.12",          # HTML parsing

src/agent_factory/judges.py CHANGED Viewed

	@@ -0,0 +1,185 @@

+"""Judge handler for evidence assessment using PydanticAI."""
+from typing import Any, cast
+import structlog
+from pydantic_ai import Agent
+from pydantic_ai.models.anthropic import AnthropicModel
+from pydantic_ai.models.openai import OpenAIModel
+from src.prompts.judge import (
+    SYSTEM_PROMPT,
+    format_empty_evidence_prompt,
+    format_user_prompt,
+)
+from src.utils.config import settings
+from src.utils.models import AssessmentDetails, Evidence, JudgeAssessment
+logger = structlog.get_logger()
+def get_model() -> Any:
+    """Get the LLM model based on configuration."""
+    provider = settings.llm_provider
+    if provider == "anthropic":
+        return AnthropicModel(settings.anthropic_model)
+    return OpenAIModel(settings.openai_model)
+class JudgeHandler:
+    """
+    Handles evidence assessment using an LLM with structured output.
+    Uses PydanticAI to ensure responses match the JudgeAssessment schema.
+    """
+    def __init__(self, model: Any = None) -> None:
+        """
+        Initialize the JudgeHandler.
+        Args:
+            model: Optional PydanticAI model. If None, uses config default.
+        """
+        self.model = model or get_model()
+        self.agent = Agent(
+            model=self.model,
+            result_type=JudgeAssessment,
+            system_prompt=SYSTEM_PROMPT,
+            retries=3,
+        )
+    async def assess(
+        self,
+        question: str,
+        evidence: list[Evidence],
+    ) -> JudgeAssessment:
+        """
+        Assess evidence and determine if it's sufficient.
+        Args:
+            question: The user's research question
+            evidence: List of Evidence objects from search
+        Returns:
+            JudgeAssessment with evaluation results
+        Raises:
+            JudgeError: If assessment fails after retries
+        """
+        logger.info(
+            "Starting evidence assessment",
+            question=question[:100],
+            evidence_count=len(evidence),
+        )
+        # Format the prompt based on whether we have evidence
+        if evidence:
+            user_prompt = format_user_prompt(question, evidence)
+        else:
+            user_prompt = format_empty_evidence_prompt(question)
+        try:
+            # Run the agent with structured output
+            result = await self.agent.run(user_prompt)
+            assessment = cast(JudgeAssessment, result.data)
+            logger.info(
+                "Assessment complete",
+                sufficient=assessment.sufficient,
+                recommendation=assessment.recommendation,
+                confidence=assessment.confidence,
+            )
+            return assessment
+        except Exception as e:
+            logger.error("Assessment failed", error=str(e))
+            # Return a safe default assessment on failure
+            return self._create_fallback_assessment(question, str(e))
+    def _create_fallback_assessment(
+        self,
+        question: str,
+        error: str,
+    ) -> JudgeAssessment:
+        """
+        Create a fallback assessment when LLM fails.
+        Args:
+            question: The original question
+            error: The error message
+        Returns:
+            Safe fallback JudgeAssessment
+        """
+        return JudgeAssessment(
+            details=AssessmentDetails(
+                mechanism_score=0,
+                mechanism_reasoning="Assessment failed due to LLM error",
+                clinical_evidence_score=0,
+                clinical_reasoning="Assessment failed due to LLM error",
+                drug_candidates=[],
+                key_findings=[],
+            ),
+            sufficient=False,
+            confidence=0.0,
+            recommendation="continue",
+            next_search_queries=[
+                f"{question} mechanism",
+                f"{question} clinical trials",
+                f"{question} drug candidates",
+            ],
+            reasoning=f"Assessment failed: {error}. Recommend retrying with refined queries.",
+        )
+class MockJudgeHandler:
+    """
+    Mock JudgeHandler for testing without LLM calls.
+    Use this in unit tests to avoid API calls.
+    """
+    def __init__(self, mock_response: JudgeAssessment | None = None) -> None:
+        """
+        Initialize with optional mock response.
+        Args:
+            mock_response: The assessment to return. If None, uses default.
+        """
+        self.mock_response = mock_response
+        self.call_count = 0
+        self.last_question: str | None = None
+        self.last_evidence: list[Evidence] | None = None
+    async def assess(
+        self,
+        question: str,
+        evidence: list[Evidence],
+    ) -> JudgeAssessment:
+        """Return the mock response."""
+        self.call_count += 1
+        self.last_question = question
+        self.last_evidence = evidence
+        if self.mock_response:
+            return self.mock_response
+        min_evidence = 3
+        # Default mock response
+        return JudgeAssessment(
+            details=AssessmentDetails(
+                mechanism_score=7,
+                mechanism_reasoning="Mock assessment - good mechanism evidence",
+                clinical_evidence_score=6,
+                clinical_reasoning="Mock assessment - moderate clinical evidence",
+                drug_candidates=["Drug A", "Drug B"],
+                key_findings=["Finding 1", "Finding 2"],
+            ),
+            sufficient=len(evidence) >= min_evidence,
+            confidence=0.75,
+            recommendation="synthesize" if len(evidence) >= min_evidence else "continue",
+            next_search_queries=["query 1", "query 2"] if len(evidence) < min_evidence else [],
+            reasoning="Mock assessment for testing purposes",
+        )

src/prompts/judge.py ADDED Viewed

	@@ -0,0 +1,101 @@

+"""Judge prompts for evidence assessment."""
+from src.utils.models import Evidence
+SYSTEM_PROMPT = """You are an expert drug repurposing research judge.
+Your task is to evaluate evidence from biomedical literature and determine if it's sufficient to
+recommend drug candidates for a given condition.
+## Evaluation Criteria
+1. **Mechanism Score (0-10)**: How well does the evidence explain the biological mechanism?
+   - 0-3: No clear mechanism, speculative
+   - 4-6: Some mechanistic insight, but gaps exist
+   - 7-10: Clear, well-supported mechanism of action
+2. **Clinical Evidence Score (0-10)**: Strength of clinical/preclinical support?
+   - 0-3: No clinical data, only theoretical
+   - 4-6: Preclinical or early clinical data
+   - 7-10: Strong clinical evidence (trials, meta-analyses)
+3. **Sufficiency**: Evidence is sufficient when:
+   - Combined scores >= 12 AND
+   - At least one specific drug candidate identified AND
+   - Clear mechanistic rationale exists
+## Output Rules
+- Always output valid JSON matching the schema
+- Be conservative: only recommend "synthesize" when truly confident
+- If continuing, suggest specific, actionable search queries
+- Never hallucinate drug names or findings not in the evidence
+"""
+def format_user_prompt(question: str, evidence: list[Evidence]) -> str:
+    """
+    Format the user prompt with question and evidence.
+    Args:
+        question: The user's research question
+        evidence: List of Evidence objects from search
+    Returns:
+        Formatted prompt string
+    """
+    max_content_len = 1500
+    evidence_text = "\n\n".join(
+        [
+            f"### Evidence {i + 1}\n"
+            f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
+            f"**URL**: {e.citation.url}\n"
+            f"**Date**: {e.citation.date}\n"
+            f"**Content**:\n{e.content[:max_content_len]}..."
+            if len(e.content) > max_content_len
+            else f"### Evidence {i + 1}\n"
+            f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
+            f"**URL**: {e.citation.url}\n"
+            f"**Date**: {e.citation.date}\n"
+            f"**Content**:\n{e.content}"
+            for i, e in enumerate(evidence)
+        ]
+    )
+    return f"""## Research Question
+{question}
+## Available Evidence ({len(evidence)} sources)
+{evidence_text}
+## Your Task
+Evaluate this evidence and determine if it's sufficient to recommend drug repurposing candidates.
+Respond with a JSON object matching the JudgeAssessment schema.
+"""
+def format_empty_evidence_prompt(question: str) -> str:
+    """
+    Format prompt when no evidence was found.
+    Args:
+        question: The user's research question
+    Returns:
+        Formatted prompt string
+    """
+    return f"""## Research Question
+{question}
+## Available Evidence
+No evidence was found from the search.
+## Your Task
+Since no evidence was found, recommend search queries that might yield better results.
+Set sufficient=False and recommendation=\"continue\".
+Suggest 3-5 specific search queries.
+"""

src/utils/models.py CHANGED Viewed

@@ -43,3 +43,50 @@ class SearchResult(BaseModel):
     sources_searched: list[Literal["pubmed", "web"]]
     total_found: int
     errors: list[str] = Field(default_factory=list)

     sources_searched: list[Literal["pubmed", "web"]]
     total_found: int
     errors: list[str] = Field(default_factory=list)
+class AssessmentDetails(BaseModel):
+    """Detailed assessment of evidence quality."""
+    mechanism_score: int = Field(
+        ...,
+        ge=0,
+        le=10,
+        description="How well does the evidence explain the mechanism? 0-10",
+    )
+    mechanism_reasoning: str = Field(
+        ..., min_length=10, description="Explanation of mechanism score"
+    )
+    clinical_evidence_score: int = Field(
+        ...,
+        ge=0,
+        le=10,
+        description="Strength of clinical/preclinical evidence. 0-10",
+    )
+    clinical_reasoning: str = Field(
+        ..., min_length=10, description="Explanation of clinical evidence score"
+    )
+    drug_candidates: list[str] = Field(
+        default_factory=list, description="List of specific drug candidates mentioned"
+    )
+    key_findings: list[str] = Field(
+        default_factory=list, description="Key findings from the evidence"
+    )
+class JudgeAssessment(BaseModel):
+    """Complete assessment from the Judge."""
+    details: AssessmentDetails
+    sufficient: bool = Field(..., description="Is evidence sufficient to provide a recommendation?")
+    confidence: float = Field(..., ge=0.0, le=1.0, description="Confidence in the assessment (0-1)")
+    recommendation: Literal["continue", "synthesize"] = Field(
+        ...,
+        description="continue = need more evidence, synthesize = ready to answer",
+    )
+    next_search_queries: list[str] = Field(
+        default_factory=list, description="If continue, what queries to search next"
+    )
+    reasoning: str = Field(
+        ..., min_length=20, description="Overall reasoning for the recommendation"
+    )

tests/unit/agent_factory/test_judges.py ADDED Viewed

	@@ -0,0 +1,211 @@

+"""Unit tests for JudgeHandler."""
+from unittest.mock import AsyncMock, MagicMock, patch
+import pytest
+from src.agent_factory.judges import JudgeHandler, MockJudgeHandler
+from src.utils.models import AssessmentDetails, Citation, Evidence, JudgeAssessment
+class TestJudgeHandler:
+    """Tests for JudgeHandler."""
+    @pytest.mark.asyncio
+    async def test_assess_returns_assessment(self):
+        """JudgeHandler should return JudgeAssessment from LLM."""
+        # Create mock assessment
+        expected_confidence = 0.85
+        mock_assessment = JudgeAssessment(
+            details=AssessmentDetails(
+                mechanism_score=8,
+                mechanism_reasoning="Strong mechanistic evidence",
+                clinical_evidence_score=7,
+                clinical_reasoning="Good clinical support",
+                drug_candidates=["Metformin"],
+                key_findings=["Neuroprotective effects"],
+            ),
+            sufficient=True,
+            confidence=expected_confidence,
+            recommendation="synthesize",
+            next_search_queries=[],
+            reasoning="Evidence is sufficient for synthesis",
+        )
+        # Mock the PydanticAI agent
+        mock_result = MagicMock()
+        mock_result.data = mock_assessment
+        with patch("src.agent_factory.judges.Agent") as mock_agent_class:
+            mock_agent = AsyncMock()
+            mock_agent.run = AsyncMock(return_value=mock_result)
+            mock_agent_class.return_value = mock_agent
+            handler = JudgeHandler()
+            # Replace the agent with our mock
+            handler.agent = mock_agent
+            evidence = [
+                Evidence(
+                    content="Metformin shows neuroprotective properties...",
+                    citation=Citation(
+                        source="pubmed",
+                        title="Metformin in AD",
+                        url="https://pubmed.ncbi.nlm.nih.gov/12345/",
+                        date="2024-01-01",
+                    ),
+                )
+            ]
+            result = await handler.assess("metformin alzheimer", evidence)
+            assert result.sufficient is True
+            assert result.recommendation == "synthesize"
+            assert result.confidence == expected_confidence
+            assert "Metformin" in result.details.drug_candidates
+    @pytest.mark.asyncio
+    async def test_assess_empty_evidence(self):
+        """JudgeHandler should handle empty evidence gracefully."""
+        mock_assessment = JudgeAssessment(
+            details=AssessmentDetails(
+                mechanism_score=0,
+                mechanism_reasoning="No evidence to assess",
+                clinical_evidence_score=0,
+                clinical_reasoning="No evidence to assess",
+                drug_candidates=[],
+                key_findings=[],
+            ),
+            sufficient=False,
+            confidence=0.0,
+            recommendation="continue",
+            next_search_queries=["metformin alzheimer mechanism"],
+            reasoning="No evidence found, need to search more",
+        )
+        mock_result = MagicMock()
+        mock_result.data = mock_assessment
+        with patch("src.agent_factory.judges.Agent") as mock_agent_class:
+            mock_agent = AsyncMock()
+            mock_agent.run = AsyncMock(return_value=mock_result)
+            mock_agent_class.return_value = mock_agent
+            handler = JudgeHandler()
+            handler.agent = mock_agent
+            result = await handler.assess("metformin alzheimer", [])
+            assert result.sufficient is False
+            assert result.recommendation == "continue"
+            assert len(result.next_search_queries) > 0
+    @pytest.mark.asyncio
+    async def test_assess_handles_llm_failure(self):
+        """JudgeHandler should return fallback on LLM failure."""
+        with patch("src.agent_factory.judges.Agent") as mock_agent_class:
+            mock_agent = AsyncMock()
+            mock_agent.run = AsyncMock(side_effect=Exception("API Error"))
+            mock_agent_class.return_value = mock_agent
+            handler = JudgeHandler()
+            handler.agent = mock_agent
+            evidence = [
+                Evidence(
+                    content="Some content",
+                    citation=Citation(
+                        source="pubmed",
+                        title="Title",
+                        url="url",
+                        date="2024",
+                    ),
+                )
+            ]
+            result = await handler.assess("test question", evidence)
+            # Should return fallback, not raise
+            assert result.sufficient is False
+            assert result.recommendation == "continue"
+            assert "failed" in result.reasoning.lower()
+class TestMockJudgeHandler:
+    """Tests for MockJudgeHandler."""
+    @pytest.mark.asyncio
+    async def test_mock_handler_returns_default(self):
+        """MockJudgeHandler should return default assessment."""
+        handler = MockJudgeHandler()
+        evidence = [
+            Evidence(
+                content="Content 1",
+                citation=Citation(source="pubmed", title="T1", url="u1", date="2024"),
+            ),
+            Evidence(
+                content="Content 2",
+                citation=Citation(source="web", title="T2", url="u2", date="2024"),
+            ),
+        ]
+        result = await handler.assess("test", evidence)
+        expected_mech_score = 7
+        expected_evidence_len = 2
+        assert handler.call_count == 1
+        assert handler.last_question == "test"
+        assert handler.last_evidence is not None
+        assert len(handler.last_evidence) == expected_evidence_len
+        assert result.details.mechanism_score == expected_mech_score
+    @pytest.mark.asyncio
+    async def test_mock_handler_custom_response(self):
+        """MockJudgeHandler should return custom response when provided."""
+        expected_score = 10
+        custom_assessment = JudgeAssessment(
+            details=AssessmentDetails(
+                mechanism_score=expected_score,
+                mechanism_reasoning="Custom reasoning",
+                clinical_evidence_score=expected_score,
+                clinical_reasoning="Custom clinical",
+                drug_candidates=["CustomDrug"],
+                key_findings=["Custom finding"],
+            ),
+            sufficient=True,
+            confidence=1.0,
+            recommendation="synthesize",
+            next_search_queries=[],
+            reasoning="Custom assessment logic for testing purposes must be at least 20 chars long",
+        )
+        handler = MockJudgeHandler(mock_response=custom_assessment)
+        result = await handler.assess("test", [])
+        assert result.details.mechanism_score == expected_score
+        assert result.details.drug_candidates == ["CustomDrug"]
+    @pytest.mark.asyncio
+    async def test_mock_handler_insufficient_with_few_evidence(self):
+        """MockJudgeHandler should recommend continue with < 3 evidence."""
+        handler = MockJudgeHandler()
+        # Only 2 pieces of evidence
+        evidence = [
+            Evidence(
+                content="Content",
+                citation=Citation(source="pubmed", title="T", url="u", date="2024"),
+            ),
+            Evidence(
+                content="Content 2",
+                citation=Citation(source="web", title="T2", url="u2", date="2024"),
+            ),
+        ]
+        result = await handler.assess("test", evidence)
+        assert result.sufficient is False
+        assert result.recommendation == "continue"
+        assert len(result.next_search_queries) > 0

uv.lock CHANGED Viewed

@@ -657,10 +657,12 @@ name = "deepcritical"
 version = "0.1.0"
 source = { editable = "." }
 dependencies = [
     { name = "beautifulsoup4" },
     { name = "duckduckgo-search" },
     { name = "gradio" },
     { name = "httpx" },
     { name = "pydantic" },
     { name = "pydantic-ai" },
     { name = "pydantic-settings" },
@@ -685,11 +687,13 @@ dev = [
 [package.metadata]
 requires-dist = [
     { name = "beautifulsoup4", specifier = ">=4.12" },
     { name = "duckduckgo-search", specifier = ">=6.0" },
     { name = "gradio", specifier = ">=5.0" },
     { name = "httpx", specifier = ">=0.27" },
     { name = "mypy", marker = "extra == 'dev'", specifier = ">=1.10" },
     { name = "pre-commit", marker = "extra == 'dev'", specifier = ">=3.7" },
     { name = "pydantic", specifier = ">=2.7" },
     { name = "pydantic-ai", specifier = ">=0.0.16" },

 version = "0.1.0"
 source = { editable = "." }
 dependencies = [
+    { name = "anthropic" },
     { name = "beautifulsoup4" },
     { name = "duckduckgo-search" },
     { name = "gradio" },
     { name = "httpx" },
+    { name = "openai" },
     { name = "pydantic" },
     { name = "pydantic-ai" },
     { name = "pydantic-settings" },
 [package.metadata]
 requires-dist = [
+    { name = "anthropic", specifier = ">=0.18.0" },
     { name = "beautifulsoup4", specifier = ">=4.12" },
     { name = "duckduckgo-search", specifier = ">=6.0" },
     { name = "gradio", specifier = ">=5.0" },
     { name = "httpx", specifier = ">=0.27" },
     { name = "mypy", marker = "extra == 'dev'", specifier = ">=1.10" },
+    { name = "openai", specifier = ">=1.0.0" },
     { name = "pre-commit", marker = "extra == 'dev'", specifier = ">=3.7" },
     { name = "pydantic", specifier = ">=2.7" },
     { name = "pydantic-ai", specifier = ">=0.0.16" },