VibecoderMcSwaggins commited on
Commit
b115229
·
unverified ·
1 Parent(s): 2e4a760

feat: HFInferenceJudgeHandler - Free AI analysis for hackathon judges (#36)

Browse files

* feat: implement HFInferenceJudgeHandler for free-tier AI analysis

Replace MockJudgeHandler with real AI analysis using HuggingFace Inference API:

- Add HFInferenceJudgeHandler with chat_completion API
- Model fallback chain: Llama 3.1 → Mistral → Zephyr (ungated)
- Robust JSON extraction (handles markdown blocks, nested braces)
- Tenacity retry with exponential backoff for rate limits
- Fix app.py to use HF Inference when no paid API keys present

Priority: User API key → Env API key → HF Inference (free)

Hackathon judges now get real AI analysis without needing API keys.
Set HF_TOKEN as Space secret for best model (Llama 3.1).

* feat: add documentation for Magentic mode bug and implementation spec

- Introduced a new bug report for Magentic mode, detailing its non-functionality and root causes.
- Updated the implementation specification for Magentic integration, emphasizing the architecture, critical insights, and necessary changes for agent coordination.
- Enhanced clarity on the roles of various agents and their interactions within the Magentic workflow.
- Provided recommendations for fixing or abandoning the Magentic mode based on observed issues.

This commit aims to improve understanding and troubleshooting of the Magentic mode within the project.

* feat: implement Magentic ChatAgent pattern with semantic state management

- Add src/agents/state.py: Thread-safe MagenticState with contextvars
- Evidence store for structured citation access
- EmbeddingService integration for semantic deduplication

- Add src/agents/tools.py: AIFunction tools that update shared state
- search_pubmed, search_clinical_trials, search_preprints
- get_bibliography for ReportAgent citations
- Tools return strings to LLM AND update state

- Add src/agents/magentic_agents.py: ChatAgent factories
- SearchAgent with search tools
- JudgeAgent, HypothesisAgent, ReportAgent
- Each agent has internal OpenAIChatClient

- Update src/orchestrator_magentic.py: Use ChatAgent pattern
- Initialize MagenticState at workflow start
- Properly stream events from MagenticBuilder

- Fix type errors for pre-commit mypy compatibility

Implements Phase 5 spec for correct Microsoft Agent Framework integration.

* docs: add P0 blockers documentation for Magentic mode implementation

- Introduced a new markdown document outlining critical blockers in the Magentic mode implementation.
- Highlighted issues such as hardcoded OpenAI models, dependency source ambiguity, and the lack of a "Free Tier" for users.
- Provided detailed impacts and required fixes for each identified issue to ensure a stable deployment.

This documentation aims to facilitate resolution of critical issues and improve the overall user experience in Magentic mode.

* fix: address CodeRabbit feedback and P0 blockers

Code Fixes (HIGH priority):
- Add API key/provider validation to prevent silent auth failures
- Fix hardcoded manager model in orchestrator_magentic.py (now uses settings.openai_model)
- Add bounds checking to JSON extraction in judges.py (prevents IndexError)
- Fix fragile test assertion in test_judges_hf.py

Code Quality (MEDIUM priority):
- Add explicit type annotation for models_to_try: list[str]
- Fix structured logging (f-string → structured params)
- Align fallback query count (3 queries) between handlers

Test Improvements:
- Add

@pytest
.mark.unit decorator to TestHFInferenceJudgeHandler

Documentation Sync:
- Update Phase 3 docs to match actual implementation:
- __init__ signature (simplified, no inline imports)
- _extract_json (string split with bounds checking)
- _call_with_retry (tenacity decorator, asyncio.get_running_loop())
- assess method (simplified model loop)
- Update Phase 4 docs with ChatInterface additional_inputs for BYOK

All 104 tests pass.

* fix: pin agent-framework-core and remove resolved bug doc

- Pin agent-framework-core>=1.0.0b251120,<2.0.0 to prevent breaking changes
- Remove docs/bugs/007_magentic_p0_blockers.md - all issues resolved:
- Issue 1 (hardcoded models): Already fixed in previous commit
- Issue 2 (dependency unpinned): Fixed in this commit
- Issue 3 (no free tier): Working as Designed

* chore: remove resolved bug documentation

- Delete 005_services_not_integrated.md - embeddings now wired to simple orchestrator
(enable_embeddings=True is the default in orchestrator.py)
- Delete 006_magentic_mode_broken.md - magentic mode is experimental/optional,
documented as requiring OpenAI (not a bug)

.env.example CHANGED
@@ -11,6 +11,20 @@ ANTHROPIC_API_KEY=sk-ant-your-key-here
11
  OPENAI_MODEL=gpt-5.1
12
  ANTHROPIC_MODEL=claude-sonnet-4-5-20250929
13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  # ============== AGENT CONFIGURATION ==============
15
 
16
  MAX_ITERATIONS=10
 
11
  OPENAI_MODEL=gpt-5.1
12
  ANTHROPIC_MODEL=claude-sonnet-4-5-20250929
13
 
14
+ # ============== HUGGINGFACE (FREE TIER) ==============
15
+
16
+ # HuggingFace Token - enables Llama 3.1 (best quality free model)
17
+ # Get yours at: https://huggingface.co/settings/tokens
18
+ #
19
+ # WITHOUT HF_TOKEN: Falls back to ungated models (zephyr-7b-beta)
20
+ # WITH HF_TOKEN: Uses Llama 3.1 8B Instruct (requires accepting license)
21
+ #
22
+ # For HuggingFace Spaces deployment:
23
+ # Set this as a "Secret" in Space Settings → Variables and secrets
24
+ # Users/judges don't need their own token - the Space secret is used
25
+ #
26
+ HF_TOKEN=hf_your-token-here
27
+
28
  # ============== AGENT CONFIGURATION ==============
29
 
30
  MAX_ITERATIONS=10
docs/bugs/005_services_not_integrated.md DELETED
@@ -1,142 +0,0 @@
1
- # Bug 005: Embedding Services Built But Not Wired to Default Orchestrator
2
-
3
- **Date:** November 26, 2025
4
- **Severity:** CRITICAL
5
- **Status:** Open
6
-
7
- ## 1. The Problem
8
-
9
- Two complete semantic search services exist but are **NOT USED** by the default orchestrator:
10
-
11
- | Service | Location | Status |
12
- | ------- | -------- | ------ |
13
- | EmbeddingService | `src/services/embeddings.py` | BUILT, not wired to simple mode |
14
- | LlamaIndexRAGService | `src/services/llamaindex_rag.py` | BUILT, not wired to simple mode |
15
-
16
- ## 2. Root Cause: Two Orchestrators
17
-
18
- ```
19
- ┌─────────────────────────────────────────────────────────────────┐
20
- │ orchestrator.py (SIMPLE MODE - DEFAULT) │
21
- │ - Basic search → judge → loop │
22
- │ - NO embeddings │
23
- │ - NO semantic search │
24
- │ - Hand-rolled keyword matching │
25
- └─────────────────────────────────────────────────────────────────┘
26
-
27
- ┌─────────────────────────────────────────────────────────────────┐
28
- │ orchestrator_magentic.py (MAGENTIC MODE) │
29
- │ - Multi-agent architecture │
30
- │ - USES EmbeddingService │
31
- │ - USES semantic search │
32
- │ - Requires agent-framework (optional dep) │
33
- │ - OpenAI only │
34
- └─────────────────────────────────────────────────────────────────┘
35
- ```
36
-
37
- **The UI defaults to simple mode**, which bypasses all the semantic search infrastructure.
38
-
39
- ## 3. What's Built (Not Wired)
40
-
41
- ### EmbeddingService (NO API KEY NEEDED)
42
-
43
- ```python
44
- # src/services/embeddings.py
45
- class EmbeddingService:
46
- async def embed(text) -> list[float]
47
- async def search_similar(query) -> list[dict] # SEMANTIC SEARCH
48
- async def deduplicate(evidence) -> list # DEDUPLICATION
49
- ```
50
-
51
- - Uses local sentence-transformers
52
- - ChromaDB vector store
53
- - **Works without API keys**
54
-
55
- ### LlamaIndexRAGService
56
-
57
- ```python
58
- # src/services/llamaindex_rag.py
59
- class LlamaIndexRAGService:
60
- def ingest_evidence(evidence_list)
61
- def retrieve(query) -> list[dict] # Semantic retrieval
62
- def query(query_str) -> str # Synthesized response
63
- ```
64
-
65
- ## 4. Where Services ARE Used
66
-
67
- ```
68
- src/orchestrator_magentic.py ← Uses EmbeddingService
69
- src/agents/search_agent.py ← Uses EmbeddingService
70
- src/agents/report_agent.py ← Uses EmbeddingService
71
- src/agents/hypothesis_agent.py ← Uses EmbeddingService
72
- src/agents/analysis_agent.py ← Uses EmbeddingService
73
- ```
74
-
75
- All in magentic mode agents, NOT in simple orchestrator.
76
-
77
- ## 5. The Fix Options
78
-
79
- ### Option A: Add Embeddings to Simple Orchestrator (RECOMMENDED)
80
-
81
- Modify `src/orchestrator.py` to optionally use EmbeddingService:
82
-
83
- ```python
84
- class Orchestrator:
85
- def __init__(self, ..., use_embeddings: bool = True):
86
- if use_embeddings:
87
- from src.services.embeddings import get_embedding_service
88
- self.embeddings = get_embedding_service()
89
- else:
90
- self.embeddings = None
91
-
92
- async def run(self, query):
93
- # ... search phase ...
94
-
95
- if self.embeddings:
96
- # Semantic ranking
97
- all_evidence = await self._rank_by_relevance(all_evidence, query)
98
- # Deduplication
99
- all_evidence = await self.embeddings.deduplicate(all_evidence)
100
- ```
101
-
102
- ### Option B: Make Magentic Mode Default
103
-
104
- Change app.py to default to "magentic" mode when deps available.
105
-
106
- ### Option C: Merge Best of Both
107
-
108
- Create a new orchestrator that:
109
- - Has the simplicity of simple mode
110
- - Uses embeddings for ranking/dedup
111
- - Doesn't require agent-framework
112
-
113
- ## 6. Implementation Plan
114
-
115
- ### Phase 1: Wire EmbeddingService to Simple Orchestrator
116
-
117
- 1. Import EmbeddingService in orchestrator.py
118
- 2. Add semantic ranking after search
119
- 3. Add deduplication before judge
120
- 4. Test end-to-end
121
-
122
- ### Phase 2: Add Relevance to Evidence
123
-
124
- 1. Use embedding similarity as relevance score
125
- 2. Sort evidence by relevance
126
- 3. Only send top-K to judge
127
-
128
- ## 7. Files to Modify
129
-
130
- ```
131
- src/orchestrator.py ← Add embedding integration
132
- src/orchestrator_factory.py ← Pass embeddings flag
133
- src/app.py ← Enable embeddings by default
134
- ```
135
-
136
- ## 8. Success Criteria
137
-
138
- - [ ] Default mode uses semantic search
139
- - [ ] Evidence ranked by relevance
140
- - [ ] Duplicates removed
141
- - [ ] No new API keys required (sentence-transformers is local)
142
- - [ ] Magentic mode still works as before
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/implementation/03_phase_judge.md CHANGED
@@ -350,20 +350,228 @@ class JudgeHandler:
350
  )
351
 
352
 
353
- class MockJudgeHandler:
354
  """
355
- Mock JudgeHandler for testing without LLM calls.
 
 
 
 
 
 
 
 
356
 
357
- Use this in unit tests to avoid API calls.
 
 
358
  """
359
 
360
- def __init__(self, mock_response: JudgeAssessment | None = None):
 
 
 
 
 
 
 
361
  """
362
- Initialize with optional mock response.
363
 
364
  Args:
365
- mock_response: The assessment to return. If None, uses default.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
366
  """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
367
  self.mock_response = mock_response
368
  self.call_count = 0
369
  self.last_question = None
@@ -374,7 +582,7 @@ class MockJudgeHandler:
374
  question: str,
375
  evidence: List[Evidence],
376
  ) -> JudgeAssessment:
377
- """Return the mock response."""
378
  self.call_count += 1
379
  self.last_question = question
380
  self.last_evidence = evidence
@@ -382,21 +590,21 @@ class MockJudgeHandler:
382
  if self.mock_response:
383
  return self.mock_response
384
 
385
- # Default mock response
386
  return JudgeAssessment(
387
  details=AssessmentDetails(
388
  mechanism_score=7,
389
- mechanism_reasoning="Mock assessment - good mechanism evidence",
390
  clinical_evidence_score=6,
391
- clinical_reasoning="Mock assessment - moderate clinical evidence",
392
- drug_candidates=["Drug A", "Drug B"],
393
- key_findings=["Finding 1", "Finding 2"],
394
  ),
395
  sufficient=len(evidence) >= 3,
396
  confidence=0.75,
397
  recommendation="synthesize" if len(evidence) >= 3 else "continue",
398
  next_search_queries=["query 1", "query 2"] if len(evidence) < 3 else [],
399
- reasoning="Mock assessment for testing purposes",
400
  )
401
  ```
402
 
@@ -547,8 +755,89 @@ class TestJudgeHandler:
547
  assert "failed" in result.reasoning.lower()
548
 
549
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
550
  class TestMockJudgeHandler:
551
- """Tests for MockJudgeHandler."""
552
 
553
  @pytest.mark.asyncio
554
  async def test_mock_handler_returns_default(self):
@@ -641,9 +930,15 @@ dependencies = [
641
  "pydantic-ai>=0.0.16",
642
  "openai>=1.0.0",
643
  "anthropic>=0.18.0",
 
644
  ]
645
  ```
646
 
 
 
 
 
 
647
  ---
648
 
649
  ## 7. Configuration (`src/utils/config.py`)
 
350
  )
351
 
352
 
353
+ class HFInferenceJudgeHandler:
354
  """
355
+ JudgeHandler using HuggingFace Inference API for FREE LLM calls.
356
+
357
+ This is the DEFAULT for demo mode - provides real AI analysis without
358
+ requiring users to have OpenAI/Anthropic API keys.
359
+
360
+ Model Fallback Chain (handles gated models and rate limits):
361
+ 1. meta-llama/Llama-3.1-8B-Instruct (best quality, requires HF_TOKEN)
362
+ 2. mistralai/Mistral-7B-Instruct-v0.3 (good quality, may require token)
363
+ 3. HuggingFaceH4/zephyr-7b-beta (ungated, always works)
364
 
365
+ Rate Limit Handling:
366
+ - Exponential backoff with 3 retries
367
+ - Falls back to next model on persistent 429/503 errors
368
  """
369
 
370
+ # Model fallback chain: gated (best) ungated (fallback)
371
+ FALLBACK_MODELS = [
372
+ "meta-llama/Llama-3.1-8B-Instruct", # Best quality (gated)
373
+ "mistralai/Mistral-7B-Instruct-v0.3", # Good quality
374
+ "HuggingFaceH4/zephyr-7b-beta", # Ungated fallback
375
+ ]
376
+
377
+ def __init__(self, model_id: str | None = None) -> None:
378
  """
379
+ Initialize with HF Inference client.
380
 
381
  Args:
382
+ model_id: Optional specific model ID. If None, uses FALLBACK_MODELS chain.
383
+ """
384
+ self.model_id = model_id
385
+ # Will automatically use HF_TOKEN from env if available
386
+ self.client = InferenceClient()
387
+ self.call_count = 0
388
+ self.last_question: str | None = None
389
+ self.last_evidence: list[Evidence] | None = None
390
+
391
+ def _extract_json(self, text: str) -> dict[str, Any] | None:
392
+ """
393
+ Robust JSON extraction that handles markdown blocks and nested braces.
394
+ """
395
+ text = text.strip()
396
+
397
+ # Remove markdown code blocks if present (with bounds checking)
398
+ if "```json" in text:
399
+ parts = text.split("```json", 1)
400
+ if len(parts) > 1:
401
+ inner_parts = parts[1].split("```", 1)
402
+ text = inner_parts[0]
403
+ elif "```" in text:
404
+ parts = text.split("```", 1)
405
+ if len(parts) > 1:
406
+ inner_parts = parts[1].split("```", 1)
407
+ text = inner_parts[0]
408
+
409
+ text = text.strip()
410
+
411
+ # Find first '{'
412
+ start_idx = text.find("{")
413
+ if start_idx == -1:
414
+ return None
415
+
416
+ # Stack-based parsing ignoring chars in strings
417
+ count = 0
418
+ in_string = False
419
+ escape = False
420
+
421
+ for i, char in enumerate(text[start_idx:], start=start_idx):
422
+ if in_string:
423
+ if escape:
424
+ escape = False
425
+ elif char == "\\":
426
+ escape = True
427
+ elif char == '"':
428
+ in_string = False
429
+ elif char == '"':
430
+ in_string = True
431
+ elif char == "{":
432
+ count += 1
433
+ elif char == "}":
434
+ count -= 1
435
+ if count == 0:
436
+ try:
437
+ result = json.loads(text[start_idx : i + 1])
438
+ if isinstance(result, dict):
439
+ return result
440
+ return None
441
+ except json.JSONDecodeError:
442
+ return None
443
+
444
+ return None
445
+
446
+ @retry(
447
+ stop=stop_after_attempt(3),
448
+ wait=wait_exponential(multiplier=1, min=1, max=4),
449
+ retry=retry_if_exception_type(Exception),
450
+ reraise=True,
451
+ )
452
+ async def _call_with_retry(self, model: str, prompt: str, question: str) -> JudgeAssessment:
453
+ """Make API call with retry logic using chat_completion."""
454
+ loop = asyncio.get_running_loop()
455
+
456
+ # Build messages for chat_completion (model-agnostic)
457
+ messages = [
458
+ {
459
+ "role": "system",
460
+ "content": f"""{SYSTEM_PROMPT}
461
+
462
+ IMPORTANT: Respond with ONLY valid JSON matching this schema:
463
+ {{
464
+ "details": {{
465
+ "mechanism_score": <int 0-10>,
466
+ "mechanism_reasoning": "<string>",
467
+ "clinical_evidence_score": <int 0-10>,
468
+ "clinical_reasoning": "<string>",
469
+ "drug_candidates": ["<string>", ...],
470
+ "key_findings": ["<string>", ...]
471
+ }},
472
+ "sufficient": <bool>,
473
+ "confidence": <float 0-1>,
474
+ "recommendation": "continue" | "synthesize",
475
+ "next_search_queries": ["<string>", ...],
476
+ "reasoning": "<string>"
477
+ }}""",
478
+ },
479
+ {"role": "user", "content": prompt},
480
+ ]
481
+
482
+ # Use chat_completion (conversational task - supported by all models)
483
+ response = await loop.run_in_executor(
484
+ None,
485
+ lambda: self.client.chat_completion(
486
+ messages=messages,
487
+ model=model,
488
+ max_tokens=1024,
489
+ temperature=0.1,
490
+ ),
491
+ )
492
+
493
+ # Extract content from response
494
+ content = response.choices[0].message.content
495
+ if not content:
496
+ raise ValueError("Empty response from model")
497
+
498
+ # Extract and parse JSON
499
+ json_data = self._extract_json(content)
500
+ if not json_data:
501
+ raise ValueError("No valid JSON found in response")
502
+
503
+ return JudgeAssessment(**json_data)
504
+
505
+ async def assess(
506
+ self,
507
+ question: str,
508
+ evidence: list[Evidence],
509
+ ) -> JudgeAssessment:
510
  """
511
+ Assess evidence using HuggingFace Inference API.
512
+ Attempts models in order until one succeeds.
513
+ """
514
+ self.call_count += 1
515
+ self.last_question = question
516
+ self.last_evidence = evidence
517
+
518
+ # Format the user prompt
519
+ if evidence:
520
+ user_prompt = format_user_prompt(question, evidence)
521
+ else:
522
+ user_prompt = format_empty_evidence_prompt(question)
523
+
524
+ models_to_try: list[str] = [self.model_id] if self.model_id else self.FALLBACK_MODELS
525
+ last_error: Exception | None = None
526
+
527
+ for model in models_to_try:
528
+ try:
529
+ return await self._call_with_retry(model, user_prompt, question)
530
+ except Exception as e:
531
+ logger.warning("Model failed", model=model, error=str(e))
532
+ last_error = e
533
+ continue
534
+
535
+ # All models failed
536
+ logger.error("All HF models failed", error=str(last_error))
537
+ return self._create_fallback_assessment(question, str(last_error))
538
+
539
+ def _create_fallback_assessment(
540
+ self,
541
+ question: str,
542
+ error: str,
543
+ ) -> JudgeAssessment:
544
+ """Create a fallback assessment when inference fails."""
545
+ return JudgeAssessment(
546
+ details=AssessmentDetails(
547
+ mechanism_score=0,
548
+ mechanism_reasoning=f"Assessment failed: {error}",
549
+ clinical_evidence_score=0,
550
+ clinical_reasoning=f"Assessment failed: {error}",
551
+ drug_candidates=[],
552
+ key_findings=[],
553
+ ),
554
+ sufficient=False,
555
+ confidence=0.0,
556
+ recommendation="continue",
557
+ next_search_queries=[
558
+ f"{question} mechanism",
559
+ f"{question} clinical trials",
560
+ f"{question} drug candidates",
561
+ ],
562
+ reasoning=f"HF Inference failed: {error}. Recommend retrying.",
563
+ )
564
+
565
+
566
+ class MockJudgeHandler:
567
+ """
568
+ Mock JudgeHandler for UNIT TESTING ONLY.
569
+
570
+ NOT for production use. Use HFInferenceJudgeHandler for demo mode.
571
+ """
572
+
573
+ def __init__(self, mock_response: JudgeAssessment | None = None):
574
+ """Initialize with optional mock response for testing."""
575
  self.mock_response = mock_response
576
  self.call_count = 0
577
  self.last_question = None
 
582
  question: str,
583
  evidence: List[Evidence],
584
  ) -> JudgeAssessment:
585
+ """Return the mock response (for testing only)."""
586
  self.call_count += 1
587
  self.last_question = question
588
  self.last_evidence = evidence
 
590
  if self.mock_response:
591
  return self.mock_response
592
 
593
+ # Default mock response for tests
594
  return JudgeAssessment(
595
  details=AssessmentDetails(
596
  mechanism_score=7,
597
+ mechanism_reasoning="Mock assessment for testing",
598
  clinical_evidence_score=6,
599
+ clinical_reasoning="Mock assessment for testing",
600
+ drug_candidates=["TestDrug"],
601
+ key_findings=["Test finding"],
602
  ),
603
  sufficient=len(evidence) >= 3,
604
  confidence=0.75,
605
  recommendation="synthesize" if len(evidence) >= 3 else "continue",
606
  next_search_queries=["query 1", "query 2"] if len(evidence) < 3 else [],
607
+ reasoning="Mock assessment for unit testing only",
608
  )
609
  ```
610
 
 
755
  assert "failed" in result.reasoning.lower()
756
 
757
 
758
+ class TestHFInferenceJudgeHandler:
759
+ """Tests for HFInferenceJudgeHandler."""
760
+
761
+ @pytest.mark.asyncio
762
+ async def test_extract_json_raw(self):
763
+ """Should extract raw JSON."""
764
+ from src.agent_factory.judges import HFInferenceJudgeHandler
765
+
766
+ handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler)
767
+ # Bypass __init__ for unit testing extraction
768
+
769
+ result = handler._extract_json('{"key": "value"}')
770
+ assert result == {"key": "value"}
771
+
772
+ @pytest.mark.asyncio
773
+ async def test_extract_json_markdown_block(self):
774
+ """Should extract JSON from markdown code block."""
775
+ from src.agent_factory.judges import HFInferenceJudgeHandler
776
+
777
+ handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler)
778
+
779
+ response = '''Here is the assessment:
780
+ ```json
781
+ {"key": "value", "nested": {"inner": 1}}
782
+ ```
783
+ '''
784
+ result = handler._extract_json(response)
785
+ assert result == {"key": "value", "nested": {"inner": 1}}
786
+
787
+ @pytest.mark.asyncio
788
+ async def test_extract_json_with_preamble(self):
789
+ """Should extract JSON with preamble text."""
790
+ from src.agent_factory.judges import HFInferenceJudgeHandler
791
+
792
+ handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler)
793
+
794
+ response = 'Here is your JSON response:\n{"sufficient": true, "confidence": 0.85}'
795
+ result = handler._extract_json(response)
796
+ assert result == {"sufficient": True, "confidence": 0.85}
797
+
798
+ @pytest.mark.asyncio
799
+ async def test_extract_json_nested_braces(self):
800
+ """Should handle nested braces correctly."""
801
+ from src.agent_factory.judges import HFInferenceJudgeHandler
802
+
803
+ handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler)
804
+
805
+ response = '{"details": {"mechanism_score": 8}, "reasoning": "test"}'
806
+ result = handler._extract_json(response)
807
+ assert result["details"]["mechanism_score"] == 8
808
+
809
+ @pytest.mark.asyncio
810
+ async def test_hf_handler_uses_fallback_models(self):
811
+ """HFInferenceJudgeHandler should have fallback model chain."""
812
+ from src.agent_factory.judges import HFInferenceJudgeHandler
813
+
814
+ # Check class has fallback models defined
815
+ assert len(HFInferenceJudgeHandler.FALLBACK_MODELS) >= 3
816
+ assert "zephyr-7b-beta" in HFInferenceJudgeHandler.FALLBACK_MODELS[-1]
817
+
818
+ @pytest.mark.asyncio
819
+ async def test_hf_handler_fallback_on_auth_error(self):
820
+ """Should fall back to ungated model on auth error."""
821
+ from src.agent_factory.judges import HFInferenceJudgeHandler
822
+ from unittest.mock import MagicMock, patch
823
+
824
+ with patch("src.agent_factory.judges.InferenceClient") as mock_client_class:
825
+ # First call raises 403, second succeeds
826
+ mock_client = MagicMock()
827
+ mock_client.chat_completion.side_effect = [
828
+ Exception("403 Forbidden: gated model"),
829
+ MagicMock(choices=[MagicMock(message=MagicMock(content='{"sufficient": false}'))])
830
+ ]
831
+ mock_client_class.return_value = mock_client
832
+
833
+ handler = HFInferenceJudgeHandler()
834
+ # Manually trigger fallback test
835
+ assert handler._try_fallback_model() is True
836
+ assert handler.model_id != "meta-llama/Llama-3.1-8B-Instruct"
837
+
838
+
839
  class TestMockJudgeHandler:
840
+ """Tests for MockJudgeHandler (UNIT TESTING ONLY)."""
841
 
842
  @pytest.mark.asyncio
843
  async def test_mock_handler_returns_default(self):
 
930
  "pydantic-ai>=0.0.16",
931
  "openai>=1.0.0",
932
  "anthropic>=0.18.0",
933
+ "huggingface-hub>=0.20.0", # For HFInferenceJudgeHandler (FREE LLM)
934
  ]
935
  ```
936
 
937
+ **Note**: `huggingface-hub` is required for the free tier to work. It:
938
+ - Provides `InferenceClient` for API calls
939
+ - Auto-reads `HF_TOKEN` from environment (optional, for gated models)
940
+ - Works without any token for ungated models like `zephyr-7b-beta`
941
+
942
  ---
943
 
944
  ## 7. Configuration (`src/utils/config.py`)
docs/implementation/04_phase_ui.md CHANGED
@@ -408,33 +408,65 @@ from typing import AsyncGenerator
408
 
409
  from src.orchestrator import Orchestrator
410
  from src.tools.pubmed import PubMedTool
411
- from src.tools.websearch import WebTool
 
412
  from src.tools.search_handler import SearchHandler
413
- from src.agent_factory.judges import JudgeHandler, MockJudgeHandler
414
  from src.utils.models import OrchestratorConfig, AgentEvent
415
 
416
 
417
- def create_orchestrator(use_mock: bool = False) -> Orchestrator:
 
 
 
418
  """
419
  Create an orchestrator instance.
420
 
421
  Args:
422
- use_mock: If True, use MockJudgeHandler (no API key needed)
 
423
 
424
  Returns:
425
- Configured Orchestrator instance
 
 
 
 
 
 
 
 
 
 
426
  """
 
 
427
  # Create search tools
428
  search_handler = SearchHandler(
429
- tools=[PubMedTool(), WebTool()],
430
  timeout=30.0,
431
  )
432
 
433
- # Create judge (mock or real)
434
- if use_mock:
435
- judge_handler = MockJudgeHandler()
 
 
 
 
 
 
 
 
 
 
436
  else:
437
- judge_handler = JudgeHandler()
 
 
 
 
 
438
 
439
  # Create orchestrator
440
  config = OrchestratorConfig(
@@ -446,12 +478,14 @@ def create_orchestrator(use_mock: bool = False) -> Orchestrator:
446
  search_handler=search_handler,
447
  judge_handler=judge_handler,
448
  config=config,
449
- )
450
 
451
 
452
  async def research_agent(
453
  message: str,
454
  history: list[dict],
 
 
455
  ) -> AsyncGenerator[str, None]:
456
  """
457
  Gradio chat function that runs the research agent.
@@ -459,6 +493,8 @@ async def research_agent(
459
  Args:
460
  message: User's research question
461
  history: Chat history (Gradio format)
 
 
462
 
463
  Yields:
464
  Markdown-formatted responses for streaming
@@ -467,10 +503,31 @@ async def research_agent(
467
  yield "Please enter a research question."
468
  return
469
 
470
- # Create orchestrator (use mock if no API key)
471
  import os
472
- use_mock = not (os.getenv("OPENAI_API_KEY") or os.getenv("ANTHROPIC_API_KEY"))
473
- orchestrator = create_orchestrator(use_mock=use_mock)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
474
 
475
  # Run the agent and stream events
476
  response_parts = []
@@ -516,19 +573,43 @@ def create_demo() -> gr.Blocks:
516
  - "What existing medications show promise for Long COVID?"
517
  """)
518
 
519
- chatbot = gr.ChatInterface(
 
520
  fn=research_agent,
521
- type="messages",
522
- title="",
523
  examples=[
524
- "What drugs could be repurposed for Alzheimer's disease?",
525
- "Is metformin effective for treating cancer?",
526
- "What medications show promise for Long COVID treatment?",
527
- "Can statins be repurposed for neurological conditions?",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
528
  ],
529
- retry_btn="🔄 Retry",
530
- undo_btn="↩️ Undo",
531
- clear_btn="🗑️ Clear",
532
  )
533
 
534
  gr.Markdown("""
@@ -952,15 +1033,22 @@ uv run python -m src.app
952
  import asyncio
953
  from src.orchestrator import Orchestrator
954
  from src.tools.pubmed import PubMedTool
955
- from src.tools.websearch import WebTool
 
956
  from src.tools.search_handler import SearchHandler
957
- from src.agent_factory.judges import MockJudgeHandler
958
  from src.utils.models import OrchestratorConfig
959
 
960
  async def test_full_flow():
961
  # Create components
962
- search_handler = SearchHandler([PubMedTool(), WebTool()])
963
- judge_handler = MockJudgeHandler() # Use mock for testing
 
 
 
 
 
 
964
  config = OrchestratorConfig(max_iterations=3)
965
 
966
  # Create orchestrator
@@ -980,6 +1068,8 @@ async def test_full_flow():
980
  asyncio.run(test_full_flow())
981
  ```
982
 
 
 
983
  ---
984
 
985
  ## 10. Deployment Verification
 
408
 
409
  from src.orchestrator import Orchestrator
410
  from src.tools.pubmed import PubMedTool
411
+ from src.tools.clinicaltrials import ClinicalTrialsTool
412
+ from src.tools.biorxiv import BioRxivTool
413
  from src.tools.search_handler import SearchHandler
414
+ from src.agent_factory.judges import JudgeHandler, HFInferenceJudgeHandler
415
  from src.utils.models import OrchestratorConfig, AgentEvent
416
 
417
 
418
+ def create_orchestrator(
419
+ user_api_key: str | None = None,
420
+ api_provider: str = "openai",
421
+ ) -> tuple[Orchestrator, str]:
422
  """
423
  Create an orchestrator instance.
424
 
425
  Args:
426
+ user_api_key: Optional user-provided API key (BYOK)
427
+ api_provider: API provider ("openai" or "anthropic")
428
 
429
  Returns:
430
+ Tuple of (Configured Orchestrator instance, backend_name)
431
+
432
+ Priority:
433
+ 1. User-provided API key → JudgeHandler (OpenAI/Anthropic)
434
+ 2. Environment API key → JudgeHandler (OpenAI/Anthropic)
435
+ 3. No key → HFInferenceJudgeHandler (FREE, automatic fallback chain)
436
+
437
+ HF Inference Fallback Chain:
438
+ 1. Llama 3.1 8B (requires HF_TOKEN for gated model)
439
+ 2. Mistral 7B (may require token)
440
+ 3. Zephyr 7B (ungated, always works)
441
  """
442
+ import os
443
+
444
  # Create search tools
445
  search_handler = SearchHandler(
446
+ tools=[PubMedTool(), ClinicalTrialsTool(), BioRxivTool()],
447
  timeout=30.0,
448
  )
449
 
450
+ # Determine which judge to use
451
+ has_env_key = bool(os.getenv("OPENAI_API_KEY") or os.getenv("ANTHROPIC_API_KEY"))
452
+ has_user_key = bool(user_api_key)
453
+ has_hf_token = bool(os.getenv("HF_TOKEN"))
454
+
455
+ if has_user_key:
456
+ # User provided their own key
457
+ judge_handler = JudgeHandler(model=None)
458
+ backend_name = f"your {api_provider.upper()} API key"
459
+ elif has_env_key:
460
+ # Environment has API key configured
461
+ judge_handler = JudgeHandler(model=None)
462
+ backend_name = "configured API key"
463
  else:
464
+ # Use FREE HuggingFace Inference with automatic fallback
465
+ judge_handler = HFInferenceJudgeHandler()
466
+ if has_hf_token:
467
+ backend_name = "HuggingFace Inference (Llama 3.1)"
468
+ else:
469
+ backend_name = "HuggingFace Inference (free tier)"
470
 
471
  # Create orchestrator
472
  config = OrchestratorConfig(
 
478
  search_handler=search_handler,
479
  judge_handler=judge_handler,
480
  config=config,
481
+ ), backend_name
482
 
483
 
484
  async def research_agent(
485
  message: str,
486
  history: list[dict],
487
+ api_key: str = "",
488
+ api_provider: str = "openai",
489
  ) -> AsyncGenerator[str, None]:
490
  """
491
  Gradio chat function that runs the research agent.
 
493
  Args:
494
  message: User's research question
495
  history: Chat history (Gradio format)
496
+ api_key: Optional user-provided API key (BYOK)
497
+ api_provider: API provider ("openai" or "anthropic")
498
 
499
  Yields:
500
  Markdown-formatted responses for streaming
 
503
  yield "Please enter a research question."
504
  return
505
 
 
506
  import os
507
+
508
+ # Clean user-provided API key
509
+ user_api_key = api_key.strip() if api_key else None
510
+
511
+ # Create orchestrator with appropriate judge
512
+ orchestrator, backend_name = create_orchestrator(
513
+ user_api_key=user_api_key,
514
+ api_provider=api_provider,
515
+ )
516
+
517
+ # Determine icon based on backend
518
+ has_hf_token = bool(os.getenv("HF_TOKEN"))
519
+ if "HuggingFace" in backend_name:
520
+ icon = "🤗"
521
+ extra_note = (
522
+ "\n*For premium analysis, enter an OpenAI or Anthropic API key.*"
523
+ if not has_hf_token else ""
524
+ )
525
+ else:
526
+ icon = "🔑"
527
+ extra_note = ""
528
+
529
+ # Inform user which backend is being used
530
+ yield f"{icon} **Using {backend_name}**{extra_note}\n\n"
531
 
532
  # Run the agent and stream events
533
  response_parts = []
 
573
  - "What existing medications show promise for Long COVID?"
574
  """)
575
 
576
+ # Note: additional_inputs render in an accordion below the chat input
577
+ gr.ChatInterface(
578
  fn=research_agent,
 
 
579
  examples=[
580
+ [
581
+ "What drugs could be repurposed for Alzheimer's disease?",
582
+ "simple",
583
+ "",
584
+ "openai",
585
+ ],
586
+ [
587
+ "Is metformin effective for treating cancer?",
588
+ "simple",
589
+ "",
590
+ "openai",
591
+ ],
592
+ ],
593
+ additional_inputs=[
594
+ gr.Radio(
595
+ choices=["simple", "magentic"],
596
+ value="simple",
597
+ label="Orchestrator Mode",
598
+ info="Simple: Linear | Magentic: Multi-Agent (OpenAI)",
599
+ ),
600
+ gr.Textbox(
601
+ label="API Key (Optional - Bring Your Own Key)",
602
+ placeholder="sk-... or sk-ant-...",
603
+ type="password",
604
+ info="Enter your own API key for full AI analysis. Never stored.",
605
+ ),
606
+ gr.Radio(
607
+ choices=["openai", "anthropic"],
608
+ value="openai",
609
+ label="API Provider",
610
+ info="Select the provider for your API key",
611
+ ),
612
  ],
 
 
 
613
  )
614
 
615
  gr.Markdown("""
 
1033
  import asyncio
1034
  from src.orchestrator import Orchestrator
1035
  from src.tools.pubmed import PubMedTool
1036
+ from src.tools.biorxiv import BioRxivTool
1037
+ from src.tools.clinicaltrials import ClinicalTrialsTool
1038
  from src.tools.search_handler import SearchHandler
1039
+ from src.agent_factory.judges import HFInferenceJudgeHandler, MockJudgeHandler
1040
  from src.utils.models import OrchestratorConfig
1041
 
1042
  async def test_full_flow():
1043
  # Create components
1044
+ search_handler = SearchHandler([PubMedTool(), ClinicalTrialsTool(), BioRxivTool()])
1045
+
1046
+ # Option 1: Use FREE HuggingFace Inference (real AI analysis)
1047
+ judge_handler = HFInferenceJudgeHandler()
1048
+
1049
+ # Option 2: Use MockJudgeHandler for UNIT TESTING ONLY
1050
+ # judge_handler = MockJudgeHandler()
1051
+
1052
  config = OrchestratorConfig(max_iterations=3)
1053
 
1054
  # Create orchestrator
 
1068
  asyncio.run(test_full_flow())
1069
  ```
1070
 
1071
+ **Important**: `MockJudgeHandler` is for **unit testing only**. For actual demo/production use, always use `HFInferenceJudgeHandler` (free) or `JudgeHandler` (with API key).
1072
+
1073
  ---
1074
 
1075
  ## 10. Deployment Verification
docs/implementation/05_phase_magentic.md CHANGED
@@ -1,4 +1,4 @@
1
- # Phase 5 Implementation Spec: Magentic Integration (Optional)
2
 
3
  **Goal**: Upgrade orchestrator to use Microsoft Agent Framework's Magentic-One pattern.
4
  **Philosophy**: "Same API, Better Engine."
@@ -15,385 +15,744 @@ Magentic-One provides:
15
  - **Event streaming** for real-time UI updates
16
  - **Multi-agent coordination** with round limits and reset logic
17
 
18
- This is **NOT required for MVP**. Only implement if time permits after Phase 4.
19
-
20
  ---
21
 
22
- ## 2. Architecture Alignment
23
 
24
- ### Current Phase 4 Architecture
25
- ```
26
- User Query
27
-
28
- Orchestrator (while loop)
29
- ├── SearchHandler.execute() → Evidence
30
- ├── JudgeHandler.assess() → JudgeAssessment
31
- └── Loop/Synthesize decision
32
-
33
- Research Report
34
- ```
35
 
36
- ### Phase 5 Magentic Architecture
37
  ```
38
- User Query
39
-
40
- MagenticBuilder
41
- ├── SearchAgent (wraps SearchHandler)
42
- ├── JudgeAgent (wraps JudgeHandler)
43
- └── StandardMagenticManager (LLM coordinator)
44
-
45
- Research Report (same output format)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  ```
47
 
48
- **Key Insight**: We wrap existing handlers as `AgentProtocol` implementations. The domain logic stays the same.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
  ---
51
 
52
- ## 3. Design for Seamless Integration
53
 
54
- ### 3.1 Protocol-Based Design (Phase 4 prep)
55
 
56
- In Phase 4, define handlers using Protocols so they can be wrapped later:
 
 
57
 
58
  ```python
59
- # src/orchestrator.py (Phase 4)
60
- from typing import Protocol, List
61
- from src.utils.models import Evidence, SearchResult, JudgeAssessment
62
 
 
 
 
 
 
63
 
64
- class SearchHandlerProtocol(Protocol):
65
- """Protocol for search handler - can be wrapped as Agent later."""
66
- async def execute(self, query: str, max_results_per_tool: int = 10) -> SearchResult:
67
- ...
68
 
 
69
 
70
- class JudgeHandlerProtocol(Protocol):
71
- """Protocol for judge handler - can be wrapped as Agent later."""
72
- async def assess(self, question: str, evidence: List[Evidence]) -> JudgeAssessment:
73
- ...
74
 
 
75
 
76
- class OrchestratorProtocol(Protocol):
77
- """Protocol for orchestrator - allows swapping implementations."""
78
- async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
79
- ...
80
- ```
81
 
82
- ### 3.2 Facade Pattern
83
 
84
- The `Orchestrator` class is a facade. In Phase 5, we create `MagenticOrchestrator` with the same interface:
 
85
 
86
- ```python
87
- # Phase 4: Simple orchestrator
88
- orchestrator = Orchestrator(search_handler, judge_handler)
 
89
 
90
- # Phase 5: Magentic orchestrator (SAME API)
91
- orchestrator = MagenticOrchestrator(search_handler, judge_handler)
 
 
92
 
93
- # Usage is identical
94
- async for event in orchestrator.run("metformin alzheimer"):
95
- print(event.to_markdown())
96
- ```
 
 
 
 
 
 
97
 
98
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
- ## 4. Phase 5 Implementation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
 
102
- ### 4.1 Install Agent Framework
 
 
 
103
 
104
- Add to `pyproject.toml`:
105
 
106
- ```toml
107
- [project.optional-dependencies]
108
- magentic = [
109
- "agent-framework-core>=0.1.0",
110
- ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
111
  ```
112
 
113
- ### 4.2 Agent Wrappers (`src/agents/search_agent.py`)
114
 
115
- Wrap `SearchHandler` as an `AgentProtocol`.
116
- **Note**: `AgentProtocol` requires `id`, `name`, `display_name`, `description`, `run`, `run_stream`, and `get_new_thread`.
117
 
118
  ```python
119
- """Search agent wrapper for Magentic integration."""
120
- from typing import Any, AsyncIterable
121
- from agent_framework import AgentProtocol, AgentRunResponse, AgentRunResponseUpdate, ChatMessage, Role, AgentThread
122
 
123
- from src.tools.search_handler import SearchHandler
124
- from src.utils.models import SearchResult
 
125
 
 
 
 
126
 
127
- class SearchAgent:
128
- """Wraps SearchHandler as an AgentProtocol for Magentic."""
 
 
129
 
130
- def __init__(self, search_handler: SearchHandler):
131
- self._handler = search_handler
132
- self._id = "search-agent"
133
- self._name = "SearchAgent"
134
- self._description = "Searches PubMed and web for drug repurposing evidence"
135
 
136
- @property
137
- def id(self) -> str:
138
- return self._id
139
 
140
- @property
141
- def name(self) -> str | None:
142
- return self._name
 
143
 
144
- @property
145
- def display_name(self) -> str:
146
- return self._name
 
 
 
147
 
148
- @property
149
- def description(self) -> str | None:
150
- return self._description
151
 
152
- async def run(
153
- self,
154
- messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
155
- *,
156
- thread: AgentThread | None = None,
157
- **kwargs: Any,
158
- ) -> AgentRunResponse:
159
- """Execute search based on the last user message."""
160
- # Extract query from messages
161
- query = ""
162
- if isinstance(messages, list):
163
- for msg in reversed(messages):
164
- if isinstance(msg, ChatMessage) and msg.role == Role.USER and msg.text:
165
- query = msg.text
166
- break
167
- elif isinstance(msg, str):
168
- query = msg
169
- break
170
- elif isinstance(messages, str):
171
- query = messages
172
-
173
- if not query:
174
- return AgentRunResponse(
175
- messages=[ChatMessage(role=Role.ASSISTANT, text="No query provided")],
176
- response_id="search-no-query",
177
- )
178
 
179
- # Execute search
180
- result: SearchResult = await self._handler.execute(query, max_results_per_tool=10)
 
181
 
182
- # Format response
183
- evidence_text = "\n".join([
184
- f"- [{e.citation.title}]({e.citation.url}): {e.content[:200]}..."
185
- for e in result.evidence[:5]
186
- ])
187
 
188
- response_text = f"Found {result.total_found} sources:\n\n{evidence_text}"
 
 
189
 
190
- return AgentRunResponse(
191
- messages=[ChatMessage(role=Role.ASSISTANT, text=response_text)],
192
- response_id=f"search-{result.total_found}",
193
- additional_properties={"evidence": [e.model_dump() for e in result.evidence]},
194
- )
195
 
196
- async def run_stream(
197
- self,
198
- messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
199
- *,
200
- thread: AgentThread | None = None,
201
- **kwargs: Any,
202
- ) -> AsyncIterable[AgentRunResponseUpdate]:
203
- """Streaming wrapper for search (search itself isn't streaming)."""
204
- result = await self.run(messages, thread=thread, **kwargs)
205
- # Yield single update with full result
206
- yield AgentRunResponseUpdate(
207
- messages=result.messages,
208
- response_id=result.response_id
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
209
  )
210
 
211
- def get_new_thread(self, **kwargs: Any) -> AgentThread:
212
- """Create a new thread."""
213
- return AgentThread(**kwargs)
214
  ```
215
 
216
- ### 4.3 Judge Agent Wrapper (`src/agents/judge_agent.py`)
217
 
218
  ```python
219
- """Judge agent wrapper for Magentic integration."""
220
- from typing import Any, List, AsyncIterable
221
- from agent_framework import AgentProtocol, AgentRunResponse, AgentRunResponseUpdate, ChatMessage, Role, AgentThread
222
 
223
- from src.agent_factory.judges import JudgeHandler
224
- from src.utils.models import Evidence, JudgeAssessment
 
 
 
 
 
 
225
 
226
 
227
- class JudgeAgent:
228
- """Wraps JudgeHandler as an AgentProtocol for Magentic."""
229
 
230
- def __init__(self, judge_handler: JudgeHandler, evidence_store: dict[str, List[Evidence]]):
231
- self._handler = judge_handler
232
- self._evidence_store = evidence_store # Shared state for evidence
233
- self._id = "judge-agent"
234
- self._name = "JudgeAgent"
235
- self._description = "Evaluates evidence quality and determines if sufficient for synthesis"
236
 
237
- @property
238
- def id(self) -> str:
239
- return self._id
 
 
 
 
240
 
241
- @property
242
- def name(self) -> str | None:
243
- return self._name
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
244
 
245
- @property
246
- def display_name(self) -> str:
247
- return self._name
248
 
249
- @property
250
- def description(self) -> str | None:
251
- return self._description
252
 
253
- async def run(
254
- self,
255
- messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
256
- *,
257
- thread: AgentThread | None = None,
258
- **kwargs: Any,
259
- ) -> AgentRunResponse:
260
- """Assess evidence quality."""
261
- # Extract original question from messages
262
- question = ""
263
- if isinstance(messages, list):
264
- for msg in reversed(messages):
265
- if isinstance(msg, ChatMessage) and msg.role == Role.USER and msg.text:
266
- question = msg.text
267
- break
268
- elif isinstance(msg, str):
269
- question = msg
270
- break
271
- elif isinstance(messages, str):
272
- question = messages
273
-
274
- # Get evidence from shared store
275
- evidence = self._evidence_store.get("current", [])
276
-
277
- # Assess
278
- assessment: JudgeAssessment = await self._handler.assess(question, evidence)
279
-
280
- # Format response
281
- response_text = f"""## Assessment
282
-
283
- **Sufficient**: {assessment.sufficient}
284
- **Confidence**: {assessment.confidence:.0%}
285
- **Recommendation**: {assessment.recommendation}
286
-
287
- ### Scores
288
- - Mechanism: {assessment.details.mechanism_score}/10
289
- - Clinical: {assessment.details.clinical_evidence_score}/10
290
-
291
- ### Reasoning
292
- {assessment.reasoning}
293
- """
294
 
295
- if assessment.next_search_queries:
296
- response_text += f"\n### Next Queries\n" + "\n".join(
297
- f"- {q}" for q in assessment.next_search_queries
298
- )
 
 
 
299
 
300
- return AgentRunResponse(
301
- messages=[ChatMessage(role=Role.ASSISTANT, text=response_text)],
302
- response_id=f"judge-{assessment.recommendation}",
303
- additional_properties={"assessment": assessment.model_dump()},
304
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
305
 
306
- async def run_stream(
307
- self,
308
- messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
309
- *,
310
- thread: AgentThread | None = None,
311
- **kwargs: Any,
312
- ) -> AsyncIterable[AgentRunResponseUpdate]:
313
- """Streaming wrapper for judge."""
314
- result = await self.run(messages, thread=thread, **kwargs)
315
- yield AgentRunResponseUpdate(
316
- messages=result.messages,
317
- response_id=result.response_id
318
- )
319
-
320
- def get_new_thread(self, **kwargs: Any) -> AgentThread:
321
- """Create a new thread."""
322
- return AgentThread(**kwargs)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
323
  ```
324
 
325
- ### 4.4 Magentic Orchestrator (`src/orchestrator_magentic.py`)
326
 
327
  ```python
328
- """Magentic-based orchestrator for DeepCritical."""
329
- from typing import AsyncGenerator, List
330
- import structlog
331
 
 
332
  from agent_framework import (
 
 
333
  MagenticBuilder,
334
  MagenticFinalResultEvent,
335
- MagenticAgentMessageEvent,
336
  MagenticOrchestratorMessageEvent,
337
- MagenticAgentDeltaEvent,
338
  WorkflowOutputEvent,
339
  )
340
  from agent_framework.openai import OpenAIChatClient
341
 
342
- from src.agents.search_agent import SearchAgent
343
- from src.agents.judge_agent import JudgeAgent
344
- from src.tools.search_handler import SearchHandler
345
- from src.agent_factory.judges import JudgeHandler
346
- from src.utils.models import AgentEvent, Evidence
 
 
 
 
 
347
 
348
  logger = structlog.get_logger()
349
 
350
 
351
  class MagenticOrchestrator:
352
  """
353
- Magentic-based orchestrator - same API as Orchestrator.
354
 
355
- Uses Microsoft Agent Framework's MagenticBuilder for multi-agent coordination.
 
356
  """
357
 
358
  def __init__(
359
  self,
360
- search_handler: SearchHandler,
361
- judge_handler: JudgeHandler,
362
  max_rounds: int = 10,
363
- ):
364
- self._search_handler = search_handler
365
- self._judge_handler = judge_handler
366
- self._max_rounds = max_rounds
367
- self._evidence_store: dict[str, List[Evidence]] = {"current": []}
368
-
369
- async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
370
- """
371
- Run the Magentic workflow - same API as simple Orchestrator.
372
 
373
- Yields AgentEvent objects for real-time UI updates.
 
 
374
  """
375
- logger.info("Starting Magentic orchestrator", query=query)
 
 
 
 
376
 
377
- yield AgentEvent(
378
- type="started",
379
- message=f"Starting research (Magentic mode): {query}",
380
- iteration=0,
 
 
 
 
 
 
 
 
 
 
 
381
  )
382
 
383
- # Create agent wrappers
384
- search_agent = SearchAgent(self._search_handler)
385
- judge_agent = JudgeAgent(self._judge_handler, self._evidence_store)
386
-
387
- # Build Magentic workflow
388
- # Note: MagenticBuilder.participants takes named arguments for agent instances
389
- workflow = (
390
  MagenticBuilder()
391
  .participants(
392
  searcher=search_agent,
 
393
  judge=judge_agent,
 
394
  )
395
  .with_standard_manager(
396
- chat_client=OpenAIChatClient(),
397
  max_round_count=self._max_rounds,
398
  max_stall_count=3,
399
  max_reset_count=2,
@@ -401,139 +760,173 @@ class MagenticOrchestrator:
401
  .build()
402
  )
403
 
404
- # Task instruction for the manager
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
405
  task = f"""Research drug repurposing opportunities for: {query}
406
 
407
- Instructions:
408
- 1. Use SearcherAgent to find evidence from PubMed and web
409
- 2. Use JudgeAgent to evaluate if evidence is sufficient
410
- 3. If JudgeAgent says "continue", search with refined queries
411
- 4. If JudgeAgent says "synthesize", provide final synthesis
412
- 5. Stop when synthesis is ready or max rounds reached
413
-
414
- Focus on finding:
415
- - Mechanism of action evidence
416
- - Clinical/preclinical studies
417
- - Specific drug candidates
418
- """
 
419
 
420
  iteration = 0
421
  try:
422
- # workflow.run_stream returns an async generator of workflow events
423
  async for event in workflow.run_stream(task):
424
- if isinstance(event, MagenticOrchestratorMessageEvent):
425
- # Manager events (planning, instruction, ledger)
426
- message_text = event.message.text if event.message else ""
427
- yield AgentEvent(
428
- type="judging",
429
- message=f"Manager ({event.kind}): {message_text[:100]}...",
430
- iteration=iteration,
431
- )
432
-
433
- elif isinstance(event, MagenticAgentMessageEvent):
434
- # Complete agent response
435
- iteration += 1
436
- agent_name = event.agent_id or "unknown"
437
- msg_text = event.message.text if event.message else ""
438
-
439
- if "search" in agent_name.lower():
440
- # Check if we found evidence (based on SearchAgent logic)
441
- # In a real implementation we might extract metadata
442
- yield AgentEvent(
443
- type="search_complete",
444
- message=f"Search agent: {msg_text[:100]}...",
445
- iteration=iteration,
446
- )
447
- elif "judge" in agent_name.lower():
448
- yield AgentEvent(
449
- type="judge_complete",
450
- message=f"Judge agent: {msg_text[:100]}...",
451
- iteration=iteration,
452
- )
453
-
454
- elif isinstance(event, MagenticFinalResultEvent):
455
- # Final workflow result
456
- final_text = event.message.text if event.message else "No result"
457
- yield AgentEvent(
458
- type="complete",
459
- message=final_text,
460
- data={"iterations": iteration},
461
- iteration=iteration,
462
- )
463
-
464
- elif isinstance(event, MagenticAgentDeltaEvent):
465
- # Streaming token chunks from agents (optional "typing" effect)
466
- # Only emit if we have actual text content
467
- if event.text:
468
- yield AgentEvent(
469
- type="streaming",
470
- message=event.text,
471
- data={"agent_id": event.agent_id},
472
- iteration=iteration,
473
- )
474
-
475
- elif isinstance(event, WorkflowOutputEvent):
476
- # Alternative final output event
477
- if event.data:
478
- yield AgentEvent(
479
- type="complete",
480
- message=str(event.data),
481
- iteration=iteration,
482
- )
483
 
484
  except Exception as e:
485
  logger.error("Magentic workflow failed", error=str(e))
486
  yield AgentEvent(
487
  type="error",
488
- message=f"Workflow error: {str(e)}",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
489
  iteration=iteration,
490
  )
491
- ```
492
 
493
- ### 4.5 Factory Pattern (`src/orchestrator_factory.py`)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
494
 
495
- Allow switching between implementations:
496
 
497
  ```python
498
  """Factory for creating orchestrators."""
499
- from typing import Literal
500
 
501
- from src.orchestrator import Orchestrator
502
- from src.tools.search_handler import SearchHandler
503
- from src.agent_factory.judges import JudgeHandler
504
  from src.utils.models import OrchestratorConfig
505
 
506
 
507
  def create_orchestrator(
508
- search_handler: SearchHandler,
509
- judge_handler: JudgeHandler,
510
  config: OrchestratorConfig | None = None,
511
  mode: Literal["simple", "magentic"] = "simple",
512
- ):
513
  """
514
  Create an orchestrator instance.
515
 
516
  Args:
517
- search_handler: The search handler
518
- judge_handler: The judge handler
519
  config: Optional configuration
520
- mode: "simple" for Phase 4 loop, "magentic" for Phase 5 multi-agent
521
 
522
  Returns:
523
- Orchestrator instance (same interface regardless of mode)
 
 
 
 
524
  """
525
  if mode == "magentic":
526
  try:
527
  from src.orchestrator_magentic import MagenticOrchestrator
 
528
  return MagenticOrchestrator(
529
- search_handler=search_handler,
530
- judge_handler=judge_handler,
531
  max_rounds=config.max_iterations if config else 10,
532
  )
533
  except ImportError:
534
  # Fallback to simple if agent-framework not installed
535
  pass
536
 
 
 
 
 
537
  return Orchestrator(
538
  search_handler=search_handler,
539
  judge_handler=judge_handler,
@@ -543,96 +936,156 @@ def create_orchestrator(
543
 
544
  ---
545
 
546
- ## 5. Directory Structure After Phase 5
 
 
547
 
548
  ```
549
- src/
550
- ├── app.py # Gradio UI (unchanged)
551
- ├── orchestrator.py # Phase 4 simple orchestrator
552
- ├── orchestrator_magentic.py # Phase 5 Magentic orchestrator
553
- ├── orchestrator_factory.py # Factory to switch implementations
554
- ├── agents/ # NEW: Agent wrappers
555
- │ ├── __init__.py
556
- │ ├── search_agent.py # SearchHandler as AgentProtocol
557
- │ └── judge_agent.py # JudgeHandler as AgentProtocol
558
- ├── agent_factory/
559
- │ └── judges.py # JudgeHandler (unchanged)
560
- ├── tools/
561
- │ ├── pubmed.py # PubMed tool (unchanged)
562
- │ ├── websearch.py # Web tool (unchanged)
563
- │ └── search_handler.py # SearchHandler (unchanged)
564
- └── utils/
565
- └── models.py # Models (unchanged)
 
 
 
 
 
 
 
 
566
  ```
567
 
568
  ---
569
 
570
- ## 6. Implementation Checklist
571
 
572
- - [ ] Ensure Phase 4 uses Protocol-based handler interfaces
573
- - [ ] Add `agent-framework-core` to optional dependencies
574
- - [ ] Create `src/agents/` directory
575
- - [ ] Implement `SearchAgent` wrapper
576
- - [ ] Implement `JudgeAgent` wrapper
577
- - [ ] Implement `MagenticOrchestrator`
578
- - [ ] Implement `orchestrator_factory.py`
579
- - [ ] Add tests for agent wrappers
580
- - [ ] Test Magentic flow end-to-end
581
- - [ ] Update `src/app.py` to use factory with mode toggle
 
 
 
582
 
583
  ---
584
 
585
- ## 7. Definition of Done
586
 
587
- Phase 5 is **COMPLETE** when:
 
 
 
 
 
 
 
 
 
 
588
 
589
- 1. All Phase 4 tests still pass (no regression)
590
- 2. `MagenticOrchestrator` has same API as `Orchestrator`
591
- 3. Can switch between modes via factory:
592
 
593
- ```python
594
- # Simple mode (Phase 4)
595
- orchestrator = create_orchestrator(search, judge, mode="simple")
 
596
 
597
- # Magentic mode (Phase 5)
598
- orchestrator = create_orchestrator(search, judge, mode="magentic")
599
 
600
- # Same usage!
601
- async for event in orchestrator.run("metformin alzheimer"):
602
- print(event.to_markdown())
603
- ```
604
 
605
- 4. UI works with both modes
606
- 5. Graceful fallback if agent-framework not installed
607
 
608
- ---
 
 
 
 
 
 
 
 
 
609
 
610
- ## 8. When to Implement
611
 
612
- **Priority**: LOW (optional enhancement)
613
 
614
- Implement ONLY after:
615
- 1. ✅ Phase 1: Foundation
616
- 2. ✅ Phase 2: Search
617
- 3. ✅ Phase 3: Judge
618
- 4. ✅ Phase 4: Orchestrator + UI (MVP SHIPPED)
619
 
620
- If hackathon deadline is approaching, **SKIP Phase 5**. Ship the MVP.
 
 
 
 
 
621
 
622
  ---
623
 
624
- ## 9. Benefits of This Design
625
 
626
- 1. **No breaking changes** - Phase 4 code works unchanged
627
- 2. **Same API** - `run()` returns `AsyncGenerator[AgentEvent, None]`
628
- 3. **Gradual adoption** - Optional dependency, factory fallback
629
- 4. **Testable** - Each component can be tested independently
630
- 5. **Aligns with Tonic's vision** - Uses Microsoft Agent Framework patterns
631
 
632
- ---
 
 
 
 
 
 
 
633
 
634
- ## 10. Reference
 
 
 
 
 
 
 
 
 
 
 
635
 
636
- - Microsoft Agent Framework: `reference_repos/agent-framework/`
637
- - Magentic samples: `reference_repos/agent-framework/python/samples/getting_started/workflows/orchestration/magentic.py`
638
- - AgentProtocol: `reference_repos/agent-framework/python/packages/core/agent_framework/_agents.py`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 5 Implementation Spec: Magentic Integration
2
 
3
  **Goal**: Upgrade orchestrator to use Microsoft Agent Framework's Magentic-One pattern.
4
  **Philosophy**: "Same API, Better Engine."
 
15
  - **Event streaming** for real-time UI updates
16
  - **Multi-agent coordination** with round limits and reset logic
17
 
 
 
18
  ---
19
 
20
+ ## 2. Critical Architecture Understanding
21
 
22
+ ### 2.1 How Magentic Actually Works
 
 
 
 
 
 
 
 
 
 
23
 
 
24
  ```
25
+ ┌─────────────────────────────────────────────────────────────────────────┐
26
+ │ MagenticBuilder Workflow │
27
+ ├─────────────────────────────────────────────────────────────────────────┤
28
+ │ │
29
+ │ User Task: "Research drug repurposing for metformin alzheimer" │
30
+ │ ↓ │
31
+ │ ┌──────────────────────────────────────────────────────────────────┐ │
32
+ │ │ StandardMagenticManager │ │
33
+ │ │ │ │
34
+ │ │ 1. plan() → LLM generates facts & plan │ │
35
+ │ │ 2. create_progress_ledger() → LLM decides: │ │
36
+ │ │ - is_request_satisfied? │ │
37
+ │ │ - next_speaker: "searcher" │ │
38
+ │ │ - instruction_or_question: "Search for clinical trials..." │ │
39
+ │ │ │ │
40
+ │ └──────────────────────────────────────────────────────────────────┘ │
41
+ │ ↓ │
42
+ │ NATURAL LANGUAGE INSTRUCTION sent to agent │
43
+ │ "Search for clinical trials about metformin..." │
44
+ │ ↓ │
45
+ │ ┌──────────────────────────────────────────────────────────────────┐ │
46
+ │ │ ChatAgent (searcher) │ │
47
+ │ │ │ │
48
+ │ │ chat_client (INTERNAL LLM) ← understands instruction │ │
49
+ │ │ ↓ │ │
50
+ │ │ "I'll search for metformin alzheimer clinical trials" │ │
51
+ │ │ ↓ │ │
52
+ │ │ tools=[search_pubmed, search_clinicaltrials] ← calls tools │ │
53
+ │ │ ↓ │ │
54
+ │ │ Returns natural language response to manager │ │
55
+ │ │ │ │
56
+ │ └──────────────────────────────────────────────────────────────────┘ │
57
+ │ ↓ │
58
+ │ Manager evaluates response │
59
+ │ Decides next agent or completion │
60
+ │ │
61
+ └──────────────────────��──────────────────────────────────────────────────┘
62
  ```
63
 
64
+ ### 2.2 The Critical Insight
65
+
66
+ **Microsoft's ChatAgent has an INTERNAL LLM (`chat_client`) that:**
67
+ 1. Receives natural language instructions from the manager
68
+ 2. Understands what action to take
69
+ 3. Calls attached tools (functions)
70
+ 4. Returns natural language responses
71
+
72
+ **Our previous implementation was WRONG because:**
73
+ - We wrapped handlers as bare `BaseAgent` subclasses
74
+ - No internal LLM to understand instructions
75
+ - Raw instruction text was passed directly to APIs (PubMed doesn't understand "Search for clinical trials...")
76
+
77
+ ### 2.3 Correct Pattern: ChatAgent with Tools
78
+
79
+ ```python
80
+ # CORRECT: Agent backed by LLM that calls tools
81
+ from agent_framework import ChatAgent, AIFunction
82
+ from agent_framework.openai import OpenAIChatClient
83
+
84
+ # Define tool that ChatAgent can call
85
+ @AIFunction
86
+ async def search_pubmed(query: str, max_results: int = 10) -> str:
87
+ """Search PubMed for biomedical literature.
88
+
89
+ Args:
90
+ query: Search keywords (e.g., "metformin alzheimer mechanism")
91
+ max_results: Maximum number of results to return
92
+ """
93
+ result = await pubmed_tool.search(query, max_results)
94
+ return format_results(result)
95
+
96
+ # ChatAgent with internal LLM + tools
97
+ search_agent = ChatAgent(
98
+ name="SearchAgent",
99
+ description="Searches biomedical databases for drug repurposing evidence",
100
+ instructions="You search PubMed, ClinicalTrials.gov, and bioRxiv for evidence.",
101
+ chat_client=OpenAIChatClient(model_id="gpt-4o-mini"), # INTERNAL LLM
102
+ tools=[search_pubmed, search_clinicaltrials, search_biorxiv], # TOOLS
103
+ )
104
+ ```
105
 
106
  ---
107
 
108
+ ## 3. Correct Implementation
109
 
110
+ ### 3.1 Shared State Module (`src/agents/state.py`)
111
 
112
+ **CRITICAL**: Tools must update shared state so:
113
+ 1. EmbeddingService can deduplicate across searches
114
+ 2. ReportAgent can access structured Evidence objects for citations
115
 
116
  ```python
117
+ """Shared state for Magentic agents.
 
 
118
 
119
+ This module provides global state that tools update as a side effect.
120
+ ChatAgent tools return strings to the LLM, but also update this state
121
+ for semantic deduplication and structured citation access.
122
+ """
123
+ from __future__ import annotations
124
 
125
+ from typing import TYPE_CHECKING
 
 
 
126
 
127
+ import structlog
128
 
129
+ if TYPE_CHECKING:
130
+ from src.services.embeddings import EmbeddingService
 
 
131
 
132
+ from src.utils.models import Evidence
133
 
134
+ logger = structlog.get_logger()
 
 
 
 
135
 
 
136
 
137
+ class MagenticState:
138
+ """Shared state container for Magentic workflow.
139
 
140
+ Maintains:
141
+ - evidence_store: All collected Evidence objects (for citations)
142
+ - embedding_service: Optional semantic search (for deduplication)
143
+ """
144
 
145
+ def __init__(self) -> None:
146
+ self.evidence_store: list[Evidence] = []
147
+ self.embedding_service: EmbeddingService | None = None
148
+ self._seen_urls: set[str] = set()
149
 
150
+ def init_embedding_service(self) -> None:
151
+ """Lazy-initialize embedding service if available."""
152
+ if self.embedding_service is not None:
153
+ return
154
+ try:
155
+ from src.services.embeddings import get_embedding_service
156
+ self.embedding_service = get_embedding_service()
157
+ logger.info("Embedding service enabled for Magentic mode")
158
+ except Exception as e:
159
+ logger.warning("Embedding service unavailable", error=str(e))
160
 
161
+ async def add_evidence(self, evidence_list: list[Evidence]) -> list[Evidence]:
162
+ """Add evidence with semantic deduplication.
163
+
164
+ Args:
165
+ evidence_list: New evidence from search
166
+
167
+ Returns:
168
+ List of unique evidence (not duplicates)
169
+ """
170
+ if not evidence_list:
171
+ return []
172
+
173
+ # URL-based deduplication first (fast)
174
+ url_unique = [
175
+ e for e in evidence_list
176
+ if e.citation.url not in self._seen_urls
177
+ ]
178
+
179
+ # Semantic deduplication if available
180
+ if self.embedding_service and url_unique:
181
+ try:
182
+ unique = await self.embedding_service.deduplicate(url_unique, threshold=0.85)
183
+ logger.info(
184
+ "Semantic deduplication",
185
+ before=len(url_unique),
186
+ after=len(unique),
187
+ )
188
+ except Exception as e:
189
+ logger.warning("Deduplication failed, using URL-based", error=str(e))
190
+ unique = url_unique
191
+ else:
192
+ unique = url_unique
193
+
194
+ # Update state
195
+ for e in unique:
196
+ self._seen_urls.add(e.citation.url)
197
+ self.evidence_store.append(e)
198
+
199
+ return unique
200
+
201
+ async def search_related(self, query: str, n_results: int = 5) -> list[Evidence]:
202
+ """Find semantically related evidence from vector store.
203
+
204
+ Args:
205
+ query: Search query
206
+ n_results: Number of related items
207
+
208
+ Returns:
209
+ Related Evidence objects (reconstructed from vector store)
210
+ """
211
+ if not self.embedding_service:
212
+ return []
213
 
214
+ try:
215
+ from src.utils.models import Citation
216
+
217
+ related = await self.embedding_service.search_similar(query, n_results)
218
+ evidence = []
219
+
220
+ for item in related:
221
+ if item["id"] in self._seen_urls:
222
+ continue # Already in results
223
+
224
+ meta = item.get("metadata", {})
225
+ authors_str = meta.get("authors", "")
226
+ authors = [a.strip() for a in authors_str.split(",") if a.strip()]
227
+
228
+ ev = Evidence(
229
+ content=item["content"],
230
+ citation=Citation(
231
+ title=meta.get("title", "Related Evidence"),
232
+ url=item["id"],
233
+ source=meta.get("source", "pubmed"),
234
+ date=meta.get("date", "n.d."),
235
+ authors=authors,
236
+ ),
237
+ relevance=max(0.0, 1.0 - item.get("distance", 0.5)),
238
+ )
239
+ evidence.append(ev)
240
+
241
+ return evidence
242
+ except Exception as e:
243
+ logger.warning("Related search failed", error=str(e))
244
+ return []
245
 
246
+ def reset(self) -> None:
247
+ """Reset state for new workflow run."""
248
+ self.evidence_store.clear()
249
+ self._seen_urls.clear()
250
 
 
251
 
252
+ # Global singleton for workflow
253
+ _state: MagenticState | None = None
254
+
255
+
256
+ def get_magentic_state() -> MagenticState:
257
+ """Get or create the global Magentic state."""
258
+ global _state
259
+ if _state is None:
260
+ _state = MagenticState()
261
+ return _state
262
+
263
+
264
+ def reset_magentic_state() -> None:
265
+ """Reset state for a fresh workflow run."""
266
+ global _state
267
+ if _state is not None:
268
+ _state.reset()
269
+ else:
270
+ _state = MagenticState()
271
  ```
272
 
273
+ ### 3.2 Tool Functions (`src/agents/tools.py`)
274
 
275
+ Tools call APIs AND update shared state. Return strings to LLM, but also store structured Evidence.
 
276
 
277
  ```python
278
+ """Tool functions for Magentic agents.
 
 
279
 
280
+ IMPORTANT: These tools do TWO things:
281
+ 1. Return formatted strings to the ChatAgent's internal LLM
282
+ 2. Update shared state (evidence_store, embeddings) as a side effect
283
 
284
+ This preserves semantic deduplication and structured citation access.
285
+ """
286
+ from agent_framework import AIFunction
287
 
288
+ from src.agents.state import get_magentic_state
289
+ from src.tools.biorxiv import BioRxivTool
290
+ from src.tools.clinicaltrials import ClinicalTrialsTool
291
+ from src.tools.pubmed import PubMedTool
292
 
293
+ # Singleton tool instances
294
+ _pubmed = PubMedTool()
295
+ _clinicaltrials = ClinicalTrialsTool()
296
+ _biorxiv = BioRxivTool()
 
297
 
 
 
 
298
 
299
+ def _format_results(results: list, source_name: str, query: str) -> str:
300
+ """Format search results for LLM consumption."""
301
+ if not results:
302
+ return f"No {source_name} results found for: {query}"
303
 
304
+ output = [f"Found {len(results)} {source_name} results:\n"]
305
+ for i, r in enumerate(results[:10], 1):
306
+ output.append(f"{i}. **{r.citation.title}**")
307
+ output.append(f" Source: {r.citation.source} | Date: {r.citation.date}")
308
+ output.append(f" {r.content[:300]}...")
309
+ output.append(f" URL: {r.citation.url}\n")
310
 
311
+ return "\n".join(output)
 
 
312
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
313
 
314
+ @AIFunction
315
+ async def search_pubmed(query: str, max_results: int = 10) -> str:
316
+ """Search PubMed for biomedical research papers.
317
 
318
+ Use this tool to find peer-reviewed scientific literature about
319
+ drugs, diseases, mechanisms of action, and clinical studies.
 
 
 
320
 
321
+ Args:
322
+ query: Search keywords (e.g., "metformin alzheimer mechanism")
323
+ max_results: Maximum results to return (default 10)
324
 
325
+ Returns:
326
+ Formatted list of papers with titles, abstracts, and citations
327
+ """
328
+ # 1. Execute search
329
+ results = await _pubmed.search(query, max_results)
330
 
331
+ # 2. Update shared state (semantic dedup + evidence store)
332
+ state = get_magentic_state()
333
+ unique = await state.add_evidence(results)
334
+
335
+ # 3. Also get related evidence from vector store
336
+ related = await state.search_related(query, n_results=3)
337
+ if related:
338
+ await state.add_evidence(related)
339
+
340
+ # 4. Return formatted string for LLM
341
+ total_new = len(unique)
342
+ total_stored = len(state.evidence_store)
343
+
344
+ output = _format_results(results, "PubMed", query)
345
+ output += f"\n[State: {total_new} new, {total_stored} total in evidence store]"
346
+
347
+ if related:
348
+ output += f"\n[Also found {len(related)} semantically related items from previous searches]"
349
+
350
+ return output
351
+
352
+
353
+ @AIFunction
354
+ async def search_clinical_trials(query: str, max_results: int = 10) -> str:
355
+ """Search ClinicalTrials.gov for clinical studies.
356
+
357
+ Use this tool to find ongoing and completed clinical trials
358
+ for drug repurposing candidates.
359
+
360
+ Args:
361
+ query: Search terms (e.g., "metformin cancer phase 3")
362
+ max_results: Maximum results to return (default 10)
363
+
364
+ Returns:
365
+ Formatted list of clinical trials with status and details
366
+ """
367
+ # 1. Execute search
368
+ results = await _clinicaltrials.search(query, max_results)
369
+
370
+ # 2. Update shared state
371
+ state = get_magentic_state()
372
+ unique = await state.add_evidence(results)
373
+
374
+ # 3. Return formatted string
375
+ total_new = len(unique)
376
+ total_stored = len(state.evidence_store)
377
+
378
+ output = _format_results(results, "ClinicalTrials.gov", query)
379
+ output += f"\n[State: {total_new} new, {total_stored} total in evidence store]"
380
+
381
+ return output
382
+
383
+
384
+ @AIFunction
385
+ async def search_preprints(query: str, max_results: int = 10) -> str:
386
+ """Search bioRxiv/medRxiv for preprint papers.
387
+
388
+ Use this tool to find the latest research that hasn't been
389
+ peer-reviewed yet. Good for cutting-edge findings.
390
+
391
+ Args:
392
+ query: Search terms (e.g., "long covid treatment")
393
+ max_results: Maximum results to return (default 10)
394
+
395
+ Returns:
396
+ Formatted list of preprints with abstracts and links
397
+ """
398
+ # 1. Execute search
399
+ results = await _biorxiv.search(query, max_results)
400
+
401
+ # 2. Update shared state
402
+ state = get_magentic_state()
403
+ unique = await state.add_evidence(results)
404
+
405
+ # 3. Return formatted string
406
+ total_new = len(unique)
407
+ total_stored = len(state.evidence_store)
408
+
409
+ output = _format_results(results, "bioRxiv/medRxiv", query)
410
+ output += f"\n[State: {total_new} new, {total_stored} total in evidence store]"
411
+
412
+ return output
413
+
414
+
415
+ @AIFunction
416
+ async def get_evidence_summary() -> str:
417
+ """Get summary of all collected evidence.
418
+
419
+ Use this tool when you need to review what evidence has been collected
420
+ before making an assessment or generating a report.
421
+
422
+ Returns:
423
+ Summary of evidence store with counts and key citations
424
+ """
425
+ state = get_magentic_state()
426
+ evidence = state.evidence_store
427
+
428
+ if not evidence:
429
+ return "No evidence collected yet."
430
+
431
+ # Group by source
432
+ by_source: dict[str, list] = {}
433
+ for e in evidence:
434
+ src = e.citation.source
435
+ if src not in by_source:
436
+ by_source[src] = []
437
+ by_source[src].append(e)
438
+
439
+ output = [f"**Evidence Store Summary** ({len(evidence)} total items)\n"]
440
+
441
+ for source, items in by_source.items():
442
+ output.append(f"\n### {source.upper()} ({len(items)} items)")
443
+ for e in items[:5]: # First 5 per source
444
+ output.append(f"- {e.citation.title[:80]}...")
445
+
446
+ return "\n".join(output)
447
+
448
+
449
+ @AIFunction
450
+ async def get_bibliography() -> str:
451
+ """Get full bibliography of all collected evidence.
452
+
453
+ Use this tool when generating a final report to get properly
454
+ formatted citations for all evidence.
455
+
456
+ Returns:
457
+ Numbered bibliography with full citation details
458
+ """
459
+ state = get_magentic_state()
460
+ evidence = state.evidence_store
461
+
462
+ if not evidence:
463
+ return "No evidence collected for bibliography."
464
+
465
+ output = ["## References\n"]
466
+
467
+ for i, e in enumerate(evidence, 1):
468
+ # Format: Authors (Year). Title. Source. URL
469
+ authors = ", ".join(e.citation.authors[:3]) if e.citation.authors else "Unknown"
470
+ if e.citation.authors and len(e.citation.authors) > 3:
471
+ authors += " et al."
472
+
473
+ year = e.citation.date[:4] if e.citation.date else "n.d."
474
+
475
+ output.append(
476
+ f"{i}. {authors} ({year}). {e.citation.title}. "
477
+ f"*{e.citation.source.upper()}*. [{e.citation.url}]({e.citation.url})"
478
  )
479
 
480
+ return "\n".join(output)
 
 
481
  ```
482
 
483
+ ### 3.3 ChatAgent-Based Agents (`src/agents/magentic_agents.py`)
484
 
485
  ```python
486
+ """Magentic-compatible agents using ChatAgent pattern."""
487
+ from agent_framework import ChatAgent
488
+ from agent_framework.openai import OpenAIChatClient
489
 
490
+ from src.agents.tools import (
491
+ get_bibliography,
492
+ get_evidence_summary,
493
+ search_clinical_trials,
494
+ search_preprints,
495
+ search_pubmed,
496
+ )
497
+ from src.utils.config import settings
498
 
499
 
500
+ def create_search_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
501
+ """Create a search agent with internal LLM and search tools.
502
 
503
+ Args:
504
+ chat_client: Optional custom chat client. If None, uses default.
 
 
 
 
505
 
506
+ Returns:
507
+ ChatAgent configured for biomedical search
508
+ """
509
+ client = chat_client or OpenAIChatClient(
510
+ model_id="gpt-4o-mini", # Fast, cheap for tool orchestration
511
+ api_key=settings.openai_api_key,
512
+ )
513
 
514
+ return ChatAgent(
515
+ name="SearchAgent",
516
+ description="Searches biomedical databases (PubMed, ClinicalTrials.gov, bioRxiv) for drug repurposing evidence",
517
+ instructions="""You are a biomedical search specialist. When asked to find evidence:
518
+
519
+ 1. Analyze the request to determine what to search for
520
+ 2. Extract key search terms (drug names, disease names, mechanisms)
521
+ 3. Use the appropriate search tools:
522
+ - search_pubmed for peer-reviewed papers
523
+ - search_clinical_trials for clinical studies
524
+ - search_preprints for cutting-edge findings
525
+ 4. Summarize what you found and highlight key evidence
526
+
527
+ Be thorough - search multiple databases when appropriate.
528
+ Focus on finding: mechanisms of action, clinical evidence, and specific drug candidates.""",
529
+ chat_client=client,
530
+ tools=[search_pubmed, search_clinical_trials, search_preprints],
531
+ temperature=0.3, # More deterministic for tool use
532
+ )
533
 
 
 
 
534
 
535
+ def create_judge_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
536
+ """Create a judge agent that evaluates evidence quality.
 
537
 
538
+ Args:
539
+ chat_client: Optional custom chat client. If None, uses default.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
540
 
541
+ Returns:
542
+ ChatAgent configured for evidence assessment
543
+ """
544
+ client = chat_client or OpenAIChatClient(
545
+ model_id="gpt-4o", # Better model for nuanced judgment
546
+ api_key=settings.openai_api_key,
547
+ )
548
 
549
+ return ChatAgent(
550
+ name="JudgeAgent",
551
+ description="Evaluates evidence quality and determines if sufficient for synthesis",
552
+ instructions="""You are an evidence quality assessor. When asked to evaluate:
553
+
554
+ 1. First, call get_evidence_summary() to see all collected evidence
555
+ 2. Score on two dimensions (0-10 each):
556
+ - Mechanism Score: How well is the biological mechanism explained?
557
+ - Clinical Score: How strong is the clinical/preclinical evidence?
558
+ 3. Determine if evidence is SUFFICIENT for a final report:
559
+ - Sufficient: Clear mechanism + supporting clinical data
560
+ - Insufficient: Gaps in mechanism OR weak clinical evidence
561
+ 4. If insufficient, suggest specific search queries to fill gaps
562
+
563
+ Be rigorous but fair. Look for:
564
+ - Molecular targets and pathways
565
+ - Animal model studies
566
+ - Human clinical trials
567
+ - Safety data
568
+ - Drug-drug interactions""",
569
+ chat_client=client,
570
+ tools=[get_evidence_summary], # Can review collected evidence
571
+ temperature=0.2, # Consistent judgments
572
+ )
573
 
574
+
575
+ def create_hypothesis_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
576
+ """Create a hypothesis generation agent.
577
+
578
+ Args:
579
+ chat_client: Optional custom chat client. If None, uses default.
580
+
581
+ Returns:
582
+ ChatAgent configured for hypothesis generation
583
+ """
584
+ client = chat_client or OpenAIChatClient(
585
+ model_id="gpt-4o",
586
+ api_key=settings.openai_api_key,
587
+ )
588
+
589
+ return ChatAgent(
590
+ name="HypothesisAgent",
591
+ description="Generates mechanistic hypotheses for drug repurposing",
592
+ instructions="""You are a biomedical hypothesis generator. Based on evidence:
593
+
594
+ 1. Identify the key molecular targets involved
595
+ 2. Map the biological pathways affected
596
+ 3. Generate testable hypotheses in this format:
597
+
598
+ DRUG → TARGET → PATHWAY → THERAPEUTIC EFFECT
599
+
600
+ Example:
601
+ Metformin → AMPK activation → mTOR inhibition → Reduced tau phosphorylation
602
+
603
+ 4. Explain the rationale for each hypothesis
604
+ 5. Suggest what additional evidence would support or refute it
605
+
606
+ Focus on mechanistic plausibility and existing evidence.""",
607
+ chat_client=client,
608
+ temperature=0.5, # Some creativity for hypothesis generation
609
+ )
610
+
611
+
612
+ def create_report_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
613
+ """Create a report synthesis agent.
614
+
615
+ Args:
616
+ chat_client: Optional custom chat client. If None, uses default.
617
+
618
+ Returns:
619
+ ChatAgent configured for report generation
620
+ """
621
+ client = chat_client or OpenAIChatClient(
622
+ model_id="gpt-4o",
623
+ api_key=settings.openai_api_key,
624
+ )
625
+
626
+ return ChatAgent(
627
+ name="ReportAgent",
628
+ description="Synthesizes research findings into structured reports",
629
+ instructions="""You are a scientific report writer. When asked to synthesize:
630
+
631
+ 1. First, call get_evidence_summary() to review all collected evidence
632
+ 2. Then call get_bibliography() to get properly formatted citations
633
+
634
+ Generate a structured report with these sections:
635
+
636
+ ## Executive Summary
637
+ Brief overview of findings and recommendation
638
+
639
+ ## Methodology
640
+ Databases searched, queries used, evidence reviewed
641
+
642
+ ## Key Findings
643
+ ### Mechanism of Action
644
+ - Molecular targets
645
+ - Biological pathways
646
+ - Proposed mechanism
647
+
648
+ ### Clinical Evidence
649
+ - Preclinical studies
650
+ - Clinical trials
651
+ - Safety profile
652
+
653
+ ## Drug Candidates
654
+ List specific drugs with repurposing potential
655
+
656
+ ## Limitations
657
+ Gaps in evidence, conflicting data, caveats
658
+
659
+ ## Conclusion
660
+ Final recommendation with confidence level
661
+
662
+ ## References
663
+ Use the output from get_bibliography() - do not make up citations!
664
+
665
+ Be comprehensive but concise. Cite evidence for all claims.""",
666
+ chat_client=client,
667
+ tools=[get_evidence_summary, get_bibliography], # Access to collected evidence
668
+ temperature=0.3,
669
+ )
670
  ```
671
 
672
+ ### 3.4 Magentic Orchestrator (`src/orchestrator_magentic.py`)
673
 
674
  ```python
675
+ """Magentic-based orchestrator using ChatAgent pattern."""
676
+ from collections.abc import AsyncGenerator
677
+ from typing import Any
678
 
679
+ import structlog
680
  from agent_framework import (
681
+ MagenticAgentDeltaEvent,
682
+ MagenticAgentMessageEvent,
683
  MagenticBuilder,
684
  MagenticFinalResultEvent,
 
685
  MagenticOrchestratorMessageEvent,
 
686
  WorkflowOutputEvent,
687
  )
688
  from agent_framework.openai import OpenAIChatClient
689
 
690
+ from src.agents.magentic_agents import (
691
+ create_hypothesis_agent,
692
+ create_judge_agent,
693
+ create_report_agent,
694
+ create_search_agent,
695
+ )
696
+ from src.agents.state import get_magentic_state, reset_magentic_state
697
+ from src.utils.config import settings
698
+ from src.utils.exceptions import ConfigurationError
699
+ from src.utils.models import AgentEvent
700
 
701
  logger = structlog.get_logger()
702
 
703
 
704
  class MagenticOrchestrator:
705
  """
706
+ Magentic-based orchestrator using ChatAgent pattern.
707
 
708
+ Each agent has an internal LLM that understands natural language
709
+ instructions from the manager and can call tools appropriately.
710
  """
711
 
712
  def __init__(
713
  self,
 
 
714
  max_rounds: int = 10,
715
+ chat_client: OpenAIChatClient | None = None,
716
+ ) -> None:
717
+ """Initialize orchestrator.
 
 
 
 
 
 
718
 
719
+ Args:
720
+ max_rounds: Maximum coordination rounds
721
+ chat_client: Optional shared chat client for agents
722
  """
723
+ if not settings.openai_api_key:
724
+ raise ConfigurationError(
725
+ "Magentic mode requires OPENAI_API_KEY. "
726
+ "Set the key or use mode='simple'."
727
+ )
728
 
729
+ self._max_rounds = max_rounds
730
+ self._chat_client = chat_client
731
+
732
+ def _build_workflow(self) -> Any:
733
+ """Build the Magentic workflow with ChatAgent participants."""
734
+ # Create agents with internal LLMs
735
+ search_agent = create_search_agent(self._chat_client)
736
+ judge_agent = create_judge_agent(self._chat_client)
737
+ hypothesis_agent = create_hypothesis_agent(self._chat_client)
738
+ report_agent = create_report_agent(self._chat_client)
739
+
740
+ # Manager chat client (orchestrates the agents)
741
+ manager_client = OpenAIChatClient(
742
+ model_id="gpt-4o", # Good model for planning/coordination
743
+ api_key=settings.openai_api_key,
744
  )
745
 
746
+ return (
 
 
 
 
 
 
747
  MagenticBuilder()
748
  .participants(
749
  searcher=search_agent,
750
+ hypothesizer=hypothesis_agent,
751
  judge=judge_agent,
752
+ reporter=report_agent,
753
  )
754
  .with_standard_manager(
755
+ chat_client=manager_client,
756
  max_round_count=self._max_rounds,
757
  max_stall_count=3,
758
  max_reset_count=2,
 
760
  .build()
761
  )
762
 
763
+ async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
764
+ """
765
+ Run the Magentic workflow.
766
+
767
+ Args:
768
+ query: User's research question
769
+
770
+ Yields:
771
+ AgentEvent objects for real-time UI updates
772
+ """
773
+ logger.info("Starting Magentic orchestrator", query=query)
774
+
775
+ # CRITICAL: Reset state for fresh workflow run
776
+ reset_magentic_state()
777
+
778
+ # Initialize embedding service if available
779
+ state = get_magentic_state()
780
+ state.init_embedding_service()
781
+
782
+ yield AgentEvent(
783
+ type="started",
784
+ message=f"Starting research (Magentic mode): {query}",
785
+ iteration=0,
786
+ )
787
+
788
+ workflow = self._build_workflow()
789
+
790
  task = f"""Research drug repurposing opportunities for: {query}
791
 
792
+ Workflow:
793
+ 1. SearchAgent: Find evidence from PubMed, ClinicalTrials.gov, and bioRxiv
794
+ 2. HypothesisAgent: Generate mechanistic hypotheses (Drug Target → Pathway → Effect)
795
+ 3. JudgeAgent: Evaluate if evidence is sufficient
796
+ 4. If insufficient SearchAgent refines search based on gaps
797
+ 5. If sufficient ReportAgent synthesizes final report
798
+
799
+ Focus on:
800
+ - Identifying specific molecular targets
801
+ - Understanding mechanism of action
802
+ - Finding clinical evidence supporting hypotheses
803
+
804
+ The final output should be a structured research report."""
805
 
806
  iteration = 0
807
  try:
 
808
  async for event in workflow.run_stream(task):
809
+ agent_event = self._process_event(event, iteration)
810
+ if agent_event:
811
+ if isinstance(event, MagenticAgentMessageEvent):
812
+ iteration += 1
813
+ yield agent_event
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
814
 
815
  except Exception as e:
816
  logger.error("Magentic workflow failed", error=str(e))
817
  yield AgentEvent(
818
  type="error",
819
+ message=f"Workflow error: {e!s}",
820
+ iteration=iteration,
821
+ )
822
+
823
+ def _process_event(self, event: Any, iteration: int) -> AgentEvent | None:
824
+ """Process workflow event into AgentEvent."""
825
+ if isinstance(event, MagenticOrchestratorMessageEvent):
826
+ text = event.message.text if event.message else ""
827
+ if text:
828
+ return AgentEvent(
829
+ type="judging",
830
+ message=f"Manager ({event.kind}): {text[:200]}...",
831
+ iteration=iteration,
832
+ )
833
+
834
+ elif isinstance(event, MagenticAgentMessageEvent):
835
+ agent_name = event.agent_id or "unknown"
836
+ text = event.message.text if event.message else ""
837
+
838
+ event_type = "judging"
839
+ if "search" in agent_name.lower():
840
+ event_type = "search_complete"
841
+ elif "judge" in agent_name.lower():
842
+ event_type = "judge_complete"
843
+ elif "hypothes" in agent_name.lower():
844
+ event_type = "hypothesizing"
845
+ elif "report" in agent_name.lower():
846
+ event_type = "synthesizing"
847
+
848
+ return AgentEvent(
849
+ type=event_type,
850
+ message=f"{agent_name}: {text[:200]}...",
851
+ iteration=iteration + 1,
852
+ )
853
+
854
+ elif isinstance(event, MagenticFinalResultEvent):
855
+ text = event.message.text if event.message else "No result"
856
+ return AgentEvent(
857
+ type="complete",
858
+ message=text,
859
+ data={"iterations": iteration},
860
  iteration=iteration,
861
  )
 
862
 
863
+ elif isinstance(event, MagenticAgentDeltaEvent):
864
+ if event.text:
865
+ return AgentEvent(
866
+ type="streaming",
867
+ message=event.text,
868
+ data={"agent_id": event.agent_id},
869
+ iteration=iteration,
870
+ )
871
+
872
+ elif isinstance(event, WorkflowOutputEvent):
873
+ if event.data:
874
+ return AgentEvent(
875
+ type="complete",
876
+ message=str(event.data),
877
+ iteration=iteration,
878
+ )
879
+
880
+ return None
881
+ ```
882
 
883
+ ### 3.4 Updated Factory (`src/orchestrator_factory.py`)
884
 
885
  ```python
886
  """Factory for creating orchestrators."""
887
+ from typing import Any, Literal
888
 
889
+ from src.orchestrator import JudgeHandlerProtocol, Orchestrator, SearchHandlerProtocol
 
 
890
  from src.utils.models import OrchestratorConfig
891
 
892
 
893
  def create_orchestrator(
894
+ search_handler: SearchHandlerProtocol | None = None,
895
+ judge_handler: JudgeHandlerProtocol | None = None,
896
  config: OrchestratorConfig | None = None,
897
  mode: Literal["simple", "magentic"] = "simple",
898
+ ) -> Any:
899
  """
900
  Create an orchestrator instance.
901
 
902
  Args:
903
+ search_handler: The search handler (required for simple mode)
904
+ judge_handler: The judge handler (required for simple mode)
905
  config: Optional configuration
906
+ mode: "simple" for Phase 4 loop, "magentic" for ChatAgent-based multi-agent
907
 
908
  Returns:
909
+ Orchestrator instance
910
+
911
+ Note:
912
+ Magentic mode does NOT use search_handler/judge_handler.
913
+ It creates ChatAgent instances with internal LLMs that call tools directly.
914
  """
915
  if mode == "magentic":
916
  try:
917
  from src.orchestrator_magentic import MagenticOrchestrator
918
+
919
  return MagenticOrchestrator(
 
 
920
  max_rounds=config.max_iterations if config else 10,
921
  )
922
  except ImportError:
923
  # Fallback to simple if agent-framework not installed
924
  pass
925
 
926
+ # Simple mode requires handlers
927
+ if search_handler is None or judge_handler is None:
928
+ raise ValueError("Simple mode requires search_handler and judge_handler")
929
+
930
  return Orchestrator(
931
  search_handler=search_handler,
932
  judge_handler=judge_handler,
 
936
 
937
  ---
938
 
939
+ ## 4. Why This Works
940
+
941
+ ### 4.1 The Manager → Agent Communication
942
 
943
  ```
944
+ Manager LLM decides: "Tell SearchAgent to find clinical trials for metformin"
945
+
946
+ Sends instruction: "Search for clinical trials about metformin and cancer"
947
+
948
+ SearchAgent's INTERNAL LLM receives this
949
+
950
+ Internal LLM understands: "I should call search_clinical_trials('metformin cancer')"
951
+
952
+ Tool executes: ClinicalTrials.gov API
953
+
954
+ Internal LLM formats response: "I found 15 trials. Here are the key ones..."
955
+
956
+ Manager receives natural language response
957
+ ```
958
+
959
+ ### 4.2 Why Our Old Implementation Failed
960
+
961
+ ```
962
+ Manager sends: "Search for clinical trials about metformin..."
963
+
964
+ OLD SearchAgent.run() extracts: query = "Search for clinical trials about metformin..."
965
+
966
+ Passes to PubMed: pubmed.search("Search for clinical trials about metformin...")
967
+
968
+ PubMed doesn't understand English instructions → garbage results or error
969
  ```
970
 
971
  ---
972
 
973
+ ## 5. Directory Structure
974
 
975
+ ```text
976
+ src/
977
+ ├── agents/
978
+ │ ├── __init__.py
979
+ │ ├── state.py # MagenticState (evidence_store + embeddings)
980
+ │ ├── tools.py # AIFunction tool definitions (update state)
981
+ │ └── magentic_agents.py # ChatAgent factory functions
982
+ ├── services/
983
+ │ └── embeddings.py # EmbeddingService (semantic dedup)
984
+ ├── orchestrator.py # Simple mode (unchanged)
985
+ ├── orchestrator_magentic.py # Magentic mode with ChatAgents
986
+ └── orchestrator_factory.py # Mode selection
987
+ ```
988
 
989
  ---
990
 
991
+ ## 6. Dependencies
992
 
993
+ ```toml
994
+ [project.optional-dependencies]
995
+ magentic = [
996
+ "agent-framework-core>=1.0.0b",
997
+ "agent-framework-openai>=1.0.0b", # For OpenAIChatClient
998
+ ]
999
+ embeddings = [
1000
+ "chromadb>=0.4.0",
1001
+ "sentence-transformers>=2.2.0",
1002
+ ]
1003
+ ```
1004
 
1005
+ **IMPORTANT: Magentic mode REQUIRES OpenAI API key.**
 
 
1006
 
1007
+ The Microsoft Agent Framework's standard manager and ChatAgent use OpenAIChatClient internally.
1008
+ There is no AnthropicChatClient in the framework. If only `ANTHROPIC_API_KEY` is set:
1009
+ - `mode="simple"` works fine
1010
+ - `mode="magentic"` throws `ConfigurationError`
1011
 
1012
+ This is enforced in `MagenticOrchestrator.__init__`.
 
1013
 
1014
+ ---
 
 
 
1015
 
1016
+ ## 7. Implementation Checklist
 
1017
 
1018
+ - [ ] Create `src/agents/state.py` with MagenticState class
1019
+ - [ ] Create `src/agents/tools.py` with AIFunction search tools + state updates
1020
+ - [ ] Create `src/agents/magentic_agents.py` with ChatAgent factories
1021
+ - [ ] Rewrite `src/orchestrator_magentic.py` to use ChatAgent pattern
1022
+ - [ ] Update `src/orchestrator_factory.py` for new signature
1023
+ - [ ] Test with real OpenAI API
1024
+ - [ ] Verify manager properly coordinates agents
1025
+ - [ ] Ensure tools are called with correct parameters
1026
+ - [ ] Verify semantic deduplication works (evidence_store populates)
1027
+ - [ ] Verify bibliography generation in final reports
1028
 
1029
+ ---
1030
 
1031
+ ## 8. Definition of Done
1032
 
1033
+ Phase 5 is **COMPLETE** when:
 
 
 
 
1034
 
1035
+ 1. Magentic mode runs without hanging
1036
+ 2. Manager successfully coordinates agents via natural language
1037
+ 3. SearchAgent calls tools with proper search keywords (not raw instructions)
1038
+ 4. JudgeAgent evaluates evidence from conversation history
1039
+ 5. ReportAgent generates structured final report
1040
+ 6. Events stream to UI correctly
1041
 
1042
  ---
1043
 
1044
+ ## 9. Testing Magentic Mode
1045
 
1046
+ ```bash
1047
+ # Test with real API
1048
+ OPENAI_API_KEY=sk-... uv run python -c "
1049
+ import asyncio
1050
+ from src.orchestrator_factory import create_orchestrator
1051
 
1052
+ async def test():
1053
+ orch = create_orchestrator(mode='magentic')
1054
+ async for event in orch.run('metformin alzheimer'):
1055
+ print(f'[{event.type}] {event.message[:100]}')
1056
+
1057
+ asyncio.run(test())
1058
+ "
1059
+ ```
1060
 
1061
+ Expected output:
1062
+ ```
1063
+ [started] Starting research (Magentic mode): metformin alzheimer
1064
+ [judging] Manager (plan): I will coordinate the agents to research...
1065
+ [search_complete] SearchAgent: Found 25 PubMed results for metformin alzheimer...
1066
+ [hypothesizing] HypothesisAgent: Based on the evidence, I propose...
1067
+ [judge_complete] JudgeAgent: Mechanism Score: 7/10, Clinical Score: 6/10...
1068
+ [synthesizing] ReportAgent: ## Executive Summary...
1069
+ [complete] <full research report>
1070
+ ```
1071
+
1072
+ ---
1073
 
1074
+ ## 10. Key Differences from Old Spec
1075
+
1076
+ | Aspect | OLD (Wrong) | NEW (Correct) |
1077
+ |--------|-------------|---------------|
1078
+ | Agent type | `BaseAgent` subclass | `ChatAgent` with `chat_client` |
1079
+ | Internal LLM | None | OpenAIChatClient |
1080
+ | How tools work | Handler.execute(raw_instruction) | LLM understands instruction, calls AIFunction |
1081
+ | Message handling | Extract text → pass to API | LLM interprets → extracts keywords → calls tool |
1082
+ | State management | Passed to agent constructors | Global MagenticState singleton |
1083
+ | Evidence storage | In agent instance | In MagenticState.evidence_store |
1084
+ | Semantic search | Coupled to agents | Tools call state.add_evidence() |
1085
+ | Citations for report | From agent's store | Via get_bibliography() tool |
1086
+
1087
+ **Key Insights:**
1088
+ 1. Magentic agents must have internal LLMs to understand natural language instructions
1089
+ 2. Tools must update shared state as a side effect (return strings, but also store Evidence)
1090
+ 3. ReportAgent uses `get_bibliography()` tool to access structured citations
1091
+ 4. State is reset at start of each workflow run via `reset_magentic_state()`
pyproject.toml CHANGED
@@ -16,6 +16,7 @@ dependencies = [
16
  "httpx>=0.27", # Async HTTP client (PubMed)
17
  "beautifulsoup4>=4.12", # HTML parsing
18
  "xmltodict>=0.13", # PubMed XML -> dict
 
19
  # UI
20
  "gradio[mcp]>=6.0.0", # Chat interface with MCP server support (6.0 required for css in launch())
21
  # Utils
@@ -42,7 +43,7 @@ dev = [
42
  "pre-commit>=3.7",
43
  ]
44
  magentic = [
45
- "agent-framework-core",
46
  ]
47
  embeddings = [
48
  "chromadb>=0.4.0",
@@ -132,5 +133,5 @@ exclude_lines = [
132
  "raise NotImplementedError",
133
  ]
134
 
135
- # Note: agent-framework-core is optional and installed locally for magentic mode
136
- # CI skips tests that require it via pytest.importorskip
 
16
  "httpx>=0.27", # Async HTTP client (PubMed)
17
  "beautifulsoup4>=4.12", # HTML parsing
18
  "xmltodict>=0.13", # PubMed XML -> dict
19
+ "huggingface-hub>=0.20.0", # Hugging Face Inference API
20
  # UI
21
  "gradio[mcp]>=6.0.0", # Chat interface with MCP server support (6.0 required for css in launch())
22
  # Utils
 
43
  "pre-commit>=3.7",
44
  ]
45
  magentic = [
46
+ "agent-framework-core>=1.0.0b251120,<2.0.0", # Pin to avoid breaking changes
47
  ]
48
  embeddings = [
49
  "chromadb>=0.4.0",
 
133
  "raise NotImplementedError",
134
  ]
135
 
136
+ # Note: agent-framework-core is optional for magentic mode (multi-agent orchestration)
137
+ # Version pinned to 1.0.0b* to avoid breaking changes. CI skips tests via pytest.importorskip
src/agent_factory/judges.py CHANGED
@@ -1,13 +1,17 @@
1
  """Judge handler for evidence assessment using PydanticAI."""
2
 
3
- from typing import Any
 
 
4
 
5
  import structlog
 
6
  from pydantic_ai import Agent
7
  from pydantic_ai.models.anthropic import AnthropicModel
8
  from pydantic_ai.models.openai import OpenAIModel
9
  from pydantic_ai.providers.anthropic import AnthropicProvider
10
  from pydantic_ai.providers.openai import OpenAIProvider
 
11
 
12
  from src.prompts.judge import (
13
  SYSTEM_PROMPT,
@@ -146,6 +150,207 @@ class JudgeHandler:
146
  )
147
 
148
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
  class MockJudgeHandler:
150
  """
151
  Mock JudgeHandler for demo mode without LLM calls.
 
1
  """Judge handler for evidence assessment using PydanticAI."""
2
 
3
+ import asyncio
4
+ import json
5
+ from typing import Any, ClassVar
6
 
7
  import structlog
8
+ from huggingface_hub import InferenceClient
9
  from pydantic_ai import Agent
10
  from pydantic_ai.models.anthropic import AnthropicModel
11
  from pydantic_ai.models.openai import OpenAIModel
12
  from pydantic_ai.providers.anthropic import AnthropicProvider
13
  from pydantic_ai.providers.openai import OpenAIProvider
14
+ from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_exponential
15
 
16
  from src.prompts.judge import (
17
  SYSTEM_PROMPT,
 
150
  )
151
 
152
 
153
+ class HFInferenceJudgeHandler:
154
+ """
155
+ JudgeHandler using HuggingFace Inference API for FREE LLM calls.
156
+ Defaults to Llama-3.1-8B-Instruct (requires HF_TOKEN) or falls back to public models.
157
+ """
158
+
159
+ FALLBACK_MODELS: ClassVar[list[str]] = [
160
+ "meta-llama/Llama-3.1-8B-Instruct", # Primary (Gated)
161
+ "mistralai/Mistral-7B-Instruct-v0.3", # Secondary
162
+ "HuggingFaceH4/zephyr-7b-beta", # Fallback (Ungated)
163
+ ]
164
+
165
+ def __init__(self, model_id: str | None = None) -> None:
166
+ """
167
+ Initialize with HF Inference client.
168
+
169
+ Args:
170
+ model_id: Optional specific model ID. If None, uses FALLBACK_MODELS chain.
171
+ """
172
+ self.model_id = model_id
173
+ # Will automatically use HF_TOKEN from env if available
174
+ self.client = InferenceClient()
175
+ self.call_count = 0
176
+ self.last_question: str | None = None
177
+ self.last_evidence: list[Evidence] | None = None
178
+
179
+ async def assess(
180
+ self,
181
+ question: str,
182
+ evidence: list[Evidence],
183
+ ) -> JudgeAssessment:
184
+ """
185
+ Assess evidence using HuggingFace Inference API.
186
+ Attempts models in order until one succeeds.
187
+ """
188
+ self.call_count += 1
189
+ self.last_question = question
190
+ self.last_evidence = evidence
191
+
192
+ # Format the user prompt
193
+ if evidence:
194
+ user_prompt = format_user_prompt(question, evidence)
195
+ else:
196
+ user_prompt = format_empty_evidence_prompt(question)
197
+
198
+ models_to_try: list[str] = [self.model_id] if self.model_id else self.FALLBACK_MODELS
199
+ last_error: Exception | None = None
200
+
201
+ for model in models_to_try:
202
+ try:
203
+ return await self._call_with_retry(model, user_prompt, question)
204
+ except Exception as e:
205
+ logger.warning("Model failed", model=model, error=str(e))
206
+ last_error = e
207
+ continue
208
+
209
+ # All models failed
210
+ logger.error("All HF models failed", error=str(last_error))
211
+ return self._create_fallback_assessment(question, str(last_error))
212
+
213
+ @retry(
214
+ stop=stop_after_attempt(3),
215
+ wait=wait_exponential(multiplier=1, min=1, max=4),
216
+ retry=retry_if_exception_type(Exception),
217
+ reraise=True,
218
+ )
219
+ async def _call_with_retry(self, model: str, prompt: str, question: str) -> JudgeAssessment:
220
+ """Make API call with retry logic using chat_completion."""
221
+ loop = asyncio.get_running_loop()
222
+
223
+ # Build messages for chat_completion (model-agnostic)
224
+ messages = [
225
+ {
226
+ "role": "system",
227
+ "content": f"""{SYSTEM_PROMPT}
228
+
229
+ IMPORTANT: Respond with ONLY valid JSON matching this schema:
230
+ {{
231
+ "details": {{
232
+ "mechanism_score": <int 0-10>,
233
+ "mechanism_reasoning": "<string>",
234
+ "clinical_evidence_score": <int 0-10>,
235
+ "clinical_reasoning": "<string>",
236
+ "drug_candidates": ["<string>", ...],
237
+ "key_findings": ["<string>", ...]
238
+ }},
239
+ "sufficient": <bool>,
240
+ "confidence": <float 0-1>,
241
+ "recommendation": "continue" | "synthesize",
242
+ "next_search_queries": ["<string>", ...],
243
+ "reasoning": "<string>"
244
+ }}""",
245
+ },
246
+ {"role": "user", "content": prompt},
247
+ ]
248
+
249
+ # Use chat_completion (conversational task - supported by all models)
250
+ response = await loop.run_in_executor(
251
+ None,
252
+ lambda: self.client.chat_completion(
253
+ messages=messages,
254
+ model=model,
255
+ max_tokens=1024,
256
+ temperature=0.1,
257
+ ),
258
+ )
259
+
260
+ # Extract content from response
261
+ content = response.choices[0].message.content
262
+ if not content:
263
+ raise ValueError("Empty response from model")
264
+
265
+ # Extract and parse JSON
266
+ json_data = self._extract_json(content)
267
+ if not json_data:
268
+ raise ValueError("No valid JSON found in response")
269
+
270
+ return JudgeAssessment(**json_data)
271
+
272
+ def _extract_json(self, text: str) -> dict[str, Any] | None:
273
+ """
274
+ Robust JSON extraction that handles markdown blocks and nested braces.
275
+ """
276
+ text = text.strip()
277
+
278
+ # Remove markdown code blocks if present (with bounds checking)
279
+ if "```json" in text:
280
+ parts = text.split("```json", 1)
281
+ if len(parts) > 1:
282
+ inner_parts = parts[1].split("```", 1)
283
+ text = inner_parts[0]
284
+ elif "```" in text:
285
+ parts = text.split("```", 1)
286
+ if len(parts) > 1:
287
+ inner_parts = parts[1].split("```", 1)
288
+ text = inner_parts[0]
289
+
290
+ text = text.strip()
291
+
292
+ # Find first '{'
293
+ start_idx = text.find("{")
294
+ if start_idx == -1:
295
+ return None
296
+
297
+ # Stack-based parsing ignoring chars in strings
298
+ count = 0
299
+ in_string = False
300
+ escape = False
301
+
302
+ for i, char in enumerate(text[start_idx:], start=start_idx):
303
+ if in_string:
304
+ if escape:
305
+ escape = False
306
+ elif char == "\\":
307
+ escape = True
308
+ elif char == '"':
309
+ in_string = False
310
+ elif char == '"':
311
+ in_string = True
312
+ elif char == "{":
313
+ count += 1
314
+ elif char == "}":
315
+ count -= 1
316
+ if count == 0:
317
+ try:
318
+ result = json.loads(text[start_idx : i + 1])
319
+ if isinstance(result, dict):
320
+ return result
321
+ return None
322
+ except json.JSONDecodeError:
323
+ return None
324
+
325
+ return None
326
+
327
+ def _create_fallback_assessment(
328
+ self,
329
+ question: str,
330
+ error: str,
331
+ ) -> JudgeAssessment:
332
+ """Create a fallback assessment when inference fails."""
333
+ return JudgeAssessment(
334
+ details=AssessmentDetails(
335
+ mechanism_score=0,
336
+ mechanism_reasoning=f"Assessment failed: {error}",
337
+ clinical_evidence_score=0,
338
+ clinical_reasoning=f"Assessment failed: {error}",
339
+ drug_candidates=[],
340
+ key_findings=[],
341
+ ),
342
+ sufficient=False,
343
+ confidence=0.0,
344
+ recommendation="continue",
345
+ next_search_queries=[
346
+ f"{question} mechanism",
347
+ f"{question} clinical trials",
348
+ f"{question} drug candidates",
349
+ ],
350
+ reasoning=f"HF Inference failed: {error}. Recommend configuring OpenAI/Anthropic key.",
351
+ )
352
+
353
+
354
  class MockJudgeHandler:
355
  """
356
  Mock JudgeHandler for demo mode without LLM calls.
src/agents/magentic_agents.py ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Magentic-compatible agents using ChatAgent pattern."""
2
+
3
+ from agent_framework import ChatAgent
4
+ from agent_framework.openai import OpenAIChatClient
5
+
6
+ from src.agents.tools import (
7
+ get_bibliography,
8
+ search_clinical_trials,
9
+ search_preprints,
10
+ search_pubmed,
11
+ )
12
+ from src.utils.config import settings
13
+
14
+
15
+ def create_search_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
16
+ """Create a search agent with internal LLM and search tools.
17
+
18
+ Args:
19
+ chat_client: Optional custom chat client. If None, uses default.
20
+
21
+ Returns:
22
+ ChatAgent configured for biomedical search
23
+ """
24
+ client = chat_client or OpenAIChatClient(
25
+ model_id=settings.openai_model, # Use configured model
26
+ api_key=settings.openai_api_key,
27
+ )
28
+
29
+ return ChatAgent(
30
+ name="SearchAgent",
31
+ description=(
32
+ "Searches biomedical databases (PubMed, ClinicalTrials.gov, bioRxiv) "
33
+ "for drug repurposing evidence"
34
+ ),
35
+ instructions="""You are a biomedical search specialist. When asked to find evidence:
36
+
37
+ 1. Analyze the request to determine what to search for
38
+ 2. Extract key search terms (drug names, disease names, mechanisms)
39
+ 3. Use the appropriate search tools:
40
+ - search_pubmed for peer-reviewed papers
41
+ - search_clinical_trials for clinical studies
42
+ - search_preprints for cutting-edge findings
43
+ 4. Summarize what you found and highlight key evidence
44
+
45
+ Be thorough - search multiple databases when appropriate.
46
+ Focus on finding: mechanisms of action, clinical evidence, and specific drug candidates.""",
47
+ chat_client=client,
48
+ tools=[search_pubmed, search_clinical_trials, search_preprints],
49
+ temperature=0.3, # More deterministic for tool use
50
+ )
51
+
52
+
53
+ def create_judge_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
54
+ """Create a judge agent that evaluates evidence quality.
55
+
56
+ Args:
57
+ chat_client: Optional custom chat client. If None, uses default.
58
+
59
+ Returns:
60
+ ChatAgent configured for evidence assessment
61
+ """
62
+ client = chat_client or OpenAIChatClient(
63
+ model_id=settings.openai_model,
64
+ api_key=settings.openai_api_key,
65
+ )
66
+
67
+ return ChatAgent(
68
+ name="JudgeAgent",
69
+ description="Evaluates evidence quality and determines if sufficient for synthesis",
70
+ instructions="""You are an evidence quality assessor. When asked to evaluate:
71
+
72
+ 1. Review all evidence presented in the conversation
73
+ 2. Score on two dimensions (0-10 each):
74
+ - Mechanism Score: How well is the biological mechanism explained?
75
+ - Clinical Score: How strong is the clinical/preclinical evidence?
76
+ 3. Determine if evidence is SUFFICIENT for a final report:
77
+ - Sufficient: Clear mechanism + supporting clinical data
78
+ - Insufficient: Gaps in mechanism OR weak clinical evidence
79
+ 4. If insufficient, suggest specific search queries to fill gaps
80
+
81
+ Be rigorous but fair. Look for:
82
+ - Molecular targets and pathways
83
+ - Animal model studies
84
+ - Human clinical trials
85
+ - Safety data
86
+ - Drug-drug interactions""",
87
+ chat_client=client,
88
+ temperature=0.2, # Consistent judgments
89
+ )
90
+
91
+
92
+ def create_hypothesis_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
93
+ """Create a hypothesis generation agent.
94
+
95
+ Args:
96
+ chat_client: Optional custom chat client. If None, uses default.
97
+
98
+ Returns:
99
+ ChatAgent configured for hypothesis generation
100
+ """
101
+ client = chat_client or OpenAIChatClient(
102
+ model_id=settings.openai_model,
103
+ api_key=settings.openai_api_key,
104
+ )
105
+
106
+ return ChatAgent(
107
+ name="HypothesisAgent",
108
+ description="Generates mechanistic hypotheses for drug repurposing",
109
+ instructions="""You are a biomedical hypothesis generator. Based on evidence:
110
+
111
+ 1. Identify the key molecular targets involved
112
+ 2. Map the biological pathways affected
113
+ 3. Generate testable hypotheses in this format:
114
+
115
+ DRUG -> TARGET -> PATHWAY -> THERAPEUTIC EFFECT
116
+
117
+ Example:
118
+ Metformin -> AMPK activation -> mTOR inhibition -> Reduced tau phosphorylation
119
+
120
+ 4. Explain the rationale for each hypothesis
121
+ 5. Suggest what additional evidence would support or refute it
122
+
123
+ Focus on mechanistic plausibility and existing evidence.""",
124
+ chat_client=client,
125
+ temperature=0.5, # Some creativity for hypothesis generation
126
+ )
127
+
128
+
129
+ def create_report_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
130
+ """Create a report synthesis agent.
131
+
132
+ Args:
133
+ chat_client: Optional custom chat client. If None, uses default.
134
+
135
+ Returns:
136
+ ChatAgent configured for report generation
137
+ """
138
+ client = chat_client or OpenAIChatClient(
139
+ model_id=settings.openai_model,
140
+ api_key=settings.openai_api_key,
141
+ )
142
+
143
+ return ChatAgent(
144
+ name="ReportAgent",
145
+ description="Synthesizes research findings into structured reports",
146
+ instructions="""You are a scientific report writer. When asked to synthesize:
147
+
148
+ Generate a structured report with these sections:
149
+
150
+ ## Executive Summary
151
+ Brief overview of findings and recommendation
152
+
153
+ ## Methodology
154
+ Databases searched, queries used, evidence reviewed
155
+
156
+ ## Key Findings
157
+ ### Mechanism of Action
158
+ - Molecular targets
159
+ - Biological pathways
160
+ - Proposed mechanism
161
+
162
+ ### Clinical Evidence
163
+ - Preclinical studies
164
+ - Clinical trials
165
+ - Safety profile
166
+
167
+ ## Drug Candidates
168
+ List specific drugs with repurposing potential
169
+
170
+ ## Limitations
171
+ Gaps in evidence, conflicting data, caveats
172
+
173
+ ## Conclusion
174
+ Final recommendation with confidence level
175
+
176
+ ## References
177
+ Use the 'get_bibliography' tool to fetch the complete list of citations.
178
+ Format them as a numbered list.
179
+
180
+ Be comprehensive but concise. Cite evidence for all claims.""",
181
+ chat_client=client,
182
+ tools=[get_bibliography],
183
+ temperature=0.3,
184
+ )
src/agents/state.py ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Thread-safe state management for Magentic agents.
2
+
3
+ Uses contextvars to ensure isolation between concurrent requests (e.g., multiple users
4
+ searching simultaneously via Gradio).
5
+ """
6
+
7
+ from contextvars import ContextVar
8
+ from typing import TYPE_CHECKING, Any
9
+
10
+ from pydantic import BaseModel, Field
11
+
12
+ from src.utils.models import Citation, Evidence
13
+
14
+ if TYPE_CHECKING:
15
+ from src.services.embeddings import EmbeddingService
16
+
17
+
18
+ class MagenticState(BaseModel):
19
+ """Mutable state for a Magentic workflow session."""
20
+
21
+ evidence: list[Evidence] = Field(default_factory=list)
22
+ # Type as Any to avoid circular imports/runtime resolution issues
23
+ # The actual object injected will be an EmbeddingService instance
24
+ embedding_service: Any = None
25
+
26
+ model_config = {"arbitrary_types_allowed": True}
27
+
28
+ def add_evidence(self, new_evidence: list[Evidence]) -> int:
29
+ """Add new evidence, deduplicating by URL.
30
+
31
+ Returns:
32
+ Number of *new* items added.
33
+ """
34
+ existing_urls = {e.citation.url for e in self.evidence}
35
+ count = 0
36
+ for item in new_evidence:
37
+ if item.citation.url not in existing_urls:
38
+ self.evidence.append(item)
39
+ existing_urls.add(item.citation.url)
40
+ count += 1
41
+ return count
42
+
43
+ async def search_related(self, query: str, n_results: int = 5) -> list[Evidence]:
44
+ """Search for semantically related evidence using the embedding service."""
45
+ if not self.embedding_service:
46
+ return []
47
+
48
+ results = await self.embedding_service.search_similar(query, n_results=n_results)
49
+
50
+ # Convert dict results back to Evidence objects
51
+ evidence_list = []
52
+ for item in results:
53
+ meta = item.get("metadata", {})
54
+ authors_str = meta.get("authors", "")
55
+ authors = [a.strip() for a in authors_str.split(",") if a.strip()]
56
+
57
+ ev = Evidence(
58
+ content=item["content"],
59
+ citation=Citation(
60
+ title=meta.get("title", "Related Evidence"),
61
+ url=item["id"],
62
+ source="pubmed", # Defaulting to pubmed if unknown
63
+ date=meta.get("date", "n.d."),
64
+ authors=authors,
65
+ ),
66
+ relevance=max(0.0, 1.0 - item.get("distance", 0.5)),
67
+ )
68
+ evidence_list.append(ev)
69
+
70
+ return evidence_list
71
+
72
+
73
+ # The ContextVar holds the MagenticState for the current execution context
74
+ _magentic_state_var: ContextVar[MagenticState | None] = ContextVar("magentic_state", default=None)
75
+
76
+
77
+ def init_magentic_state(embedding_service: "EmbeddingService | None" = None) -> MagenticState:
78
+ """Initialize a new state for the current context."""
79
+ state = MagenticState(embedding_service=embedding_service)
80
+ _magentic_state_var.set(state)
81
+ return state
82
+
83
+
84
+ def get_magentic_state() -> MagenticState:
85
+ """Get the current state. Raises RuntimeError if not initialized."""
86
+ state = _magentic_state_var.get()
87
+ if state is None:
88
+ # Auto-initialize if missing (e.g. during tests or simple scripts)
89
+ return init_magentic_state()
90
+ return state
src/agents/tools.py ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Tool functions for Magentic agents.
2
+
3
+ These functions are decorated with @ai_function to be callable by the ChatAgent's internal LLM.
4
+ They also interact with the thread-safe MagenticState to persist evidence.
5
+ """
6
+
7
+ from agent_framework import ai_function
8
+
9
+ from src.agents.state import get_magentic_state
10
+ from src.tools.biorxiv import BioRxivTool
11
+ from src.tools.clinicaltrials import ClinicalTrialsTool
12
+ from src.tools.pubmed import PubMedTool
13
+
14
+ # Singleton tool instances (stateless wrappers)
15
+ _pubmed = PubMedTool()
16
+ _clinicaltrials = ClinicalTrialsTool()
17
+ _biorxiv = BioRxivTool()
18
+
19
+
20
+ @ai_function # type: ignore[arg-type, misc]
21
+ async def search_pubmed(query: str, max_results: int = 10) -> str:
22
+ """Search PubMed for biomedical research papers.
23
+
24
+ Use this tool to find peer-reviewed scientific literature about
25
+ drugs, diseases, mechanisms of action, and clinical studies.
26
+
27
+ Args:
28
+ query: Search keywords (e.g., "metformin alzheimer mechanism")
29
+ max_results: Maximum results to return (default 10)
30
+
31
+ Returns:
32
+ Formatted list of papers with titles, abstracts, and citations
33
+ """
34
+ state = get_magentic_state()
35
+
36
+ # 1. Execute raw search
37
+ results = await _pubmed.search(query, max_results)
38
+ if not results:
39
+ return f"No PubMed results found for: {query}"
40
+
41
+ # 2. Semantic Deduplication & Expansion (The "Digital Twin" Brain)
42
+ display_results = results
43
+ if state.embedding_service:
44
+ # Deduplicate against what we just found vs what's in the DB
45
+ unique_results = await state.embedding_service.deduplicate(results)
46
+
47
+ # Search for related context in the vector DB (previous searches)
48
+ related = await state.search_related(query, n_results=3)
49
+
50
+ # Combine unique new results + relevant historical results
51
+ display_results = unique_results + related
52
+
53
+ # 3. Update State (Persist for ReportAgent)
54
+ # We add *all* found results to state, not just the displayed ones
55
+ new_count = state.add_evidence(results)
56
+
57
+ # 4. Format Output for LLM
58
+ output = [f"Found {len(results)} results ({new_count} new stored):\n"]
59
+
60
+ # Limit display to avoid context window overflow, but state has everything
61
+ limit = min(len(display_results), max_results)
62
+
63
+ for i, r in enumerate(display_results[:limit], 1):
64
+ title = r.citation.title
65
+ date = r.citation.date
66
+ source = r.citation.source
67
+ content_clean = r.content[:300].replace("\n", " ")
68
+ url = r.citation.url
69
+
70
+ output.append(f"{i}. **{title}** ({date})")
71
+ output.append(f" Source: {source} | {url}")
72
+ output.append(f" {content_clean}...")
73
+ output.append("")
74
+
75
+ return "\n".join(output)
76
+
77
+
78
+ @ai_function # type: ignore[arg-type, misc]
79
+ async def search_clinical_trials(query: str, max_results: int = 10) -> str:
80
+ """Search ClinicalTrials.gov for clinical studies.
81
+
82
+ Use this tool to find ongoing and completed clinical trials
83
+ for drug repurposing candidates.
84
+
85
+ Args:
86
+ query: Search terms (e.g., "metformin cancer phase 3")
87
+ max_results: Maximum results to return (default 10)
88
+
89
+ Returns:
90
+ Formatted list of clinical trials with status and details
91
+ """
92
+ state = get_magentic_state()
93
+
94
+ results = await _clinicaltrials.search(query, max_results)
95
+ if not results:
96
+ return f"No clinical trials found for: {query}"
97
+
98
+ # Update state
99
+ new_count = state.add_evidence(results)
100
+
101
+ output = [f"Found {len(results)} clinical trials ({new_count} new stored):\n"]
102
+ for i, r in enumerate(results[:max_results], 1):
103
+ title = r.citation.title
104
+ date = r.citation.date
105
+ source = r.citation.source
106
+ content_clean = r.content[:300].replace("\n", " ")
107
+ url = r.citation.url
108
+
109
+ output.append(f"{i}. **{title}**")
110
+ output.append(f" Status: {source} | Date: {date}")
111
+ output.append(f" {content_clean}...")
112
+ output.append(f" URL: {url}\n")
113
+
114
+ return "\n".join(output)
115
+
116
+
117
+ @ai_function # type: ignore[arg-type, misc]
118
+ async def search_preprints(query: str, max_results: int = 10) -> str:
119
+ """Search bioRxiv/medRxiv for preprint papers.
120
+
121
+ Use this tool to find the latest research that hasn't been
122
+ peer-reviewed yet. Good for cutting-edge findings.
123
+
124
+ Args:
125
+ query: Search terms (e.g., "long covid treatment")
126
+ max_results: Maximum results to return (default 10)
127
+
128
+ Returns:
129
+ Formatted list of preprints with abstracts and links
130
+ """
131
+ state = get_magentic_state()
132
+
133
+ results = await _biorxiv.search(query, max_results)
134
+ if not results:
135
+ return f"No preprints found for: {query}"
136
+
137
+ # Update state
138
+ new_count = state.add_evidence(results)
139
+
140
+ output = [f"Found {len(results)} preprints ({new_count} new stored):\n"]
141
+ for i, r in enumerate(results[:max_results], 1):
142
+ title = r.citation.title
143
+ date = r.citation.date
144
+ source = r.citation.source
145
+ content_clean = r.content[:300].replace("\n", " ")
146
+ url = r.citation.url
147
+
148
+ output.append(f"{i}. **{title}**")
149
+ output.append(f" Server: {source} | Date: {date}")
150
+ output.append(f" {content_clean}...")
151
+ output.append(f" URL: {url}\n")
152
+
153
+ return "\n".join(output)
154
+
155
+
156
+ @ai_function # type: ignore[arg-type, misc]
157
+ async def get_bibliography() -> str:
158
+ """Get the full list of collected evidence for the bibliography.
159
+
160
+ Use this tool when generating the final report to get the complete
161
+ list of references.
162
+
163
+ Returns:
164
+ Formatted bibliography string.
165
+ """
166
+ state = get_magentic_state()
167
+ if not state.evidence:
168
+ return "No evidence collected."
169
+
170
+ output = ["## References"]
171
+ for i, ev in enumerate(state.evidence, 1):
172
+ output.append(f"{i}. {ev.citation.formatted}")
173
+ output.append(f" URL: {ev.citation.url}")
174
+
175
+ return "\n".join(output)
src/app.py CHANGED
@@ -10,7 +10,7 @@ from pydantic_ai.models.openai import OpenAIModel
10
  from pydantic_ai.providers.anthropic import AnthropicProvider
11
  from pydantic_ai.providers.openai import OpenAIProvider
12
 
13
- from src.agent_factory.judges import JudgeHandler, MockJudgeHandler
14
  from src.mcp_tools import (
15
  analyze_hypothesis,
16
  search_all_sources,
@@ -32,7 +32,7 @@ def configure_orchestrator(
32
  mode: str = "simple",
33
  user_api_key: str | None = None,
34
  api_provider: str = "openai",
35
- ) -> Any:
36
  """
37
  Create an orchestrator instance.
38
 
@@ -43,7 +43,7 @@ def configure_orchestrator(
43
  api_provider: API provider ("openai" or "anthropic")
44
 
45
  Returns:
46
- Configured Orchestrator instance
47
  """
48
  # Create orchestrator config
49
  config = OrchestratorConfig(
@@ -57,31 +57,57 @@ def configure_orchestrator(
57
  timeout=config.search_timeout,
58
  )
59
 
60
- # Create judge (mock or real)
61
- judge_handler: JudgeHandler | MockJudgeHandler
 
 
 
62
  if use_mock:
63
  judge_handler = MockJudgeHandler()
64
- else:
65
- # Create model with user's API key if provided
 
 
 
 
 
 
66
  model: AnthropicModel | OpenAIModel | None = None
67
  if user_api_key:
 
 
 
 
 
 
 
 
68
  if api_provider == "anthropic":
69
  anthropic_provider = AnthropicProvider(api_key=user_api_key)
70
  model = AnthropicModel(settings.anthropic_model, provider=anthropic_provider)
71
  elif api_provider == "openai":
72
  openai_provider = OpenAIProvider(api_key=user_api_key)
73
  model = OpenAIModel(settings.openai_model, provider=openai_provider)
74
- else:
75
- raise ValueError(f"Unsupported API provider: {api_provider}")
 
 
76
  judge_handler = JudgeHandler(model=model)
77
 
78
- return create_orchestrator(
 
 
 
 
 
79
  search_handler=search_handler,
80
  judge_handler=judge_handler,
81
  config=config,
82
  mode=mode, # type: ignore
83
  )
84
 
 
 
85
 
86
  async def research_agent(
87
  message: str,
@@ -110,54 +136,47 @@ async def research_agent(
110
  # Clean user-provided API key
111
  user_api_key = api_key.strip() if api_key else None
112
 
113
- # Decide whether to use real LLMs or mock based on mode and available keys
114
  has_openai = bool(os.getenv("OPENAI_API_KEY"))
115
  has_anthropic = bool(os.getenv("ANTHROPIC_API_KEY"))
116
  has_user_key = bool(user_api_key)
 
117
 
118
- if mode == "magentic":
119
- # Magentic currently supports OpenAI only
120
- use_mock = not (has_openai or (has_user_key and api_provider == "openai"))
121
- else:
122
- # Simple mode can work with either provider
123
- use_mock = not (has_openai or has_anthropic or has_user_key)
124
-
125
- # If magentic mode requested but no OpenAI key, fallback/warn
126
- if mode == "magentic" and use_mock:
127
  yield (
128
- "⚠️ **Warning**: Magentic mode requires OpenAI API key. "
129
- "Falling back to demo mode.\n\n"
130
  )
131
  mode = "simple"
132
 
133
  # Inform user about their key being used
134
- if has_user_key and not use_mock:
135
  yield (
136
  f"🔑 **Using your {api_provider.upper()} API key** - "
137
  "Your key is used only for this session and is never stored.\n\n"
138
  )
139
-
140
- # Warn users when running in demo mode (no LLM keys)
141
- if use_mock:
142
  yield (
143
- "🔬 **Demo Mode**: Running with real biomedical searches but without "
144
- "LLM-powered analysis.\n\n"
145
- "**To unlock full AI analysis:**\n"
146
- "- Enter your OpenAI or Anthropic API key below, OR\n"
147
- "- Configure secrets in HuggingFace Space settings\n\n"
148
- "---\n\n"
149
  )
150
 
151
  # Run the agent and stream events
152
  response_parts: list[str] = []
153
 
154
  try:
155
- orchestrator = configure_orchestrator(
156
- use_mock=use_mock,
 
 
157
  mode=mode,
158
  user_api_key=user_api_key,
159
  api_provider=api_provider,
160
  )
 
 
 
161
  async for event in orchestrator.run(message):
162
  # Format event as markdown
163
  event_md = event.to_markdown()
 
10
  from pydantic_ai.providers.anthropic import AnthropicProvider
11
  from pydantic_ai.providers.openai import OpenAIProvider
12
 
13
+ from src.agent_factory.judges import HFInferenceJudgeHandler, JudgeHandler, MockJudgeHandler
14
  from src.mcp_tools import (
15
  analyze_hypothesis,
16
  search_all_sources,
 
32
  mode: str = "simple",
33
  user_api_key: str | None = None,
34
  api_provider: str = "openai",
35
+ ) -> tuple[Any, str]:
36
  """
37
  Create an orchestrator instance.
38
 
 
43
  api_provider: API provider ("openai" or "anthropic")
44
 
45
  Returns:
46
+ Tuple of (Orchestrator instance, backend_name)
47
  """
48
  # Create orchestrator config
49
  config = OrchestratorConfig(
 
57
  timeout=config.search_timeout,
58
  )
59
 
60
+ # Create judge (mock, real, or free tier)
61
+ judge_handler: JudgeHandler | MockJudgeHandler | HFInferenceJudgeHandler
62
+ backend_info = "Unknown"
63
+
64
+ # 1. Forced Mock (Unit Testing)
65
  if use_mock:
66
  judge_handler = MockJudgeHandler()
67
+ backend_info = "Mock (Testing)"
68
+
69
+ # 2. Paid API Key (User provided or Env)
70
+ elif (
71
+ user_api_key
72
+ or (api_provider == "openai" and os.getenv("OPENAI_API_KEY"))
73
+ or (api_provider == "anthropic" and os.getenv("ANTHROPIC_API_KEY"))
74
+ ):
75
  model: AnthropicModel | OpenAIModel | None = None
76
  if user_api_key:
77
+ # Validate key/provider match to prevent silent auth failures
78
+ if api_provider == "openai" and user_api_key.startswith("sk-ant-"):
79
+ raise ValueError("Anthropic key provided but OpenAI provider selected")
80
+ is_openai_key = user_api_key.startswith("sk-") and not user_api_key.startswith(
81
+ "sk-ant-"
82
+ )
83
+ if api_provider == "anthropic" and is_openai_key:
84
+ raise ValueError("OpenAI key provided but Anthropic provider selected")
85
  if api_provider == "anthropic":
86
  anthropic_provider = AnthropicProvider(api_key=user_api_key)
87
  model = AnthropicModel(settings.anthropic_model, provider=anthropic_provider)
88
  elif api_provider == "openai":
89
  openai_provider = OpenAIProvider(api_key=user_api_key)
90
  model = OpenAIModel(settings.openai_model, provider=openai_provider)
91
+ backend_info = f"Paid API ({api_provider.upper()})"
92
+ else:
93
+ backend_info = "Paid API (Env Config)"
94
+
95
  judge_handler = JudgeHandler(model=model)
96
 
97
+ # 3. Free Tier (HuggingFace Inference)
98
+ else:
99
+ judge_handler = HFInferenceJudgeHandler()
100
+ backend_info = "Free Tier (Llama 3.1 / Mistral)"
101
+
102
+ orchestrator = create_orchestrator(
103
  search_handler=search_handler,
104
  judge_handler=judge_handler,
105
  config=config,
106
  mode=mode, # type: ignore
107
  )
108
 
109
+ return orchestrator, backend_info
110
+
111
 
112
  async def research_agent(
113
  message: str,
 
136
  # Clean user-provided API key
137
  user_api_key = api_key.strip() if api_key else None
138
 
139
+ # Check available keys
140
  has_openai = bool(os.getenv("OPENAI_API_KEY"))
141
  has_anthropic = bool(os.getenv("ANTHROPIC_API_KEY"))
142
  has_user_key = bool(user_api_key)
143
+ has_paid_key = has_openai or has_anthropic or has_user_key
144
 
145
+ # Magentic mode requires OpenAI specifically
146
+ if mode == "magentic" and not (has_openai or (has_user_key and api_provider == "openai")):
 
 
 
 
 
 
 
147
  yield (
148
+ "⚠️ **Warning**: Magentic mode requires OpenAI API key. Falling back to simple mode.\n\n"
 
149
  )
150
  mode = "simple"
151
 
152
  # Inform user about their key being used
153
+ if has_user_key:
154
  yield (
155
  f"🔑 **Using your {api_provider.upper()} API key** - "
156
  "Your key is used only for this session and is never stored.\n\n"
157
  )
158
+ elif not has_paid_key:
159
+ # No paid keys - will use FREE HuggingFace Inference
 
160
  yield (
161
+ "🤗 **Free Tier**: Using HuggingFace Inference (Llama 3.1 / Mistral) for AI analysis.\n"
162
+ "For premium models, enter an OpenAI or Anthropic API key below.\n\n"
 
 
 
 
163
  )
164
 
165
  # Run the agent and stream events
166
  response_parts: list[str] = []
167
 
168
  try:
169
+ # use_mock=False - let configure_orchestrator decide based on available keys
170
+ # It will use: Paid API > HF Inference (free tier)
171
+ orchestrator, backend_name = configure_orchestrator(
172
+ use_mock=False, # Never use mock in production - HF Inference is the free fallback
173
  mode=mode,
174
  user_api_key=user_api_key,
175
  api_provider=api_provider,
176
  )
177
+
178
+ yield f"🧠 **Backend**: {backend_name}\n\n"
179
+
180
  async for event in orchestrator.run(message):
181
  # Format event as markdown
182
  event_md = event.to_markdown()
src/orchestrator_factory.py CHANGED
@@ -5,18 +5,10 @@ from typing import Any, Literal
5
  from src.orchestrator import JudgeHandlerProtocol, Orchestrator, SearchHandlerProtocol
6
  from src.utils.models import OrchestratorConfig
7
 
8
- # Define protocols again or import if they were in a shared place.
9
-
10
- # Since they are in src/orchestrator.py, we can import them?
11
-
12
- # But SearchHandler and JudgeHandler in arguments are concrete classes in the type hint,
13
-
14
- # which satisfy the protocol.
15
-
16
 
17
  def create_orchestrator(
18
- search_handler: SearchHandlerProtocol,
19
- judge_handler: JudgeHandlerProtocol,
20
  config: OrchestratorConfig | None = None,
21
  mode: Literal["simple", "magentic"] = "simple",
22
  ) -> Any:
@@ -24,27 +16,33 @@ def create_orchestrator(
24
  Create an orchestrator instance.
25
 
26
  Args:
27
- search_handler: The search handler
28
- judge_handler: The judge handler
29
  config: Optional configuration
30
- mode: "simple" for Phase 4 loop, "magentic" for Phase 5 multi-agent
31
 
32
  Returns:
33
- Orchestrator instance (same interface regardless of mode)
 
 
 
 
34
  """
35
  if mode == "magentic":
36
  try:
37
  from src.orchestrator_magentic import MagenticOrchestrator
38
 
39
  return MagenticOrchestrator(
40
- search_handler=search_handler,
41
- judge_handler=judge_handler,
42
  max_rounds=config.max_iterations if config else 10,
43
  )
44
  except ImportError:
45
  # Fallback to simple if agent-framework not installed
46
  pass
47
 
 
 
 
 
48
  return Orchestrator(
49
  search_handler=search_handler,
50
  judge_handler=judge_handler,
 
5
  from src.orchestrator import JudgeHandlerProtocol, Orchestrator, SearchHandlerProtocol
6
  from src.utils.models import OrchestratorConfig
7
 
 
 
 
 
 
 
 
 
8
 
9
  def create_orchestrator(
10
+ search_handler: SearchHandlerProtocol | None = None,
11
+ judge_handler: JudgeHandlerProtocol | None = None,
12
  config: OrchestratorConfig | None = None,
13
  mode: Literal["simple", "magentic"] = "simple",
14
  ) -> Any:
 
16
  Create an orchestrator instance.
17
 
18
  Args:
19
+ search_handler: The search handler (required for simple mode)
20
+ judge_handler: The judge handler (required for simple mode)
21
  config: Optional configuration
22
+ mode: "simple" for Phase 4 loop, "magentic" for ChatAgent-based multi-agent
23
 
24
  Returns:
25
+ Orchestrator instance
26
+
27
+ Note:
28
+ Magentic mode does NOT use search_handler/judge_handler.
29
+ It creates ChatAgent instances with internal LLMs that call tools directly.
30
  """
31
  if mode == "magentic":
32
  try:
33
  from src.orchestrator_magentic import MagenticOrchestrator
34
 
35
  return MagenticOrchestrator(
 
 
36
  max_rounds=config.max_iterations if config else 10,
37
  )
38
  except ImportError:
39
  # Fallback to simple if agent-framework not installed
40
  pass
41
 
42
+ # Simple mode requires handlers
43
+ if search_handler is None or judge_handler is None:
44
+ raise ValueError("Simple mode requires search_handler and judge_handler")
45
+
46
  return Orchestrator(
47
  search_handler=search_handler,
48
  judge_handler=judge_handler,
src/orchestrator_magentic.py CHANGED
@@ -1,18 +1,9 @@
1
- """Magentic-based orchestrator for DeepCritical.
2
-
3
- NOTE: Magentic mode currently requires OpenAI API keys. The MagenticBuilder's
4
- standard manager uses OpenAIChatClient. Anthropic support may be added when
5
- the agent_framework provides an AnthropicChatClient.
6
- """
7
 
8
  from collections.abc import AsyncGenerator
9
  from typing import TYPE_CHECKING, Any
10
 
11
  import structlog
12
-
13
- if TYPE_CHECKING:
14
- from src.services.embeddings import EmbeddingService
15
-
16
  from agent_framework import (
17
  MagenticAgentDeltaEvent,
18
  MagenticAgentMessageEvent,
@@ -23,45 +14,49 @@ from agent_framework import (
23
  )
24
  from agent_framework.openai import OpenAIChatClient
25
 
26
- from src.agents.hypothesis_agent import HypothesisAgent
27
- from src.agents.judge_agent import JudgeAgent
28
- from src.agents.report_agent import ReportAgent
29
- from src.agents.search_agent import SearchAgent
30
- from src.orchestrator import JudgeHandlerProtocol, SearchHandlerProtocol
 
 
31
  from src.utils.config import settings
32
  from src.utils.exceptions import ConfigurationError
33
- from src.utils.models import AgentEvent, Evidence
34
-
35
- logger = structlog.get_logger()
36
 
 
 
37
 
38
- def _truncate(text: str, max_len: int = 100) -> str:
39
- """Truncate text with ellipsis only if needed."""
40
- return f"{text[:max_len]}..." if len(text) > max_len else text
41
 
42
 
43
  class MagenticOrchestrator:
44
  """
45
- Magentic-based orchestrator - same API as Orchestrator.
46
-
47
- Uses Microsoft Agent Framework's MagenticBuilder for multi-agent coordination.
48
 
49
- Note:
50
- Magentic mode requires OPENAI_API_KEY. The MagenticBuilder's standard
51
- manager currently only supports OpenAI. If you have only an Anthropic
52
- key, use the "simple" orchestrator mode instead.
53
  """
54
 
55
  def __init__(
56
  self,
57
- search_handler: SearchHandlerProtocol,
58
- judge_handler: JudgeHandlerProtocol,
59
  max_rounds: int = 10,
 
60
  ) -> None:
61
- self._search_handler = search_handler
62
- self._judge_handler = judge_handler
 
 
 
 
 
 
 
 
 
63
  self._max_rounds = max_rounds
64
- self._evidence_store: dict[str, list[Evidence]] = {"current": []}
65
 
66
  def _init_embedding_service(self) -> "EmbeddingService | None":
67
  """Initialize embedding service if available."""
@@ -77,19 +72,19 @@ class MagenticOrchestrator:
77
  logger.warning("Failed to initialize embedding service", error=str(e))
78
  return None
79
 
80
- def _build_workflow(
81
- self,
82
- search_agent: SearchAgent,
83
- hypothesis_agent: HypothesisAgent,
84
- judge_agent: JudgeAgent,
85
- report_agent: ReportAgent,
86
- ) -> Any:
87
- """Build the Magentic workflow with participants."""
88
- if not settings.openai_api_key:
89
- raise ConfigurationError(
90
- "Magentic mode requires OPENAI_API_KEY. "
91
- "Set the key or use mode='simple' with Anthropic."
92
- )
93
 
94
  return (
95
  MagenticBuilder()
@@ -100,9 +95,7 @@ class MagenticOrchestrator:
100
  reporter=report_agent,
101
  )
102
  .with_standard_manager(
103
- chat_client=OpenAIChatClient(
104
- model_id=settings.openai_model, api_key=settings.openai_api_key
105
- ),
106
  max_round_count=self._max_rounds,
107
  max_stall_count=3,
108
  max_reset_count=2,
@@ -110,46 +103,15 @@ class MagenticOrchestrator:
110
  .build()
111
  )
112
 
113
- def _format_task(self, query: str, has_embeddings: bool) -> str:
114
- """Format the task instruction for the manager."""
115
- semantic_note = ""
116
- if has_embeddings:
117
- semantic_note = """
118
- The system has semantic search enabled. When evidence is found:
119
- 1. Related concepts will be automatically surfaced
120
- 2. Duplicates are removed by meaning, not just URL
121
- 3. Use the surfaced related concepts to refine searches
122
- """
123
- return f"""Research drug repurposing opportunities for: {query}
124
- {semantic_note}
125
- Workflow:
126
- 1. SearcherAgent: Find initial evidence from PubMed and web. SEND ONLY A SIMPLE KEYWORD QUERY.
127
- 2. HypothesisAgent: Generate mechanistic hypotheses (Drug -> Target -> Pathway -> Effect).
128
- 3. SearcherAgent: Use hypothesis-suggested queries for targeted search.
129
- 4. JudgeAgent: Evaluate if evidence supports hypotheses.
130
- 5. If sufficient -> ReportAgent: Generate structured research report.
131
- 6. If not sufficient -> Repeat from step 1 with refined queries.
132
-
133
- Focus on:
134
- - Identifying specific molecular targets
135
- - Understanding mechanism of action
136
- - Finding supporting/contradicting evidence for hypotheses
137
-
138
- The final output should be a complete research report with:
139
- - Executive summary
140
- - Methodology
141
- - Hypotheses tested
142
- - Mechanistic and clinical findings
143
- - Drug candidates
144
- - Limitations
145
- - Conclusion with references
146
- """
147
-
148
  async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
149
  """
150
- Run the Magentic workflow - same API as simple Orchestrator.
 
 
 
151
 
152
- Yields AgentEvent objects for real-time UI updates.
 
153
  """
154
  logger.info("Starting Magentic orchestrator", query=query)
155
 
@@ -159,20 +121,27 @@ The final output should be a complete research report with:
159
  iteration=0,
160
  )
161
 
162
- # Initialize services and agents
163
  embedding_service = self._init_embedding_service()
164
- search_agent = SearchAgent(
165
- self._search_handler, self._evidence_store, embedding_service=embedding_service
166
- )
167
- judge_agent = JudgeAgent(self._judge_handler, self._evidence_store)
168
- hypothesis_agent = HypothesisAgent(
169
- self._evidence_store, embedding_service=embedding_service
170
- )
171
- report_agent = ReportAgent(self._evidence_store, embedding_service=embedding_service)
172
 
173
- # Build workflow and task
174
- workflow = self._build_workflow(search_agent, hypothesis_agent, judge_agent, report_agent)
175
- task = self._format_task(query, embedding_service is not None)
 
 
 
 
 
 
 
 
 
 
176
 
177
  iteration = 0
178
  try:
@@ -182,6 +151,7 @@ The final output should be a complete research report with:
182
  if isinstance(event, MagenticAgentMessageEvent):
183
  iteration += 1
184
  yield agent_event
 
185
  except Exception as e:
186
  logger.error("Magentic workflow failed", error=str(e))
187
  yield AgentEvent(
@@ -191,35 +161,41 @@ The final output should be a complete research report with:
191
  )
192
 
193
  def _process_event(self, event: Any, iteration: int) -> AgentEvent | None:
194
- """Process a workflow event and return an AgentEvent if applicable."""
195
  if isinstance(event, MagenticOrchestratorMessageEvent):
196
- message_text = (
197
- event.message.text if event.message and hasattr(event.message, "text") else ""
198
- )
199
- kind = getattr(event, "kind", "manager")
200
- if message_text:
201
  return AgentEvent(
202
  type="judging",
203
- message=f"Manager ({kind}): {_truncate(message_text)}",
204
  iteration=iteration,
205
  )
206
 
207
  elif isinstance(event, MagenticAgentMessageEvent):
208
  agent_name = event.agent_id or "unknown"
209
- msg_text = (
210
- event.message.text if event.message and hasattr(event.message, "text") else ""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
211
  )
212
- return self._agent_message_event(agent_name, msg_text, iteration + 1)
213
 
214
  elif isinstance(event, MagenticFinalResultEvent):
215
- final_text = (
216
- event.message.text
217
- if event.message and hasattr(event.message, "text")
218
- else "No result"
219
- )
220
  return AgentEvent(
221
  type="complete",
222
- message=final_text,
223
  data={"iterations": iteration},
224
  iteration=iteration,
225
  )
@@ -242,35 +218,3 @@ The final output should be a complete research report with:
242
  )
243
 
244
  return None
245
-
246
- def _agent_message_event(self, agent_name: str, msg_text: str, iteration: int) -> AgentEvent:
247
- """Create an AgentEvent for an agent message."""
248
- if "search" in agent_name.lower():
249
- return AgentEvent(
250
- type="search_complete",
251
- message=f"Search agent: {_truncate(msg_text)}",
252
- iteration=iteration,
253
- )
254
- elif "hypothes" in agent_name.lower():
255
- return AgentEvent(
256
- type="hypothesizing",
257
- message=f"Hypothesis agent: {_truncate(msg_text)}",
258
- iteration=iteration,
259
- )
260
- elif "judge" in agent_name.lower():
261
- return AgentEvent(
262
- type="judge_complete",
263
- message=f"Judge agent: {_truncate(msg_text)}",
264
- iteration=iteration,
265
- )
266
- elif "report" in agent_name.lower():
267
- return AgentEvent(
268
- type="synthesizing",
269
- message=f"Report agent: {_truncate(msg_text)}" if msg_text else "Report generated.",
270
- iteration=iteration,
271
- )
272
- return AgentEvent(
273
- type="judging",
274
- message=f"{agent_name}: {_truncate(msg_text)}",
275
- iteration=iteration,
276
- )
 
1
+ """Magentic-based orchestrator using ChatAgent pattern."""
 
 
 
 
 
2
 
3
  from collections.abc import AsyncGenerator
4
  from typing import TYPE_CHECKING, Any
5
 
6
  import structlog
 
 
 
 
7
  from agent_framework import (
8
  MagenticAgentDeltaEvent,
9
  MagenticAgentMessageEvent,
 
14
  )
15
  from agent_framework.openai import OpenAIChatClient
16
 
17
+ from src.agents.magentic_agents import (
18
+ create_hypothesis_agent,
19
+ create_judge_agent,
20
+ create_report_agent,
21
+ create_search_agent,
22
+ )
23
+ from src.agents.state import init_magentic_state
24
  from src.utils.config import settings
25
  from src.utils.exceptions import ConfigurationError
26
+ from src.utils.models import AgentEvent
 
 
27
 
28
+ if TYPE_CHECKING:
29
+ from src.services.embeddings import EmbeddingService
30
 
31
+ logger = structlog.get_logger()
 
 
32
 
33
 
34
  class MagenticOrchestrator:
35
  """
36
+ Magentic-based orchestrator using ChatAgent pattern.
 
 
37
 
38
+ Each agent has an internal LLM that understands natural language
39
+ instructions from the manager and can call tools appropriately.
 
 
40
  """
41
 
42
  def __init__(
43
  self,
 
 
44
  max_rounds: int = 10,
45
+ chat_client: OpenAIChatClient | None = None,
46
  ) -> None:
47
+ """Initialize orchestrator.
48
+
49
+ Args:
50
+ max_rounds: Maximum coordination rounds
51
+ chat_client: Optional shared chat client for agents
52
+ """
53
+ if not settings.openai_api_key:
54
+ raise ConfigurationError(
55
+ "Magentic mode requires OPENAI_API_KEY. " "Set the key or use mode='simple'."
56
+ )
57
+
58
  self._max_rounds = max_rounds
59
+ self._chat_client = chat_client
60
 
61
  def _init_embedding_service(self) -> "EmbeddingService | None":
62
  """Initialize embedding service if available."""
 
72
  logger.warning("Failed to initialize embedding service", error=str(e))
73
  return None
74
 
75
+ def _build_workflow(self) -> Any:
76
+ """Build the Magentic workflow with ChatAgent participants."""
77
+ # Create agents with internal LLMs
78
+ search_agent = create_search_agent(self._chat_client)
79
+ judge_agent = create_judge_agent(self._chat_client)
80
+ hypothesis_agent = create_hypothesis_agent(self._chat_client)
81
+ report_agent = create_report_agent(self._chat_client)
82
+
83
+ # Manager chat client (orchestrates the agents)
84
+ manager_client = OpenAIChatClient(
85
+ model_id=settings.openai_model, # Use configured model
86
+ api_key=settings.openai_api_key,
87
+ )
88
 
89
  return (
90
  MagenticBuilder()
 
95
  reporter=report_agent,
96
  )
97
  .with_standard_manager(
98
+ chat_client=manager_client,
 
 
99
  max_round_count=self._max_rounds,
100
  max_stall_count=3,
101
  max_reset_count=2,
 
103
  .build()
104
  )
105
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
  async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
107
  """
108
+ Run the Magentic workflow.
109
+
110
+ Args:
111
+ query: User's research question
112
 
113
+ Yields:
114
+ AgentEvent objects for real-time UI updates
115
  """
116
  logger.info("Starting Magentic orchestrator", query=query)
117
 
 
121
  iteration=0,
122
  )
123
 
124
+ # Initialize context state
125
  embedding_service = self._init_embedding_service()
126
+ init_magentic_state(embedding_service)
127
+
128
+ workflow = self._build_workflow()
129
+
130
+ task = f"""Research drug repurposing opportunities for: {query}
 
 
 
131
 
132
+ Workflow:
133
+ 1. SearchAgent: Find evidence from PubMed, ClinicalTrials.gov, and bioRxiv
134
+ 2. HypothesisAgent: Generate mechanistic hypotheses (Drug -> Target -> Pathway -> Effect)
135
+ 3. JudgeAgent: Evaluate if evidence is sufficient
136
+ 4. If insufficient -> SearchAgent refines search based on gaps
137
+ 5. If sufficient -> ReportAgent synthesizes final report
138
+
139
+ Focus on:
140
+ - Identifying specific molecular targets
141
+ - Understanding mechanism of action
142
+ - Finding clinical evidence supporting hypotheses
143
+
144
+ The final output should be a structured research report."""
145
 
146
  iteration = 0
147
  try:
 
151
  if isinstance(event, MagenticAgentMessageEvent):
152
  iteration += 1
153
  yield agent_event
154
+
155
  except Exception as e:
156
  logger.error("Magentic workflow failed", error=str(e))
157
  yield AgentEvent(
 
161
  )
162
 
163
  def _process_event(self, event: Any, iteration: int) -> AgentEvent | None:
164
+ """Process workflow event into AgentEvent."""
165
  if isinstance(event, MagenticOrchestratorMessageEvent):
166
+ text = event.message.text if event.message else ""
167
+ if text:
 
 
 
168
  return AgentEvent(
169
  type="judging",
170
+ message=f"Manager ({event.kind}): {text[:200]}...",
171
  iteration=iteration,
172
  )
173
 
174
  elif isinstance(event, MagenticAgentMessageEvent):
175
  agent_name = event.agent_id or "unknown"
176
+ text = event.message.text if event.message else ""
177
+
178
+ event_type = "judging"
179
+ if "search" in agent_name.lower():
180
+ event_type = "search_complete"
181
+ elif "judge" in agent_name.lower():
182
+ event_type = "judge_complete"
183
+ elif "hypothes" in agent_name.lower():
184
+ event_type = "hypothesizing"
185
+ elif "report" in agent_name.lower():
186
+ event_type = "synthesizing"
187
+
188
+ return AgentEvent(
189
+ type=event_type, # type: ignore[arg-type]
190
+ message=f"{agent_name}: {text[:200]}...",
191
+ iteration=iteration + 1,
192
  )
 
193
 
194
  elif isinstance(event, MagenticFinalResultEvent):
195
+ text = event.message.text if event.message else "No result"
 
 
 
 
196
  return AgentEvent(
197
  type="complete",
198
+ message=text,
199
  data={"iterations": iteration},
200
  iteration=iteration,
201
  )
 
218
  )
219
 
220
  return None
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/prompts/report.py CHANGED
@@ -124,13 +124,13 @@ async def format_report_prompt(
124
  {hypotheses_summary}
125
 
126
  ## Assessment Scores
127
- - Mechanism Score: {assessment.get('mechanism_score', 'N/A')}/10
128
- - Clinical Evidence Score: {assessment.get('clinical_score', 'N/A')}/10
129
- - Overall Confidence: {assessment.get('confidence', 0):.0%}
130
 
131
  ## Metadata
132
  - Sources Searched: {sources}
133
- - Search Iterations: {metadata.get('iterations', 0)}
134
 
135
  Generate a complete ResearchReport with all sections filled in.
136
 
 
124
  {hypotheses_summary}
125
 
126
  ## Assessment Scores
127
+ - Mechanism Score: {assessment.get("mechanism_score", "N/A")}/10
128
+ - Clinical Evidence Score: {assessment.get("clinical_score", "N/A")}/10
129
+ - Overall Confidence: {assessment.get("confidence", 0):.0%}
130
 
131
  ## Metadata
132
  - Sources Searched: {sources}
133
+ - Search Iterations: {metadata.get("iterations", 0)}
134
 
135
  Generate a complete ResearchReport with all sections filled in.
136
 
tests/unit/agent_factory/test_judges_hf.py ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Unit tests for HFInferenceJudgeHandler."""
2
+
3
+ from unittest.mock import AsyncMock, MagicMock, patch
4
+
5
+ import pytest
6
+
7
+ from src.agent_factory.judges import HFInferenceJudgeHandler
8
+ from src.utils.models import Citation, Evidence
9
+
10
+
11
+ @pytest.mark.unit
12
+ class TestHFInferenceJudgeHandler:
13
+ """Tests for HFInferenceJudgeHandler."""
14
+
15
+ @pytest.fixture
16
+ def mock_client(self):
17
+ """Mock HuggingFace InferenceClient."""
18
+ with patch("src.agent_factory.judges.InferenceClient") as mock:
19
+ client_instance = MagicMock()
20
+ mock.return_value = client_instance
21
+ yield client_instance
22
+
23
+ @pytest.fixture
24
+ def handler(self, mock_client):
25
+ """Create a handler instance with mocked client."""
26
+ return HFInferenceJudgeHandler()
27
+
28
+ @pytest.mark.asyncio
29
+ async def test_assess_success(self, handler, mock_client):
30
+ """Test successful assessment with primary model."""
31
+ import json
32
+
33
+ # Construct valid JSON payload
34
+ data = {
35
+ "details": {
36
+ "mechanism_score": 8,
37
+ "mechanism_reasoning": "Good mechanism",
38
+ "clinical_evidence_score": 7,
39
+ "clinical_reasoning": "Good clinical",
40
+ "drug_candidates": ["Drug A"],
41
+ "key_findings": ["Finding 1"],
42
+ },
43
+ "sufficient": True,
44
+ "confidence": 0.85,
45
+ "recommendation": "synthesize",
46
+ "next_search_queries": [],
47
+ "reasoning": (
48
+ "Sufficient evidence provided to support the hypothesis with high confidence."
49
+ ),
50
+ }
51
+
52
+ # Mock chat_completion response structure
53
+ mock_message = MagicMock()
54
+ mock_message.content = f"""Here is the analysis:
55
+ ```json
56
+ {json.dumps(data)}
57
+ ```"""
58
+ mock_choice = MagicMock()
59
+ mock_choice.message = mock_message
60
+ mock_response = MagicMock()
61
+ mock_response.choices = [mock_choice]
62
+
63
+ # Setup async mock for run_in_executor
64
+ with patch("asyncio.get_running_loop") as mock_loop:
65
+ mock_loop.return_value.run_in_executor = AsyncMock(return_value=mock_response)
66
+
67
+ evidence = [
68
+ Evidence(
69
+ content="test", citation=Citation(source="pubmed", title="t", url="u", date="d")
70
+ )
71
+ ]
72
+ result = await handler.assess("test question", evidence)
73
+
74
+ assert result.sufficient is True
75
+ assert result.confidence == 0.85
76
+ assert result.details.drug_candidates == ["Drug A"]
77
+
78
+ @pytest.mark.asyncio
79
+ async def test_assess_fallback_logic(self, handler, mock_client):
80
+ """Test fallback to secondary model when primary fails."""
81
+
82
+ # Setup async mock to fail first, then succeed
83
+ with patch("asyncio.get_running_loop"):
84
+ # We need to mock the _call_with_retry method directly to test the loop in assess
85
+ # but _call_with_retry is decorated with tenacity,
86
+ # which makes it harder to mock partial failures easily
87
+ # without triggering the tenacity retry loop first.
88
+ # Instead, let's mock run_in_executor to raise exception on first call
89
+
90
+ # This is tricky because assess loops over models,
91
+ # and for each model _call_with_retry retries.
92
+ # We want to simulate: Model 1 fails (retries exhausted) -> Model 2 succeeds.
93
+
94
+ # Let's patch _call_with_retry to avoid waiting for real retries
95
+ side_effect = [
96
+ Exception("Model 1 failed"),
97
+ Exception("Model 2 failed"),
98
+ Exception("Model 3 failed"),
99
+ ]
100
+ with patch.object(handler, "_call_with_retry", side_effect=side_effect) as mock_call:
101
+ evidence = []
102
+ result = await handler.assess("test", evidence)
103
+
104
+ # Should have tried all 3 fallback models
105
+ assert mock_call.call_count == 3
106
+ # Fallback assessment should indicate failure
107
+ assert result.sufficient is False
108
+ assert "failed" in result.reasoning.lower() or "error" in result.reasoning.lower()
109
+
110
+ def test_extract_json_robustness(self, handler):
111
+ """Test JSON extraction with various inputs."""
112
+
113
+ # 1. Clean JSON
114
+ assert handler._extract_json('{"a": 1}') == {"a": 1}
115
+
116
+ # 2. Markdown block
117
+ assert handler._extract_json('```json\n{"a": 1}\n```') == {"a": 1}
118
+
119
+ # 3. Text preamble/postamble
120
+ text = """
121
+ Sure, here is the JSON:
122
+ {
123
+ "a": 1,
124
+ "b": {
125
+ "c": 2
126
+ }
127
+ }
128
+ Hope that helps!
129
+ """
130
+ assert handler._extract_json(text) == {"a": 1, "b": {"c": 2}}
131
+
132
+ # 4. Nested braces
133
+ nested = '{"a": {"b": "}"}}'
134
+ assert handler._extract_json(nested) == {"a": {"b": "}"}}
135
+
136
+ # 5. Invalid JSON
137
+ assert handler._extract_json("Not JSON") is None
138
+ assert handler._extract_json("{Incomplete") is None
uv.lock CHANGED
@@ -1065,6 +1065,7 @@ dependencies = [
1065
  { name = "beautifulsoup4" },
1066
  { name = "gradio", extra = ["mcp"] },
1067
  { name = "httpx" },
 
1068
  { name = "openai" },
1069
  { name = "pydantic" },
1070
  { name = "pydantic-ai" },
@@ -1107,13 +1108,14 @@ modal = [
1107
 
1108
  [package.metadata]
1109
  requires-dist = [
1110
- { name = "agent-framework-core", marker = "extra == 'magentic'" },
1111
  { name = "anthropic", specifier = ">=0.18.0" },
1112
  { name = "beautifulsoup4", specifier = ">=4.12" },
1113
  { name = "chromadb", marker = "extra == 'embeddings'", specifier = ">=0.4.0" },
1114
  { name = "chromadb", marker = "extra == 'modal'", specifier = ">=0.4.0" },
1115
  { name = "gradio", extras = ["mcp"], specifier = ">=6.0.0" },
1116
  { name = "httpx", specifier = ">=0.27" },
 
1117
  { name = "llama-index", marker = "extra == 'modal'", specifier = ">=0.11.0" },
1118
  { name = "llama-index-embeddings-openai", marker = "extra == 'modal'" },
1119
  { name = "llama-index-llms-openai", marker = "extra == 'modal'" },
 
1065
  { name = "beautifulsoup4" },
1066
  { name = "gradio", extra = ["mcp"] },
1067
  { name = "httpx" },
1068
+ { name = "huggingface-hub" },
1069
  { name = "openai" },
1070
  { name = "pydantic" },
1071
  { name = "pydantic-ai" },
 
1108
 
1109
  [package.metadata]
1110
  requires-dist = [
1111
+ { name = "agent-framework-core", marker = "extra == 'magentic'", specifier = ">=1.0.0b251120,<2.0.0" },
1112
  { name = "anthropic", specifier = ">=0.18.0" },
1113
  { name = "beautifulsoup4", specifier = ">=4.12" },
1114
  { name = "chromadb", marker = "extra == 'embeddings'", specifier = ">=0.4.0" },
1115
  { name = "chromadb", marker = "extra == 'modal'", specifier = ">=0.4.0" },
1116
  { name = "gradio", extras = ["mcp"], specifier = ">=6.0.0" },
1117
  { name = "httpx", specifier = ">=0.27" },
1118
+ { name = "huggingface-hub", specifier = ">=0.20.0" },
1119
  { name = "llama-index", marker = "extra == 'modal'", specifier = ">=0.11.0" },
1120
  { name = "llama-index-embeddings-openai", marker = "extra == 'modal'" },
1121
  { name = "llama-index-llms-openai", marker = "extra == 'modal'" },