Spaces:

awellis
/

bfh-studadmin-assist

Running on CPU Upgrade

awellis Claude commited on Oct 20

Commit

344ba50

1 Parent(s): e20e6e8

Fix HuggingFace Spaces deployment and add GPT-5 support

Major changes:
1. Fixed PydanticAI structured outputs - Added output_type parameter to all agents
2. Fixed GPT-5 API parameters - Uses max_completion_tokens and reasoning_effort="minimal"
3. Lazy-load cross-encoder reranker - Prevents HF Spaces startup crash
4. Created unified app with mode toggle - Simple (fast) vs Multi-Agent (quality)
5. Fixed white-on-white text in chunk display
6. Upgraded OpenAI SDK to 2.3.0 for GPT-5 support

Async improvements:
- True parallelism in Phase 1 (intent + retrieval)
- 15.8% speedup with GPT-5 (75s → 65s)
- Configurable AGENT_TIMEOUT for different models

Documentation:
- RUN.md: Quick start guide
- MODELS.md: Model selection guide

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (11) hide show

MODELS.md +93 -0
RUN.md +86 -0
app.py +2 -2
app_unified.py +379 -0
requirements.txt +2 -2
src/agents/composer_agent.py +3 -1
src/agents/fact_checker_agent.py +3 -1
src/agents/intent_agent.py +7 -5
src/pipeline/memory_orchestrator.py +4 -4
src/retrieval/__init__.py +1 -1
src/retrieval/memory_retriever.py +26 -19

MODELS.md ADDED Viewed

	@@ -0,0 +1,93 @@

+# Model Configuration Guide
+## Recommended Models by Use Case
+### For Simple Mode (Fast Single LLM)
+| Model | Speed | Quality | Cost | Recommendation |
+|-------|-------|---------|------|----------------|
+| **gpt-5-mini** | ⚡️ Fast | ⭐️⭐️⭐️⭐️ | 💰 Low | ✅ **Best choice** - Fast with reasoning_effort=minimal |
+| gpt-4o-mini | ⚡️⚡️ Fastest | ⭐️⭐️⭐️ | 💰 Low | ✅ **Good fallback** - Most reliable |
+| gpt-5 | 🐌 Slower | ⭐️⭐️⭐️⭐️⭐️ | 💰💰 Medium | Optional for best quality |
+### For Multi-Agent Mode (Quality Pipeline)
+| Model | Speed | Quality | Reliability | Recommendation |
+|-------|-------|---------|-------------|----------------|
+| **gpt-4o-mini** | ⚡️ Fast | ⭐️⭐️⭐️⭐️ | ✅✅✅ Excellent | ✅ **Recommended** - Reliable structured outputs |
+| gpt-4o | 🐌 Slow | ⭐️⭐️⭐️⭐️⭐️ | ✅✅✅ Excellent | For enterprise quality |
+| gpt-5-mini | 🐌 Slower | ⭐️⭐️⭐️⭐️ | ⚠️ Needs tuning | Requires AGENT_TIMEOUT=120 |
+## GPT-5 Models - Special Configuration
+GPT-5 reasoning models have unique requirements:
+### Parameters
+- ❌ No custom `temperature` (fixed at 1.0)
+- ✅ Use `max_completion_tokens` (not `max_tokens`)
+- ✅ Set `reasoning_effort="minimal"` for speed
+- ⚠️ Requires `openai>=2.3.0`
+### Simple Mode (Works Great)
+```bash
+LLM_MODEL=gpt-5-mini
+```
+- Uses `reasoning_effort="minimal"` automatically
+- Fast responses (~5-15s)
+- Good quality
+### Multi-Agent Mode (Use with Caution)
+```bash
+LLM_MODEL=gpt-4o-mini  # Recommended instead
+AGENT_TIMEOUT=120      # Or increase timeout if using GPT-5
+```
+**Issue**: PydanticAI agents don't easily support `reasoning_effort` parameter yet, so GPT-5 models may:
+- Use more reasoning tokens than needed
+- Be slower than expected
+- Timeout with default settings
+**Solution**: Use `gpt-4o-mini` for multi-agent mode, or increase `AGENT_TIMEOUT=120`
+## Configuration Examples
+### Fast & Cheap Setup (Recommended)
+```bash
+# .env
+LLM_MODEL=gpt-5-mini     # Simple mode: Fast with reasoning
+USE_PARALLEL=true        # 15% speedup
+AGENT_TIMEOUT=60         # Standard timeout
+```
+For multi-agent, the app will still use gpt-5-mini but may be slower.
+### Most Reliable Setup
+```bash
+# .env
+LLM_MODEL=gpt-4o-mini    # Both modes: Fast & reliable
+USE_PARALLEL=true
+AGENT_TIMEOUT=60
+```
+### Best Quality Setup
+```bash
+# .env
+LLM_MODEL=gpt-4o         # Premium quality
+USE_PARALLEL=true
+AGENT_TIMEOUT=90         # Longer for complex reasoning
+```
+## Troubleshooting
+**"Pipeline execution timed out"**
+- Increase `AGENT_TIMEOUT` in `.env` (try 120 for GPT-5)
+- Or switch to `gpt-4o-mini`
+**"Empty response"** (GPT-5-nano only)
+- Switch to `gpt-5-mini` or `gpt-5`
+- Check `openai>=2.3.0` installed
+**Slow responses**
+- Simple mode: Use `gpt-5-mini` (with reasoning_effort=minimal)
+- Multi-agent: Use `gpt-4o-mini`
+- Or set `SKIP_FACT_CHECK=true`

RUN.md ADDED Viewed

	@@ -0,0 +1,86 @@

+# How to Run
+## Quick Start
+```bash
+# 1. Activate environment
+source .venv/bin/activate
+# 2. Run the app
+python app.py
+```
+Open browser to: **http://localhost:7860**
+## What You'll See
+### Two Modes:
+**1. Simple (Fast)** - Default
+- Single LLM call
+- ~5-15 seconds
+- Good for quick queries
+**2. Multi-Agent (Quality)**
+- Full pipeline: Intent + Compose + Fact-Check
+- ~40-75 seconds (depending on model)
+- 15.8% faster with async parallelism
+- Higher accuracy with fact-checking
+### Try These Queries:
+- `Wie kann ich mich exmatrikulieren?`
+- `What are the deadlines for leave of absence?`
+- `Wie ändere ich meinen Namen?`
+## Configuration
+Edit `.env`:
+```bash
+LLM_MODEL=gpt-5-mini          # gpt-5-mini, gpt-5, gpt-4o-mini, gpt-4o
+USE_PARALLEL=true             # Parallel async (15% speedup)
+AGENT_TIMEOUT=60              # Timeout per agent
+```
+## Model Comparison
+| Model | Speed | Quality | Cost | Notes |
+|-------|-------|---------|------|-------|
+| **gpt-5-mini** | ⚡️ Fast | ⭐️⭐️⭐️⭐️ Excellent | 💰 Low | **Recommended** - Uses minimal reasoning |
+| gpt-5 | 🐌 Slow | ⭐️⭐️⭐️⭐️⭐️ Best | 💰💰 Medium | Highest quality, slower |
+| gpt-4o-mini | ⚡️⚡️ Fastest | ⭐️⭐️⭐️ Good | 💰 Low | Good fallback option |
+| gpt-4o | 🐌 Slow | ⭐️⭐️⭐️⭐️⭐️ Best | 💰💰💰 High | Enterprise quality |
+### GPT-5 Models
+GPT-5 models use **reasoning tokens** internally:
+- Set `reasoning_effort="minimal"` for speed (default in app)
+- Don't support custom `temperature` (fixed at 1.0)
+- Use `max_completion_tokens` instead of `max_tokens`
+- Requires `openai>=2.3.0`
+## Troubleshooting
+**Slow responses**
+- Use Simple mode
+- Or switch to `gpt-4o-mini`
+- Or set `SKIP_FACT_CHECK=true`
+**Timeout errors**
+- Increase `AGENT_TIMEOUT` in `.env`
+- Or reduce `LLM_MAX_TOKENS`
+**Empty output with GPT-5**
+- Make sure `openai>=2.3.0` is installed
+- Check that `reasoning_effort` is set to `"minimal"`
+## Performance (Multi-Agent Mode)
+With GPT-5 and async parallelism:
+- Sequential: 75.23s 🐌
+- Parallel: 64.97s ⚡
+- **Speedup: 15.8%**
+Test it yourself:
+```bash
+python test_async_performance.py
+```

app.py CHANGED Viewed

@@ -1,7 +1,7 @@
 """Main application entry point for Hugging Face Spaces deployment."""
-# Use the fast simple version (no OpenSearch dependencies)
-from app_simple import demo
 if __name__ == "__main__":
     demo.launch()

 """Main application entry point for Hugging Face Spaces deployment."""
+# Use the unified version with mode toggle
+from app_unified import demo
 if __name__ == "__main__":
     demo.launch()

app_unified.py ADDED Viewed

	@@ -0,0 +1,379 @@

+"""Unified application with toggle between Simple and Multi-Agent modes."""
+import gradio as gr
+import asyncio
+import logging
+from pathlib import Path
+from typing import Tuple
+# Simple mode imports
+from openai import OpenAI
+from src.config import get_config
+from src.document_processing.loader import MarkdownDocumentLoader
+from src.document_processing.chunker import SemanticChunker
+from src.indexing.memory_indexer import MemoryDocumentIndexer
+from src.retrieval.memory_retriever import MemoryRetriever
+from src.ui.formatters import ChunkFormatter
+# Multi-agent mode imports (lazy loaded to avoid import errors on HF Spaces)
+try:
+    from src.pipeline.memory_orchestrator import MemoryRAGOrchestrator
+    MULTI_AGENT_AVAILABLE = True
+except ImportError as e:
+    logging.warning(f"Multi-agent mode not available: {e}")
+    MULTI_AGENT_AVAILABLE = False
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+class UnifiedAssistant:
+    """Unified assistant supporting both simple and multi-agent modes."""
+    @staticmethod
+    def _format_documents_html(documents) -> str:
+        """Format Haystack documents as simple HTML."""
+        if not documents:
+            return "<p>No documents retrieved.</p>"
+        html_parts = []
+        for i, doc in enumerate(documents, 1):
+            source = doc.meta.get("source_file", "Unknown") if hasattr(doc, 'meta') and doc.meta else "Unknown"
+            score = getattr(doc, 'score', 0.0)
+            content = doc.content if hasattr(doc, 'content') else str(doc)
+            # Truncate long content
+            preview = content[:200] + "..." if len(content) > 200 else content
+            html_parts.append(f"""
+            <div style="border: 1px solid #ddd; border-radius: 8px; padding: 16px; margin-bottom: 16px; background: #f9f9f9; color: #1f2937;">
+                <div style="margin-bottom: 12px; color: #1f2937;">
+                    <strong>#{i}</strong> |
+                    <span style="color: #2563eb;">📄 {source}</span> |
+                    <span style="color: #666;">Score: {score:.3f}</span>
+                </div>
+                <details>
+                    <summary style="cursor: pointer; padding: 8px; background: white; border-radius: 4px; margin-bottom: 8px; color: #1f2937;">
+                        <strong>Preview:</strong> {preview}
+                    </summary>
+                    <div style="padding: 12px; background: white; border-radius: 4px; margin-top: 8px; white-space: pre-wrap; font-size: 0.9em; color: #1f2937;">
+                        {content}
+                    </div>
+                </details>
+            </div>
+            """)
+        return "".join(html_parts)
+    def __init__(self):
+        self.config = get_config()
+        self.client = OpenAI(api_key=self.config.llm.api_key)
+        # Load documents (shared by both modes)
+        self.indexer = MemoryDocumentIndexer(llm_config=self.config.llm)
+        self._load_or_create_documents()
+        # Initialize retriever (for simple mode)
+        self.retriever = MemoryRetriever(
+            document_store=self.indexer.document_store,
+            llm_config=self.config.llm,
+            retrieval_config=self.config.retrieval,
+        )
+        # Initialize orchestrator (for multi-agent mode) - lazy
+        self.orchestrator = None
+    def _load_or_create_documents(self):
+        """Load documents from JSON or create fresh."""
+        import json
+        from haystack import Document as HaystackDoc
+        json_path = Path("data/embedded_documents.json")
+        if json_path.exists():
+            logger.info(f"Loading embedded documents from {json_path}...")
+            try:
+                with open(json_path, "r") as f:
+                    docs_data = json.load(f)
+                documents = []
+                for doc_data in docs_data:
+                    doc = HaystackDoc(
+                        id=doc_data.get("id"),
+                        content=doc_data["content"],
+                        embedding=doc_data.get("embedding"),
+                        meta=doc_data.get("meta", {})
+                    )
+                    documents.append(doc)
+                self.indexer.document_store.write_documents(documents)
+                logger.info(f"Loaded {len(documents)} documents with embeddings")
+                return
+            except Exception as e:
+                logger.warning(f"Failed to load documents: {e}")
+        # Create documents if not found
+        logger.info("Creating fresh document index...")
+        loader = MarkdownDocumentLoader(self.config.document_processing.documents_path)
+        documents = loader.load_documents()
+        chunker = SemanticChunker(
+            chunk_size=self.config.document_processing.chunk_size,
+            chunk_overlap=self.config.document_processing.chunk_overlap,
+            min_chunk_size=self.config.document_processing.min_chunk_size,
+        )
+        chunked_docs = chunker.chunk_documents(documents)
+        self.indexer.index_documents(chunked_docs)
+    def _get_orchestrator(self):
+        """Lazy load orchestrator for multi-agent mode."""
+        if self.orchestrator is None:
+            if not MULTI_AGENT_AVAILABLE:
+                raise RuntimeError("Multi-agent mode is not available")
+            self.orchestrator = MemoryRAGOrchestrator(
+                config=self.config,
+                document_indexer=self.indexer  # Correct parameter name
+            )
+        return self.orchestrator
+    def process_query_simple(self, query: str) -> Tuple[str, str, str]:
+        """Process query with simple single-LLM mode (fast)."""
+        logger.info(f"[SIMPLE MODE] Processing query: {query}")
+        # Retrieve documents
+        retrieved_docs = self.retriever.retrieve(query)
+        logger.info(f"Retrieved {len(retrieved_docs)} documents")
+        # Build context
+        max_docs = 2 if "gpt-5" in self.config.llm.model_name else 3
+        max_chars_per_doc = 800 if "gpt-5" in self.config.llm.model_name else 1500
+        context_parts = []
+        for i, doc in enumerate(retrieved_docs[:max_docs], 1):
+            source = doc.meta.get("source_file", "Unknown")
+            content = doc.content[:max_chars_per_doc]
+            context_parts.append(f"[Dokument {i}: {source}]\n{content}\n")
+        context = "\n".join(context_parts) if context_parts else "Keine relevanten Dokumente gefunden."
+        # Generate email with single LLM call
+        system_prompt = """Du bist ein hilfreicher Assistent für die Studienadministration der BFH.
+Deine Aufgabe ist es, professionelle E-Mail-Antworten auf Studentenanfragen zu verfassen.
+Richtlinien:
+- Antworte in der gleichen Sprache wie die Anfrage (Deutsch, Englisch oder Französisch)
+- Verwende einen professionellen aber freundlichen Ton
+- Sei klar, präzise und hilfreich
+- Beziehe dich auf konkrete Formulare, Fristen oder Verfahren wenn relevant
+- Gib klare nächste Schritte an
+- Wenn Informationen fehlen, sage dies ehrlich
+Für deutsche E-Mails:
+- Verwende die formelle "Sie"-Form
+- Grußformel: "Guten Tag" oder "Sehr geehrte/r..."
+- Schlussformel: "Freundliche Grüsse" oder "Mit freundlichen Grüssen"
+"""
+        user_prompt = f"""Beantworte die folgende Anfrage basierend auf den verfügbaren Informationen:
+Anfrage: {query}
+Verfügbare Informationen:
+{context}
+Verfasse eine vollständige professionelle E-Mail-Antwort."""
+        try:
+            # GPT-5 models have different parameter requirements
+            completion_params = {
+                "model": self.config.llm.model_name,
+                "messages": [
+                    {"role": "system", "content": system_prompt},
+                    {"role": "user", "content": user_prompt}
+                ],
+            }
+            # GPT-5 uses max_completion_tokens and supports reasoning_effort parameter
+            if "gpt-5" in self.config.llm.model_name:
+                completion_params["max_completion_tokens"] = self.config.llm.max_tokens
+                # Don't set temperature for GPT-5 (only supports default 1.0)
+                # Use minimal reasoning effort to get actual output instead of all reasoning tokens
+                completion_params["reasoning_effort"] = "minimal"
+            else:
+                completion_params["max_tokens"] = self.config.llm.max_tokens
+                completion_params["temperature"] = self.config.llm.temperature
+            response = self.client.chat.completions.create(**completion_params)
+            logger.info(f"[DEBUG] Response object: {response}")
+            logger.info(f"[DEBUG] Response.choices: {response.choices}")
+            email = response.choices[0].message.content
+            if email is None or email.strip() == "":
+                logger.error(f"LLM returned null or empty response!")
+                logger.error(f"[DEBUG] Full response: {response.model_dump()}")
+                email = "Error: The model returned an empty response. Please try again."
+        except Exception as e:
+            logger.error(f"Error generating email: {e}")
+            email = f"Error generating response: {str(e)}"
+        # Format chunks for display
+        chunks_html = self._format_documents_html(retrieved_docs)
+        # Create metadata
+        metadata = f"""**Mode**: Simple (Single LLM call)
+**Model**: {self.config.llm.model_name}
+**Documents Retrieved**: {len(retrieved_docs)}
+**Documents Used**: {min(len(retrieved_docs), max_docs)}
+"""
+        return email, chunks_html, metadata
+    async def process_query_multi_agent(self, query: str) -> Tuple[str, str, str]:
+        """Process query with multi-agent mode (high quality, async parallel)."""
+        logger.info(f"[MULTI-AGENT MODE] Processing query: {query}")
+        orchestrator = self._get_orchestrator()
+        result = await orchestrator.process_query(query)
+        # Format email
+        email = f"""Subject: {result.email_draft.subject}
+{result.email_draft.body}"""
+        # Format chunks - result.retrieved_docs are dicts, need to convert back
+        from haystack import Document as HaystackDoc
+        docs = []
+        for doc_dict in result.retrieved_docs:
+            if isinstance(doc_dict, dict):
+                doc = HaystackDoc(
+                    content=doc_dict.get('content', ''),
+                    meta=doc_dict.get('meta', {}),
+                    id=doc_dict.get('id')
+                )
+                if 'score' in doc_dict:
+                    doc.score = doc_dict['score']
+                docs.append(doc)
+        chunks_html = self._format_documents_html(docs)
+        # Create metadata
+        mode_type = "Parallel ⚡" if self.config.use_parallel_processing else "Sequential 🐌"
+        metadata = f"""**Mode**: Multi-Agent ({mode_type})
+**Model**: {self.config.llm.model_name}
+**Processing Time**: {result.processing_time:.1f}s
+**Documents Retrieved**: {len(result.retrieved_docs)}
+**Intent Detected**:
+- Action: {result.intent.action_type}
+- Topic: {result.intent.topic}
+- Language: {result.intent.language}
+**Fact Check**:
+- Accuracy: {result.fact_check.accuracy_score:.0%}
+- Status: {'✓ Accurate' if result.fact_check.is_accurate else '⚠ Issues Found'}
+- Issues: {len(result.fact_check.issues_found)}
+"""
+        if result.fact_check.issues_found:
+            metadata += "\n**Issues**:\n"
+            for issue in result.fact_check.issues_found[:3]:  # Show first 3
+                metadata += f"- {issue}\n"
+        return email, chunks_html, metadata
+    def process_query(self, query: str, mode: str) -> Tuple[str, str, str]:
+        """Process query with selected mode."""
+        if not query or not query.strip():
+            return "Please enter a query.", "", ""
+        try:
+            if mode == "Simple (Fast)":
+                return self.process_query_simple(query)
+            else:  # Multi-Agent
+                if not MULTI_AGENT_AVAILABLE:
+                    return (
+                        "Multi-agent mode is not available. Using simple mode instead.",
+                        "",
+                        "Error: Multi-agent dependencies not loaded"
+                    )
+                # Run async function
+                return asyncio.run(self.process_query_multi_agent(query))
+        except Exception as e:
+            logger.error(f"Error processing query: {e}", exc_info=True)
+            return f"Error: {str(e)}", "", ""
+# Initialize assistant
+logger.info("Initializing Unified Assistant...")
+assistant = UnifiedAssistant()
+logger.info("Assistant ready!")
+# Example queries
+EXAMPLE_QUERIES = [
+    "Wie kann ich mich exmatrikulieren?",
+    "What are the deadlines for leave of absence?",
+    "Wie ändere ich meinen Namen in den Studiendokumenten?",
+    "Welche Versicherungen brauche ich als Student?",
+]
+# Create Gradio interface
+with gr.Blocks(title="BFH Student Administration Assistant") as demo:
+    gr.Markdown("# 🎓 BFH Student Administration Email Assistant")
+    gr.Markdown("""
+    Ask questions about BFH student administration and receive professional email responses.
+    **Modes**:
+    - **Simple (Fast)**: Single LLM call (~5-10s) - Best for quick responses
+    - **Multi-Agent (Quality)**: Intent + Compose + Fact-Check (~60-75s) - Best for accuracy
+    """)
+    with gr.Row():
+        with gr.Column(scale=2):
+            mode_radio = gr.Radio(
+                choices=["Simple (Fast)", "Multi-Agent (Quality)"],
+                value="Simple (Fast)",
+                label="Processing Mode",
+                info="Simple mode is faster, Multi-Agent provides higher quality"
+            )
+            query_input = gr.Textbox(
+                label="Your Question",
+                placeholder="e.g., Wie kann ich mich exmatrikulieren?",
+                lines=3
+            )
+            submit_btn = gr.Button("Generate Email Response", variant="primary")
+            gr.Examples(
+                examples=EXAMPLE_QUERIES,
+                inputs=query_input,
+                label="Example Questions"
+            )
+        with gr.Column(scale=3):
+            email_output = gr.Textbox(
+                label="Generated Email",
+                lines=15,
+                show_copy_button=True
+            )
+            metadata_output = gr.Markdown(label="Processing Info")
+    with gr.Accordion("Retrieved Source Documents", open=False):
+        chunks_output = gr.HTML(label="Source Chunks")
+    submit_btn.click(
+        fn=assistant.process_query,
+        inputs=[query_input, mode_radio],
+        outputs=[email_output, chunks_output, metadata_output]
+    )
+if __name__ == "__main__":
+    demo.launch()

requirements.txt CHANGED Viewed

@@ -10,8 +10,8 @@ pydantic==2.11.10
 pydantic_core==2.33.2
 griffe>=1.5.0
-# OpenAI
-openai==2.2.0
 # Gradio UI
 gradio==5.49.0

 pydantic_core==2.33.2
 griffe>=1.5.0
+# OpenAI (>= 2.3.0 required for GPT-5 reasoning_effort parameter)
+openai==2.3.0
 # Gradio UI
 gradio==5.49.0

src/agents/composer_agent.py CHANGED Viewed

@@ -41,8 +41,10 @@ class ComposerAgent:
             api_key: OpenAI API key
             model: Model to use for composition
         """
-        self.agent = Agent[None, EmailDraft](
             model,
             system_prompt="""You are an expert email composer for BFH (Bern University of Applied Sciences) administrative staff.
 Your task is to compose professional, accurate, and helpful email responses to student inquiries based on:

             api_key: OpenAI API key
             model: Model to use for composition
         """
+        # Note: Must explicitly pass output_type parameter for structured outputs
+        self.agent = Agent(
             model,
+            output_type=EmailDraft,
             system_prompt="""You are an expert email composer for BFH (Bern University of Applied Sciences) administrative staff.
 Your task is to compose professional, accurate, and helpful email responses to student inquiries based on:

src/agents/fact_checker_agent.py CHANGED Viewed

@@ -51,8 +51,10 @@ class FactCheckerAgent:
             api_key: OpenAI API key
             model: Model to use for fact checking
         """
-        self.agent = Agent[None, FactCheckResult](
             model,
             system_prompt="""You are an expert fact-checker for university administrative communications.
 Your task is to verify the accuracy of email drafts against source documents from the knowledge base.

             api_key: OpenAI API key
             model: Model to use for fact checking
         """
+        # Note: Must explicitly pass output_type parameter for structured outputs
+        self.agent = Agent(
             model,
+            output_type=FactCheckResult,
             system_prompt="""You are an expert fact-checker for university administrative communications.
 Your task is to verify the accuracy of email drafts against source documents from the knowledge base.

src/agents/intent_agent.py CHANGED Viewed

@@ -44,8 +44,10 @@ class IntentAgent:
             api_key: OpenAI API key
             model: Model to use for intent extraction
         """
-        self.agent = Agent[None, IntentData](
             model,
             system_prompt="""You are an expert at analyzing user queries for a university administrative email assistant.
 Your task is to extract structured intent information from user queries. Analyze:
@@ -81,10 +83,10 @@ Provide accurate, structured intent extraction to help compose appropriate email
         try:
             result = await self.agent.run(query)
-            logger.debug(f"Agent result type: {type(result)}")
-            logger.debug(f"Result attributes: {dir(result)}")
-            logger.debug(f"Result.output type: {type(result.output)}")
-            logger.debug(f"Result.output: {result.output}")
             intent = result.output

             api_key: OpenAI API key
             model: Model to use for intent extraction
         """
+        # Note: Must explicitly pass output_type parameter for structured outputs
+        self.agent = Agent(
             model,
+            output_type=IntentData,
             system_prompt="""You are an expert at analyzing user queries for a university administrative email assistant.
 Your task is to extract structured intent information from user queries. Analyze:
         try:
             result = await self.agent.run(query)
+            logger.info(f"[DEBUG] Agent result type: {type(result)}")
+            logger.info(f"[DEBUG] Result.output type: {type(result.output)}")
+            logger.info(f"[DEBUG] Result.output content: {result.output}")
+            logger.info(f"[DEBUG] Result.output repr: {repr(result.output)}")
             intent = result.output

src/pipeline/memory_orchestrator.py CHANGED Viewed

@@ -106,8 +106,8 @@ class MemoryRAGOrchestrator:
             phase1_start = time.time()
             intent, retrieved_docs = await asyncio.gather(
-                self._extract_intent_with_timeout(query, timeout=10),
-                self._retrieve_with_timeout(query, timeout=10),
             )
             phase1_time = time.time() - phase1_start
@@ -122,7 +122,7 @@ class MemoryRAGOrchestrator:
                 query=query,
                 intent=intent,
                 context_docs=retrieved_docs,
-                timeout=20
             )
             phase2_time = time.time() - phase2_start
@@ -143,7 +143,7 @@ class MemoryRAGOrchestrator:
                 fact_check = await self._fact_check_with_timeout(
                     email_draft=email_draft,
                     source_docs=retrieved_docs,
-                    timeout=15
                 )
                 phase3_time = time.time() - phase3_start

             phase1_start = time.time()
             intent, retrieved_docs = await asyncio.gather(
+                self._extract_intent_with_timeout(query, timeout=self.config.agent_timeout),
+                self._retrieve_with_timeout(query, timeout=self.config.agent_timeout),
             )
             phase1_time = time.time() - phase1_start
                 query=query,
                 intent=intent,
                 context_docs=retrieved_docs,
+                timeout=self.config.agent_timeout
             )
             phase2_time = time.time() - phase2_start
                 fact_check = await self._fact_check_with_timeout(
                     email_draft=email_draft,
                     source_docs=retrieved_docs,
+                    timeout=self.config.agent_timeout
                 )
                 phase3_time = time.time() - phase3_start

src/retrieval/__init__.py CHANGED Viewed

@@ -17,4 +17,4 @@ try:
     from .query_rewriter import QueryRewriter, RewrittenQueries
     __all__.extend(["QueryRewriter", "RewrittenQueries"])
 except ImportError:
-    pass  # PydanticAI not installed

     from .query_rewriter import QueryRewriter, RewrittenQueries
     __all__.extend(["QueryRewriter", "RewrittenQueries"])
 except ImportError:
+    pass  # PydanticAI not installed

src/retrieval/memory_retriever.py CHANGED Viewed

@@ -51,25 +51,32 @@ class MemoryRetriever:
             model=llm_config.embedding_model,
         )
-        # Initialize cross-encoder reranker (public model, disable auth)
-        logger.info("Loading cross-encoder reranking model...")
-        # Explicitly disable token to avoid HF Spaces auto-injected invalid tokens
-        import os
-        # Temporarily remove HF tokens to prevent auth errors
-        old_token = os.environ.pop('HF_TOKEN', None)
-        old_hub_token = os.environ.pop('HUGGING_FACE_HUB_TOKEN', None)
-        try:
-            self.reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
-        finally:
-            # Restore tokens
-            if old_token:
-                os.environ['HF_TOKEN'] = old_token
-            if old_hub_token:
-                os.environ['HUGGING_FACE_HUB_TOKEN'] = old_hub_token
-        logger.info("Reranker loaded successfully")
     def retrieve(self, query: str) -> List[Document]:
         """

             model=llm_config.embedding_model,
         )
+        # Lazy-load reranker (only when first needed to avoid HF Spaces startup issues)
+        self._reranker = None
+    @property
+    def reranker(self):
+        """Lazy-load the cross-encoder reranker on first use."""
+        if self._reranker is None:
+            logger.info("Loading cross-encoder reranking model...")
+            # Explicitly disable token to avoid HF Spaces auto-injected invalid tokens
+            import os
+            old_token = os.environ.pop('HF_TOKEN', None)
+            old_hub_token = os.environ.pop('HUGGING_FACE_HUB_TOKEN', None)
+            try:
+                self._reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
+            finally:
+                # Restore tokens
+                if old_token:
+                    os.environ['HF_TOKEN'] = old_token
+                if old_hub_token:
+                    os.environ['HUGGING_FACE_HUB_TOKEN'] = old_hub_token
+            logger.info("Reranker loaded successfully")
+        return self._reranker
     def retrieve(self, query: str) -> List[Document]:
         """