GAIA_Agent / current_architecture.md
Delanoe Pirard
First commit
a23082c

A newer version of the Gradio SDK is available: 5.29.1

Upgrade

Current GAIA Multi-Agent Framework Architecture

This document summarizes the architecture of the GAIA multi-agent framework based on the provided Python source files.

Core Framework

  • Technology: The system is built using the llama_index.core.agent.workflow.AgentWorkflow from the LlamaIndex library.
  • Orchestration: app.py serves as the main entry point. It initializes a Gradio web interface, fetches benchmark questions from a specified API endpoint, manages file handling (text, image, audio) associated with questions, runs the agent workflow for each question, and submits the answers back to the API.
  • Root Agent: The workflow designates planner_agent as the root_agent, meaning it receives the initial user request (question) and orchestrates the subsequent steps.

Agent Roster and Capabilities

The framework comprises several specialized agents, each designed for specific tasks:

  1. planner_agent (Root):

    • Purpose: Strategic planning, task decomposition, and final synthesis.
    • Tools: generate_substeps (breaks down objectives using an LLM), synthesize_and_respond (aggregates results into a final report using an LLM).
    • Workflow: Receives the initial objective, breaks it into sub-steps, delegates these steps to appropriate specialist agents, and finally synthesizes the collected results into a coherent answer.
    • Handoffs: Can delegate to code_agent, research_agent, math_agent, role_agent, image_analyzer_agent, text_analyzer_agent, verifier_agent, reasoning_agent.
  2. role_agent:

    • Purpose: Determines and sets the appropriate persona or context for the task.
    • Tools: role_prompt_retriever (uses a combination of vector search and BM25 retrieval on the fka/awesome-chatgpt-prompts dataset, followed by reranking, to find the best role/prompt).
    • Workflow: Interprets user intent, retrieves relevant role descriptions, selects the best fit, and provides the role/prompt.
    • Handoffs: Hands off to planner_agent after setting the role.
  3. code_agent:

    • Purpose: Generates and executes Python code.
    • Tools: python_code_generator (uses an OpenAI model o4-mini to generate code from a prompt), code_interpreter (uses LlamaIndex's tool spec, likely for sandboxed execution), and a custom SimpleCodeExecutor (executes Python code via subprocess, not safe for production).
    • Workflow: Takes a description, generates code, executes/tests it, and returns the result or final code.
    • Handoffs: Hands off to planner_agent or reasoning_agent.
  4. math_agent:

    • Purpose: Performs mathematical computations.
    • Tools: A large suite of functions covering symbolic math (SymPy), matrix operations (NumPy), statistics (NumPy), numerical methods (NumPy, SciPy), vector math (NumPy), probability (SciPy), and potentially more (file was truncated). Also includes WolframAlpha integration.
    • Workflow: Executes specific mathematical operations based on requests.
    • Handoffs: (Inferred) Likely hands off to planner_agent or reasoning_agent.
  5. research_agent:

    • Purpose: Gathers information from the web and specialized sources.
    • Tools: Web search (Google, DuckDuckGo, Tavily), web browsing/interaction (Helium/Selenium: visit, get_text_by_css, get_page_html, click_element, search_item_ctrl_f, go_back, close_popups), Wikipedia search/loading, Yahoo Finance data retrieval, ArXiv paper search.
    • Workflow: Executes a plan-act-observe loop to find and extract information from various online sources.
    • Handoffs: Can delegate to code_agent, math_agent, analyzer_agent (likely meant text_analyzer_agent or image_analyzer_agent), planner_agent, reasoning_agent.
  6. text_analyzer_agent:

    • Purpose: Extracts text from PDFs and analyzes text content.
    • Tools: extract_text_from_pdf (uses PyPDF2, handles URLs and local files), analyze_text (uses an LLM to generate summary and key facts).
    • Workflow: If input is PDF, extracts text; then analyzes the text to produce a summary and list of facts.
    • Handoffs: Hands off to verifier_agent.
  7. image_analyzer_agent:

    • Purpose: Analyzes image content factually.
    • Tools: Relies directly on the multimodal capabilities of its underlying LLM (Gemini 1.5 Pro) to process image inputs provided via ChatMessage blocks. No specific image analysis tool is defined, but the system prompt dictates a detailed, structured analysis format.
    • Workflow: Receives an image, performs analysis according to a strict factual template.
    • Handoffs: Hands off to planner_agent, research_agent, or reasoning_agent.
  8. verifier_agent:

    • Purpose: Assesses the confidence of factual statements and detects contradictions.
    • Tools: verify_facts (uses an LLM - Gemini 2.0 Flash - to assign confidence scores), find_contradictions (uses simple string matching for negation pairs).
    • Workflow: Takes a list of facts, scores them, checks for contradictions, and reports results.
    • Handoffs: Hands off to reasoning_agent or planner_agent.
  9. reasoning_agent:

    • Purpose: Performs explicit chain-of-thought reasoning.
    • Tools: reasoning_tool (uses an OpenAI model o4-mini with a detailed prompt to perform CoT reasoning over the provided context).
    • Workflow: Takes context, applies reasoning via the tool, and provides the structured reasoning output.
    • Handoffs: Hands off to planner_agent.

Workflow and Data Flow

  1. A question (potentially with associated files) arrives at app.py.
  2. app.py formats the input (e.g., ChatMessage with TextBlock, ImageBlock, AudioBlock) and passes it to the AgentWorkflow starting with planner_agent.
  3. planner_agent breaks down the task.
  4. It may call role_agent to set context.
  5. It delegates sub-tasks to specialized agents (research, code, math, text_analyzer, image_analyzer).
  6. Agents execute their tasks, potentially calling tools or other agents (e.g., text_analyzer calls verifier_agent).
  7. reasoning_agent might be called for complex logical steps or verification.
  8. Results flow back up, eventually reaching planner_agent.
  9. planner_agent synthesizes the final answer using synthesize_and_respond.
  10. app.py receives the final answer and submits it.

Technology Stack Summary

  • Core: Python, LlamaIndex
  • LLMs: Google Gemini (1.5 Pro, 2.0 Flash), OpenAI (o4-mini)
  • UI: Gradio
  • Web Interaction: Selenium, Helium
  • Data Handling: Pandas, PyPDF2, Requests
  • Search/Retrieval: HuggingFace Embeddings/Rerankers, Datasets, LlamaIndex Tool Specs (Google, Tavily, Wikipedia, DuckDuckGo, Yahoo Finance, ArXiv)
  • Math: SymPy, NumPy, SciPy, WolframAlpha
  • Code Execution: Subprocess (basic executor), LlamaIndex Code Interpreter