Spaces:
Runtime error
Runtime error
Yago Bolivar
commited on
Commit
·
3a78b26
1
Parent(s):
571ca98
feat: add GAIA Agent Development Plan document outlining task understanding, architecture design, and testing strategy
Browse files- devplan.md +99 -0
devplan.md
ADDED
|
@@ -0,0 +1,99 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# GAIA Agent Development Plan
|
| 2 |
+
# This document outlines a structured approach to developing an agent for the GAIA benchmark, focusing on understanding the task, designing the agent architecture, and planning the development process.
|
| 3 |
+
**I. Understanding the Task & Data:**
|
| 4 |
+
|
| 5 |
+
1. **Analyze common_questions.json:**
|
| 6 |
+
* **Structure:** Each entry has `task_id`, `Question`, `Level`, `Final answer`, and sometimes `file_name`.
|
| 7 |
+
* **Question Types:** Identify patterns:
|
| 8 |
+
* Direct information retrieval (e.g., "How many studio albums...").
|
| 9 |
+
* Web search required (e.g., "On June 6, 2023, an article...").
|
| 10 |
+
* File-based questions (audio, images, code - indicated by `file_name`).
|
| 11 |
+
* Logic/reasoning puzzles (e.g., the table-based commutativity question, reversed sentence).
|
| 12 |
+
* Multistep questions.
|
| 13 |
+
* **Answer Format:** Observe the format of `Final answer` for each type. Note the guidance in `docs/submission_instructions.md` regarding formatting (numbers, few words, comma-separated lists).
|
| 14 |
+
* **File Dependencies:** List all unique `file_name` extensions to understand what file processing capabilities are needed (e.g., `.mp3`, `.png`, `.py`, `.xlsx`).
|
| 15 |
+
|
| 16 |
+
2. **Review Project Context:**
|
| 17 |
+
* **Agent Interface:** The agent will need to fit into the `BasicAgent` structure in `app.py` (i.e., an `__init__` and a `__call__(self, question: str) -> str` method).
|
| 18 |
+
* **Evaluation:** Keep `docs/testing_recipe.md` and the `normalize` function in mind for how answers will be compared.
|
| 19 |
+
* **Model:** The agent will likely use an LLM (like the Llama 3 model mentioned in `docs/log.md`).
|
| 20 |
+
|
| 21 |
+
**II. Agent Architecture Design (Conceptual):**
|
| 22 |
+
|
| 23 |
+
1. **Core Agent Loop (`MyAgent.answer` or `MyAgent.__call__`):**
|
| 24 |
+
* **Input:** Question string (and `task_id`/`file_name` if passed separately or parsed from a richer input object).
|
| 25 |
+
* **Step 1: Question Analysis & Planning:**
|
| 26 |
+
* Use the LLM to understand the question.
|
| 27 |
+
* Determine the type of question (web search, file processing, direct knowledge, etc.).
|
| 28 |
+
* Identify if any tools are needed.
|
| 29 |
+
* Formulate a high-level plan (e.g., "Search web for X, then extract Y from the page").
|
| 30 |
+
* **Step 2: Tool Selection & Execution (if needed):**
|
| 31 |
+
* Based on the plan, select and invoke appropriate tools.
|
| 32 |
+
* Pass necessary parameters to tools (e.g., search query, file path).
|
| 33 |
+
* Collect tool outputs.
|
| 34 |
+
* **Step 3: Information Synthesis & Answer Generation:**
|
| 35 |
+
* Use the LLM to process tool outputs and any retrieved information.
|
| 36 |
+
* Generate the final answer string.
|
| 37 |
+
* **Step 4: Answer Formatting:**
|
| 38 |
+
* Ensure the answer conforms to the expected format (using guidance from common_questions.json examples and `docs/submission_instructions.md`). This might involve specific post-processing rules or prompting the LLM for a specific format.
|
| 39 |
+
* **Output:** Return the formatted answer string.
|
| 40 |
+
|
| 41 |
+
2. **Key Modules/Components:**
|
| 42 |
+
|
| 43 |
+
* **LLM Interaction Module:**
|
| 44 |
+
* Handles communication with the chosen LLM (e.g., GPT4All Llama 3).
|
| 45 |
+
* Manages prompt construction (system prompts, user prompts, few-shot examples if useful).
|
| 46 |
+
* Parses LLM responses.
|
| 47 |
+
* **Tool Library:** A collection of functions/classes that the agent can call.
|
| 48 |
+
* `WebSearchTool`:
|
| 49 |
+
* Input: Search query.
|
| 50 |
+
* Action: Uses a search engine API (or simulates browsing if necessary, though direct API is better).
|
| 51 |
+
* Output: List of search results (titles, snippets, URLs) or page content.
|
| 52 |
+
* `FileReaderTool`:
|
| 53 |
+
* Input: File path (derived from `file_name` and `task_id` to locate/fetch the file).
|
| 54 |
+
* Action: Reads content based on file type.
|
| 55 |
+
* Text files (`.py`): Read as string.
|
| 56 |
+
* Spreadsheets (`.xlsx`): Parse relevant data (requires a library like `pandas` or `openpyxl`).
|
| 57 |
+
* Audio files (`.mp3`): Transcribe to text (requires a speech-to-text model/API).
|
| 58 |
+
* Image files (`.png`): Describe image content or extract text (requires a vision model/API or OCR).
|
| 59 |
+
* Output: Processed content (text, structured data).
|
| 60 |
+
* `CodeInterpreterTool` (for `.py` files like in task `f918266a-b3e0-4914-865d-4faa564f1aef`):
|
| 61 |
+
* Input: Python code string.
|
| 62 |
+
* Action: Executes the code in a sandboxed environment.
|
| 63 |
+
* Output: Captured stdout/stderr or final expression value.
|
| 64 |
+
* *(Potentially)* `KnowledgeBaseTool`: If there's a way to pre-process or index relevant documents/FAQs for faster lookups (though most GAIA questions imply dynamic information retrieval).
|
| 65 |
+
* **File Management/Access:**
|
| 66 |
+
* Mechanism to locate/download files associated with `task_id` and `file_name`. The API endpoint `GET /files/{task_id}` from `docs/API.md` is relevant here. For local testing with common_questions.json, ensure these files are available locally.
|
| 67 |
+
* **Prompt Engineering Strategy:**
|
| 68 |
+
* Develop a set of system prompts to guide the agent's behavior (e.g., "You are a helpful AI assistant designed to answer questions from the GAIA benchmark...").
|
| 69 |
+
* Develop task-specific prompts or prompt templates for different question types or tool usage.
|
| 70 |
+
* Incorporate answer formatting instructions into prompts.
|
| 71 |
+
|
| 72 |
+
**III. Development & Testing Strategy:**
|
| 73 |
+
|
| 74 |
+
1. **Environment Setup:**
|
| 75 |
+
* Install necessary Python libraries for LLM interaction, web requests, file processing (e.g., `requests`, `beautifulsoup4` (for web scraping if needed), `pandas`, `Pillow` (for images), speech recognition libraries, etc.).
|
| 76 |
+
2. **Iterative Implementation:**
|
| 77 |
+
* **Phase 1: Basic LLM Agent:** Start with an agent that only uses the LLM for direct-answer questions (no tools).
|
| 78 |
+
* **Phase 2: Web Search Integration:** Implement the `WebSearchTool` and integrate it for questions requiring web lookups.
|
| 79 |
+
* **Phase 3: File Handling:**
|
| 80 |
+
* Implement `FileReaderTool` for one file type at a time (e.g., start with `.txt` or `.py`, then `.mp3`, `.png`, `.xlsx`).
|
| 81 |
+
* Implement `CodeInterpreterTool`.
|
| 82 |
+
* **Phase 4: Complex Reasoning & Multi-step:** Refine the planning and synthesis capabilities of the LLM to handle more complex, multi-step questions that might involve multiple tool uses.
|
| 83 |
+
3. **Testing:**
|
| 84 |
+
* Use `common_questions.json` as the primary test set.
|
| 85 |
+
* Adapt the script from `docs/testing_recipe.md` (or use `utilities/evaluate_local.py` if suitable) to run your agent against these questions and compare outputs.
|
| 86 |
+
* Focus on one question type or `task_id` at a time for debugging.
|
| 87 |
+
* Log agent's internal "thoughts" (plan, tool calls, tool outputs) for easier debugging.
|
| 88 |
+
|
| 89 |
+
**IV. Pre-computation/Pre-analysis (before coding):**
|
| 90 |
+
|
| 91 |
+
1. **Map Question Types to Tools:** For each question in common_questions.json, manually note which tool(s) would ideally be used. This helps prioritize tool development.
|
| 92 |
+
* Example:
|
| 93 |
+
* `8e867cd7-cff9-4e6c-867a-ff5ddc2550be` (Mercedes Sosa albums): WebSearchTool
|
| 94 |
+
* `cca530fc-4052-43b2-b130-b30968d8aa44` (Chess): FileReaderTool (image) + Vision/Chess Engine Tool (or very advanced LLM vision)
|
| 95 |
+
* `99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3` (Pie ingredients): FileReaderTool (audio) + SpeechToText
|
| 96 |
+
* `f918266a-b3e0-4914-865d-4faa564f1aef` (Python output): FileReaderTool (code) + CodeInterpreterTool
|
| 97 |
+
2. **Define Tool Interfaces:** Specify the exact input/output signature for each planned tool.
|
| 98 |
+
|
| 99 |
+
This structured approach should provide a solid foundation for developing the agent. The key will be modularity, robust tool implementation, and effective prompt engineering to guide the LLM.
|