Final_Assignment_Template

Sleeping

App Files Files Community

Romain Fayoux commited on Oct 1

Commit

b284752

1 Parent(s): 3a7aaed

Cleaned custom evaluations from project (claude slop)

Browse files

Files changed (9) hide show

GAIA_COMPARISON.md +0 -142
app.py +0 -43
comparison.py +0 -160
debug_phoenix.py +0 -285
debug_spans.py +0 -77
phoenix_evaluator.py +0 -273
test_comparison.py +0 -144
test_phoenix_logging.py +0 -261
test_phoenix_simple.py +0 -139

GAIA_COMPARISON.md DELETED Viewed

@@ -1,142 +0,0 @@
-# GAIA Ground Truth Comparison
-This project now includes automatic comparison of your agent's answers against the GAIA dataset ground truth, with Phoenix observability integration.
-## Features
-- **Ground Truth Comparison**: Automatically compares agent answers to correct answers from `data/metadata.jsonl`
-- **Multiple Evaluation Metrics**: Exact match, similarity score, and contains-answer detection
-- **Phoenix Integration**: Logs evaluations to Phoenix for persistent tracking and analysis
-- **Enhanced Results Display**: Shows ground truth and comparison results in the Gradio interface
-## How It Works
-### 1. Ground Truth Loading
-- Loads correct answers from `data/metadata.jsonl`
-- Maps task IDs to ground truth answers
-- Currently supports 165 questions from the GAIA dataset
-### 2. Answer Comparison
-For each agent answer, the system calculates:
-- **Exact Match**: Boolean indicating if answers match exactly (after normalization)
-- **Similarity Score**: 0-1 score using difflib.SequenceMatcher
-- **Contains Answer**: Boolean indicating if the correct answer is contained in the agent's response
-### 3. Answer Normalization
-Before comparison, answers are normalized by:
-- Converting to lowercase
-- Removing punctuation (.,;:!?"')
-- Normalizing whitespace
-- Trimming leading/trailing spaces
-### 4. Phoenix Integration
-- Evaluations are automatically logged to Phoenix
-- Each evaluation includes score, label, explanation, and detailed metrics
-- Viewable in Phoenix UI for historical tracking and analysis
-## Usage
-### In Your Agent App
-The comparison happens automatically when you run your agent:
-1. **Run your agent** - Process questions as usual
-2. **Automatic comparison** - System compares answers to ground truth
-3. **Enhanced results** - Results table includes comparison columns
-4. **Phoenix logging** - Evaluations are logged for persistent tracking
-### Results Display
-Your results table now includes these additional columns:
-- **Ground Truth**: The correct answer from GAIA dataset
-- **Exact Match**: True/False for exact matches
-- **Similarity**: Similarity score (0-1)
-- **Contains Answer**: True/False if correct answer is contained
-### Status Message
-The status message now includes:
-```
-Ground Truth Comparison:
-Exact matches: 15/50 (30.0%)
-Average similarity: 0.654
-Contains correct answer: 22/50 (44.0%)
-Evaluations logged to Phoenix ✅
-```
-## Testing
-Run the test suite to verify functionality:
-```bash
-python test_comparison.py
-```
-This will test:
-- Basic comparison functionality
-- Results enhancement
-- Phoenix integration
-- Ground truth loading
-## Files Added
-- `comparison.py`: Main comparison logic and AnswerComparator class
-- `phoenix_evaluator.py`: Phoenix integration for logging evaluations
-- `test_comparison.py`: Test suite for verification
-- `GAIA_COMPARISON.md`: This documentation
-## Dependencies Added
-- `arize-phoenix`: For observability and evaluation logging
-- `pandas`: For data manipulation (if not already present)
-## Example Evaluation Result
-```python
-{
-    "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
-    "predicted_answer": "3",
-    "actual_answer": "3",
-    "exact_match": True,
-    "similarity_score": 1.0,
-    "contains_answer": True,
-    "error": None
-}
-```
-## Phoenix UI
-In the Phoenix interface, you can:
-- View evaluation results alongside agent traces
-- Track accuracy over time
-- Filter by correct/incorrect answers
-- Analyze which question types your agent struggles with
-- Export evaluation data for further analysis
-## Troubleshooting
-### No Ground Truth Available
-If you see "N/A" for ground truth, the question's task_id is not in the metadata.jsonl file.
-### Phoenix Connection Issues
-If Phoenix logging fails, the comparison will still work but won't be persisted. Ensure Phoenix is running and accessible.
-### Low Similarity Scores
-Low similarity scores might indicate:
-- Agent is providing verbose answers when short ones are expected
-- Answer format doesn't match expected format
-- Agent is partially correct but not exact
-## Customization
-You can adjust the comparison logic in `comparison.py`:
-- Modify `normalize_answer()` for different normalization rules
-- Adjust similarity thresholds
-- Add custom evaluation metrics
-- Modify Phoenix logging format
-## Performance
-The comparison adds minimal overhead:
-- Ground truth loading: ~1-2 seconds (one-time)
-- Per-answer comparison: ~1-10ms
-- Phoenix logging: ~10-50ms per evaluation
-Total additional time: Usually < 5 seconds for 50 questions.

app.py CHANGED Viewed

@@ -7,10 +7,6 @@ from phoenix.otel import register
 from openinference.instrumentation.smolagents import SmolagentsInstrumentor
 from llm_only_agent import LLMOnlyAgent
 from multi_agent import MultiAgent
-from comparison import AnswerComparator
-from phoenix_evaluator import log_evaluations_to_phoenix
-import phoenix as px
 # (Keep Constants as is)
 # --- Constants ---
@@ -139,32 +135,6 @@ def run_and_submit_all(profile: gr.OAuthProfile | None, limit: int | None):
         print("Agent did not produce any answers to submit.")
         return "Agent did not produce any answers to submit.", pd.DataFrame(results_log)
-    # 3.5. Compare with Ground Truth and Log to Phoenix
-    print("Comparing answers with ground truth...")
-    try:
-        # Initialize comparator
-        comparator = AnswerComparator()
-        # Evaluate answers
-        evaluations_df = comparator.evaluate_batch(answers_payload)
-        # Get summary statistics
-        summary_stats = comparator.get_summary_stats(evaluations_df)
-        # Enhance results log with comparison data
-        results_log = comparator.enhance_results_log(results_log)
-        # Log evaluations to Phoenix
-        log_evaluations_to_phoenix(evaluations_df)
-        print(
-            f"Ground truth comparison completed: {summary_stats['exact_matches']}/{summary_stats['total_questions']} exact matches"
-        )
-    except Exception as e:
-        print(f"Error during ground truth comparison: {e}")
-        summary_stats = {"error": str(e)}
     # 4. Prepare Submission
     submission_data = {
         "username": username.strip(),
@@ -172,19 +142,6 @@ def run_and_submit_all(profile: gr.OAuthProfile | None, limit: int | None):
         "answers": answers_payload,
     }
     status_update = f"Agent finished. Submitting {len(answers_payload)} answers for user '{username}'..."
-    # Add ground truth comparison to status
-    if "error" not in summary_stats:
-        status_update += f"\n\nGround Truth Comparison:\n"
-        status_update += f"Exact matches: {summary_stats['exact_matches']}/{summary_stats['total_questions']} ({summary_stats['exact_match_rate']:.1%})\n"
-        status_update += (
-            f"Average similarity: {summary_stats['average_similarity']:.3f}\n"
-        )
-        status_update += f"Contains correct answer: {summary_stats['contains_matches']}/{summary_stats['total_questions']} ({summary_stats['contains_match_rate']:.1%})\n"
-        status_update += f"Evaluations logged to Phoenix ✅"
-    else:
-        status_update += f"\n\nGround Truth Comparison Error: {summary_stats['error']}"
     print(status_update)
     # 5. Submit

 from openinference.instrumentation.smolagents import SmolagentsInstrumentor
 from llm_only_agent import LLMOnlyAgent
 from multi_agent import MultiAgent
 # (Keep Constants as is)
 # --- Constants ---
         print("Agent did not produce any answers to submit.")
         return "Agent did not produce any answers to submit.", pd.DataFrame(results_log)
     # 4. Prepare Submission
     submission_data = {
         "username": username.strip(),
         "answers": answers_payload,
     }
     status_update = f"Agent finished. Submitting {len(answers_payload)} answers for user '{username}'..."
     print(status_update)
     # 5. Submit

comparison.py DELETED Viewed

@@ -1,160 +0,0 @@
-import json
-import pandas as pd
-from typing import Dict, List, Any
-from difflib import SequenceMatcher
-import re
-class AnswerComparator:
-    def __init__(self, metadata_path: str = "data/metadata.jsonl"):
-        """Initialize the comparator with ground truth data."""
-        self.ground_truth = self._load_ground_truth(metadata_path)
-        print(f"Loaded ground truth for {len(self.ground_truth)} questions")
-    def _load_ground_truth(self, metadata_path: str) -> Dict[str, str]:
-        """Load ground truth answers from metadata.jsonl file."""
-        ground_truth = {}
-        try:
-            with open(metadata_path, 'r', encoding='utf-8') as f:
-                for line in f:
-                    if line.strip():
-                        data = json.loads(line)
-                        task_id = data.get("task_id")
-                        final_answer = data.get("Final answer")
-                        if task_id and final_answer is not None:
-                            ground_truth[task_id] = str(final_answer)
-        except FileNotFoundError:
-            print(f"Warning: Ground truth file {metadata_path} not found")
-        except Exception as e:
-            print(f"Error loading ground truth: {e}")
-        return ground_truth
-    def normalize_answer(self, answer: str) -> str:
-        """Normalize answer for comparison."""
-        if answer is None:
-            return ""
-        # Convert to string and strip whitespace
-        answer = str(answer).strip()
-        # Convert to lowercase for case-insensitive comparison
-        answer = answer.lower()
-        # Remove common punctuation that might not affect correctness
-        answer = re.sub(r'[.,;:!?"\']', '', answer)
-        # Normalize whitespace
-        answer = re.sub(r'\s+', ' ', answer)
-        return answer
-    def exact_match(self, predicted: str, actual: str) -> bool:
-        """Check if answers match exactly after normalization."""
-        return self.normalize_answer(predicted) == self.normalize_answer(actual)
-    def similarity_score(self, predicted: str, actual: str) -> float:
-        """Calculate similarity score between predicted and actual answers."""
-        normalized_pred = self.normalize_answer(predicted)
-        normalized_actual = self.normalize_answer(actual)
-        if not normalized_pred and not normalized_actual:
-            return 1.0
-        if not normalized_pred or not normalized_actual:
-            return 0.0
-        return SequenceMatcher(None, normalized_pred, normalized_actual).ratio()
-    def contains_answer(self, predicted: str, actual: str) -> bool:
-        """Check if the actual answer is contained in the predicted answer."""
-        normalized_pred = self.normalize_answer(predicted)
-        normalized_actual = self.normalize_answer(actual)
-        return normalized_actual in normalized_pred
-    def evaluate_answer(self, task_id: str, predicted_answer: str) -> Dict[str, Any]:
-        """Evaluate a single answer against ground truth."""
-        actual_answer = self.ground_truth.get(task_id)
-        if actual_answer is None:
-            return {
-                "task_id": task_id,
-                "predicted_answer": predicted_answer,
-                "actual_answer": None,
-                "exact_match": False,
-                "similarity_score": 0.0,
-                "contains_answer": False,
-                "error": "No ground truth available"
-            }
-        return {
-            "task_id": task_id,
-            "predicted_answer": predicted_answer,
-            "actual_answer": actual_answer,
-            "exact_match": self.exact_match(predicted_answer, actual_answer),
-            "similarity_score": self.similarity_score(predicted_answer, actual_answer),
-            "contains_answer": self.contains_answer(predicted_answer, actual_answer),
-            "error": None
-        }
-    def evaluate_batch(self, results: List[Dict[str, Any]]) -> pd.DataFrame:
-        """Evaluate a batch of results."""
-        evaluations = []
-        for result in results:
-            task_id = result.get("task_id") or result.get("Task ID")
-            predicted_answer = result.get("submitted_answer") or result.get("Submitted Answer", "")
-            if task_id is not None:
-                evaluation = self.evaluate_answer(task_id, predicted_answer)
-                evaluations.append(evaluation)
-        return pd.DataFrame(evaluations)
-    def get_summary_stats(self, evaluations_df: pd.DataFrame) -> Dict[str, Any]:
-        """Get summary statistics from evaluations."""
-        if evaluations_df.empty:
-            return {"error": "No evaluations available"}
-        # Filter out entries without ground truth
-        valid_evaluations = evaluations_df[evaluations_df['error'].isna()]
-        if valid_evaluations.empty:
-            return {"error": "No valid ground truth available"}
-        total_questions = len(valid_evaluations)
-        exact_matches = valid_evaluations['exact_match'].sum()
-        avg_similarity = valid_evaluations['similarity_score'].mean()
-        contains_matches = valid_evaluations['contains_answer'].sum()
-        return {
-            "total_questions": total_questions,
-            "exact_matches": exact_matches,
-            "exact_match_rate": exact_matches / total_questions,
-            "average_similarity": avg_similarity,
-            "contains_matches": contains_matches,
-            "contains_match_rate": contains_matches / total_questions,
-            "questions_with_ground_truth": total_questions
-        }
-    def enhance_results_log(self, results_log: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
-        """Add comparison columns to results log."""
-        enhanced_results = []
-        for result in results_log:
-            task_id = result.get("Task ID")
-            predicted_answer = result.get("Submitted Answer", "")
-            if task_id is not None:
-                evaluation = self.evaluate_answer(task_id, predicted_answer)
-                # Add comparison info to result
-                enhanced_result = result.copy()
-                enhanced_result["Ground Truth"] = evaluation["actual_answer"] or "N/A"
-                enhanced_result["Exact Match"] = evaluation["exact_match"]
-                enhanced_result["Similarity"] = f"{evaluation['similarity_score']:.3f}"
-                enhanced_result["Contains Answer"] = evaluation["contains_answer"]
-                enhanced_results.append(enhanced_result)
-        return enhanced_results

debug_phoenix.py DELETED Viewed

@@ -1,285 +0,0 @@
-#!/usr/bin/env python3
-"""
-Enhanced debug script to check Phoenix status and evaluations.
-"""
-import phoenix as px
-import pandas as pd
-from comparison import AnswerComparator
-from phoenix_evaluator import log_evaluations_to_phoenix
-import time
-from datetime import datetime
-def check_phoenix_connection():
-    """Check if Phoenix is running and accessible."""
-    try:
-        client = px.Client()
-        print("✅ Phoenix client connected successfully")
-        # Try to get basic info
-        try:
-            spans_df = client.get_spans_dataframe()
-            print(f"✅ Phoenix API working - can retrieve spans")
-            return client
-        except Exception as e:
-            print(f"⚠️ Phoenix connected but API might have issues: {e}")
-            return client
-    except Exception as e:
-        print(f"❌ Phoenix connection failed: {e}")
-        print("Make sure Phoenix is running. You should see a message like:")
-        print("🌍 To view the Phoenix app in your browser, visit http://localhost:6006")
-        return None
-def check_spans(client):
-    """Check spans in Phoenix."""
-    try:
-        spans_df = client.get_spans_dataframe()
-        print(f"📊 Found {len(spans_df)} spans in Phoenix")
-        if len(spans_df) > 0:
-            print("Recent spans:")
-            for i, (_, span) in enumerate(spans_df.head(5).iterrows()):
-                span_id = span.get('context.span_id', 'no-id')
-                span_name = span.get('name', 'unnamed')
-                start_time = span.get('start_time', 'unknown')
-                print(f"  {i+1}. {span_name} ({span_id[:8]}...) - {start_time}")
-            # Show input/output samples
-            print("\nSpan content samples:")
-            for i, (_, span) in enumerate(spans_df.head(3).iterrows()):
-                input_val = str(span.get('input.value', ''))[:100]
-                output_val = str(span.get('output.value', ''))[:100]
-                print(f"  Span {i+1}:")
-                print(f"    Input: {input_val}...")
-                print(f"    Output: {output_val}...")
-        else:
-            print("⚠️ No spans found. Run your agent first to generate traces.")
-        return spans_df
-    except Exception as e:
-        print(f"❌ Error getting spans: {e}")
-        return pd.DataFrame()
-def check_evaluations(client):
-    """Check evaluations in Phoenix."""
-    try:
-        # Try different methods to get evaluations
-        print("🔍 Checking evaluations...")
-        # Method 1: Direct evaluation dataframe
-        try:
-            evals_df = client.get_evaluations_dataframe()
-            print(f"📊 Found {len(evals_df)} evaluations in Phoenix")
-            if len(evals_df) > 0:
-                print("Evaluation breakdown:")
-                eval_names = evals_df['name'].value_counts()
-                for name, count in eval_names.items():
-                    print(f"  - {name}: {count} evaluations")
-                # Check for GAIA evaluations specifically
-                gaia_evals = evals_df[evals_df['name'] == 'gaia_ground_truth']
-                if len(gaia_evals) > 0:
-                    print(f"✅ Found {len(gaia_evals)} GAIA ground truth evaluations")
-                    # Show sample evaluation
-                    sample = gaia_evals.iloc[0]
-                    print("Sample GAIA evaluation:")
-                    print(f"  - Score: {sample.get('score', 'N/A')}")
-                    print(f"  - Label: {sample.get('label', 'N/A')}")
-                    print(f"  - Explanation: {sample.get('explanation', 'N/A')[:100]}...")
-                    # Show metadata if available
-                    metadata = sample.get('metadata', {})
-                    if metadata:
-                        print(f"  - Metadata keys: {list(metadata.keys())}")
-                else:
-                    print("❌ No GAIA ground truth evaluations found")
-                    print("Available evaluation types:", list(eval_names.keys()))
-            else:
-                print("⚠️ No evaluations found in Phoenix")
-            return evals_df
-        except AttributeError as e:
-            print(f"⚠️ get_evaluations_dataframe not available: {e}")
-            print("This might be a Phoenix version issue")
-            return pd.DataFrame()
-    except Exception as e:
-        print(f"❌ Error getting evaluations: {e}")
-        return pd.DataFrame()
-def test_evaluation_creation_and_logging():
-    """Test creating and logging evaluations."""
-    print("\n🧪 Testing evaluation creation and logging...")
-    # Create sample evaluations
-    sample_data = [
-        {
-            "task_id": "debug-test-1",
-            "predicted_answer": "test answer 1",
-            "actual_answer": "correct answer 1",
-            "exact_match": False,
-            "similarity_score": 0.75,
-            "contains_answer": True,
-            "error": None
-        },
-        {
-            "task_id": "debug-test-2",
-            "predicted_answer": "exact match",
-            "actual_answer": "exact match",
-            "exact_match": True,
-            "similarity_score": 1.0,
-            "contains_answer": True,
-            "error": None
-        }
-    ]
-    evaluations_df = pd.DataFrame(sample_data)
-    print(f"Created {len(evaluations_df)} test evaluations")
-    # Try to log to Phoenix
-    try:
-        print("Attempting to log evaluations to Phoenix...")
-        result = log_evaluations_to_phoenix(evaluations_df)
-        if result is not None:
-            print("✅ Test evaluation logging successful")
-            print(f"Logged {len(result)} evaluations")
-            return True
-        else:
-            print("❌ Test evaluation logging failed - no result returned")
-            return False
-    except Exception as e:
-        print(f"❌ Test evaluation logging error: {e}")
-        import traceback
-        traceback.print_exc()
-        return False
-def check_gaia_data():
-    """Check GAIA ground truth data availability."""
-    print("\n📚 Checking GAIA ground truth data...")
-    try:
-        comparator = AnswerComparator()
-        print(f"✅ Loaded {len(comparator.ground_truth)} GAIA ground truth answers")
-        if len(comparator.ground_truth) > 0:
-            # Show sample
-            sample_task_id = list(comparator.ground_truth.keys())[0]
-            sample_answer = comparator.ground_truth[sample_task_id]
-            print(f"Sample: {sample_task_id} -> '{sample_answer}'")
-            # Test evaluation
-            test_eval = comparator.evaluate_answer(sample_task_id, "test answer")
-            print(f"Test evaluation result: {test_eval}")
-            return True
-        else:
-            print("❌ No GAIA ground truth data found")
-            return False
-    except Exception as e:
-        print(f"❌ Error checking GAIA data: {e}")
-        return False
-def show_phoenix_ui_info():
-    """Show information about Phoenix UI."""
-    print("\n🌐 Phoenix UI Information:")
-    print("-" * 30)
-    print("Phoenix UI should be available at: http://localhost:6006")
-    print("")
-    print("In the Phoenix UI, look for:")
-    print("  • 'Evaluations' tab or section")
-    print("  • 'Evals' section")
-    print("  • 'Annotations' tab")
-    print("  • In 'Spans' view, look for evaluation badges on spans")
-    print("")
-    print("If you see evaluations, they should be named 'gaia_ground_truth'")
-    print("Each evaluation should show:")
-    print("  - Score (similarity score 0-1)")
-    print("  - Label (correct/incorrect)")
-    print("  - Explanation (predicted vs ground truth)")
-    print("  - Metadata (task_id, exact_match, etc.)")
-def main():
-    """Main debug function."""
-    print("🔍 Enhanced Phoenix Debug Script")
-    print("=" * 50)
-    # Check Phoenix connection
-    client = check_phoenix_connection()
-    if not client:
-        print("\n❌ Cannot proceed without Phoenix connection")
-        print("Make sure your agent app is running (it starts Phoenix)")
-        return
-    print("\n📋 Checking Phoenix Data:")
-    print("-" * 30)
-    # Check spans
-    spans_df = check_spans(client)
-    # Check evaluations
-    evals_df = check_evaluations(client)
-    # Test evaluation creation
-    test_success = test_evaluation_creation_and_logging()
-    # Wait a moment and recheck evaluations
-    if test_success:
-        print("\n⏳ Waiting for evaluations to be processed...")
-        time.sleep(3)
-        print("🔍 Rechecking evaluations after test logging...")
-        evals_df_after = check_evaluations(client)
-        if len(evals_df_after) > len(evals_df):
-            print("✅ New evaluations detected after test!")
-        else:
-            print("⚠️ No new evaluations detected")
-    # Check GAIA data
-    gaia_available = check_gaia_data()
-    # Show Phoenix UI info
-    show_phoenix_ui_info()
-    # Final summary
-    print("\n" + "=" * 50)
-    print("📊 Summary:")
-    print(f"  • Phoenix connected: {'✅' if client else '❌'}")
-    print(f"  • Spans available: {len(spans_df)} spans")
-    print(f"  • Evaluations found: {len(evals_df)} evaluations")
-    print(f"  • GAIA data available: {'✅' if gaia_available else '❌'}")
-    print(f"  • Test logging worked: {'✅' if test_success else '❌'}")
-    print("\n💡 Next Steps:")
-    if len(spans_df) == 0:
-        print("  • Run your agent to generate traces first")
-    if len(evals_df) == 0:
-        print("  • Check if evaluations are being logged correctly")
-        print("  • Verify Phoenix version compatibility")
-    if not gaia_available:
-        print("  • Check that data/metadata.jsonl exists and is readable")
-    print(f"\n🌐 Phoenix UI: http://localhost:6006")
-if __name__ == "__main__":
-    main()

debug_spans.py DELETED Viewed

@@ -1,77 +0,0 @@
-#!/usr/bin/env python3
-"""
-Debug script to see Phoenix spans column structure.
-"""
-import sys
-import os
-sys.path.append(os.path.dirname(os.path.abspath(__file__)))
-import phoenix as px
-import pandas as pd
-def debug_spans_structure():
-    """Debug the structure of Phoenix spans."""
-    print("🔍 Debugging Phoenix Spans Structure")
-    print("=" * 50)
-    try:
-        client = px.Client()
-        print("✅ Phoenix connected successfully")
-    except Exception as e:
-        print(f"❌ Phoenix connection failed: {e}")
-        return
-    try:
-        spans_df = client.get_spans_dataframe()
-        print(f"📊 Found {len(spans_df)} spans in Phoenix")
-        if len(spans_df) == 0:
-            print("⚠️ No spans found. Run your agent first to create spans.")
-            return
-        print(f"\n📋 Available Columns ({len(spans_df.columns)} total):")
-        for i, col in enumerate(spans_df.columns):
-            print(f"  {i+1:2d}. {col}")
-        print(f"\n🔍 Sample Data (first span):")
-        sample_span = spans_df.iloc[0]
-        for col in spans_df.columns:
-            value = sample_span.get(col)
-            if value is not None:
-                value_str = str(value)[:100] + "..." if len(str(value)) > 100 else str(value)
-                print(f"  {col}: {value_str}")
-        # Look for input/output related columns
-        input_cols = [col for col in spans_df.columns if 'input' in col.lower()]
-        output_cols = [col for col in spans_df.columns if 'output' in col.lower()]
-        print(f"\n🎯 Input-related columns: {input_cols}")
-        print(f"🎯 Output-related columns: {output_cols}")
-        # Look for span ID columns
-        id_cols = [col for col in spans_df.columns if 'id' in col.lower()]
-        print(f"🎯 ID-related columns: {id_cols}")
-        # Look for columns that might contain task IDs
-        print(f"\n🔍 Searching for task IDs in spans...")
-        task_id_sample = "8e867cd7-cff9-4e6c-867a-ff5ddc2550be"
-        for col in spans_df.columns:
-            if spans_df[col].dtype == 'object':  # String-like columns
-                try:
-                    matches = spans_df[spans_df[col].astype(str).str.contains(task_id_sample, na=False, case=False)]
-                    if len(matches) > 0:
-                        print(f"  ✅ Found task ID in column '{col}': {len(matches)} matches")
-                except:
-                    pass
-    except Exception as e:
-        print(f"❌ Error debugging spans: {e}")
-        import traceback
-        traceback.print_exc()
-if __name__ == "__main__":
-    debug_spans_structure()

phoenix_evaluator.py DELETED Viewed

@@ -1,273 +0,0 @@
-import pandas as pd
-from typing import Dict, Any, List, Optional
-from comparison import AnswerComparator
-import phoenix as px
-from phoenix.trace import SpanEvaluations
-class GAIAPhoenixEvaluator:
-    """Phoenix evaluator for GAIA dataset ground truth comparison."""
-    def __init__(self, metadata_path: str = "data/metadata.jsonl"):
-        self.comparator = AnswerComparator(metadata_path)
-        self.eval_name = "gaia_ground_truth"
-    def evaluate_spans(self, spans_df: pd.DataFrame) -> List[SpanEvaluations]:
-        """Evaluate spans and return Phoenix SpanEvaluations."""
-        evaluations = []
-        for _, span in spans_df.iterrows():
-            # Extract task_id and answer from span
-            task_id = self._extract_task_id(span)
-            predicted_answer = self._extract_predicted_answer(span)
-            span_id = span.get("context.span_id")
-            if task_id and predicted_answer is not None and span_id:
-                evaluation = self.comparator.evaluate_answer(task_id, predicted_answer)
-                # Create evaluation record for Phoenix
-                eval_record = {
-                    "span_id": span_id,
-                    "score": 1.0 if evaluation["exact_match"] else evaluation["similarity_score"],
-                    "label": "correct" if evaluation["exact_match"] else "incorrect",
-                    "explanation": self._create_explanation(evaluation),
-                    "task_id": task_id,
-                    "predicted_answer": evaluation["predicted_answer"],
-                    "ground_truth": evaluation["actual_answer"],
-                    "exact_match": evaluation["exact_match"],
-                    "similarity_score": evaluation["similarity_score"],
-                    "contains_answer": evaluation["contains_answer"]
-                }
-                evaluations.append(eval_record)
-        if evaluations:
-            # Create SpanEvaluations object
-            eval_df = pd.DataFrame(evaluations)
-            return [SpanEvaluations(eval_name=self.eval_name, dataframe=eval_df)]
-        return []
-    def _extract_task_id(self, span) -> Optional[str]:
-        """Extract task_id from span data."""
-        # Try span attributes first
-        attributes = span.get("attributes", {})
-        if isinstance(attributes, dict):
-            if "task_id" in attributes:
-                return attributes["task_id"]
-        # Try input data
-        input_data = span.get("input", {})
-        if isinstance(input_data, dict):
-            if "task_id" in input_data:
-                return input_data["task_id"]
-        # Try to extract from input value if it's a string
-        input_value = span.get("input.value", "")
-        if isinstance(input_value, str):
-            # Look for UUID pattern in input
-            import re
-            uuid_pattern = r'[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}'
-            match = re.search(uuid_pattern, input_value)
-            if match:
-                return match.group(0)
-        # Try span name
-        span_name = span.get("name", "")
-        if isinstance(span_name, str):
-            import re
-            uuid_pattern = r'[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}'
-            match = re.search(uuid_pattern, span_name)
-            if match:
-                return match.group(0)
-        return None
-    def _extract_predicted_answer(self, span) -> Optional[str]:
-        """Extract predicted answer from span output."""
-        # Try different output fields
-        output_fields = ["output.value", "output", "response", "result"]
-        for field in output_fields:
-            value = span.get(field)
-            if value is not None:
-                return str(value)
-        return None
-    def _create_explanation(self, evaluation: Dict[str, Any]) -> str:
-        """Create human-readable explanation of the evaluation."""
-        predicted = evaluation["predicted_answer"]
-        actual = evaluation["actual_answer"]
-        exact_match = evaluation["exact_match"]
-        similarity = evaluation["similarity_score"]
-        contains = evaluation["contains_answer"]
-        if actual is None:
-            return "❓ No ground truth available for comparison"
-        explanation = f"Predicted: '{predicted}' | Ground Truth: '{actual}' | "
-        if exact_match:
-            explanation += "✅ Exact match"
-        elif contains:
-            explanation += f"⚠️ Contains correct answer (similarity: {similarity:.3f})"
-        else:
-            explanation += f"❌ Incorrect (similarity: {similarity:.3f})"
-        return explanation
-def add_gaia_evaluations_to_phoenix(spans_df: pd.DataFrame, metadata_path: str = "data/metadata.jsonl") -> List[SpanEvaluations]:
-    """Add GAIA evaluation results to Phoenix spans."""
-    evaluator = GAIAPhoenixEvaluator(metadata_path)
-    return evaluator.evaluate_spans(spans_df)
-def log_evaluations_to_phoenix(evaluations_df: pd.DataFrame, session_id: Optional[str] = None) -> Optional[pd.DataFrame]:
-    """Log evaluation results directly to Phoenix."""
-    try:
-        client = px.Client()
-        # Get current spans to match evaluations to span_ids
-        spans_df = client.get_spans_dataframe()
-        if spans_df is None or spans_df.empty:
-            print("No spans found to attach evaluations to")
-            return None
-        # Debug: Show available columns
-        print(f"📊 Available span columns: {list(spans_df.columns)}")
-        # Get possible input/output column names
-        input_columns = [col for col in spans_df.columns if 'input' in col.lower()]
-        output_columns = [col for col in spans_df.columns if 'output' in col.lower()]
-        name_columns = [col for col in spans_df.columns if 'name' in col.lower()]
-        print(f"📊 Input columns found: {input_columns}")
-        print(f"📊 Output columns found: {output_columns}")
-        print(f"📊 Name columns found: {name_columns}")
-        # Create evaluation records for Phoenix
-        evaluation_records = []
-        spans_with_evals = []
-        for _, eval_row in evaluations_df.iterrows():
-            task_id = eval_row["task_id"]
-            matching_spans = pd.DataFrame()
-            # Try different strategies to find matching spans
-            # Strategy 1: Search in all string columns for task_id
-            for col in spans_df.columns:
-                if spans_df[col].dtype == 'object':  # String-like columns
-                    try:
-                        matches = spans_df[
-                            spans_df[col].astype(str).str.contains(task_id, na=False, case=False)
-                        ]
-                        if len(matches) > 0:
-                            matching_spans = matches
-                            print(f"✅ Found match for {task_id} in column '{col}'")
-                            break
-                    except Exception as e:
-                        continue
-            # Strategy 2: If no matches found, try searching in input columns specifically
-            if len(matching_spans) == 0 and input_columns:
-                for input_col in input_columns:
-                    try:
-                        matches = spans_df[
-                            spans_df[input_col].astype(str).str.contains(task_id, na=False, case=False)
-                        ]
-                        if len(matches) > 0:
-                            matching_spans = matches
-                            print(f"✅ Found match for {task_id} in input column '{input_col}'")
-                            break
-                    except Exception as e:
-                        continue
-            # Strategy 3: If still no matches, try with partial task_id (last 8 characters)
-            if len(matching_spans) == 0:
-                short_task_id = task_id[-8:] if len(task_id) > 8 else task_id
-                for col in spans_df.columns:
-                    if spans_df[col].dtype == 'object':
-                        try:
-                            matches = spans_df[
-                                spans_df[col].astype(str).str.contains(short_task_id, na=False, case=False)
-                            ]
-                            if len(matches) > 0:
-                                matching_spans = matches
-                                print(f"✅ Found match for {task_id} using short ID in column '{col}'")
-                                break
-                        except Exception as e:
-                            continue
-            if len(matching_spans) > 0:
-                span_id = matching_spans.iloc[0].get('context.span_id') or matching_spans.iloc[0].get('span_id')
-                if span_id:
-                    # Create evaluation record in Phoenix format
-                    evaluation_record = {
-                        "span_id": span_id,
-                        "name": "gaia_ground_truth",
-                        "score": eval_row["similarity_score"],
-                        "label": "correct" if bool(eval_row["exact_match"]) else "incorrect",
-                        "explanation": f"Predicted: '{eval_row['predicted_answer']}' | Ground Truth: '{eval_row['actual_answer']}' | Similarity: {eval_row['similarity_score']:.3f} | Exact Match: {eval_row['exact_match']}",
-                        "annotator_kind": "HUMAN",
-                        "metadata": {
-                            "task_id": task_id,
-                            "exact_match": bool(eval_row["exact_match"]),
-                            "similarity_score": float(eval_row["similarity_score"]),
-                            "contains_answer": bool(eval_row["contains_answer"]),
-                            "predicted_answer": str(eval_row["predicted_answer"]),
-                            "ground_truth": str(eval_row["actual_answer"])
-                        }
-                    }
-                    evaluation_records.append(evaluation_record)
-                    spans_with_evals.append(span_id)
-                else:
-                    print(f"⚠️ No span_id found for matching span with task {task_id}")
-            else:
-                print(f"⚠️ No matching span found for task {task_id}")
-        if evaluation_records:
-            # Convert to DataFrame for Phoenix
-            eval_df = pd.DataFrame(evaluation_records)
-            # Create SpanEvaluations object
-            span_evaluations = SpanEvaluations(
-                eval_name="gaia_ground_truth",
-                dataframe=eval_df
-            )
-            # Log evaluations to Phoenix
-            try:
-                # Try the newer Phoenix API
-                px.log_evaluations(span_evaluations)
-                print(f"✅ Successfully logged {len(evaluation_records)} evaluations to Phoenix using px.log_evaluations")
-            except AttributeError:
-                try:
-                    # Fallback for older Phoenix versions
-                    client.log_evaluations(span_evaluations)
-                    print(f"✅ Successfully logged {len(evaluation_records)} evaluations to Phoenix using client.log_evaluations")
-                except Exception as e:
-                    print(f"⚠️ Could not log evaluations using either method: {e}")
-                    # Still return the DataFrame so we know what would have been logged
-                    print("Evaluation records created but not logged to Phoenix")
-            return eval_df
-        else:
-            print("⚠️ No matching spans found for any evaluations")
-            if spans_df is not None:
-                print(f"Available spans: {len(spans_df)}")
-                if len(spans_df) > 0:
-                    available_cols = [col for col in spans_df.columns if spans_df[col].dtype == 'object'][:5]
-                    print(f"Sample searchable columns: {available_cols}")
-            return None
-    except Exception as e:
-        print(f"❌ Could not log evaluations to Phoenix: {e}")
-        import traceback
-        traceback.print_exc()
-        return None

test_comparison.py DELETED Viewed

@@ -1,144 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script for GAIA comparison functionality.
-"""
-import sys
-import os
-sys.path.append(os.path.dirname(os.path.abspath(__file__)))
-from comparison import AnswerComparator
-from phoenix_evaluator import log_evaluations_to_phoenix
-import pandas as pd
-def test_basic_comparison():
-    """Test basic comparison functionality."""
-    print("Testing basic comparison...")
-    # Initialize comparator
-    comparator = AnswerComparator()
-    # Test with some sample data
-    sample_results = [
-        {"task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be", "submitted_answer": "3"},
-        {"task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6", "submitted_answer": "3"},
-        {"task_id": "nonexistent-task", "submitted_answer": "test"}
-    ]
-    # Evaluate batch
-    evaluations_df = comparator.evaluate_batch(sample_results)
-    print(f"Evaluated {len(evaluations_df)} answers")
-    # Get summary stats
-    summary_stats = comparator.get_summary_stats(evaluations_df)
-    print("Summary statistics:")
-    for key, value in summary_stats.items():
-        print(f"  {key}: {value}")
-    # Test single evaluation
-    print("\nTesting single evaluation...")
-    single_eval = comparator.evaluate_answer("8e867cd7-cff9-4e6c-867a-ff5ddc2550be", "3")
-    print(f"Single evaluation result: {single_eval}")
-    return evaluations_df
-def test_results_enhancement():
-    """Test results log enhancement."""
-    print("\nTesting results log enhancement...")
-    comparator = AnswerComparator()
-    # Sample results log (like what comes from your agent)
-    sample_results_log = [
-        {
-            "Task ID": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
-            "Question": "How many studio albums were published by Mercedes Sosa between 2000 and 2009?",
-            "Submitted Answer": "3"
-        },
-        {
-            "Task ID": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
-            "Question": "Test question",
-            "Submitted Answer": "wrong answer"
-        }
-    ]
-    # Enhance results
-    enhanced_results = comparator.enhance_results_log(sample_results_log)
-    print("Enhanced results:")
-    for result in enhanced_results:
-        print(f"  Task: {result['Task ID']}")
-        print(f"  Answer: {result['Submitted Answer']}")
-        print(f"  Ground Truth: {result['Ground Truth']}")
-        print(f"  Exact Match: {result['Exact Match']}")
-        print(f"  Similarity: {result['Similarity']}")
-        print()
-def test_phoenix_integration():
-    """Test Phoenix integration (basic)."""
-    print("\nTesting Phoenix integration...")
-    # Create sample evaluations
-    sample_evaluations = pd.DataFrame([
-        {
-            "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
-            "predicted_answer": "3",
-            "actual_answer": "3",
-            "exact_match": True,
-            "similarity_score": 1.0,
-            "contains_answer": True,
-            "error": None
-        },
-        {
-            "task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
-            "predicted_answer": "wrong",
-            "actual_answer": "3",
-            "exact_match": False,
-            "similarity_score": 0.2,
-            "contains_answer": False,
-            "error": None
-        }
-    ])
-    # Try to log to Phoenix
-    try:
-        result = log_evaluations_to_phoenix(sample_evaluations)
-        if result is not None:
-            print("✅ Phoenix integration successful")
-        else:
-            print("⚠️ Phoenix integration failed (likely Phoenix not running)")
-    except Exception as e:
-        print(f"⚠️ Phoenix integration error: {e}")
-def main():
-    """Run all tests."""
-    print("="*50)
-    print("GAIA Comparison Test Suite")
-    print("="*50)
-    try:
-        # Test basic comparison
-        evaluations_df = test_basic_comparison()
-        # Test results enhancement
-        test_results_enhancement()
-        # Test Phoenix integration
-        test_phoenix_integration()
-        print("\n" + "="*50)
-        print("All tests completed!")
-        print("="*50)
-    except Exception as e:
-        print(f"❌ Test failed with error: {e}")
-        import traceback
-        traceback.print_exc()
-if __name__ == "__main__":
-    main()

test_phoenix_logging.py DELETED Viewed

@@ -1,261 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script to verify Phoenix evaluations logging.
-"""
-import sys
-import os
-sys.path.append(os.path.dirname(os.path.abspath(__file__)))
-import phoenix as px
-import pandas as pd
-from comparison import AnswerComparator
-from phoenix_evaluator import log_evaluations_to_phoenix
-from datetime import datetime
-import time
-def test_phoenix_connection():
-    """Test Phoenix connection and basic functionality."""
-    print("🔍 Testing Phoenix Connection...")
-    try:
-        client = px.Client()
-        print("✅ Phoenix client connected successfully")
-        # Check if Phoenix is actually running
-        spans_df = client.get_spans_dataframe()
-        print(f"📊 Found {len(spans_df)} existing spans in Phoenix")
-        return client, spans_df
-    except Exception as e:
-        print(f"❌ Phoenix connection failed: {e}")
-        print("Make sure Phoenix is running and accessible at http://localhost:6006")
-        return None, None
-def create_test_evaluations():
-    """Create test evaluations for logging."""
-    print("\n🧪 Creating test evaluations...")
-    test_data = [
-        {
-            "task_id": "test-exact-match",
-            "predicted_answer": "Paris",
-            "actual_answer": "Paris",
-            "exact_match": True,
-            "similarity_score": 1.0,
-            "contains_answer": True,
-            "error": None
-        },
-        {
-            "task_id": "test-partial-match",
-            "predicted_answer": "The capital of France is Paris",
-            "actual_answer": "Paris",
-            "exact_match": False,
-            "similarity_score": 0.75,
-            "contains_answer": True,
-            "error": None
-        },
-        {
-            "task_id": "test-no-match",
-            "predicted_answer": "London",
-            "actual_answer": "Paris",
-            "exact_match": False,
-            "similarity_score": 0.2,
-            "contains_answer": False,
-            "error": None
-        }
-    ]
-    evaluations_df = pd.DataFrame(test_data)
-    print(f"Created {len(evaluations_df)} test evaluations")
-    return evaluations_df
-def create_mock_spans(client):
-    """Create mock spans for testing (if no real spans exist)."""
-    print("\n🎭 Creating mock spans for testing...")
-    # Note: This is a simplified mock - in real usage, spans are created by agent runs
-    mock_spans = [
-        {
-            "context.span_id": "mock-span-1",
-            "name": "test_agent_run",
-            "input.value": "Question about test-exact-match",
-            "output.value": "Paris",
-            "start_time": datetime.now(),
-            "end_time": datetime.now()
-        },
-        {
-            "context.span_id": "mock-span-2",
-            "name": "test_agent_run",
-            "input.value": "Question about test-partial-match",
-            "output.value": "The capital of France is Paris",
-            "start_time": datetime.now(),
-            "end_time": datetime.now()
-        },
-        {
-            "context.span_id": "mock-span-3",
-            "name": "test_agent_run",
-            "input.value": "Question about test-no-match",
-            "output.value": "London",
-            "start_time": datetime.now(),
-            "end_time": datetime.now()
-        }
-    ]
-    print(f"Created {len(mock_spans)} mock spans")
-    return pd.DataFrame(mock_spans)
-def test_evaluation_logging():
-    """Test the actual evaluation logging to Phoenix."""
-    print("\n📝 Testing evaluation logging...")
-    # Create test evaluations
-    evaluations_df = create_test_evaluations()
-    # Try to log to Phoenix
-    try:
-        result = log_evaluations_to_phoenix(evaluations_df)
-        if result is not None:
-            print("✅ Evaluation logging test successful!")
-            print(f"Logged {len(result)} evaluations")
-            return True
-        else:
-            print("❌ Evaluation logging test failed - no result returned")
-            return False
-    except Exception as e:
-        print(f"❌ Evaluation logging test failed with error: {e}")
-        import traceback
-        traceback.print_exc()
-        return False
-def verify_logged_evaluations(client):
-    """Verify that evaluations were actually logged to Phoenix."""
-    print("\n🔍 Verifying logged evaluations...")
-    try:
-        # Give Phoenix a moment to process
-        time.sleep(2)
-        # Try to retrieve evaluations
-        evals_df = client.get_evaluations_dataframe()
-        print(f"📊 Found {len(evals_df)} total evaluations in Phoenix")
-        # Look for our specific evaluations
-        if len(evals_df) > 0:
-            gaia_evals = evals_df[evals_df['name'] == 'gaia_ground_truth']
-            print(f"🎯 Found {len(gaia_evals)} GAIA ground truth evaluations")
-            if len(gaia_evals) > 0:
-                print("✅ Successfully verified evaluations in Phoenix!")
-                # Show sample evaluation
-                sample_eval = gaia_evals.iloc[0]
-                print(f"Sample evaluation:")
-                print(f"  - Score: {sample_eval.get('score', 'N/A')}")
-                print(f"  - Label: {sample_eval.get('label', 'N/A')}")
-                print(f"  - Explanation: {sample_eval.get('explanation', 'N/A')}")
-                return True
-            else:
-                print("❌ No GAIA evaluations found after logging")
-                return False
-        else:
-            print("❌ No evaluations found in Phoenix")
-            return False
-    except Exception as e:
-        print(f"❌ Error verifying evaluations: {e}")
-        return False
-def test_with_real_gaia_data():
-    """Test with actual GAIA data if available."""
-    print("\n📚 Testing with real GAIA data...")
-    try:
-        # Initialize comparator
-        comparator = AnswerComparator()
-        if len(comparator.ground_truth) == 0:
-            print("⚠️ No GAIA ground truth data available")
-            return False
-        # Create a real evaluation with GAIA data
-        real_task_id = list(comparator.ground_truth.keys())[0]
-        real_ground_truth = comparator.ground_truth[real_task_id]
-        real_evaluation = comparator.evaluate_answer(real_task_id, "test answer")
-        real_eval_df = pd.DataFrame([real_evaluation])
-        # Log to Phoenix
-        result = log_evaluations_to_phoenix(real_eval_df)
-        if result is not None:
-            print("✅ Real GAIA data logging successful!")
-            print(f"Task ID: {real_task_id}")
-            print(f"Ground Truth: {real_ground_truth}")
-            print(f"Similarity Score: {real_evaluation['similarity_score']:.3f}")
-            return True
-        else:
-            print("❌ Real GAIA data logging failed")
-            return False
-    except Exception as e:
-        print(f"❌ Error testing with real GAIA data: {e}")
-        return False
-def main():
-    """Main test function."""
-    print("🚀 Phoenix Evaluations Logging Test")
-    print("=" * 50)
-    # Test Phoenix connection
-    client, spans_df = test_phoenix_connection()
-    if not client:
-        print("❌ Cannot proceed without Phoenix connection")
-        return
-    # Run tests
-    tests_passed = 0
-    total_tests = 3
-    print(f"\n🧪 Running {total_tests} tests...")
-    # Test 1: Basic evaluation logging
-    if test_evaluation_logging():
-        tests_passed += 1
-    # Test 2: Verify evaluations were logged
-    if verify_logged_evaluations(client):
-        tests_passed += 1
-    # Test 3: Test with real GAIA data
-    if test_with_real_gaia_data():
-        tests_passed += 1
-    # Summary
-    print("\n" + "=" * 50)
-    print(f"🎯 Test Results: {tests_passed}/{total_tests} tests passed")
-    if tests_passed == total_tests:
-        print("🎉 All tests passed! Phoenix evaluations logging is working correctly.")
-        print("You should now see 'gaia_ground_truth' evaluations in the Phoenix UI.")
-    else:
-        print("⚠️ Some tests failed. Check the output above for details.")
-    print(f"\n🌐 Phoenix UI: http://localhost:6006")
-    print("Look for 'Evaluations' or 'Evals' tab to see the logged evaluations.")
-if __name__ == "__main__":
-    main()

test_phoenix_simple.py DELETED Viewed

@@ -1,139 +0,0 @@
-#!/usr/bin/env python3
-"""
-Simple test for Phoenix evaluations logging.
-"""
-import sys
-import os
-sys.path.append(os.path.dirname(os.path.abspath(__file__)))
-import phoenix as px
-import pandas as pd
-from comparison import AnswerComparator
-from phoenix_evaluator import log_evaluations_to_phoenix
-def test_phoenix_logging():
-    """Test Phoenix evaluations logging with simple data."""
-    print("🧪 Testing Phoenix Evaluations Logging")
-    print("=" * 50)
-    # Step 1: Check Phoenix connection
-    print("1. Checking Phoenix connection...")
-    try:
-        client = px.Client()
-        print("✅ Phoenix connected successfully")
-    except Exception as e:
-        print(f"❌ Phoenix connection failed: {e}")
-        return False
-    # Step 2: Create test evaluations
-    print("\n2. Creating test evaluations...")
-    test_evaluations = pd.DataFrame([
-        {
-            "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
-            "predicted_answer": "3",
-            "actual_answer": "3",
-            "exact_match": True,
-            "similarity_score": 1.0,
-            "contains_answer": True,
-            "error": None
-        },
-        {
-            "task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
-            "predicted_answer": "5",
-            "actual_answer": "3",
-            "exact_match": False,
-            "similarity_score": 0.2,
-            "contains_answer": False,
-            "error": None
-        }
-    ])
-    print(f"✅ Created {len(test_evaluations)} test evaluations")
-    # Step 3: Check existing spans
-    print("\n3. Checking existing spans...")
-    try:
-        spans_df = client.get_spans_dataframe()
-        print(f"📊 Found {len(spans_df)} existing spans")
-        if len(spans_df) == 0:
-            print("⚠️ No spans found - you need to run your agent first to create spans")
-            return False
-        # Debug: Show available columns
-        print(f"📊 Available span columns: {list(spans_df.columns)}")
-        input_columns = [col for col in spans_df.columns if 'input' in col.lower()]
-        print(f"📊 Input columns found: {input_columns}")
-    except Exception as e:
-        print(f"❌ Error getting spans: {e}")
-        return False
-    # Step 4: Test logging
-    print("\n4. Testing evaluation logging...")
-    try:
-        result = log_evaluations_to_phoenix(test_evaluations)
-        if result is not None:
-            print(f"✅ Successfully logged {len(result)} evaluations to Phoenix")
-            if len(result) > 0:
-                print("Sample evaluation:")
-                print(f"  - Score: {result.iloc[0]['score']}")
-                print(f"  - Label: {result.iloc[0]['label']}")
-                print(f"  - Explanation: {result.iloc[0]['explanation'][:100]}...")
-            # Step 5: Verify evaluations were logged
-            print("\n5. Verifying evaluations in Phoenix...")
-            try:
-                import time
-                time.sleep(2)  # Give Phoenix time to process
-                evals_df = client.get_evaluations_dataframe()
-                gaia_evals = evals_df[evals_df['name'] == 'gaia_ground_truth']
-                print(f"📊 Found {len(gaia_evals)} GAIA evaluations in Phoenix")
-                if len(gaia_evals) > 0:
-                    print("✅ Evaluations successfully verified in Phoenix!")
-                    return True
-                else:
-                    print("⚠️ No GAIA evaluations found in Phoenix")
-                    return False
-            except Exception as e:
-                print(f"⚠️ Could not verify evaluations: {e}")
-                print("✅ Logging appeared successful though")
-                return True
-        else:
-            print("❌ Evaluation logging failed")
-            return False
-    except Exception as e:
-        print(f"❌ Error during logging: {e}")
-        import traceback
-        traceback.print_exc()
-        return False
-def main():
-    """Main test function."""
-    success = test_phoenix_logging()
-    print("\n" + "=" * 50)
-    if success:
-        print("🎉 Phoenix evaluations logging test PASSED!")
-        print("You should now see 'gaia_ground_truth' evaluations in Phoenix UI")
-        print("🌐 Visit: http://localhost:6006")
-    else:
-        print("❌ Phoenix evaluations logging test FAILED!")
-        print("Make sure:")
-        print("  1. Your agent app is running (it starts Phoenix)")
-        print("  2. You've run your agent at least once to create spans")
-        print("  3. Phoenix is accessible at http://localhost:6006")
-        print("  4. Run 'python debug_spans.py' to see span column structure")
-if __name__ == "__main__":
-    main()