Romain Fayoux commited on
Commit
b284752
Β·
1 Parent(s): 3a7aaed

Cleaned custom evaluations from project (claude slop)

Browse files
GAIA_COMPARISON.md DELETED
@@ -1,142 +0,0 @@
1
- # GAIA Ground Truth Comparison
2
-
3
- This project now includes automatic comparison of your agent's answers against the GAIA dataset ground truth, with Phoenix observability integration.
4
-
5
- ## Features
6
-
7
- - **Ground Truth Comparison**: Automatically compares agent answers to correct answers from `data/metadata.jsonl`
8
- - **Multiple Evaluation Metrics**: Exact match, similarity score, and contains-answer detection
9
- - **Phoenix Integration**: Logs evaluations to Phoenix for persistent tracking and analysis
10
- - **Enhanced Results Display**: Shows ground truth and comparison results in the Gradio interface
11
-
12
- ## How It Works
13
-
14
- ### 1. Ground Truth Loading
15
- - Loads correct answers from `data/metadata.jsonl`
16
- - Maps task IDs to ground truth answers
17
- - Currently supports 165 questions from the GAIA dataset
18
-
19
- ### 2. Answer Comparison
20
- For each agent answer, the system calculates:
21
- - **Exact Match**: Boolean indicating if answers match exactly (after normalization)
22
- - **Similarity Score**: 0-1 score using difflib.SequenceMatcher
23
- - **Contains Answer**: Boolean indicating if the correct answer is contained in the agent's response
24
-
25
- ### 3. Answer Normalization
26
- Before comparison, answers are normalized by:
27
- - Converting to lowercase
28
- - Removing punctuation (.,;:!?"')
29
- - Normalizing whitespace
30
- - Trimming leading/trailing spaces
31
-
32
- ### 4. Phoenix Integration
33
- - Evaluations are automatically logged to Phoenix
34
- - Each evaluation includes score, label, explanation, and detailed metrics
35
- - Viewable in Phoenix UI for historical tracking and analysis
36
-
37
- ## Usage
38
-
39
- ### In Your Agent App
40
- The comparison happens automatically when you run your agent:
41
-
42
- 1. **Run your agent** - Process questions as usual
43
- 2. **Automatic comparison** - System compares answers to ground truth
44
- 3. **Enhanced results** - Results table includes comparison columns
45
- 4. **Phoenix logging** - Evaluations are logged for persistent tracking
46
-
47
- ### Results Display
48
- Your results table now includes these additional columns:
49
- - **Ground Truth**: The correct answer from GAIA dataset
50
- - **Exact Match**: True/False for exact matches
51
- - **Similarity**: Similarity score (0-1)
52
- - **Contains Answer**: True/False if correct answer is contained
53
-
54
- ### Status Message
55
- The status message now includes:
56
- ```
57
- Ground Truth Comparison:
58
- Exact matches: 15/50 (30.0%)
59
- Average similarity: 0.654
60
- Contains correct answer: 22/50 (44.0%)
61
- Evaluations logged to Phoenix βœ…
62
- ```
63
-
64
- ## Testing
65
-
66
- Run the test suite to verify functionality:
67
-
68
- ```bash
69
- python test_comparison.py
70
- ```
71
-
72
- This will test:
73
- - Basic comparison functionality
74
- - Results enhancement
75
- - Phoenix integration
76
- - Ground truth loading
77
-
78
- ## Files Added
79
-
80
- - `comparison.py`: Main comparison logic and AnswerComparator class
81
- - `phoenix_evaluator.py`: Phoenix integration for logging evaluations
82
- - `test_comparison.py`: Test suite for verification
83
- - `GAIA_COMPARISON.md`: This documentation
84
-
85
- ## Dependencies Added
86
-
87
- - `arize-phoenix`: For observability and evaluation logging
88
- - `pandas`: For data manipulation (if not already present)
89
-
90
- ## Example Evaluation Result
91
-
92
- ```python
93
- {
94
- "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
95
- "predicted_answer": "3",
96
- "actual_answer": "3",
97
- "exact_match": True,
98
- "similarity_score": 1.0,
99
- "contains_answer": True,
100
- "error": None
101
- }
102
- ```
103
-
104
- ## Phoenix UI
105
-
106
- In the Phoenix interface, you can:
107
- - View evaluation results alongside agent traces
108
- - Track accuracy over time
109
- - Filter by correct/incorrect answers
110
- - Analyze which question types your agent struggles with
111
- - Export evaluation data for further analysis
112
-
113
- ## Troubleshooting
114
-
115
- ### No Ground Truth Available
116
- If you see "N/A" for ground truth, the question's task_id is not in the metadata.jsonl file.
117
-
118
- ### Phoenix Connection Issues
119
- If Phoenix logging fails, the comparison will still work but won't be persisted. Ensure Phoenix is running and accessible.
120
-
121
- ### Low Similarity Scores
122
- Low similarity scores might indicate:
123
- - Agent is providing verbose answers when short ones are expected
124
- - Answer format doesn't match expected format
125
- - Agent is partially correct but not exact
126
-
127
- ## Customization
128
-
129
- You can adjust the comparison logic in `comparison.py`:
130
- - Modify `normalize_answer()` for different normalization rules
131
- - Adjust similarity thresholds
132
- - Add custom evaluation metrics
133
- - Modify Phoenix logging format
134
-
135
- ## Performance
136
-
137
- The comparison adds minimal overhead:
138
- - Ground truth loading: ~1-2 seconds (one-time)
139
- - Per-answer comparison: ~1-10ms
140
- - Phoenix logging: ~10-50ms per evaluation
141
-
142
- Total additional time: Usually < 5 seconds for 50 questions.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py CHANGED
@@ -7,10 +7,6 @@ from phoenix.otel import register
7
  from openinference.instrumentation.smolagents import SmolagentsInstrumentor
8
  from llm_only_agent import LLMOnlyAgent
9
  from multi_agent import MultiAgent
10
- from comparison import AnswerComparator
11
- from phoenix_evaluator import log_evaluations_to_phoenix
12
- import phoenix as px
13
-
14
 
15
  # (Keep Constants as is)
16
  # --- Constants ---
@@ -139,32 +135,6 @@ def run_and_submit_all(profile: gr.OAuthProfile | None, limit: int | None):
139
  print("Agent did not produce any answers to submit.")
140
  return "Agent did not produce any answers to submit.", pd.DataFrame(results_log)
141
 
142
- # 3.5. Compare with Ground Truth and Log to Phoenix
143
- print("Comparing answers with ground truth...")
144
- try:
145
- # Initialize comparator
146
- comparator = AnswerComparator()
147
-
148
- # Evaluate answers
149
- evaluations_df = comparator.evaluate_batch(answers_payload)
150
-
151
- # Get summary statistics
152
- summary_stats = comparator.get_summary_stats(evaluations_df)
153
-
154
- # Enhance results log with comparison data
155
- results_log = comparator.enhance_results_log(results_log)
156
-
157
- # Log evaluations to Phoenix
158
- log_evaluations_to_phoenix(evaluations_df)
159
-
160
- print(
161
- f"Ground truth comparison completed: {summary_stats['exact_matches']}/{summary_stats['total_questions']} exact matches"
162
- )
163
-
164
- except Exception as e:
165
- print(f"Error during ground truth comparison: {e}")
166
- summary_stats = {"error": str(e)}
167
-
168
  # 4. Prepare Submission
169
  submission_data = {
170
  "username": username.strip(),
@@ -172,19 +142,6 @@ def run_and_submit_all(profile: gr.OAuthProfile | None, limit: int | None):
172
  "answers": answers_payload,
173
  }
174
  status_update = f"Agent finished. Submitting {len(answers_payload)} answers for user '{username}'..."
175
-
176
- # Add ground truth comparison to status
177
- if "error" not in summary_stats:
178
- status_update += f"\n\nGround Truth Comparison:\n"
179
- status_update += f"Exact matches: {summary_stats['exact_matches']}/{summary_stats['total_questions']} ({summary_stats['exact_match_rate']:.1%})\n"
180
- status_update += (
181
- f"Average similarity: {summary_stats['average_similarity']:.3f}\n"
182
- )
183
- status_update += f"Contains correct answer: {summary_stats['contains_matches']}/{summary_stats['total_questions']} ({summary_stats['contains_match_rate']:.1%})\n"
184
- status_update += f"Evaluations logged to Phoenix βœ…"
185
- else:
186
- status_update += f"\n\nGround Truth Comparison Error: {summary_stats['error']}"
187
-
188
  print(status_update)
189
 
190
  # 5. Submit
 
7
  from openinference.instrumentation.smolagents import SmolagentsInstrumentor
8
  from llm_only_agent import LLMOnlyAgent
9
  from multi_agent import MultiAgent
 
 
 
 
10
 
11
  # (Keep Constants as is)
12
  # --- Constants ---
 
135
  print("Agent did not produce any answers to submit.")
136
  return "Agent did not produce any answers to submit.", pd.DataFrame(results_log)
137
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
  # 4. Prepare Submission
139
  submission_data = {
140
  "username": username.strip(),
 
142
  "answers": answers_payload,
143
  }
144
  status_update = f"Agent finished. Submitting {len(answers_payload)} answers for user '{username}'..."
 
 
 
 
 
 
 
 
 
 
 
 
 
145
  print(status_update)
146
 
147
  # 5. Submit
comparison.py DELETED
@@ -1,160 +0,0 @@
1
- import json
2
- import pandas as pd
3
- from typing import Dict, List, Any
4
- from difflib import SequenceMatcher
5
- import re
6
-
7
-
8
- class AnswerComparator:
9
- def __init__(self, metadata_path: str = "data/metadata.jsonl"):
10
- """Initialize the comparator with ground truth data."""
11
- self.ground_truth = self._load_ground_truth(metadata_path)
12
- print(f"Loaded ground truth for {len(self.ground_truth)} questions")
13
-
14
- def _load_ground_truth(self, metadata_path: str) -> Dict[str, str]:
15
- """Load ground truth answers from metadata.jsonl file."""
16
- ground_truth = {}
17
- try:
18
- with open(metadata_path, 'r', encoding='utf-8') as f:
19
- for line in f:
20
- if line.strip():
21
- data = json.loads(line)
22
- task_id = data.get("task_id")
23
- final_answer = data.get("Final answer")
24
- if task_id and final_answer is not None:
25
- ground_truth[task_id] = str(final_answer)
26
- except FileNotFoundError:
27
- print(f"Warning: Ground truth file {metadata_path} not found")
28
- except Exception as e:
29
- print(f"Error loading ground truth: {e}")
30
-
31
- return ground_truth
32
-
33
- def normalize_answer(self, answer: str) -> str:
34
- """Normalize answer for comparison."""
35
- if answer is None:
36
- return ""
37
-
38
- # Convert to string and strip whitespace
39
- answer = str(answer).strip()
40
-
41
- # Convert to lowercase for case-insensitive comparison
42
- answer = answer.lower()
43
-
44
- # Remove common punctuation that might not affect correctness
45
- answer = re.sub(r'[.,;:!?"\']', '', answer)
46
-
47
- # Normalize whitespace
48
- answer = re.sub(r'\s+', ' ', answer)
49
-
50
- return answer
51
-
52
- def exact_match(self, predicted: str, actual: str) -> bool:
53
- """Check if answers match exactly after normalization."""
54
- return self.normalize_answer(predicted) == self.normalize_answer(actual)
55
-
56
- def similarity_score(self, predicted: str, actual: str) -> float:
57
- """Calculate similarity score between predicted and actual answers."""
58
- normalized_pred = self.normalize_answer(predicted)
59
- normalized_actual = self.normalize_answer(actual)
60
-
61
- if not normalized_pred and not normalized_actual:
62
- return 1.0
63
- if not normalized_pred or not normalized_actual:
64
- return 0.0
65
-
66
- return SequenceMatcher(None, normalized_pred, normalized_actual).ratio()
67
-
68
- def contains_answer(self, predicted: str, actual: str) -> bool:
69
- """Check if the actual answer is contained in the predicted answer."""
70
- normalized_pred = self.normalize_answer(predicted)
71
- normalized_actual = self.normalize_answer(actual)
72
-
73
- return normalized_actual in normalized_pred
74
-
75
- def evaluate_answer(self, task_id: str, predicted_answer: str) -> Dict[str, Any]:
76
- """Evaluate a single answer against ground truth."""
77
- actual_answer = self.ground_truth.get(task_id)
78
-
79
- if actual_answer is None:
80
- return {
81
- "task_id": task_id,
82
- "predicted_answer": predicted_answer,
83
- "actual_answer": None,
84
- "exact_match": False,
85
- "similarity_score": 0.0,
86
- "contains_answer": False,
87
- "error": "No ground truth available"
88
- }
89
-
90
- return {
91
- "task_id": task_id,
92
- "predicted_answer": predicted_answer,
93
- "actual_answer": actual_answer,
94
- "exact_match": self.exact_match(predicted_answer, actual_answer),
95
- "similarity_score": self.similarity_score(predicted_answer, actual_answer),
96
- "contains_answer": self.contains_answer(predicted_answer, actual_answer),
97
- "error": None
98
- }
99
-
100
- def evaluate_batch(self, results: List[Dict[str, Any]]) -> pd.DataFrame:
101
- """Evaluate a batch of results."""
102
- evaluations = []
103
-
104
- for result in results:
105
- task_id = result.get("task_id") or result.get("Task ID")
106
- predicted_answer = result.get("submitted_answer") or result.get("Submitted Answer", "")
107
-
108
- if task_id is not None:
109
- evaluation = self.evaluate_answer(task_id, predicted_answer)
110
- evaluations.append(evaluation)
111
-
112
- return pd.DataFrame(evaluations)
113
-
114
- def get_summary_stats(self, evaluations_df: pd.DataFrame) -> Dict[str, Any]:
115
- """Get summary statistics from evaluations."""
116
- if evaluations_df.empty:
117
- return {"error": "No evaluations available"}
118
-
119
- # Filter out entries without ground truth
120
- valid_evaluations = evaluations_df[evaluations_df['error'].isna()]
121
-
122
- if valid_evaluations.empty:
123
- return {"error": "No valid ground truth available"}
124
-
125
- total_questions = len(valid_evaluations)
126
- exact_matches = valid_evaluations['exact_match'].sum()
127
- avg_similarity = valid_evaluations['similarity_score'].mean()
128
- contains_matches = valid_evaluations['contains_answer'].sum()
129
-
130
- return {
131
- "total_questions": total_questions,
132
- "exact_matches": exact_matches,
133
- "exact_match_rate": exact_matches / total_questions,
134
- "average_similarity": avg_similarity,
135
- "contains_matches": contains_matches,
136
- "contains_match_rate": contains_matches / total_questions,
137
- "questions_with_ground_truth": total_questions
138
- }
139
-
140
- def enhance_results_log(self, results_log: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
141
- """Add comparison columns to results log."""
142
- enhanced_results = []
143
-
144
- for result in results_log:
145
- task_id = result.get("Task ID")
146
- predicted_answer = result.get("Submitted Answer", "")
147
-
148
- if task_id is not None:
149
- evaluation = self.evaluate_answer(task_id, predicted_answer)
150
-
151
- # Add comparison info to result
152
- enhanced_result = result.copy()
153
- enhanced_result["Ground Truth"] = evaluation["actual_answer"] or "N/A"
154
- enhanced_result["Exact Match"] = evaluation["exact_match"]
155
- enhanced_result["Similarity"] = f"{evaluation['similarity_score']:.3f}"
156
- enhanced_result["Contains Answer"] = evaluation["contains_answer"]
157
-
158
- enhanced_results.append(enhanced_result)
159
-
160
- return enhanced_results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
debug_phoenix.py DELETED
@@ -1,285 +0,0 @@
1
- #!/usr/bin/env python3
2
- """
3
- Enhanced debug script to check Phoenix status and evaluations.
4
- """
5
-
6
- import phoenix as px
7
- import pandas as pd
8
- from comparison import AnswerComparator
9
- from phoenix_evaluator import log_evaluations_to_phoenix
10
- import time
11
- from datetime import datetime
12
-
13
-
14
- def check_phoenix_connection():
15
- """Check if Phoenix is running and accessible."""
16
- try:
17
- client = px.Client()
18
- print("βœ… Phoenix client connected successfully")
19
-
20
- # Try to get basic info
21
- try:
22
- spans_df = client.get_spans_dataframe()
23
- print(f"βœ… Phoenix API working - can retrieve spans")
24
- return client
25
- except Exception as e:
26
- print(f"⚠️ Phoenix connected but API might have issues: {e}")
27
- return client
28
-
29
- except Exception as e:
30
- print(f"❌ Phoenix connection failed: {e}")
31
- print("Make sure Phoenix is running. You should see a message like:")
32
- print("🌍 To view the Phoenix app in your browser, visit http://localhost:6006")
33
- return None
34
-
35
-
36
- def check_spans(client):
37
- """Check spans in Phoenix."""
38
- try:
39
- spans_df = client.get_spans_dataframe()
40
- print(f"πŸ“Š Found {len(spans_df)} spans in Phoenix")
41
-
42
- if len(spans_df) > 0:
43
- print("Recent spans:")
44
- for i, (_, span) in enumerate(spans_df.head(5).iterrows()):
45
- span_id = span.get('context.span_id', 'no-id')
46
- span_name = span.get('name', 'unnamed')
47
- start_time = span.get('start_time', 'unknown')
48
- print(f" {i+1}. {span_name} ({span_id[:8]}...) - {start_time}")
49
-
50
- # Show input/output samples
51
- print("\nSpan content samples:")
52
- for i, (_, span) in enumerate(spans_df.head(3).iterrows()):
53
- input_val = str(span.get('input.value', ''))[:100]
54
- output_val = str(span.get('output.value', ''))[:100]
55
- print(f" Span {i+1}:")
56
- print(f" Input: {input_val}...")
57
- print(f" Output: {output_val}...")
58
-
59
- else:
60
- print("⚠️ No spans found. Run your agent first to generate traces.")
61
-
62
- return spans_df
63
-
64
- except Exception as e:
65
- print(f"❌ Error getting spans: {e}")
66
- return pd.DataFrame()
67
-
68
-
69
- def check_evaluations(client):
70
- """Check evaluations in Phoenix."""
71
- try:
72
- # Try different methods to get evaluations
73
- print("πŸ” Checking evaluations...")
74
-
75
- # Method 1: Direct evaluation dataframe
76
- try:
77
- evals_df = client.get_evaluations_dataframe()
78
- print(f"πŸ“Š Found {len(evals_df)} evaluations in Phoenix")
79
-
80
- if len(evals_df) > 0:
81
- print("Evaluation breakdown:")
82
- eval_names = evals_df['name'].value_counts()
83
- for name, count in eval_names.items():
84
- print(f" - {name}: {count} evaluations")
85
-
86
- # Check for GAIA evaluations specifically
87
- gaia_evals = evals_df[evals_df['name'] == 'gaia_ground_truth']
88
- if len(gaia_evals) > 0:
89
- print(f"βœ… Found {len(gaia_evals)} GAIA ground truth evaluations")
90
-
91
- # Show sample evaluation
92
- sample = gaia_evals.iloc[0]
93
- print("Sample GAIA evaluation:")
94
- print(f" - Score: {sample.get('score', 'N/A')}")
95
- print(f" - Label: {sample.get('label', 'N/A')}")
96
- print(f" - Explanation: {sample.get('explanation', 'N/A')[:100]}...")
97
-
98
- # Show metadata if available
99
- metadata = sample.get('metadata', {})
100
- if metadata:
101
- print(f" - Metadata keys: {list(metadata.keys())}")
102
-
103
- else:
104
- print("❌ No GAIA ground truth evaluations found")
105
- print("Available evaluation types:", list(eval_names.keys()))
106
-
107
- else:
108
- print("⚠️ No evaluations found in Phoenix")
109
-
110
- return evals_df
111
-
112
- except AttributeError as e:
113
- print(f"⚠️ get_evaluations_dataframe not available: {e}")
114
- print("This might be a Phoenix version issue")
115
- return pd.DataFrame()
116
-
117
- except Exception as e:
118
- print(f"❌ Error getting evaluations: {e}")
119
- return pd.DataFrame()
120
-
121
-
122
- def test_evaluation_creation_and_logging():
123
- """Test creating and logging evaluations."""
124
- print("\nπŸ§ͺ Testing evaluation creation and logging...")
125
-
126
- # Create sample evaluations
127
- sample_data = [
128
- {
129
- "task_id": "debug-test-1",
130
- "predicted_answer": "test answer 1",
131
- "actual_answer": "correct answer 1",
132
- "exact_match": False,
133
- "similarity_score": 0.75,
134
- "contains_answer": True,
135
- "error": None
136
- },
137
- {
138
- "task_id": "debug-test-2",
139
- "predicted_answer": "exact match",
140
- "actual_answer": "exact match",
141
- "exact_match": True,
142
- "similarity_score": 1.0,
143
- "contains_answer": True,
144
- "error": None
145
- }
146
- ]
147
-
148
- evaluations_df = pd.DataFrame(sample_data)
149
- print(f"Created {len(evaluations_df)} test evaluations")
150
-
151
- # Try to log to Phoenix
152
- try:
153
- print("Attempting to log evaluations to Phoenix...")
154
- result = log_evaluations_to_phoenix(evaluations_df)
155
-
156
- if result is not None:
157
- print("βœ… Test evaluation logging successful")
158
- print(f"Logged {len(result)} evaluations")
159
- return True
160
- else:
161
- print("❌ Test evaluation logging failed - no result returned")
162
- return False
163
-
164
- except Exception as e:
165
- print(f"❌ Test evaluation logging error: {e}")
166
- import traceback
167
- traceback.print_exc()
168
- return False
169
-
170
-
171
- def check_gaia_data():
172
- """Check GAIA ground truth data availability."""
173
- print("\nπŸ“š Checking GAIA ground truth data...")
174
-
175
- try:
176
- comparator = AnswerComparator()
177
-
178
- print(f"βœ… Loaded {len(comparator.ground_truth)} GAIA ground truth answers")
179
-
180
- if len(comparator.ground_truth) > 0:
181
- # Show sample
182
- sample_task_id = list(comparator.ground_truth.keys())[0]
183
- sample_answer = comparator.ground_truth[sample_task_id]
184
- print(f"Sample: {sample_task_id} -> '{sample_answer}'")
185
-
186
- # Test evaluation
187
- test_eval = comparator.evaluate_answer(sample_task_id, "test answer")
188
- print(f"Test evaluation result: {test_eval}")
189
-
190
- return True
191
- else:
192
- print("❌ No GAIA ground truth data found")
193
- return False
194
-
195
- except Exception as e:
196
- print(f"❌ Error checking GAIA data: {e}")
197
- return False
198
-
199
-
200
- def show_phoenix_ui_info():
201
- """Show information about Phoenix UI."""
202
- print("\n🌐 Phoenix UI Information:")
203
- print("-" * 30)
204
- print("Phoenix UI should be available at: http://localhost:6006")
205
- print("")
206
- print("In the Phoenix UI, look for:")
207
- print(" β€’ 'Evaluations' tab or section")
208
- print(" β€’ 'Evals' section")
209
- print(" β€’ 'Annotations' tab")
210
- print(" β€’ In 'Spans' view, look for evaluation badges on spans")
211
- print("")
212
- print("If you see evaluations, they should be named 'gaia_ground_truth'")
213
- print("Each evaluation should show:")
214
- print(" - Score (similarity score 0-1)")
215
- print(" - Label (correct/incorrect)")
216
- print(" - Explanation (predicted vs ground truth)")
217
- print(" - Metadata (task_id, exact_match, etc.)")
218
-
219
-
220
- def main():
221
- """Main debug function."""
222
- print("πŸ” Enhanced Phoenix Debug Script")
223
- print("=" * 50)
224
-
225
- # Check Phoenix connection
226
- client = check_phoenix_connection()
227
- if not client:
228
- print("\n❌ Cannot proceed without Phoenix connection")
229
- print("Make sure your agent app is running (it starts Phoenix)")
230
- return
231
-
232
- print("\nπŸ“‹ Checking Phoenix Data:")
233
- print("-" * 30)
234
-
235
- # Check spans
236
- spans_df = check_spans(client)
237
-
238
- # Check evaluations
239
- evals_df = check_evaluations(client)
240
-
241
- # Test evaluation creation
242
- test_success = test_evaluation_creation_and_logging()
243
-
244
- # Wait a moment and recheck evaluations
245
- if test_success:
246
- print("\n⏳ Waiting for evaluations to be processed...")
247
- time.sleep(3)
248
-
249
- print("πŸ” Rechecking evaluations after test logging...")
250
- evals_df_after = check_evaluations(client)
251
-
252
- if len(evals_df_after) > len(evals_df):
253
- print("βœ… New evaluations detected after test!")
254
- else:
255
- print("⚠️ No new evaluations detected")
256
-
257
- # Check GAIA data
258
- gaia_available = check_gaia_data()
259
-
260
- # Show Phoenix UI info
261
- show_phoenix_ui_info()
262
-
263
- # Final summary
264
- print("\n" + "=" * 50)
265
- print("πŸ“Š Summary:")
266
- print(f" β€’ Phoenix connected: {'βœ…' if client else '❌'}")
267
- print(f" β€’ Spans available: {len(spans_df)} spans")
268
- print(f" β€’ Evaluations found: {len(evals_df)} evaluations")
269
- print(f" β€’ GAIA data available: {'βœ…' if gaia_available else '❌'}")
270
- print(f" β€’ Test logging worked: {'βœ…' if test_success else '❌'}")
271
-
272
- print("\nπŸ’‘ Next Steps:")
273
- if len(spans_df) == 0:
274
- print(" β€’ Run your agent to generate traces first")
275
- if len(evals_df) == 0:
276
- print(" β€’ Check if evaluations are being logged correctly")
277
- print(" β€’ Verify Phoenix version compatibility")
278
- if not gaia_available:
279
- print(" β€’ Check that data/metadata.jsonl exists and is readable")
280
-
281
- print(f"\n🌐 Phoenix UI: http://localhost:6006")
282
-
283
-
284
- if __name__ == "__main__":
285
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
debug_spans.py DELETED
@@ -1,77 +0,0 @@
1
- #!/usr/bin/env python3
2
- """
3
- Debug script to see Phoenix spans column structure.
4
- """
5
-
6
- import sys
7
- import os
8
- sys.path.append(os.path.dirname(os.path.abspath(__file__)))
9
-
10
- import phoenix as px
11
- import pandas as pd
12
-
13
-
14
- def debug_spans_structure():
15
- """Debug the structure of Phoenix spans."""
16
- print("πŸ” Debugging Phoenix Spans Structure")
17
- print("=" * 50)
18
-
19
- try:
20
- client = px.Client()
21
- print("βœ… Phoenix connected successfully")
22
- except Exception as e:
23
- print(f"❌ Phoenix connection failed: {e}")
24
- return
25
-
26
- try:
27
- spans_df = client.get_spans_dataframe()
28
- print(f"πŸ“Š Found {len(spans_df)} spans in Phoenix")
29
-
30
- if len(spans_df) == 0:
31
- print("⚠️ No spans found. Run your agent first to create spans.")
32
- return
33
-
34
- print(f"\nπŸ“‹ Available Columns ({len(spans_df.columns)} total):")
35
- for i, col in enumerate(spans_df.columns):
36
- print(f" {i+1:2d}. {col}")
37
-
38
- print(f"\nπŸ” Sample Data (first span):")
39
- sample_span = spans_df.iloc[0]
40
- for col in spans_df.columns:
41
- value = sample_span.get(col)
42
- if value is not None:
43
- value_str = str(value)[:100] + "..." if len(str(value)) > 100 else str(value)
44
- print(f" {col}: {value_str}")
45
-
46
- # Look for input/output related columns
47
- input_cols = [col for col in spans_df.columns if 'input' in col.lower()]
48
- output_cols = [col for col in spans_df.columns if 'output' in col.lower()]
49
-
50
- print(f"\n🎯 Input-related columns: {input_cols}")
51
- print(f"🎯 Output-related columns: {output_cols}")
52
-
53
- # Look for span ID columns
54
- id_cols = [col for col in spans_df.columns if 'id' in col.lower()]
55
- print(f"🎯 ID-related columns: {id_cols}")
56
-
57
- # Look for columns that might contain task IDs
58
- print(f"\nπŸ” Searching for task IDs in spans...")
59
- task_id_sample = "8e867cd7-cff9-4e6c-867a-ff5ddc2550be"
60
-
61
- for col in spans_df.columns:
62
- if spans_df[col].dtype == 'object': # String-like columns
63
- try:
64
- matches = spans_df[spans_df[col].astype(str).str.contains(task_id_sample, na=False, case=False)]
65
- if len(matches) > 0:
66
- print(f" βœ… Found task ID in column '{col}': {len(matches)} matches")
67
- except:
68
- pass
69
-
70
- except Exception as e:
71
- print(f"❌ Error debugging spans: {e}")
72
- import traceback
73
- traceback.print_exc()
74
-
75
-
76
- if __name__ == "__main__":
77
- debug_spans_structure()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
phoenix_evaluator.py DELETED
@@ -1,273 +0,0 @@
1
- import pandas as pd
2
- from typing import Dict, Any, List, Optional
3
- from comparison import AnswerComparator
4
- import phoenix as px
5
- from phoenix.trace import SpanEvaluations
6
-
7
-
8
- class GAIAPhoenixEvaluator:
9
- """Phoenix evaluator for GAIA dataset ground truth comparison."""
10
-
11
- def __init__(self, metadata_path: str = "data/metadata.jsonl"):
12
- self.comparator = AnswerComparator(metadata_path)
13
- self.eval_name = "gaia_ground_truth"
14
-
15
- def evaluate_spans(self, spans_df: pd.DataFrame) -> List[SpanEvaluations]:
16
- """Evaluate spans and return Phoenix SpanEvaluations."""
17
- evaluations = []
18
-
19
- for _, span in spans_df.iterrows():
20
- # Extract task_id and answer from span
21
- task_id = self._extract_task_id(span)
22
- predicted_answer = self._extract_predicted_answer(span)
23
- span_id = span.get("context.span_id")
24
-
25
- if task_id and predicted_answer is not None and span_id:
26
- evaluation = self.comparator.evaluate_answer(task_id, predicted_answer)
27
-
28
- # Create evaluation record for Phoenix
29
- eval_record = {
30
- "span_id": span_id,
31
- "score": 1.0 if evaluation["exact_match"] else evaluation["similarity_score"],
32
- "label": "correct" if evaluation["exact_match"] else "incorrect",
33
- "explanation": self._create_explanation(evaluation),
34
- "task_id": task_id,
35
- "predicted_answer": evaluation["predicted_answer"],
36
- "ground_truth": evaluation["actual_answer"],
37
- "exact_match": evaluation["exact_match"],
38
- "similarity_score": evaluation["similarity_score"],
39
- "contains_answer": evaluation["contains_answer"]
40
- }
41
-
42
- evaluations.append(eval_record)
43
-
44
- if evaluations:
45
- # Create SpanEvaluations object
46
- eval_df = pd.DataFrame(evaluations)
47
- return [SpanEvaluations(eval_name=self.eval_name, dataframe=eval_df)]
48
-
49
- return []
50
-
51
- def _extract_task_id(self, span) -> Optional[str]:
52
- """Extract task_id from span data."""
53
- # Try span attributes first
54
- attributes = span.get("attributes", {})
55
- if isinstance(attributes, dict):
56
- if "task_id" in attributes:
57
- return attributes["task_id"]
58
-
59
- # Try input data
60
- input_data = span.get("input", {})
61
- if isinstance(input_data, dict):
62
- if "task_id" in input_data:
63
- return input_data["task_id"]
64
-
65
- # Try to extract from input value if it's a string
66
- input_value = span.get("input.value", "")
67
- if isinstance(input_value, str):
68
- # Look for UUID pattern in input
69
- import re
70
- uuid_pattern = r'[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}'
71
- match = re.search(uuid_pattern, input_value)
72
- if match:
73
- return match.group(0)
74
-
75
- # Try span name
76
- span_name = span.get("name", "")
77
- if isinstance(span_name, str):
78
- import re
79
- uuid_pattern = r'[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}'
80
- match = re.search(uuid_pattern, span_name)
81
- if match:
82
- return match.group(0)
83
-
84
- return None
85
-
86
- def _extract_predicted_answer(self, span) -> Optional[str]:
87
- """Extract predicted answer from span output."""
88
- # Try different output fields
89
- output_fields = ["output.value", "output", "response", "result"]
90
-
91
- for field in output_fields:
92
- value = span.get(field)
93
- if value is not None:
94
- return str(value)
95
-
96
- return None
97
-
98
- def _create_explanation(self, evaluation: Dict[str, Any]) -> str:
99
- """Create human-readable explanation of the evaluation."""
100
- predicted = evaluation["predicted_answer"]
101
- actual = evaluation["actual_answer"]
102
- exact_match = evaluation["exact_match"]
103
- similarity = evaluation["similarity_score"]
104
- contains = evaluation["contains_answer"]
105
-
106
- if actual is None:
107
- return "❓ No ground truth available for comparison"
108
-
109
- explanation = f"Predicted: '{predicted}' | Ground Truth: '{actual}' | "
110
-
111
- if exact_match:
112
- explanation += "βœ… Exact match"
113
- elif contains:
114
- explanation += f"⚠️ Contains correct answer (similarity: {similarity:.3f})"
115
- else:
116
- explanation += f"❌ Incorrect (similarity: {similarity:.3f})"
117
-
118
- return explanation
119
-
120
-
121
- def add_gaia_evaluations_to_phoenix(spans_df: pd.DataFrame, metadata_path: str = "data/metadata.jsonl") -> List[SpanEvaluations]:
122
- """Add GAIA evaluation results to Phoenix spans."""
123
- evaluator = GAIAPhoenixEvaluator(metadata_path)
124
- return evaluator.evaluate_spans(spans_df)
125
-
126
-
127
- def log_evaluations_to_phoenix(evaluations_df: pd.DataFrame, session_id: Optional[str] = None) -> Optional[pd.DataFrame]:
128
- """Log evaluation results directly to Phoenix."""
129
- try:
130
- client = px.Client()
131
-
132
- # Get current spans to match evaluations to span_ids
133
- spans_df = client.get_spans_dataframe()
134
-
135
- if spans_df is None or spans_df.empty:
136
- print("No spans found to attach evaluations to")
137
- return None
138
-
139
- # Debug: Show available columns
140
- print(f"πŸ“Š Available span columns: {list(spans_df.columns)}")
141
-
142
- # Get possible input/output column names
143
- input_columns = [col for col in spans_df.columns if 'input' in col.lower()]
144
- output_columns = [col for col in spans_df.columns if 'output' in col.lower()]
145
- name_columns = [col for col in spans_df.columns if 'name' in col.lower()]
146
-
147
- print(f"πŸ“Š Input columns found: {input_columns}")
148
- print(f"πŸ“Š Output columns found: {output_columns}")
149
- print(f"πŸ“Š Name columns found: {name_columns}")
150
-
151
- # Create evaluation records for Phoenix
152
- evaluation_records = []
153
- spans_with_evals = []
154
-
155
- for _, eval_row in evaluations_df.iterrows():
156
- task_id = eval_row["task_id"]
157
- matching_spans = pd.DataFrame()
158
-
159
- # Try different strategies to find matching spans
160
-
161
- # Strategy 1: Search in all string columns for task_id
162
- for col in spans_df.columns:
163
- if spans_df[col].dtype == 'object': # String-like columns
164
- try:
165
- matches = spans_df[
166
- spans_df[col].astype(str).str.contains(task_id, na=False, case=False)
167
- ]
168
- if len(matches) > 0:
169
- matching_spans = matches
170
- print(f"βœ… Found match for {task_id} in column '{col}'")
171
- break
172
- except Exception as e:
173
- continue
174
-
175
- # Strategy 2: If no matches found, try searching in input columns specifically
176
- if len(matching_spans) == 0 and input_columns:
177
- for input_col in input_columns:
178
- try:
179
- matches = spans_df[
180
- spans_df[input_col].astype(str).str.contains(task_id, na=False, case=False)
181
- ]
182
- if len(matches) > 0:
183
- matching_spans = matches
184
- print(f"βœ… Found match for {task_id} in input column '{input_col}'")
185
- break
186
- except Exception as e:
187
- continue
188
-
189
- # Strategy 3: If still no matches, try with partial task_id (last 8 characters)
190
- if len(matching_spans) == 0:
191
- short_task_id = task_id[-8:] if len(task_id) > 8 else task_id
192
- for col in spans_df.columns:
193
- if spans_df[col].dtype == 'object':
194
- try:
195
- matches = spans_df[
196
- spans_df[col].astype(str).str.contains(short_task_id, na=False, case=False)
197
- ]
198
- if len(matches) > 0:
199
- matching_spans = matches
200
- print(f"βœ… Found match for {task_id} using short ID in column '{col}'")
201
- break
202
- except Exception as e:
203
- continue
204
-
205
- if len(matching_spans) > 0:
206
- span_id = matching_spans.iloc[0].get('context.span_id') or matching_spans.iloc[0].get('span_id')
207
-
208
- if span_id:
209
- # Create evaluation record in Phoenix format
210
- evaluation_record = {
211
- "span_id": span_id,
212
- "name": "gaia_ground_truth",
213
- "score": eval_row["similarity_score"],
214
- "label": "correct" if bool(eval_row["exact_match"]) else "incorrect",
215
- "explanation": f"Predicted: '{eval_row['predicted_answer']}' | Ground Truth: '{eval_row['actual_answer']}' | Similarity: {eval_row['similarity_score']:.3f} | Exact Match: {eval_row['exact_match']}",
216
- "annotator_kind": "HUMAN",
217
- "metadata": {
218
- "task_id": task_id,
219
- "exact_match": bool(eval_row["exact_match"]),
220
- "similarity_score": float(eval_row["similarity_score"]),
221
- "contains_answer": bool(eval_row["contains_answer"]),
222
- "predicted_answer": str(eval_row["predicted_answer"]),
223
- "ground_truth": str(eval_row["actual_answer"])
224
- }
225
- }
226
-
227
- evaluation_records.append(evaluation_record)
228
- spans_with_evals.append(span_id)
229
- else:
230
- print(f"⚠️ No span_id found for matching span with task {task_id}")
231
- else:
232
- print(f"⚠️ No matching span found for task {task_id}")
233
-
234
- if evaluation_records:
235
- # Convert to DataFrame for Phoenix
236
- eval_df = pd.DataFrame(evaluation_records)
237
-
238
- # Create SpanEvaluations object
239
- span_evaluations = SpanEvaluations(
240
- eval_name="gaia_ground_truth",
241
- dataframe=eval_df
242
- )
243
-
244
- # Log evaluations to Phoenix
245
- try:
246
- # Try the newer Phoenix API
247
- px.log_evaluations(span_evaluations)
248
- print(f"βœ… Successfully logged {len(evaluation_records)} evaluations to Phoenix using px.log_evaluations")
249
- except AttributeError:
250
- try:
251
- # Fallback for older Phoenix versions
252
- client.log_evaluations(span_evaluations)
253
- print(f"βœ… Successfully logged {len(evaluation_records)} evaluations to Phoenix using client.log_evaluations")
254
- except Exception as e:
255
- print(f"⚠️ Could not log evaluations using either method: {e}")
256
- # Still return the DataFrame so we know what would have been logged
257
- print("Evaluation records created but not logged to Phoenix")
258
-
259
- return eval_df
260
- else:
261
- print("⚠️ No matching spans found for any evaluations")
262
- if spans_df is not None:
263
- print(f"Available spans: {len(spans_df)}")
264
- if len(spans_df) > 0:
265
- available_cols = [col for col in spans_df.columns if spans_df[col].dtype == 'object'][:5]
266
- print(f"Sample searchable columns: {available_cols}")
267
- return None
268
-
269
- except Exception as e:
270
- print(f"❌ Could not log evaluations to Phoenix: {e}")
271
- import traceback
272
- traceback.print_exc()
273
- return None
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
test_comparison.py DELETED
@@ -1,144 +0,0 @@
1
- #!/usr/bin/env python3
2
- """
3
- Test script for GAIA comparison functionality.
4
- """
5
-
6
- import sys
7
- import os
8
- sys.path.append(os.path.dirname(os.path.abspath(__file__)))
9
-
10
- from comparison import AnswerComparator
11
- from phoenix_evaluator import log_evaluations_to_phoenix
12
- import pandas as pd
13
-
14
-
15
- def test_basic_comparison():
16
- """Test basic comparison functionality."""
17
- print("Testing basic comparison...")
18
-
19
- # Initialize comparator
20
- comparator = AnswerComparator()
21
-
22
- # Test with some sample data
23
- sample_results = [
24
- {"task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be", "submitted_answer": "3"},
25
- {"task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6", "submitted_answer": "3"},
26
- {"task_id": "nonexistent-task", "submitted_answer": "test"}
27
- ]
28
-
29
- # Evaluate batch
30
- evaluations_df = comparator.evaluate_batch(sample_results)
31
- print(f"Evaluated {len(evaluations_df)} answers")
32
-
33
- # Get summary stats
34
- summary_stats = comparator.get_summary_stats(evaluations_df)
35
- print("Summary statistics:")
36
- for key, value in summary_stats.items():
37
- print(f" {key}: {value}")
38
-
39
- # Test single evaluation
40
- print("\nTesting single evaluation...")
41
- single_eval = comparator.evaluate_answer("8e867cd7-cff9-4e6c-867a-ff5ddc2550be", "3")
42
- print(f"Single evaluation result: {single_eval}")
43
-
44
- return evaluations_df
45
-
46
-
47
- def test_results_enhancement():
48
- """Test results log enhancement."""
49
- print("\nTesting results log enhancement...")
50
-
51
- comparator = AnswerComparator()
52
-
53
- # Sample results log (like what comes from your agent)
54
- sample_results_log = [
55
- {
56
- "Task ID": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
57
- "Question": "How many studio albums were published by Mercedes Sosa between 2000 and 2009?",
58
- "Submitted Answer": "3"
59
- },
60
- {
61
- "Task ID": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
62
- "Question": "Test question",
63
- "Submitted Answer": "wrong answer"
64
- }
65
- ]
66
-
67
- # Enhance results
68
- enhanced_results = comparator.enhance_results_log(sample_results_log)
69
-
70
- print("Enhanced results:")
71
- for result in enhanced_results:
72
- print(f" Task: {result['Task ID']}")
73
- print(f" Answer: {result['Submitted Answer']}")
74
- print(f" Ground Truth: {result['Ground Truth']}")
75
- print(f" Exact Match: {result['Exact Match']}")
76
- print(f" Similarity: {result['Similarity']}")
77
- print()
78
-
79
-
80
- def test_phoenix_integration():
81
- """Test Phoenix integration (basic)."""
82
- print("\nTesting Phoenix integration...")
83
-
84
- # Create sample evaluations
85
- sample_evaluations = pd.DataFrame([
86
- {
87
- "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
88
- "predicted_answer": "3",
89
- "actual_answer": "3",
90
- "exact_match": True,
91
- "similarity_score": 1.0,
92
- "contains_answer": True,
93
- "error": None
94
- },
95
- {
96
- "task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
97
- "predicted_answer": "wrong",
98
- "actual_answer": "3",
99
- "exact_match": False,
100
- "similarity_score": 0.2,
101
- "contains_answer": False,
102
- "error": None
103
- }
104
- ])
105
-
106
- # Try to log to Phoenix
107
- try:
108
- result = log_evaluations_to_phoenix(sample_evaluations)
109
- if result is not None:
110
- print("βœ… Phoenix integration successful")
111
- else:
112
- print("⚠️ Phoenix integration failed (likely Phoenix not running)")
113
- except Exception as e:
114
- print(f"⚠️ Phoenix integration error: {e}")
115
-
116
-
117
- def main():
118
- """Run all tests."""
119
- print("="*50)
120
- print("GAIA Comparison Test Suite")
121
- print("="*50)
122
-
123
- try:
124
- # Test basic comparison
125
- evaluations_df = test_basic_comparison()
126
-
127
- # Test results enhancement
128
- test_results_enhancement()
129
-
130
- # Test Phoenix integration
131
- test_phoenix_integration()
132
-
133
- print("\n" + "="*50)
134
- print("All tests completed!")
135
- print("="*50)
136
-
137
- except Exception as e:
138
- print(f"❌ Test failed with error: {e}")
139
- import traceback
140
- traceback.print_exc()
141
-
142
-
143
- if __name__ == "__main__":
144
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
test_phoenix_logging.py DELETED
@@ -1,261 +0,0 @@
1
- #!/usr/bin/env python3
2
- """
3
- Test script to verify Phoenix evaluations logging.
4
- """
5
-
6
- import sys
7
- import os
8
- sys.path.append(os.path.dirname(os.path.abspath(__file__)))
9
-
10
- import phoenix as px
11
- import pandas as pd
12
- from comparison import AnswerComparator
13
- from phoenix_evaluator import log_evaluations_to_phoenix
14
- from datetime import datetime
15
- import time
16
-
17
-
18
- def test_phoenix_connection():
19
- """Test Phoenix connection and basic functionality."""
20
- print("πŸ” Testing Phoenix Connection...")
21
-
22
- try:
23
- client = px.Client()
24
- print("βœ… Phoenix client connected successfully")
25
-
26
- # Check if Phoenix is actually running
27
- spans_df = client.get_spans_dataframe()
28
- print(f"πŸ“Š Found {len(spans_df)} existing spans in Phoenix")
29
-
30
- return client, spans_df
31
- except Exception as e:
32
- print(f"❌ Phoenix connection failed: {e}")
33
- print("Make sure Phoenix is running and accessible at http://localhost:6006")
34
- return None, None
35
-
36
-
37
- def create_test_evaluations():
38
- """Create test evaluations for logging."""
39
- print("\nπŸ§ͺ Creating test evaluations...")
40
-
41
- test_data = [
42
- {
43
- "task_id": "test-exact-match",
44
- "predicted_answer": "Paris",
45
- "actual_answer": "Paris",
46
- "exact_match": True,
47
- "similarity_score": 1.0,
48
- "contains_answer": True,
49
- "error": None
50
- },
51
- {
52
- "task_id": "test-partial-match",
53
- "predicted_answer": "The capital of France is Paris",
54
- "actual_answer": "Paris",
55
- "exact_match": False,
56
- "similarity_score": 0.75,
57
- "contains_answer": True,
58
- "error": None
59
- },
60
- {
61
- "task_id": "test-no-match",
62
- "predicted_answer": "London",
63
- "actual_answer": "Paris",
64
- "exact_match": False,
65
- "similarity_score": 0.2,
66
- "contains_answer": False,
67
- "error": None
68
- }
69
- ]
70
-
71
- evaluations_df = pd.DataFrame(test_data)
72
- print(f"Created {len(evaluations_df)} test evaluations")
73
-
74
- return evaluations_df
75
-
76
-
77
- def create_mock_spans(client):
78
- """Create mock spans for testing (if no real spans exist)."""
79
- print("\n🎭 Creating mock spans for testing...")
80
-
81
- # Note: This is a simplified mock - in real usage, spans are created by agent runs
82
- mock_spans = [
83
- {
84
- "context.span_id": "mock-span-1",
85
- "name": "test_agent_run",
86
- "input.value": "Question about test-exact-match",
87
- "output.value": "Paris",
88
- "start_time": datetime.now(),
89
- "end_time": datetime.now()
90
- },
91
- {
92
- "context.span_id": "mock-span-2",
93
- "name": "test_agent_run",
94
- "input.value": "Question about test-partial-match",
95
- "output.value": "The capital of France is Paris",
96
- "start_time": datetime.now(),
97
- "end_time": datetime.now()
98
- },
99
- {
100
- "context.span_id": "mock-span-3",
101
- "name": "test_agent_run",
102
- "input.value": "Question about test-no-match",
103
- "output.value": "London",
104
- "start_time": datetime.now(),
105
- "end_time": datetime.now()
106
- }
107
- ]
108
-
109
- print(f"Created {len(mock_spans)} mock spans")
110
- return pd.DataFrame(mock_spans)
111
-
112
-
113
- def test_evaluation_logging():
114
- """Test the actual evaluation logging to Phoenix."""
115
- print("\nπŸ“ Testing evaluation logging...")
116
-
117
- # Create test evaluations
118
- evaluations_df = create_test_evaluations()
119
-
120
- # Try to log to Phoenix
121
- try:
122
- result = log_evaluations_to_phoenix(evaluations_df)
123
-
124
- if result is not None:
125
- print("βœ… Evaluation logging test successful!")
126
- print(f"Logged {len(result)} evaluations")
127
- return True
128
- else:
129
- print("❌ Evaluation logging test failed - no result returned")
130
- return False
131
-
132
- except Exception as e:
133
- print(f"❌ Evaluation logging test failed with error: {e}")
134
- import traceback
135
- traceback.print_exc()
136
- return False
137
-
138
-
139
- def verify_logged_evaluations(client):
140
- """Verify that evaluations were actually logged to Phoenix."""
141
- print("\nπŸ” Verifying logged evaluations...")
142
-
143
- try:
144
- # Give Phoenix a moment to process
145
- time.sleep(2)
146
-
147
- # Try to retrieve evaluations
148
- evals_df = client.get_evaluations_dataframe()
149
- print(f"πŸ“Š Found {len(evals_df)} total evaluations in Phoenix")
150
-
151
- # Look for our specific evaluations
152
- if len(evals_df) > 0:
153
- gaia_evals = evals_df[evals_df['name'] == 'gaia_ground_truth']
154
- print(f"🎯 Found {len(gaia_evals)} GAIA ground truth evaluations")
155
-
156
- if len(gaia_evals) > 0:
157
- print("βœ… Successfully verified evaluations in Phoenix!")
158
-
159
- # Show sample evaluation
160
- sample_eval = gaia_evals.iloc[0]
161
- print(f"Sample evaluation:")
162
- print(f" - Score: {sample_eval.get('score', 'N/A')}")
163
- print(f" - Label: {sample_eval.get('label', 'N/A')}")
164
- print(f" - Explanation: {sample_eval.get('explanation', 'N/A')}")
165
-
166
- return True
167
- else:
168
- print("❌ No GAIA evaluations found after logging")
169
- return False
170
- else:
171
- print("❌ No evaluations found in Phoenix")
172
- return False
173
-
174
- except Exception as e:
175
- print(f"❌ Error verifying evaluations: {e}")
176
- return False
177
-
178
-
179
- def test_with_real_gaia_data():
180
- """Test with actual GAIA data if available."""
181
- print("\nπŸ“š Testing with real GAIA data...")
182
-
183
- try:
184
- # Initialize comparator
185
- comparator = AnswerComparator()
186
-
187
- if len(comparator.ground_truth) == 0:
188
- print("⚠️ No GAIA ground truth data available")
189
- return False
190
-
191
- # Create a real evaluation with GAIA data
192
- real_task_id = list(comparator.ground_truth.keys())[0]
193
- real_ground_truth = comparator.ground_truth[real_task_id]
194
-
195
- real_evaluation = comparator.evaluate_answer(real_task_id, "test answer")
196
-
197
- real_eval_df = pd.DataFrame([real_evaluation])
198
-
199
- # Log to Phoenix
200
- result = log_evaluations_to_phoenix(real_eval_df)
201
-
202
- if result is not None:
203
- print("βœ… Real GAIA data logging successful!")
204
- print(f"Task ID: {real_task_id}")
205
- print(f"Ground Truth: {real_ground_truth}")
206
- print(f"Similarity Score: {real_evaluation['similarity_score']:.3f}")
207
- return True
208
- else:
209
- print("❌ Real GAIA data logging failed")
210
- return False
211
-
212
- except Exception as e:
213
- print(f"❌ Error testing with real GAIA data: {e}")
214
- return False
215
-
216
-
217
- def main():
218
- """Main test function."""
219
- print("πŸš€ Phoenix Evaluations Logging Test")
220
- print("=" * 50)
221
-
222
- # Test Phoenix connection
223
- client, spans_df = test_phoenix_connection()
224
- if not client:
225
- print("❌ Cannot proceed without Phoenix connection")
226
- return
227
-
228
- # Run tests
229
- tests_passed = 0
230
- total_tests = 3
231
-
232
- print(f"\nπŸ§ͺ Running {total_tests} tests...")
233
-
234
- # Test 1: Basic evaluation logging
235
- if test_evaluation_logging():
236
- tests_passed += 1
237
-
238
- # Test 2: Verify evaluations were logged
239
- if verify_logged_evaluations(client):
240
- tests_passed += 1
241
-
242
- # Test 3: Test with real GAIA data
243
- if test_with_real_gaia_data():
244
- tests_passed += 1
245
-
246
- # Summary
247
- print("\n" + "=" * 50)
248
- print(f"🎯 Test Results: {tests_passed}/{total_tests} tests passed")
249
-
250
- if tests_passed == total_tests:
251
- print("πŸŽ‰ All tests passed! Phoenix evaluations logging is working correctly.")
252
- print("You should now see 'gaia_ground_truth' evaluations in the Phoenix UI.")
253
- else:
254
- print("⚠️ Some tests failed. Check the output above for details.")
255
-
256
- print(f"\n🌐 Phoenix UI: http://localhost:6006")
257
- print("Look for 'Evaluations' or 'Evals' tab to see the logged evaluations.")
258
-
259
-
260
- if __name__ == "__main__":
261
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
test_phoenix_simple.py DELETED
@@ -1,139 +0,0 @@
1
- #!/usr/bin/env python3
2
- """
3
- Simple test for Phoenix evaluations logging.
4
- """
5
-
6
- import sys
7
- import os
8
- sys.path.append(os.path.dirname(os.path.abspath(__file__)))
9
-
10
- import phoenix as px
11
- import pandas as pd
12
- from comparison import AnswerComparator
13
- from phoenix_evaluator import log_evaluations_to_phoenix
14
-
15
-
16
- def test_phoenix_logging():
17
- """Test Phoenix evaluations logging with simple data."""
18
- print("πŸ§ͺ Testing Phoenix Evaluations Logging")
19
- print("=" * 50)
20
-
21
- # Step 1: Check Phoenix connection
22
- print("1. Checking Phoenix connection...")
23
- try:
24
- client = px.Client()
25
- print("βœ… Phoenix connected successfully")
26
- except Exception as e:
27
- print(f"❌ Phoenix connection failed: {e}")
28
- return False
29
-
30
- # Step 2: Create test evaluations
31
- print("\n2. Creating test evaluations...")
32
- test_evaluations = pd.DataFrame([
33
- {
34
- "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
35
- "predicted_answer": "3",
36
- "actual_answer": "3",
37
- "exact_match": True,
38
- "similarity_score": 1.0,
39
- "contains_answer": True,
40
- "error": None
41
- },
42
- {
43
- "task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
44
- "predicted_answer": "5",
45
- "actual_answer": "3",
46
- "exact_match": False,
47
- "similarity_score": 0.2,
48
- "contains_answer": False,
49
- "error": None
50
- }
51
- ])
52
- print(f"βœ… Created {len(test_evaluations)} test evaluations")
53
-
54
- # Step 3: Check existing spans
55
- print("\n3. Checking existing spans...")
56
- try:
57
- spans_df = client.get_spans_dataframe()
58
- print(f"πŸ“Š Found {len(spans_df)} existing spans")
59
-
60
- if len(spans_df) == 0:
61
- print("⚠️ No spans found - you need to run your agent first to create spans")
62
- return False
63
-
64
- # Debug: Show available columns
65
- print(f"πŸ“Š Available span columns: {list(spans_df.columns)}")
66
- input_columns = [col for col in spans_df.columns if 'input' in col.lower()]
67
- print(f"πŸ“Š Input columns found: {input_columns}")
68
-
69
- except Exception as e:
70
- print(f"❌ Error getting spans: {e}")
71
- return False
72
-
73
- # Step 4: Test logging
74
- print("\n4. Testing evaluation logging...")
75
- try:
76
- result = log_evaluations_to_phoenix(test_evaluations)
77
-
78
- if result is not None:
79
- print(f"βœ… Successfully logged {len(result)} evaluations to Phoenix")
80
- if len(result) > 0:
81
- print("Sample evaluation:")
82
- print(f" - Score: {result.iloc[0]['score']}")
83
- print(f" - Label: {result.iloc[0]['label']}")
84
- print(f" - Explanation: {result.iloc[0]['explanation'][:100]}...")
85
-
86
- # Step 5: Verify evaluations were logged
87
- print("\n5. Verifying evaluations in Phoenix...")
88
- try:
89
- import time
90
- time.sleep(2) # Give Phoenix time to process
91
-
92
- evals_df = client.get_evaluations_dataframe()
93
- gaia_evals = evals_df[evals_df['name'] == 'gaia_ground_truth']
94
-
95
- print(f"πŸ“Š Found {len(gaia_evals)} GAIA evaluations in Phoenix")
96
-
97
- if len(gaia_evals) > 0:
98
- print("βœ… Evaluations successfully verified in Phoenix!")
99
- return True
100
- else:
101
- print("⚠️ No GAIA evaluations found in Phoenix")
102
- return False
103
-
104
- except Exception as e:
105
- print(f"⚠️ Could not verify evaluations: {e}")
106
- print("βœ… Logging appeared successful though")
107
- return True
108
-
109
- else:
110
- print("❌ Evaluation logging failed")
111
- return False
112
-
113
- except Exception as e:
114
- print(f"❌ Error during logging: {e}")
115
- import traceback
116
- traceback.print_exc()
117
- return False
118
-
119
-
120
- def main():
121
- """Main test function."""
122
- success = test_phoenix_logging()
123
-
124
- print("\n" + "=" * 50)
125
- if success:
126
- print("πŸŽ‰ Phoenix evaluations logging test PASSED!")
127
- print("You should now see 'gaia_ground_truth' evaluations in Phoenix UI")
128
- print("🌐 Visit: http://localhost:6006")
129
- else:
130
- print("❌ Phoenix evaluations logging test FAILED!")
131
- print("Make sure:")
132
- print(" 1. Your agent app is running (it starts Phoenix)")
133
- print(" 2. You've run your agent at least once to create spans")
134
- print(" 3. Phoenix is accessible at http://localhost:6006")
135
- print(" 4. Run 'python debug_spans.py' to see span column structure")
136
-
137
-
138
- if __name__ == "__main__":
139
- main()