Romain Fayoux
commited on
Commit
Β·
b284752
1
Parent(s):
3a7aaed
Cleaned custom evaluations from project (claude slop)
Browse files- GAIA_COMPARISON.md +0 -142
- app.py +0 -43
- comparison.py +0 -160
- debug_phoenix.py +0 -285
- debug_spans.py +0 -77
- phoenix_evaluator.py +0 -273
- test_comparison.py +0 -144
- test_phoenix_logging.py +0 -261
- test_phoenix_simple.py +0 -139
GAIA_COMPARISON.md
DELETED
|
@@ -1,142 +0,0 @@
|
|
| 1 |
-
# GAIA Ground Truth Comparison
|
| 2 |
-
|
| 3 |
-
This project now includes automatic comparison of your agent's answers against the GAIA dataset ground truth, with Phoenix observability integration.
|
| 4 |
-
|
| 5 |
-
## Features
|
| 6 |
-
|
| 7 |
-
- **Ground Truth Comparison**: Automatically compares agent answers to correct answers from `data/metadata.jsonl`
|
| 8 |
-
- **Multiple Evaluation Metrics**: Exact match, similarity score, and contains-answer detection
|
| 9 |
-
- **Phoenix Integration**: Logs evaluations to Phoenix for persistent tracking and analysis
|
| 10 |
-
- **Enhanced Results Display**: Shows ground truth and comparison results in the Gradio interface
|
| 11 |
-
|
| 12 |
-
## How It Works
|
| 13 |
-
|
| 14 |
-
### 1. Ground Truth Loading
|
| 15 |
-
- Loads correct answers from `data/metadata.jsonl`
|
| 16 |
-
- Maps task IDs to ground truth answers
|
| 17 |
-
- Currently supports 165 questions from the GAIA dataset
|
| 18 |
-
|
| 19 |
-
### 2. Answer Comparison
|
| 20 |
-
For each agent answer, the system calculates:
|
| 21 |
-
- **Exact Match**: Boolean indicating if answers match exactly (after normalization)
|
| 22 |
-
- **Similarity Score**: 0-1 score using difflib.SequenceMatcher
|
| 23 |
-
- **Contains Answer**: Boolean indicating if the correct answer is contained in the agent's response
|
| 24 |
-
|
| 25 |
-
### 3. Answer Normalization
|
| 26 |
-
Before comparison, answers are normalized by:
|
| 27 |
-
- Converting to lowercase
|
| 28 |
-
- Removing punctuation (.,;:!?"')
|
| 29 |
-
- Normalizing whitespace
|
| 30 |
-
- Trimming leading/trailing spaces
|
| 31 |
-
|
| 32 |
-
### 4. Phoenix Integration
|
| 33 |
-
- Evaluations are automatically logged to Phoenix
|
| 34 |
-
- Each evaluation includes score, label, explanation, and detailed metrics
|
| 35 |
-
- Viewable in Phoenix UI for historical tracking and analysis
|
| 36 |
-
|
| 37 |
-
## Usage
|
| 38 |
-
|
| 39 |
-
### In Your Agent App
|
| 40 |
-
The comparison happens automatically when you run your agent:
|
| 41 |
-
|
| 42 |
-
1. **Run your agent** - Process questions as usual
|
| 43 |
-
2. **Automatic comparison** - System compares answers to ground truth
|
| 44 |
-
3. **Enhanced results** - Results table includes comparison columns
|
| 45 |
-
4. **Phoenix logging** - Evaluations are logged for persistent tracking
|
| 46 |
-
|
| 47 |
-
### Results Display
|
| 48 |
-
Your results table now includes these additional columns:
|
| 49 |
-
- **Ground Truth**: The correct answer from GAIA dataset
|
| 50 |
-
- **Exact Match**: True/False for exact matches
|
| 51 |
-
- **Similarity**: Similarity score (0-1)
|
| 52 |
-
- **Contains Answer**: True/False if correct answer is contained
|
| 53 |
-
|
| 54 |
-
### Status Message
|
| 55 |
-
The status message now includes:
|
| 56 |
-
```
|
| 57 |
-
Ground Truth Comparison:
|
| 58 |
-
Exact matches: 15/50 (30.0%)
|
| 59 |
-
Average similarity: 0.654
|
| 60 |
-
Contains correct answer: 22/50 (44.0%)
|
| 61 |
-
Evaluations logged to Phoenix β
|
| 62 |
-
```
|
| 63 |
-
|
| 64 |
-
## Testing
|
| 65 |
-
|
| 66 |
-
Run the test suite to verify functionality:
|
| 67 |
-
|
| 68 |
-
```bash
|
| 69 |
-
python test_comparison.py
|
| 70 |
-
```
|
| 71 |
-
|
| 72 |
-
This will test:
|
| 73 |
-
- Basic comparison functionality
|
| 74 |
-
- Results enhancement
|
| 75 |
-
- Phoenix integration
|
| 76 |
-
- Ground truth loading
|
| 77 |
-
|
| 78 |
-
## Files Added
|
| 79 |
-
|
| 80 |
-
- `comparison.py`: Main comparison logic and AnswerComparator class
|
| 81 |
-
- `phoenix_evaluator.py`: Phoenix integration for logging evaluations
|
| 82 |
-
- `test_comparison.py`: Test suite for verification
|
| 83 |
-
- `GAIA_COMPARISON.md`: This documentation
|
| 84 |
-
|
| 85 |
-
## Dependencies Added
|
| 86 |
-
|
| 87 |
-
- `arize-phoenix`: For observability and evaluation logging
|
| 88 |
-
- `pandas`: For data manipulation (if not already present)
|
| 89 |
-
|
| 90 |
-
## Example Evaluation Result
|
| 91 |
-
|
| 92 |
-
```python
|
| 93 |
-
{
|
| 94 |
-
"task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
|
| 95 |
-
"predicted_answer": "3",
|
| 96 |
-
"actual_answer": "3",
|
| 97 |
-
"exact_match": True,
|
| 98 |
-
"similarity_score": 1.0,
|
| 99 |
-
"contains_answer": True,
|
| 100 |
-
"error": None
|
| 101 |
-
}
|
| 102 |
-
```
|
| 103 |
-
|
| 104 |
-
## Phoenix UI
|
| 105 |
-
|
| 106 |
-
In the Phoenix interface, you can:
|
| 107 |
-
- View evaluation results alongside agent traces
|
| 108 |
-
- Track accuracy over time
|
| 109 |
-
- Filter by correct/incorrect answers
|
| 110 |
-
- Analyze which question types your agent struggles with
|
| 111 |
-
- Export evaluation data for further analysis
|
| 112 |
-
|
| 113 |
-
## Troubleshooting
|
| 114 |
-
|
| 115 |
-
### No Ground Truth Available
|
| 116 |
-
If you see "N/A" for ground truth, the question's task_id is not in the metadata.jsonl file.
|
| 117 |
-
|
| 118 |
-
### Phoenix Connection Issues
|
| 119 |
-
If Phoenix logging fails, the comparison will still work but won't be persisted. Ensure Phoenix is running and accessible.
|
| 120 |
-
|
| 121 |
-
### Low Similarity Scores
|
| 122 |
-
Low similarity scores might indicate:
|
| 123 |
-
- Agent is providing verbose answers when short ones are expected
|
| 124 |
-
- Answer format doesn't match expected format
|
| 125 |
-
- Agent is partially correct but not exact
|
| 126 |
-
|
| 127 |
-
## Customization
|
| 128 |
-
|
| 129 |
-
You can adjust the comparison logic in `comparison.py`:
|
| 130 |
-
- Modify `normalize_answer()` for different normalization rules
|
| 131 |
-
- Adjust similarity thresholds
|
| 132 |
-
- Add custom evaluation metrics
|
| 133 |
-
- Modify Phoenix logging format
|
| 134 |
-
|
| 135 |
-
## Performance
|
| 136 |
-
|
| 137 |
-
The comparison adds minimal overhead:
|
| 138 |
-
- Ground truth loading: ~1-2 seconds (one-time)
|
| 139 |
-
- Per-answer comparison: ~1-10ms
|
| 140 |
-
- Phoenix logging: ~10-50ms per evaluation
|
| 141 |
-
|
| 142 |
-
Total additional time: Usually < 5 seconds for 50 questions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app.py
CHANGED
|
@@ -7,10 +7,6 @@ from phoenix.otel import register
|
|
| 7 |
from openinference.instrumentation.smolagents import SmolagentsInstrumentor
|
| 8 |
from llm_only_agent import LLMOnlyAgent
|
| 9 |
from multi_agent import MultiAgent
|
| 10 |
-
from comparison import AnswerComparator
|
| 11 |
-
from phoenix_evaluator import log_evaluations_to_phoenix
|
| 12 |
-
import phoenix as px
|
| 13 |
-
|
| 14 |
|
| 15 |
# (Keep Constants as is)
|
| 16 |
# --- Constants ---
|
|
@@ -139,32 +135,6 @@ def run_and_submit_all(profile: gr.OAuthProfile | None, limit: int | None):
|
|
| 139 |
print("Agent did not produce any answers to submit.")
|
| 140 |
return "Agent did not produce any answers to submit.", pd.DataFrame(results_log)
|
| 141 |
|
| 142 |
-
# 3.5. Compare with Ground Truth and Log to Phoenix
|
| 143 |
-
print("Comparing answers with ground truth...")
|
| 144 |
-
try:
|
| 145 |
-
# Initialize comparator
|
| 146 |
-
comparator = AnswerComparator()
|
| 147 |
-
|
| 148 |
-
# Evaluate answers
|
| 149 |
-
evaluations_df = comparator.evaluate_batch(answers_payload)
|
| 150 |
-
|
| 151 |
-
# Get summary statistics
|
| 152 |
-
summary_stats = comparator.get_summary_stats(evaluations_df)
|
| 153 |
-
|
| 154 |
-
# Enhance results log with comparison data
|
| 155 |
-
results_log = comparator.enhance_results_log(results_log)
|
| 156 |
-
|
| 157 |
-
# Log evaluations to Phoenix
|
| 158 |
-
log_evaluations_to_phoenix(evaluations_df)
|
| 159 |
-
|
| 160 |
-
print(
|
| 161 |
-
f"Ground truth comparison completed: {summary_stats['exact_matches']}/{summary_stats['total_questions']} exact matches"
|
| 162 |
-
)
|
| 163 |
-
|
| 164 |
-
except Exception as e:
|
| 165 |
-
print(f"Error during ground truth comparison: {e}")
|
| 166 |
-
summary_stats = {"error": str(e)}
|
| 167 |
-
|
| 168 |
# 4. Prepare Submission
|
| 169 |
submission_data = {
|
| 170 |
"username": username.strip(),
|
|
@@ -172,19 +142,6 @@ def run_and_submit_all(profile: gr.OAuthProfile | None, limit: int | None):
|
|
| 172 |
"answers": answers_payload,
|
| 173 |
}
|
| 174 |
status_update = f"Agent finished. Submitting {len(answers_payload)} answers for user '{username}'..."
|
| 175 |
-
|
| 176 |
-
# Add ground truth comparison to status
|
| 177 |
-
if "error" not in summary_stats:
|
| 178 |
-
status_update += f"\n\nGround Truth Comparison:\n"
|
| 179 |
-
status_update += f"Exact matches: {summary_stats['exact_matches']}/{summary_stats['total_questions']} ({summary_stats['exact_match_rate']:.1%})\n"
|
| 180 |
-
status_update += (
|
| 181 |
-
f"Average similarity: {summary_stats['average_similarity']:.3f}\n"
|
| 182 |
-
)
|
| 183 |
-
status_update += f"Contains correct answer: {summary_stats['contains_matches']}/{summary_stats['total_questions']} ({summary_stats['contains_match_rate']:.1%})\n"
|
| 184 |
-
status_update += f"Evaluations logged to Phoenix β
"
|
| 185 |
-
else:
|
| 186 |
-
status_update += f"\n\nGround Truth Comparison Error: {summary_stats['error']}"
|
| 187 |
-
|
| 188 |
print(status_update)
|
| 189 |
|
| 190 |
# 5. Submit
|
|
|
|
| 7 |
from openinference.instrumentation.smolagents import SmolagentsInstrumentor
|
| 8 |
from llm_only_agent import LLMOnlyAgent
|
| 9 |
from multi_agent import MultiAgent
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
# (Keep Constants as is)
|
| 12 |
# --- Constants ---
|
|
|
|
| 135 |
print("Agent did not produce any answers to submit.")
|
| 136 |
return "Agent did not produce any answers to submit.", pd.DataFrame(results_log)
|
| 137 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
# 4. Prepare Submission
|
| 139 |
submission_data = {
|
| 140 |
"username": username.strip(),
|
|
|
|
| 142 |
"answers": answers_payload,
|
| 143 |
}
|
| 144 |
status_update = f"Agent finished. Submitting {len(answers_payload)} answers for user '{username}'..."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
print(status_update)
|
| 146 |
|
| 147 |
# 5. Submit
|
comparison.py
DELETED
|
@@ -1,160 +0,0 @@
|
|
| 1 |
-
import json
|
| 2 |
-
import pandas as pd
|
| 3 |
-
from typing import Dict, List, Any
|
| 4 |
-
from difflib import SequenceMatcher
|
| 5 |
-
import re
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
class AnswerComparator:
|
| 9 |
-
def __init__(self, metadata_path: str = "data/metadata.jsonl"):
|
| 10 |
-
"""Initialize the comparator with ground truth data."""
|
| 11 |
-
self.ground_truth = self._load_ground_truth(metadata_path)
|
| 12 |
-
print(f"Loaded ground truth for {len(self.ground_truth)} questions")
|
| 13 |
-
|
| 14 |
-
def _load_ground_truth(self, metadata_path: str) -> Dict[str, str]:
|
| 15 |
-
"""Load ground truth answers from metadata.jsonl file."""
|
| 16 |
-
ground_truth = {}
|
| 17 |
-
try:
|
| 18 |
-
with open(metadata_path, 'r', encoding='utf-8') as f:
|
| 19 |
-
for line in f:
|
| 20 |
-
if line.strip():
|
| 21 |
-
data = json.loads(line)
|
| 22 |
-
task_id = data.get("task_id")
|
| 23 |
-
final_answer = data.get("Final answer")
|
| 24 |
-
if task_id and final_answer is not None:
|
| 25 |
-
ground_truth[task_id] = str(final_answer)
|
| 26 |
-
except FileNotFoundError:
|
| 27 |
-
print(f"Warning: Ground truth file {metadata_path} not found")
|
| 28 |
-
except Exception as e:
|
| 29 |
-
print(f"Error loading ground truth: {e}")
|
| 30 |
-
|
| 31 |
-
return ground_truth
|
| 32 |
-
|
| 33 |
-
def normalize_answer(self, answer: str) -> str:
|
| 34 |
-
"""Normalize answer for comparison."""
|
| 35 |
-
if answer is None:
|
| 36 |
-
return ""
|
| 37 |
-
|
| 38 |
-
# Convert to string and strip whitespace
|
| 39 |
-
answer = str(answer).strip()
|
| 40 |
-
|
| 41 |
-
# Convert to lowercase for case-insensitive comparison
|
| 42 |
-
answer = answer.lower()
|
| 43 |
-
|
| 44 |
-
# Remove common punctuation that might not affect correctness
|
| 45 |
-
answer = re.sub(r'[.,;:!?"\']', '', answer)
|
| 46 |
-
|
| 47 |
-
# Normalize whitespace
|
| 48 |
-
answer = re.sub(r'\s+', ' ', answer)
|
| 49 |
-
|
| 50 |
-
return answer
|
| 51 |
-
|
| 52 |
-
def exact_match(self, predicted: str, actual: str) -> bool:
|
| 53 |
-
"""Check if answers match exactly after normalization."""
|
| 54 |
-
return self.normalize_answer(predicted) == self.normalize_answer(actual)
|
| 55 |
-
|
| 56 |
-
def similarity_score(self, predicted: str, actual: str) -> float:
|
| 57 |
-
"""Calculate similarity score between predicted and actual answers."""
|
| 58 |
-
normalized_pred = self.normalize_answer(predicted)
|
| 59 |
-
normalized_actual = self.normalize_answer(actual)
|
| 60 |
-
|
| 61 |
-
if not normalized_pred and not normalized_actual:
|
| 62 |
-
return 1.0
|
| 63 |
-
if not normalized_pred or not normalized_actual:
|
| 64 |
-
return 0.0
|
| 65 |
-
|
| 66 |
-
return SequenceMatcher(None, normalized_pred, normalized_actual).ratio()
|
| 67 |
-
|
| 68 |
-
def contains_answer(self, predicted: str, actual: str) -> bool:
|
| 69 |
-
"""Check if the actual answer is contained in the predicted answer."""
|
| 70 |
-
normalized_pred = self.normalize_answer(predicted)
|
| 71 |
-
normalized_actual = self.normalize_answer(actual)
|
| 72 |
-
|
| 73 |
-
return normalized_actual in normalized_pred
|
| 74 |
-
|
| 75 |
-
def evaluate_answer(self, task_id: str, predicted_answer: str) -> Dict[str, Any]:
|
| 76 |
-
"""Evaluate a single answer against ground truth."""
|
| 77 |
-
actual_answer = self.ground_truth.get(task_id)
|
| 78 |
-
|
| 79 |
-
if actual_answer is None:
|
| 80 |
-
return {
|
| 81 |
-
"task_id": task_id,
|
| 82 |
-
"predicted_answer": predicted_answer,
|
| 83 |
-
"actual_answer": None,
|
| 84 |
-
"exact_match": False,
|
| 85 |
-
"similarity_score": 0.0,
|
| 86 |
-
"contains_answer": False,
|
| 87 |
-
"error": "No ground truth available"
|
| 88 |
-
}
|
| 89 |
-
|
| 90 |
-
return {
|
| 91 |
-
"task_id": task_id,
|
| 92 |
-
"predicted_answer": predicted_answer,
|
| 93 |
-
"actual_answer": actual_answer,
|
| 94 |
-
"exact_match": self.exact_match(predicted_answer, actual_answer),
|
| 95 |
-
"similarity_score": self.similarity_score(predicted_answer, actual_answer),
|
| 96 |
-
"contains_answer": self.contains_answer(predicted_answer, actual_answer),
|
| 97 |
-
"error": None
|
| 98 |
-
}
|
| 99 |
-
|
| 100 |
-
def evaluate_batch(self, results: List[Dict[str, Any]]) -> pd.DataFrame:
|
| 101 |
-
"""Evaluate a batch of results."""
|
| 102 |
-
evaluations = []
|
| 103 |
-
|
| 104 |
-
for result in results:
|
| 105 |
-
task_id = result.get("task_id") or result.get("Task ID")
|
| 106 |
-
predicted_answer = result.get("submitted_answer") or result.get("Submitted Answer", "")
|
| 107 |
-
|
| 108 |
-
if task_id is not None:
|
| 109 |
-
evaluation = self.evaluate_answer(task_id, predicted_answer)
|
| 110 |
-
evaluations.append(evaluation)
|
| 111 |
-
|
| 112 |
-
return pd.DataFrame(evaluations)
|
| 113 |
-
|
| 114 |
-
def get_summary_stats(self, evaluations_df: pd.DataFrame) -> Dict[str, Any]:
|
| 115 |
-
"""Get summary statistics from evaluations."""
|
| 116 |
-
if evaluations_df.empty:
|
| 117 |
-
return {"error": "No evaluations available"}
|
| 118 |
-
|
| 119 |
-
# Filter out entries without ground truth
|
| 120 |
-
valid_evaluations = evaluations_df[evaluations_df['error'].isna()]
|
| 121 |
-
|
| 122 |
-
if valid_evaluations.empty:
|
| 123 |
-
return {"error": "No valid ground truth available"}
|
| 124 |
-
|
| 125 |
-
total_questions = len(valid_evaluations)
|
| 126 |
-
exact_matches = valid_evaluations['exact_match'].sum()
|
| 127 |
-
avg_similarity = valid_evaluations['similarity_score'].mean()
|
| 128 |
-
contains_matches = valid_evaluations['contains_answer'].sum()
|
| 129 |
-
|
| 130 |
-
return {
|
| 131 |
-
"total_questions": total_questions,
|
| 132 |
-
"exact_matches": exact_matches,
|
| 133 |
-
"exact_match_rate": exact_matches / total_questions,
|
| 134 |
-
"average_similarity": avg_similarity,
|
| 135 |
-
"contains_matches": contains_matches,
|
| 136 |
-
"contains_match_rate": contains_matches / total_questions,
|
| 137 |
-
"questions_with_ground_truth": total_questions
|
| 138 |
-
}
|
| 139 |
-
|
| 140 |
-
def enhance_results_log(self, results_log: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
|
| 141 |
-
"""Add comparison columns to results log."""
|
| 142 |
-
enhanced_results = []
|
| 143 |
-
|
| 144 |
-
for result in results_log:
|
| 145 |
-
task_id = result.get("Task ID")
|
| 146 |
-
predicted_answer = result.get("Submitted Answer", "")
|
| 147 |
-
|
| 148 |
-
if task_id is not None:
|
| 149 |
-
evaluation = self.evaluate_answer(task_id, predicted_answer)
|
| 150 |
-
|
| 151 |
-
# Add comparison info to result
|
| 152 |
-
enhanced_result = result.copy()
|
| 153 |
-
enhanced_result["Ground Truth"] = evaluation["actual_answer"] or "N/A"
|
| 154 |
-
enhanced_result["Exact Match"] = evaluation["exact_match"]
|
| 155 |
-
enhanced_result["Similarity"] = f"{evaluation['similarity_score']:.3f}"
|
| 156 |
-
enhanced_result["Contains Answer"] = evaluation["contains_answer"]
|
| 157 |
-
|
| 158 |
-
enhanced_results.append(enhanced_result)
|
| 159 |
-
|
| 160 |
-
return enhanced_results
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
debug_phoenix.py
DELETED
|
@@ -1,285 +0,0 @@
|
|
| 1 |
-
#!/usr/bin/env python3
|
| 2 |
-
"""
|
| 3 |
-
Enhanced debug script to check Phoenix status and evaluations.
|
| 4 |
-
"""
|
| 5 |
-
|
| 6 |
-
import phoenix as px
|
| 7 |
-
import pandas as pd
|
| 8 |
-
from comparison import AnswerComparator
|
| 9 |
-
from phoenix_evaluator import log_evaluations_to_phoenix
|
| 10 |
-
import time
|
| 11 |
-
from datetime import datetime
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
def check_phoenix_connection():
|
| 15 |
-
"""Check if Phoenix is running and accessible."""
|
| 16 |
-
try:
|
| 17 |
-
client = px.Client()
|
| 18 |
-
print("β
Phoenix client connected successfully")
|
| 19 |
-
|
| 20 |
-
# Try to get basic info
|
| 21 |
-
try:
|
| 22 |
-
spans_df = client.get_spans_dataframe()
|
| 23 |
-
print(f"β
Phoenix API working - can retrieve spans")
|
| 24 |
-
return client
|
| 25 |
-
except Exception as e:
|
| 26 |
-
print(f"β οΈ Phoenix connected but API might have issues: {e}")
|
| 27 |
-
return client
|
| 28 |
-
|
| 29 |
-
except Exception as e:
|
| 30 |
-
print(f"β Phoenix connection failed: {e}")
|
| 31 |
-
print("Make sure Phoenix is running. You should see a message like:")
|
| 32 |
-
print("π To view the Phoenix app in your browser, visit http://localhost:6006")
|
| 33 |
-
return None
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
def check_spans(client):
|
| 37 |
-
"""Check spans in Phoenix."""
|
| 38 |
-
try:
|
| 39 |
-
spans_df = client.get_spans_dataframe()
|
| 40 |
-
print(f"π Found {len(spans_df)} spans in Phoenix")
|
| 41 |
-
|
| 42 |
-
if len(spans_df) > 0:
|
| 43 |
-
print("Recent spans:")
|
| 44 |
-
for i, (_, span) in enumerate(spans_df.head(5).iterrows()):
|
| 45 |
-
span_id = span.get('context.span_id', 'no-id')
|
| 46 |
-
span_name = span.get('name', 'unnamed')
|
| 47 |
-
start_time = span.get('start_time', 'unknown')
|
| 48 |
-
print(f" {i+1}. {span_name} ({span_id[:8]}...) - {start_time}")
|
| 49 |
-
|
| 50 |
-
# Show input/output samples
|
| 51 |
-
print("\nSpan content samples:")
|
| 52 |
-
for i, (_, span) in enumerate(spans_df.head(3).iterrows()):
|
| 53 |
-
input_val = str(span.get('input.value', ''))[:100]
|
| 54 |
-
output_val = str(span.get('output.value', ''))[:100]
|
| 55 |
-
print(f" Span {i+1}:")
|
| 56 |
-
print(f" Input: {input_val}...")
|
| 57 |
-
print(f" Output: {output_val}...")
|
| 58 |
-
|
| 59 |
-
else:
|
| 60 |
-
print("β οΈ No spans found. Run your agent first to generate traces.")
|
| 61 |
-
|
| 62 |
-
return spans_df
|
| 63 |
-
|
| 64 |
-
except Exception as e:
|
| 65 |
-
print(f"β Error getting spans: {e}")
|
| 66 |
-
return pd.DataFrame()
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
def check_evaluations(client):
|
| 70 |
-
"""Check evaluations in Phoenix."""
|
| 71 |
-
try:
|
| 72 |
-
# Try different methods to get evaluations
|
| 73 |
-
print("π Checking evaluations...")
|
| 74 |
-
|
| 75 |
-
# Method 1: Direct evaluation dataframe
|
| 76 |
-
try:
|
| 77 |
-
evals_df = client.get_evaluations_dataframe()
|
| 78 |
-
print(f"π Found {len(evals_df)} evaluations in Phoenix")
|
| 79 |
-
|
| 80 |
-
if len(evals_df) > 0:
|
| 81 |
-
print("Evaluation breakdown:")
|
| 82 |
-
eval_names = evals_df['name'].value_counts()
|
| 83 |
-
for name, count in eval_names.items():
|
| 84 |
-
print(f" - {name}: {count} evaluations")
|
| 85 |
-
|
| 86 |
-
# Check for GAIA evaluations specifically
|
| 87 |
-
gaia_evals = evals_df[evals_df['name'] == 'gaia_ground_truth']
|
| 88 |
-
if len(gaia_evals) > 0:
|
| 89 |
-
print(f"β
Found {len(gaia_evals)} GAIA ground truth evaluations")
|
| 90 |
-
|
| 91 |
-
# Show sample evaluation
|
| 92 |
-
sample = gaia_evals.iloc[0]
|
| 93 |
-
print("Sample GAIA evaluation:")
|
| 94 |
-
print(f" - Score: {sample.get('score', 'N/A')}")
|
| 95 |
-
print(f" - Label: {sample.get('label', 'N/A')}")
|
| 96 |
-
print(f" - Explanation: {sample.get('explanation', 'N/A')[:100]}...")
|
| 97 |
-
|
| 98 |
-
# Show metadata if available
|
| 99 |
-
metadata = sample.get('metadata', {})
|
| 100 |
-
if metadata:
|
| 101 |
-
print(f" - Metadata keys: {list(metadata.keys())}")
|
| 102 |
-
|
| 103 |
-
else:
|
| 104 |
-
print("β No GAIA ground truth evaluations found")
|
| 105 |
-
print("Available evaluation types:", list(eval_names.keys()))
|
| 106 |
-
|
| 107 |
-
else:
|
| 108 |
-
print("β οΈ No evaluations found in Phoenix")
|
| 109 |
-
|
| 110 |
-
return evals_df
|
| 111 |
-
|
| 112 |
-
except AttributeError as e:
|
| 113 |
-
print(f"β οΈ get_evaluations_dataframe not available: {e}")
|
| 114 |
-
print("This might be a Phoenix version issue")
|
| 115 |
-
return pd.DataFrame()
|
| 116 |
-
|
| 117 |
-
except Exception as e:
|
| 118 |
-
print(f"β Error getting evaluations: {e}")
|
| 119 |
-
return pd.DataFrame()
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
def test_evaluation_creation_and_logging():
|
| 123 |
-
"""Test creating and logging evaluations."""
|
| 124 |
-
print("\nπ§ͺ Testing evaluation creation and logging...")
|
| 125 |
-
|
| 126 |
-
# Create sample evaluations
|
| 127 |
-
sample_data = [
|
| 128 |
-
{
|
| 129 |
-
"task_id": "debug-test-1",
|
| 130 |
-
"predicted_answer": "test answer 1",
|
| 131 |
-
"actual_answer": "correct answer 1",
|
| 132 |
-
"exact_match": False,
|
| 133 |
-
"similarity_score": 0.75,
|
| 134 |
-
"contains_answer": True,
|
| 135 |
-
"error": None
|
| 136 |
-
},
|
| 137 |
-
{
|
| 138 |
-
"task_id": "debug-test-2",
|
| 139 |
-
"predicted_answer": "exact match",
|
| 140 |
-
"actual_answer": "exact match",
|
| 141 |
-
"exact_match": True,
|
| 142 |
-
"similarity_score": 1.0,
|
| 143 |
-
"contains_answer": True,
|
| 144 |
-
"error": None
|
| 145 |
-
}
|
| 146 |
-
]
|
| 147 |
-
|
| 148 |
-
evaluations_df = pd.DataFrame(sample_data)
|
| 149 |
-
print(f"Created {len(evaluations_df)} test evaluations")
|
| 150 |
-
|
| 151 |
-
# Try to log to Phoenix
|
| 152 |
-
try:
|
| 153 |
-
print("Attempting to log evaluations to Phoenix...")
|
| 154 |
-
result = log_evaluations_to_phoenix(evaluations_df)
|
| 155 |
-
|
| 156 |
-
if result is not None:
|
| 157 |
-
print("β
Test evaluation logging successful")
|
| 158 |
-
print(f"Logged {len(result)} evaluations")
|
| 159 |
-
return True
|
| 160 |
-
else:
|
| 161 |
-
print("β Test evaluation logging failed - no result returned")
|
| 162 |
-
return False
|
| 163 |
-
|
| 164 |
-
except Exception as e:
|
| 165 |
-
print(f"β Test evaluation logging error: {e}")
|
| 166 |
-
import traceback
|
| 167 |
-
traceback.print_exc()
|
| 168 |
-
return False
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
def check_gaia_data():
|
| 172 |
-
"""Check GAIA ground truth data availability."""
|
| 173 |
-
print("\nπ Checking GAIA ground truth data...")
|
| 174 |
-
|
| 175 |
-
try:
|
| 176 |
-
comparator = AnswerComparator()
|
| 177 |
-
|
| 178 |
-
print(f"β
Loaded {len(comparator.ground_truth)} GAIA ground truth answers")
|
| 179 |
-
|
| 180 |
-
if len(comparator.ground_truth) > 0:
|
| 181 |
-
# Show sample
|
| 182 |
-
sample_task_id = list(comparator.ground_truth.keys())[0]
|
| 183 |
-
sample_answer = comparator.ground_truth[sample_task_id]
|
| 184 |
-
print(f"Sample: {sample_task_id} -> '{sample_answer}'")
|
| 185 |
-
|
| 186 |
-
# Test evaluation
|
| 187 |
-
test_eval = comparator.evaluate_answer(sample_task_id, "test answer")
|
| 188 |
-
print(f"Test evaluation result: {test_eval}")
|
| 189 |
-
|
| 190 |
-
return True
|
| 191 |
-
else:
|
| 192 |
-
print("β No GAIA ground truth data found")
|
| 193 |
-
return False
|
| 194 |
-
|
| 195 |
-
except Exception as e:
|
| 196 |
-
print(f"β Error checking GAIA data: {e}")
|
| 197 |
-
return False
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
def show_phoenix_ui_info():
|
| 201 |
-
"""Show information about Phoenix UI."""
|
| 202 |
-
print("\nπ Phoenix UI Information:")
|
| 203 |
-
print("-" * 30)
|
| 204 |
-
print("Phoenix UI should be available at: http://localhost:6006")
|
| 205 |
-
print("")
|
| 206 |
-
print("In the Phoenix UI, look for:")
|
| 207 |
-
print(" β’ 'Evaluations' tab or section")
|
| 208 |
-
print(" β’ 'Evals' section")
|
| 209 |
-
print(" β’ 'Annotations' tab")
|
| 210 |
-
print(" β’ In 'Spans' view, look for evaluation badges on spans")
|
| 211 |
-
print("")
|
| 212 |
-
print("If you see evaluations, they should be named 'gaia_ground_truth'")
|
| 213 |
-
print("Each evaluation should show:")
|
| 214 |
-
print(" - Score (similarity score 0-1)")
|
| 215 |
-
print(" - Label (correct/incorrect)")
|
| 216 |
-
print(" - Explanation (predicted vs ground truth)")
|
| 217 |
-
print(" - Metadata (task_id, exact_match, etc.)")
|
| 218 |
-
|
| 219 |
-
|
| 220 |
-
def main():
|
| 221 |
-
"""Main debug function."""
|
| 222 |
-
print("π Enhanced Phoenix Debug Script")
|
| 223 |
-
print("=" * 50)
|
| 224 |
-
|
| 225 |
-
# Check Phoenix connection
|
| 226 |
-
client = check_phoenix_connection()
|
| 227 |
-
if not client:
|
| 228 |
-
print("\nβ Cannot proceed without Phoenix connection")
|
| 229 |
-
print("Make sure your agent app is running (it starts Phoenix)")
|
| 230 |
-
return
|
| 231 |
-
|
| 232 |
-
print("\nπ Checking Phoenix Data:")
|
| 233 |
-
print("-" * 30)
|
| 234 |
-
|
| 235 |
-
# Check spans
|
| 236 |
-
spans_df = check_spans(client)
|
| 237 |
-
|
| 238 |
-
# Check evaluations
|
| 239 |
-
evals_df = check_evaluations(client)
|
| 240 |
-
|
| 241 |
-
# Test evaluation creation
|
| 242 |
-
test_success = test_evaluation_creation_and_logging()
|
| 243 |
-
|
| 244 |
-
# Wait a moment and recheck evaluations
|
| 245 |
-
if test_success:
|
| 246 |
-
print("\nβ³ Waiting for evaluations to be processed...")
|
| 247 |
-
time.sleep(3)
|
| 248 |
-
|
| 249 |
-
print("π Rechecking evaluations after test logging...")
|
| 250 |
-
evals_df_after = check_evaluations(client)
|
| 251 |
-
|
| 252 |
-
if len(evals_df_after) > len(evals_df):
|
| 253 |
-
print("β
New evaluations detected after test!")
|
| 254 |
-
else:
|
| 255 |
-
print("β οΈ No new evaluations detected")
|
| 256 |
-
|
| 257 |
-
# Check GAIA data
|
| 258 |
-
gaia_available = check_gaia_data()
|
| 259 |
-
|
| 260 |
-
# Show Phoenix UI info
|
| 261 |
-
show_phoenix_ui_info()
|
| 262 |
-
|
| 263 |
-
# Final summary
|
| 264 |
-
print("\n" + "=" * 50)
|
| 265 |
-
print("π Summary:")
|
| 266 |
-
print(f" β’ Phoenix connected: {'β
' if client else 'β'}")
|
| 267 |
-
print(f" β’ Spans available: {len(spans_df)} spans")
|
| 268 |
-
print(f" β’ Evaluations found: {len(evals_df)} evaluations")
|
| 269 |
-
print(f" β’ GAIA data available: {'β
' if gaia_available else 'β'}")
|
| 270 |
-
print(f" β’ Test logging worked: {'β
' if test_success else 'β'}")
|
| 271 |
-
|
| 272 |
-
print("\nπ‘ Next Steps:")
|
| 273 |
-
if len(spans_df) == 0:
|
| 274 |
-
print(" β’ Run your agent to generate traces first")
|
| 275 |
-
if len(evals_df) == 0:
|
| 276 |
-
print(" β’ Check if evaluations are being logged correctly")
|
| 277 |
-
print(" β’ Verify Phoenix version compatibility")
|
| 278 |
-
if not gaia_available:
|
| 279 |
-
print(" β’ Check that data/metadata.jsonl exists and is readable")
|
| 280 |
-
|
| 281 |
-
print(f"\nπ Phoenix UI: http://localhost:6006")
|
| 282 |
-
|
| 283 |
-
|
| 284 |
-
if __name__ == "__main__":
|
| 285 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
debug_spans.py
DELETED
|
@@ -1,77 +0,0 @@
|
|
| 1 |
-
#!/usr/bin/env python3
|
| 2 |
-
"""
|
| 3 |
-
Debug script to see Phoenix spans column structure.
|
| 4 |
-
"""
|
| 5 |
-
|
| 6 |
-
import sys
|
| 7 |
-
import os
|
| 8 |
-
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
| 9 |
-
|
| 10 |
-
import phoenix as px
|
| 11 |
-
import pandas as pd
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
def debug_spans_structure():
|
| 15 |
-
"""Debug the structure of Phoenix spans."""
|
| 16 |
-
print("π Debugging Phoenix Spans Structure")
|
| 17 |
-
print("=" * 50)
|
| 18 |
-
|
| 19 |
-
try:
|
| 20 |
-
client = px.Client()
|
| 21 |
-
print("β
Phoenix connected successfully")
|
| 22 |
-
except Exception as e:
|
| 23 |
-
print(f"β Phoenix connection failed: {e}")
|
| 24 |
-
return
|
| 25 |
-
|
| 26 |
-
try:
|
| 27 |
-
spans_df = client.get_spans_dataframe()
|
| 28 |
-
print(f"π Found {len(spans_df)} spans in Phoenix")
|
| 29 |
-
|
| 30 |
-
if len(spans_df) == 0:
|
| 31 |
-
print("β οΈ No spans found. Run your agent first to create spans.")
|
| 32 |
-
return
|
| 33 |
-
|
| 34 |
-
print(f"\nπ Available Columns ({len(spans_df.columns)} total):")
|
| 35 |
-
for i, col in enumerate(spans_df.columns):
|
| 36 |
-
print(f" {i+1:2d}. {col}")
|
| 37 |
-
|
| 38 |
-
print(f"\nπ Sample Data (first span):")
|
| 39 |
-
sample_span = spans_df.iloc[0]
|
| 40 |
-
for col in spans_df.columns:
|
| 41 |
-
value = sample_span.get(col)
|
| 42 |
-
if value is not None:
|
| 43 |
-
value_str = str(value)[:100] + "..." if len(str(value)) > 100 else str(value)
|
| 44 |
-
print(f" {col}: {value_str}")
|
| 45 |
-
|
| 46 |
-
# Look for input/output related columns
|
| 47 |
-
input_cols = [col for col in spans_df.columns if 'input' in col.lower()]
|
| 48 |
-
output_cols = [col for col in spans_df.columns if 'output' in col.lower()]
|
| 49 |
-
|
| 50 |
-
print(f"\nπ― Input-related columns: {input_cols}")
|
| 51 |
-
print(f"π― Output-related columns: {output_cols}")
|
| 52 |
-
|
| 53 |
-
# Look for span ID columns
|
| 54 |
-
id_cols = [col for col in spans_df.columns if 'id' in col.lower()]
|
| 55 |
-
print(f"π― ID-related columns: {id_cols}")
|
| 56 |
-
|
| 57 |
-
# Look for columns that might contain task IDs
|
| 58 |
-
print(f"\nπ Searching for task IDs in spans...")
|
| 59 |
-
task_id_sample = "8e867cd7-cff9-4e6c-867a-ff5ddc2550be"
|
| 60 |
-
|
| 61 |
-
for col in spans_df.columns:
|
| 62 |
-
if spans_df[col].dtype == 'object': # String-like columns
|
| 63 |
-
try:
|
| 64 |
-
matches = spans_df[spans_df[col].astype(str).str.contains(task_id_sample, na=False, case=False)]
|
| 65 |
-
if len(matches) > 0:
|
| 66 |
-
print(f" β
Found task ID in column '{col}': {len(matches)} matches")
|
| 67 |
-
except:
|
| 68 |
-
pass
|
| 69 |
-
|
| 70 |
-
except Exception as e:
|
| 71 |
-
print(f"β Error debugging spans: {e}")
|
| 72 |
-
import traceback
|
| 73 |
-
traceback.print_exc()
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
if __name__ == "__main__":
|
| 77 |
-
debug_spans_structure()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
phoenix_evaluator.py
DELETED
|
@@ -1,273 +0,0 @@
|
|
| 1 |
-
import pandas as pd
|
| 2 |
-
from typing import Dict, Any, List, Optional
|
| 3 |
-
from comparison import AnswerComparator
|
| 4 |
-
import phoenix as px
|
| 5 |
-
from phoenix.trace import SpanEvaluations
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
class GAIAPhoenixEvaluator:
|
| 9 |
-
"""Phoenix evaluator for GAIA dataset ground truth comparison."""
|
| 10 |
-
|
| 11 |
-
def __init__(self, metadata_path: str = "data/metadata.jsonl"):
|
| 12 |
-
self.comparator = AnswerComparator(metadata_path)
|
| 13 |
-
self.eval_name = "gaia_ground_truth"
|
| 14 |
-
|
| 15 |
-
def evaluate_spans(self, spans_df: pd.DataFrame) -> List[SpanEvaluations]:
|
| 16 |
-
"""Evaluate spans and return Phoenix SpanEvaluations."""
|
| 17 |
-
evaluations = []
|
| 18 |
-
|
| 19 |
-
for _, span in spans_df.iterrows():
|
| 20 |
-
# Extract task_id and answer from span
|
| 21 |
-
task_id = self._extract_task_id(span)
|
| 22 |
-
predicted_answer = self._extract_predicted_answer(span)
|
| 23 |
-
span_id = span.get("context.span_id")
|
| 24 |
-
|
| 25 |
-
if task_id and predicted_answer is not None and span_id:
|
| 26 |
-
evaluation = self.comparator.evaluate_answer(task_id, predicted_answer)
|
| 27 |
-
|
| 28 |
-
# Create evaluation record for Phoenix
|
| 29 |
-
eval_record = {
|
| 30 |
-
"span_id": span_id,
|
| 31 |
-
"score": 1.0 if evaluation["exact_match"] else evaluation["similarity_score"],
|
| 32 |
-
"label": "correct" if evaluation["exact_match"] else "incorrect",
|
| 33 |
-
"explanation": self._create_explanation(evaluation),
|
| 34 |
-
"task_id": task_id,
|
| 35 |
-
"predicted_answer": evaluation["predicted_answer"],
|
| 36 |
-
"ground_truth": evaluation["actual_answer"],
|
| 37 |
-
"exact_match": evaluation["exact_match"],
|
| 38 |
-
"similarity_score": evaluation["similarity_score"],
|
| 39 |
-
"contains_answer": evaluation["contains_answer"]
|
| 40 |
-
}
|
| 41 |
-
|
| 42 |
-
evaluations.append(eval_record)
|
| 43 |
-
|
| 44 |
-
if evaluations:
|
| 45 |
-
# Create SpanEvaluations object
|
| 46 |
-
eval_df = pd.DataFrame(evaluations)
|
| 47 |
-
return [SpanEvaluations(eval_name=self.eval_name, dataframe=eval_df)]
|
| 48 |
-
|
| 49 |
-
return []
|
| 50 |
-
|
| 51 |
-
def _extract_task_id(self, span) -> Optional[str]:
|
| 52 |
-
"""Extract task_id from span data."""
|
| 53 |
-
# Try span attributes first
|
| 54 |
-
attributes = span.get("attributes", {})
|
| 55 |
-
if isinstance(attributes, dict):
|
| 56 |
-
if "task_id" in attributes:
|
| 57 |
-
return attributes["task_id"]
|
| 58 |
-
|
| 59 |
-
# Try input data
|
| 60 |
-
input_data = span.get("input", {})
|
| 61 |
-
if isinstance(input_data, dict):
|
| 62 |
-
if "task_id" in input_data:
|
| 63 |
-
return input_data["task_id"]
|
| 64 |
-
|
| 65 |
-
# Try to extract from input value if it's a string
|
| 66 |
-
input_value = span.get("input.value", "")
|
| 67 |
-
if isinstance(input_value, str):
|
| 68 |
-
# Look for UUID pattern in input
|
| 69 |
-
import re
|
| 70 |
-
uuid_pattern = r'[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}'
|
| 71 |
-
match = re.search(uuid_pattern, input_value)
|
| 72 |
-
if match:
|
| 73 |
-
return match.group(0)
|
| 74 |
-
|
| 75 |
-
# Try span name
|
| 76 |
-
span_name = span.get("name", "")
|
| 77 |
-
if isinstance(span_name, str):
|
| 78 |
-
import re
|
| 79 |
-
uuid_pattern = r'[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}'
|
| 80 |
-
match = re.search(uuid_pattern, span_name)
|
| 81 |
-
if match:
|
| 82 |
-
return match.group(0)
|
| 83 |
-
|
| 84 |
-
return None
|
| 85 |
-
|
| 86 |
-
def _extract_predicted_answer(self, span) -> Optional[str]:
|
| 87 |
-
"""Extract predicted answer from span output."""
|
| 88 |
-
# Try different output fields
|
| 89 |
-
output_fields = ["output.value", "output", "response", "result"]
|
| 90 |
-
|
| 91 |
-
for field in output_fields:
|
| 92 |
-
value = span.get(field)
|
| 93 |
-
if value is not None:
|
| 94 |
-
return str(value)
|
| 95 |
-
|
| 96 |
-
return None
|
| 97 |
-
|
| 98 |
-
def _create_explanation(self, evaluation: Dict[str, Any]) -> str:
|
| 99 |
-
"""Create human-readable explanation of the evaluation."""
|
| 100 |
-
predicted = evaluation["predicted_answer"]
|
| 101 |
-
actual = evaluation["actual_answer"]
|
| 102 |
-
exact_match = evaluation["exact_match"]
|
| 103 |
-
similarity = evaluation["similarity_score"]
|
| 104 |
-
contains = evaluation["contains_answer"]
|
| 105 |
-
|
| 106 |
-
if actual is None:
|
| 107 |
-
return "β No ground truth available for comparison"
|
| 108 |
-
|
| 109 |
-
explanation = f"Predicted: '{predicted}' | Ground Truth: '{actual}' | "
|
| 110 |
-
|
| 111 |
-
if exact_match:
|
| 112 |
-
explanation += "β
Exact match"
|
| 113 |
-
elif contains:
|
| 114 |
-
explanation += f"β οΈ Contains correct answer (similarity: {similarity:.3f})"
|
| 115 |
-
else:
|
| 116 |
-
explanation += f"β Incorrect (similarity: {similarity:.3f})"
|
| 117 |
-
|
| 118 |
-
return explanation
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
def add_gaia_evaluations_to_phoenix(spans_df: pd.DataFrame, metadata_path: str = "data/metadata.jsonl") -> List[SpanEvaluations]:
|
| 122 |
-
"""Add GAIA evaluation results to Phoenix spans."""
|
| 123 |
-
evaluator = GAIAPhoenixEvaluator(metadata_path)
|
| 124 |
-
return evaluator.evaluate_spans(spans_df)
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
def log_evaluations_to_phoenix(evaluations_df: pd.DataFrame, session_id: Optional[str] = None) -> Optional[pd.DataFrame]:
|
| 128 |
-
"""Log evaluation results directly to Phoenix."""
|
| 129 |
-
try:
|
| 130 |
-
client = px.Client()
|
| 131 |
-
|
| 132 |
-
# Get current spans to match evaluations to span_ids
|
| 133 |
-
spans_df = client.get_spans_dataframe()
|
| 134 |
-
|
| 135 |
-
if spans_df is None or spans_df.empty:
|
| 136 |
-
print("No spans found to attach evaluations to")
|
| 137 |
-
return None
|
| 138 |
-
|
| 139 |
-
# Debug: Show available columns
|
| 140 |
-
print(f"π Available span columns: {list(spans_df.columns)}")
|
| 141 |
-
|
| 142 |
-
# Get possible input/output column names
|
| 143 |
-
input_columns = [col for col in spans_df.columns if 'input' in col.lower()]
|
| 144 |
-
output_columns = [col for col in spans_df.columns if 'output' in col.lower()]
|
| 145 |
-
name_columns = [col for col in spans_df.columns if 'name' in col.lower()]
|
| 146 |
-
|
| 147 |
-
print(f"π Input columns found: {input_columns}")
|
| 148 |
-
print(f"π Output columns found: {output_columns}")
|
| 149 |
-
print(f"π Name columns found: {name_columns}")
|
| 150 |
-
|
| 151 |
-
# Create evaluation records for Phoenix
|
| 152 |
-
evaluation_records = []
|
| 153 |
-
spans_with_evals = []
|
| 154 |
-
|
| 155 |
-
for _, eval_row in evaluations_df.iterrows():
|
| 156 |
-
task_id = eval_row["task_id"]
|
| 157 |
-
matching_spans = pd.DataFrame()
|
| 158 |
-
|
| 159 |
-
# Try different strategies to find matching spans
|
| 160 |
-
|
| 161 |
-
# Strategy 1: Search in all string columns for task_id
|
| 162 |
-
for col in spans_df.columns:
|
| 163 |
-
if spans_df[col].dtype == 'object': # String-like columns
|
| 164 |
-
try:
|
| 165 |
-
matches = spans_df[
|
| 166 |
-
spans_df[col].astype(str).str.contains(task_id, na=False, case=False)
|
| 167 |
-
]
|
| 168 |
-
if len(matches) > 0:
|
| 169 |
-
matching_spans = matches
|
| 170 |
-
print(f"β
Found match for {task_id} in column '{col}'")
|
| 171 |
-
break
|
| 172 |
-
except Exception as e:
|
| 173 |
-
continue
|
| 174 |
-
|
| 175 |
-
# Strategy 2: If no matches found, try searching in input columns specifically
|
| 176 |
-
if len(matching_spans) == 0 and input_columns:
|
| 177 |
-
for input_col in input_columns:
|
| 178 |
-
try:
|
| 179 |
-
matches = spans_df[
|
| 180 |
-
spans_df[input_col].astype(str).str.contains(task_id, na=False, case=False)
|
| 181 |
-
]
|
| 182 |
-
if len(matches) > 0:
|
| 183 |
-
matching_spans = matches
|
| 184 |
-
print(f"β
Found match for {task_id} in input column '{input_col}'")
|
| 185 |
-
break
|
| 186 |
-
except Exception as e:
|
| 187 |
-
continue
|
| 188 |
-
|
| 189 |
-
# Strategy 3: If still no matches, try with partial task_id (last 8 characters)
|
| 190 |
-
if len(matching_spans) == 0:
|
| 191 |
-
short_task_id = task_id[-8:] if len(task_id) > 8 else task_id
|
| 192 |
-
for col in spans_df.columns:
|
| 193 |
-
if spans_df[col].dtype == 'object':
|
| 194 |
-
try:
|
| 195 |
-
matches = spans_df[
|
| 196 |
-
spans_df[col].astype(str).str.contains(short_task_id, na=False, case=False)
|
| 197 |
-
]
|
| 198 |
-
if len(matches) > 0:
|
| 199 |
-
matching_spans = matches
|
| 200 |
-
print(f"β
Found match for {task_id} using short ID in column '{col}'")
|
| 201 |
-
break
|
| 202 |
-
except Exception as e:
|
| 203 |
-
continue
|
| 204 |
-
|
| 205 |
-
if len(matching_spans) > 0:
|
| 206 |
-
span_id = matching_spans.iloc[0].get('context.span_id') or matching_spans.iloc[0].get('span_id')
|
| 207 |
-
|
| 208 |
-
if span_id:
|
| 209 |
-
# Create evaluation record in Phoenix format
|
| 210 |
-
evaluation_record = {
|
| 211 |
-
"span_id": span_id,
|
| 212 |
-
"name": "gaia_ground_truth",
|
| 213 |
-
"score": eval_row["similarity_score"],
|
| 214 |
-
"label": "correct" if bool(eval_row["exact_match"]) else "incorrect",
|
| 215 |
-
"explanation": f"Predicted: '{eval_row['predicted_answer']}' | Ground Truth: '{eval_row['actual_answer']}' | Similarity: {eval_row['similarity_score']:.3f} | Exact Match: {eval_row['exact_match']}",
|
| 216 |
-
"annotator_kind": "HUMAN",
|
| 217 |
-
"metadata": {
|
| 218 |
-
"task_id": task_id,
|
| 219 |
-
"exact_match": bool(eval_row["exact_match"]),
|
| 220 |
-
"similarity_score": float(eval_row["similarity_score"]),
|
| 221 |
-
"contains_answer": bool(eval_row["contains_answer"]),
|
| 222 |
-
"predicted_answer": str(eval_row["predicted_answer"]),
|
| 223 |
-
"ground_truth": str(eval_row["actual_answer"])
|
| 224 |
-
}
|
| 225 |
-
}
|
| 226 |
-
|
| 227 |
-
evaluation_records.append(evaluation_record)
|
| 228 |
-
spans_with_evals.append(span_id)
|
| 229 |
-
else:
|
| 230 |
-
print(f"β οΈ No span_id found for matching span with task {task_id}")
|
| 231 |
-
else:
|
| 232 |
-
print(f"β οΈ No matching span found for task {task_id}")
|
| 233 |
-
|
| 234 |
-
if evaluation_records:
|
| 235 |
-
# Convert to DataFrame for Phoenix
|
| 236 |
-
eval_df = pd.DataFrame(evaluation_records)
|
| 237 |
-
|
| 238 |
-
# Create SpanEvaluations object
|
| 239 |
-
span_evaluations = SpanEvaluations(
|
| 240 |
-
eval_name="gaia_ground_truth",
|
| 241 |
-
dataframe=eval_df
|
| 242 |
-
)
|
| 243 |
-
|
| 244 |
-
# Log evaluations to Phoenix
|
| 245 |
-
try:
|
| 246 |
-
# Try the newer Phoenix API
|
| 247 |
-
px.log_evaluations(span_evaluations)
|
| 248 |
-
print(f"β
Successfully logged {len(evaluation_records)} evaluations to Phoenix using px.log_evaluations")
|
| 249 |
-
except AttributeError:
|
| 250 |
-
try:
|
| 251 |
-
# Fallback for older Phoenix versions
|
| 252 |
-
client.log_evaluations(span_evaluations)
|
| 253 |
-
print(f"β
Successfully logged {len(evaluation_records)} evaluations to Phoenix using client.log_evaluations")
|
| 254 |
-
except Exception as e:
|
| 255 |
-
print(f"β οΈ Could not log evaluations using either method: {e}")
|
| 256 |
-
# Still return the DataFrame so we know what would have been logged
|
| 257 |
-
print("Evaluation records created but not logged to Phoenix")
|
| 258 |
-
|
| 259 |
-
return eval_df
|
| 260 |
-
else:
|
| 261 |
-
print("β οΈ No matching spans found for any evaluations")
|
| 262 |
-
if spans_df is not None:
|
| 263 |
-
print(f"Available spans: {len(spans_df)}")
|
| 264 |
-
if len(spans_df) > 0:
|
| 265 |
-
available_cols = [col for col in spans_df.columns if spans_df[col].dtype == 'object'][:5]
|
| 266 |
-
print(f"Sample searchable columns: {available_cols}")
|
| 267 |
-
return None
|
| 268 |
-
|
| 269 |
-
except Exception as e:
|
| 270 |
-
print(f"β Could not log evaluations to Phoenix: {e}")
|
| 271 |
-
import traceback
|
| 272 |
-
traceback.print_exc()
|
| 273 |
-
return None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
test_comparison.py
DELETED
|
@@ -1,144 +0,0 @@
|
|
| 1 |
-
#!/usr/bin/env python3
|
| 2 |
-
"""
|
| 3 |
-
Test script for GAIA comparison functionality.
|
| 4 |
-
"""
|
| 5 |
-
|
| 6 |
-
import sys
|
| 7 |
-
import os
|
| 8 |
-
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
| 9 |
-
|
| 10 |
-
from comparison import AnswerComparator
|
| 11 |
-
from phoenix_evaluator import log_evaluations_to_phoenix
|
| 12 |
-
import pandas as pd
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
def test_basic_comparison():
|
| 16 |
-
"""Test basic comparison functionality."""
|
| 17 |
-
print("Testing basic comparison...")
|
| 18 |
-
|
| 19 |
-
# Initialize comparator
|
| 20 |
-
comparator = AnswerComparator()
|
| 21 |
-
|
| 22 |
-
# Test with some sample data
|
| 23 |
-
sample_results = [
|
| 24 |
-
{"task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be", "submitted_answer": "3"},
|
| 25 |
-
{"task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6", "submitted_answer": "3"},
|
| 26 |
-
{"task_id": "nonexistent-task", "submitted_answer": "test"}
|
| 27 |
-
]
|
| 28 |
-
|
| 29 |
-
# Evaluate batch
|
| 30 |
-
evaluations_df = comparator.evaluate_batch(sample_results)
|
| 31 |
-
print(f"Evaluated {len(evaluations_df)} answers")
|
| 32 |
-
|
| 33 |
-
# Get summary stats
|
| 34 |
-
summary_stats = comparator.get_summary_stats(evaluations_df)
|
| 35 |
-
print("Summary statistics:")
|
| 36 |
-
for key, value in summary_stats.items():
|
| 37 |
-
print(f" {key}: {value}")
|
| 38 |
-
|
| 39 |
-
# Test single evaluation
|
| 40 |
-
print("\nTesting single evaluation...")
|
| 41 |
-
single_eval = comparator.evaluate_answer("8e867cd7-cff9-4e6c-867a-ff5ddc2550be", "3")
|
| 42 |
-
print(f"Single evaluation result: {single_eval}")
|
| 43 |
-
|
| 44 |
-
return evaluations_df
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
def test_results_enhancement():
|
| 48 |
-
"""Test results log enhancement."""
|
| 49 |
-
print("\nTesting results log enhancement...")
|
| 50 |
-
|
| 51 |
-
comparator = AnswerComparator()
|
| 52 |
-
|
| 53 |
-
# Sample results log (like what comes from your agent)
|
| 54 |
-
sample_results_log = [
|
| 55 |
-
{
|
| 56 |
-
"Task ID": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
|
| 57 |
-
"Question": "How many studio albums were published by Mercedes Sosa between 2000 and 2009?",
|
| 58 |
-
"Submitted Answer": "3"
|
| 59 |
-
},
|
| 60 |
-
{
|
| 61 |
-
"Task ID": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
|
| 62 |
-
"Question": "Test question",
|
| 63 |
-
"Submitted Answer": "wrong answer"
|
| 64 |
-
}
|
| 65 |
-
]
|
| 66 |
-
|
| 67 |
-
# Enhance results
|
| 68 |
-
enhanced_results = comparator.enhance_results_log(sample_results_log)
|
| 69 |
-
|
| 70 |
-
print("Enhanced results:")
|
| 71 |
-
for result in enhanced_results:
|
| 72 |
-
print(f" Task: {result['Task ID']}")
|
| 73 |
-
print(f" Answer: {result['Submitted Answer']}")
|
| 74 |
-
print(f" Ground Truth: {result['Ground Truth']}")
|
| 75 |
-
print(f" Exact Match: {result['Exact Match']}")
|
| 76 |
-
print(f" Similarity: {result['Similarity']}")
|
| 77 |
-
print()
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
def test_phoenix_integration():
|
| 81 |
-
"""Test Phoenix integration (basic)."""
|
| 82 |
-
print("\nTesting Phoenix integration...")
|
| 83 |
-
|
| 84 |
-
# Create sample evaluations
|
| 85 |
-
sample_evaluations = pd.DataFrame([
|
| 86 |
-
{
|
| 87 |
-
"task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
|
| 88 |
-
"predicted_answer": "3",
|
| 89 |
-
"actual_answer": "3",
|
| 90 |
-
"exact_match": True,
|
| 91 |
-
"similarity_score": 1.0,
|
| 92 |
-
"contains_answer": True,
|
| 93 |
-
"error": None
|
| 94 |
-
},
|
| 95 |
-
{
|
| 96 |
-
"task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
|
| 97 |
-
"predicted_answer": "wrong",
|
| 98 |
-
"actual_answer": "3",
|
| 99 |
-
"exact_match": False,
|
| 100 |
-
"similarity_score": 0.2,
|
| 101 |
-
"contains_answer": False,
|
| 102 |
-
"error": None
|
| 103 |
-
}
|
| 104 |
-
])
|
| 105 |
-
|
| 106 |
-
# Try to log to Phoenix
|
| 107 |
-
try:
|
| 108 |
-
result = log_evaluations_to_phoenix(sample_evaluations)
|
| 109 |
-
if result is not None:
|
| 110 |
-
print("β
Phoenix integration successful")
|
| 111 |
-
else:
|
| 112 |
-
print("β οΈ Phoenix integration failed (likely Phoenix not running)")
|
| 113 |
-
except Exception as e:
|
| 114 |
-
print(f"β οΈ Phoenix integration error: {e}")
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
def main():
|
| 118 |
-
"""Run all tests."""
|
| 119 |
-
print("="*50)
|
| 120 |
-
print("GAIA Comparison Test Suite")
|
| 121 |
-
print("="*50)
|
| 122 |
-
|
| 123 |
-
try:
|
| 124 |
-
# Test basic comparison
|
| 125 |
-
evaluations_df = test_basic_comparison()
|
| 126 |
-
|
| 127 |
-
# Test results enhancement
|
| 128 |
-
test_results_enhancement()
|
| 129 |
-
|
| 130 |
-
# Test Phoenix integration
|
| 131 |
-
test_phoenix_integration()
|
| 132 |
-
|
| 133 |
-
print("\n" + "="*50)
|
| 134 |
-
print("All tests completed!")
|
| 135 |
-
print("="*50)
|
| 136 |
-
|
| 137 |
-
except Exception as e:
|
| 138 |
-
print(f"β Test failed with error: {e}")
|
| 139 |
-
import traceback
|
| 140 |
-
traceback.print_exc()
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
if __name__ == "__main__":
|
| 144 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
test_phoenix_logging.py
DELETED
|
@@ -1,261 +0,0 @@
|
|
| 1 |
-
#!/usr/bin/env python3
|
| 2 |
-
"""
|
| 3 |
-
Test script to verify Phoenix evaluations logging.
|
| 4 |
-
"""
|
| 5 |
-
|
| 6 |
-
import sys
|
| 7 |
-
import os
|
| 8 |
-
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
| 9 |
-
|
| 10 |
-
import phoenix as px
|
| 11 |
-
import pandas as pd
|
| 12 |
-
from comparison import AnswerComparator
|
| 13 |
-
from phoenix_evaluator import log_evaluations_to_phoenix
|
| 14 |
-
from datetime import datetime
|
| 15 |
-
import time
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
def test_phoenix_connection():
|
| 19 |
-
"""Test Phoenix connection and basic functionality."""
|
| 20 |
-
print("π Testing Phoenix Connection...")
|
| 21 |
-
|
| 22 |
-
try:
|
| 23 |
-
client = px.Client()
|
| 24 |
-
print("β
Phoenix client connected successfully")
|
| 25 |
-
|
| 26 |
-
# Check if Phoenix is actually running
|
| 27 |
-
spans_df = client.get_spans_dataframe()
|
| 28 |
-
print(f"π Found {len(spans_df)} existing spans in Phoenix")
|
| 29 |
-
|
| 30 |
-
return client, spans_df
|
| 31 |
-
except Exception as e:
|
| 32 |
-
print(f"β Phoenix connection failed: {e}")
|
| 33 |
-
print("Make sure Phoenix is running and accessible at http://localhost:6006")
|
| 34 |
-
return None, None
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
def create_test_evaluations():
|
| 38 |
-
"""Create test evaluations for logging."""
|
| 39 |
-
print("\nπ§ͺ Creating test evaluations...")
|
| 40 |
-
|
| 41 |
-
test_data = [
|
| 42 |
-
{
|
| 43 |
-
"task_id": "test-exact-match",
|
| 44 |
-
"predicted_answer": "Paris",
|
| 45 |
-
"actual_answer": "Paris",
|
| 46 |
-
"exact_match": True,
|
| 47 |
-
"similarity_score": 1.0,
|
| 48 |
-
"contains_answer": True,
|
| 49 |
-
"error": None
|
| 50 |
-
},
|
| 51 |
-
{
|
| 52 |
-
"task_id": "test-partial-match",
|
| 53 |
-
"predicted_answer": "The capital of France is Paris",
|
| 54 |
-
"actual_answer": "Paris",
|
| 55 |
-
"exact_match": False,
|
| 56 |
-
"similarity_score": 0.75,
|
| 57 |
-
"contains_answer": True,
|
| 58 |
-
"error": None
|
| 59 |
-
},
|
| 60 |
-
{
|
| 61 |
-
"task_id": "test-no-match",
|
| 62 |
-
"predicted_answer": "London",
|
| 63 |
-
"actual_answer": "Paris",
|
| 64 |
-
"exact_match": False,
|
| 65 |
-
"similarity_score": 0.2,
|
| 66 |
-
"contains_answer": False,
|
| 67 |
-
"error": None
|
| 68 |
-
}
|
| 69 |
-
]
|
| 70 |
-
|
| 71 |
-
evaluations_df = pd.DataFrame(test_data)
|
| 72 |
-
print(f"Created {len(evaluations_df)} test evaluations")
|
| 73 |
-
|
| 74 |
-
return evaluations_df
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
def create_mock_spans(client):
|
| 78 |
-
"""Create mock spans for testing (if no real spans exist)."""
|
| 79 |
-
print("\nπ Creating mock spans for testing...")
|
| 80 |
-
|
| 81 |
-
# Note: This is a simplified mock - in real usage, spans are created by agent runs
|
| 82 |
-
mock_spans = [
|
| 83 |
-
{
|
| 84 |
-
"context.span_id": "mock-span-1",
|
| 85 |
-
"name": "test_agent_run",
|
| 86 |
-
"input.value": "Question about test-exact-match",
|
| 87 |
-
"output.value": "Paris",
|
| 88 |
-
"start_time": datetime.now(),
|
| 89 |
-
"end_time": datetime.now()
|
| 90 |
-
},
|
| 91 |
-
{
|
| 92 |
-
"context.span_id": "mock-span-2",
|
| 93 |
-
"name": "test_agent_run",
|
| 94 |
-
"input.value": "Question about test-partial-match",
|
| 95 |
-
"output.value": "The capital of France is Paris",
|
| 96 |
-
"start_time": datetime.now(),
|
| 97 |
-
"end_time": datetime.now()
|
| 98 |
-
},
|
| 99 |
-
{
|
| 100 |
-
"context.span_id": "mock-span-3",
|
| 101 |
-
"name": "test_agent_run",
|
| 102 |
-
"input.value": "Question about test-no-match",
|
| 103 |
-
"output.value": "London",
|
| 104 |
-
"start_time": datetime.now(),
|
| 105 |
-
"end_time": datetime.now()
|
| 106 |
-
}
|
| 107 |
-
]
|
| 108 |
-
|
| 109 |
-
print(f"Created {len(mock_spans)} mock spans")
|
| 110 |
-
return pd.DataFrame(mock_spans)
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
def test_evaluation_logging():
|
| 114 |
-
"""Test the actual evaluation logging to Phoenix."""
|
| 115 |
-
print("\nπ Testing evaluation logging...")
|
| 116 |
-
|
| 117 |
-
# Create test evaluations
|
| 118 |
-
evaluations_df = create_test_evaluations()
|
| 119 |
-
|
| 120 |
-
# Try to log to Phoenix
|
| 121 |
-
try:
|
| 122 |
-
result = log_evaluations_to_phoenix(evaluations_df)
|
| 123 |
-
|
| 124 |
-
if result is not None:
|
| 125 |
-
print("β
Evaluation logging test successful!")
|
| 126 |
-
print(f"Logged {len(result)} evaluations")
|
| 127 |
-
return True
|
| 128 |
-
else:
|
| 129 |
-
print("β Evaluation logging test failed - no result returned")
|
| 130 |
-
return False
|
| 131 |
-
|
| 132 |
-
except Exception as e:
|
| 133 |
-
print(f"β Evaluation logging test failed with error: {e}")
|
| 134 |
-
import traceback
|
| 135 |
-
traceback.print_exc()
|
| 136 |
-
return False
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
def verify_logged_evaluations(client):
|
| 140 |
-
"""Verify that evaluations were actually logged to Phoenix."""
|
| 141 |
-
print("\nπ Verifying logged evaluations...")
|
| 142 |
-
|
| 143 |
-
try:
|
| 144 |
-
# Give Phoenix a moment to process
|
| 145 |
-
time.sleep(2)
|
| 146 |
-
|
| 147 |
-
# Try to retrieve evaluations
|
| 148 |
-
evals_df = client.get_evaluations_dataframe()
|
| 149 |
-
print(f"π Found {len(evals_df)} total evaluations in Phoenix")
|
| 150 |
-
|
| 151 |
-
# Look for our specific evaluations
|
| 152 |
-
if len(evals_df) > 0:
|
| 153 |
-
gaia_evals = evals_df[evals_df['name'] == 'gaia_ground_truth']
|
| 154 |
-
print(f"π― Found {len(gaia_evals)} GAIA ground truth evaluations")
|
| 155 |
-
|
| 156 |
-
if len(gaia_evals) > 0:
|
| 157 |
-
print("β
Successfully verified evaluations in Phoenix!")
|
| 158 |
-
|
| 159 |
-
# Show sample evaluation
|
| 160 |
-
sample_eval = gaia_evals.iloc[0]
|
| 161 |
-
print(f"Sample evaluation:")
|
| 162 |
-
print(f" - Score: {sample_eval.get('score', 'N/A')}")
|
| 163 |
-
print(f" - Label: {sample_eval.get('label', 'N/A')}")
|
| 164 |
-
print(f" - Explanation: {sample_eval.get('explanation', 'N/A')}")
|
| 165 |
-
|
| 166 |
-
return True
|
| 167 |
-
else:
|
| 168 |
-
print("β No GAIA evaluations found after logging")
|
| 169 |
-
return False
|
| 170 |
-
else:
|
| 171 |
-
print("β No evaluations found in Phoenix")
|
| 172 |
-
return False
|
| 173 |
-
|
| 174 |
-
except Exception as e:
|
| 175 |
-
print(f"β Error verifying evaluations: {e}")
|
| 176 |
-
return False
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
def test_with_real_gaia_data():
|
| 180 |
-
"""Test with actual GAIA data if available."""
|
| 181 |
-
print("\nπ Testing with real GAIA data...")
|
| 182 |
-
|
| 183 |
-
try:
|
| 184 |
-
# Initialize comparator
|
| 185 |
-
comparator = AnswerComparator()
|
| 186 |
-
|
| 187 |
-
if len(comparator.ground_truth) == 0:
|
| 188 |
-
print("β οΈ No GAIA ground truth data available")
|
| 189 |
-
return False
|
| 190 |
-
|
| 191 |
-
# Create a real evaluation with GAIA data
|
| 192 |
-
real_task_id = list(comparator.ground_truth.keys())[0]
|
| 193 |
-
real_ground_truth = comparator.ground_truth[real_task_id]
|
| 194 |
-
|
| 195 |
-
real_evaluation = comparator.evaluate_answer(real_task_id, "test answer")
|
| 196 |
-
|
| 197 |
-
real_eval_df = pd.DataFrame([real_evaluation])
|
| 198 |
-
|
| 199 |
-
# Log to Phoenix
|
| 200 |
-
result = log_evaluations_to_phoenix(real_eval_df)
|
| 201 |
-
|
| 202 |
-
if result is not None:
|
| 203 |
-
print("β
Real GAIA data logging successful!")
|
| 204 |
-
print(f"Task ID: {real_task_id}")
|
| 205 |
-
print(f"Ground Truth: {real_ground_truth}")
|
| 206 |
-
print(f"Similarity Score: {real_evaluation['similarity_score']:.3f}")
|
| 207 |
-
return True
|
| 208 |
-
else:
|
| 209 |
-
print("β Real GAIA data logging failed")
|
| 210 |
-
return False
|
| 211 |
-
|
| 212 |
-
except Exception as e:
|
| 213 |
-
print(f"β Error testing with real GAIA data: {e}")
|
| 214 |
-
return False
|
| 215 |
-
|
| 216 |
-
|
| 217 |
-
def main():
|
| 218 |
-
"""Main test function."""
|
| 219 |
-
print("π Phoenix Evaluations Logging Test")
|
| 220 |
-
print("=" * 50)
|
| 221 |
-
|
| 222 |
-
# Test Phoenix connection
|
| 223 |
-
client, spans_df = test_phoenix_connection()
|
| 224 |
-
if not client:
|
| 225 |
-
print("β Cannot proceed without Phoenix connection")
|
| 226 |
-
return
|
| 227 |
-
|
| 228 |
-
# Run tests
|
| 229 |
-
tests_passed = 0
|
| 230 |
-
total_tests = 3
|
| 231 |
-
|
| 232 |
-
print(f"\nπ§ͺ Running {total_tests} tests...")
|
| 233 |
-
|
| 234 |
-
# Test 1: Basic evaluation logging
|
| 235 |
-
if test_evaluation_logging():
|
| 236 |
-
tests_passed += 1
|
| 237 |
-
|
| 238 |
-
# Test 2: Verify evaluations were logged
|
| 239 |
-
if verify_logged_evaluations(client):
|
| 240 |
-
tests_passed += 1
|
| 241 |
-
|
| 242 |
-
# Test 3: Test with real GAIA data
|
| 243 |
-
if test_with_real_gaia_data():
|
| 244 |
-
tests_passed += 1
|
| 245 |
-
|
| 246 |
-
# Summary
|
| 247 |
-
print("\n" + "=" * 50)
|
| 248 |
-
print(f"π― Test Results: {tests_passed}/{total_tests} tests passed")
|
| 249 |
-
|
| 250 |
-
if tests_passed == total_tests:
|
| 251 |
-
print("π All tests passed! Phoenix evaluations logging is working correctly.")
|
| 252 |
-
print("You should now see 'gaia_ground_truth' evaluations in the Phoenix UI.")
|
| 253 |
-
else:
|
| 254 |
-
print("β οΈ Some tests failed. Check the output above for details.")
|
| 255 |
-
|
| 256 |
-
print(f"\nπ Phoenix UI: http://localhost:6006")
|
| 257 |
-
print("Look for 'Evaluations' or 'Evals' tab to see the logged evaluations.")
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
if __name__ == "__main__":
|
| 261 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
test_phoenix_simple.py
DELETED
|
@@ -1,139 +0,0 @@
|
|
| 1 |
-
#!/usr/bin/env python3
|
| 2 |
-
"""
|
| 3 |
-
Simple test for Phoenix evaluations logging.
|
| 4 |
-
"""
|
| 5 |
-
|
| 6 |
-
import sys
|
| 7 |
-
import os
|
| 8 |
-
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
| 9 |
-
|
| 10 |
-
import phoenix as px
|
| 11 |
-
import pandas as pd
|
| 12 |
-
from comparison import AnswerComparator
|
| 13 |
-
from phoenix_evaluator import log_evaluations_to_phoenix
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
def test_phoenix_logging():
|
| 17 |
-
"""Test Phoenix evaluations logging with simple data."""
|
| 18 |
-
print("π§ͺ Testing Phoenix Evaluations Logging")
|
| 19 |
-
print("=" * 50)
|
| 20 |
-
|
| 21 |
-
# Step 1: Check Phoenix connection
|
| 22 |
-
print("1. Checking Phoenix connection...")
|
| 23 |
-
try:
|
| 24 |
-
client = px.Client()
|
| 25 |
-
print("β
Phoenix connected successfully")
|
| 26 |
-
except Exception as e:
|
| 27 |
-
print(f"β Phoenix connection failed: {e}")
|
| 28 |
-
return False
|
| 29 |
-
|
| 30 |
-
# Step 2: Create test evaluations
|
| 31 |
-
print("\n2. Creating test evaluations...")
|
| 32 |
-
test_evaluations = pd.DataFrame([
|
| 33 |
-
{
|
| 34 |
-
"task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
|
| 35 |
-
"predicted_answer": "3",
|
| 36 |
-
"actual_answer": "3",
|
| 37 |
-
"exact_match": True,
|
| 38 |
-
"similarity_score": 1.0,
|
| 39 |
-
"contains_answer": True,
|
| 40 |
-
"error": None
|
| 41 |
-
},
|
| 42 |
-
{
|
| 43 |
-
"task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
|
| 44 |
-
"predicted_answer": "5",
|
| 45 |
-
"actual_answer": "3",
|
| 46 |
-
"exact_match": False,
|
| 47 |
-
"similarity_score": 0.2,
|
| 48 |
-
"contains_answer": False,
|
| 49 |
-
"error": None
|
| 50 |
-
}
|
| 51 |
-
])
|
| 52 |
-
print(f"β
Created {len(test_evaluations)} test evaluations")
|
| 53 |
-
|
| 54 |
-
# Step 3: Check existing spans
|
| 55 |
-
print("\n3. Checking existing spans...")
|
| 56 |
-
try:
|
| 57 |
-
spans_df = client.get_spans_dataframe()
|
| 58 |
-
print(f"π Found {len(spans_df)} existing spans")
|
| 59 |
-
|
| 60 |
-
if len(spans_df) == 0:
|
| 61 |
-
print("β οΈ No spans found - you need to run your agent first to create spans")
|
| 62 |
-
return False
|
| 63 |
-
|
| 64 |
-
# Debug: Show available columns
|
| 65 |
-
print(f"π Available span columns: {list(spans_df.columns)}")
|
| 66 |
-
input_columns = [col for col in spans_df.columns if 'input' in col.lower()]
|
| 67 |
-
print(f"π Input columns found: {input_columns}")
|
| 68 |
-
|
| 69 |
-
except Exception as e:
|
| 70 |
-
print(f"β Error getting spans: {e}")
|
| 71 |
-
return False
|
| 72 |
-
|
| 73 |
-
# Step 4: Test logging
|
| 74 |
-
print("\n4. Testing evaluation logging...")
|
| 75 |
-
try:
|
| 76 |
-
result = log_evaluations_to_phoenix(test_evaluations)
|
| 77 |
-
|
| 78 |
-
if result is not None:
|
| 79 |
-
print(f"β
Successfully logged {len(result)} evaluations to Phoenix")
|
| 80 |
-
if len(result) > 0:
|
| 81 |
-
print("Sample evaluation:")
|
| 82 |
-
print(f" - Score: {result.iloc[0]['score']}")
|
| 83 |
-
print(f" - Label: {result.iloc[0]['label']}")
|
| 84 |
-
print(f" - Explanation: {result.iloc[0]['explanation'][:100]}...")
|
| 85 |
-
|
| 86 |
-
# Step 5: Verify evaluations were logged
|
| 87 |
-
print("\n5. Verifying evaluations in Phoenix...")
|
| 88 |
-
try:
|
| 89 |
-
import time
|
| 90 |
-
time.sleep(2) # Give Phoenix time to process
|
| 91 |
-
|
| 92 |
-
evals_df = client.get_evaluations_dataframe()
|
| 93 |
-
gaia_evals = evals_df[evals_df['name'] == 'gaia_ground_truth']
|
| 94 |
-
|
| 95 |
-
print(f"π Found {len(gaia_evals)} GAIA evaluations in Phoenix")
|
| 96 |
-
|
| 97 |
-
if len(gaia_evals) > 0:
|
| 98 |
-
print("β
Evaluations successfully verified in Phoenix!")
|
| 99 |
-
return True
|
| 100 |
-
else:
|
| 101 |
-
print("β οΈ No GAIA evaluations found in Phoenix")
|
| 102 |
-
return False
|
| 103 |
-
|
| 104 |
-
except Exception as e:
|
| 105 |
-
print(f"β οΈ Could not verify evaluations: {e}")
|
| 106 |
-
print("β
Logging appeared successful though")
|
| 107 |
-
return True
|
| 108 |
-
|
| 109 |
-
else:
|
| 110 |
-
print("β Evaluation logging failed")
|
| 111 |
-
return False
|
| 112 |
-
|
| 113 |
-
except Exception as e:
|
| 114 |
-
print(f"β Error during logging: {e}")
|
| 115 |
-
import traceback
|
| 116 |
-
traceback.print_exc()
|
| 117 |
-
return False
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
def main():
|
| 121 |
-
"""Main test function."""
|
| 122 |
-
success = test_phoenix_logging()
|
| 123 |
-
|
| 124 |
-
print("\n" + "=" * 50)
|
| 125 |
-
if success:
|
| 126 |
-
print("π Phoenix evaluations logging test PASSED!")
|
| 127 |
-
print("You should now see 'gaia_ground_truth' evaluations in Phoenix UI")
|
| 128 |
-
print("π Visit: http://localhost:6006")
|
| 129 |
-
else:
|
| 130 |
-
print("β Phoenix evaluations logging test FAILED!")
|
| 131 |
-
print("Make sure:")
|
| 132 |
-
print(" 1. Your agent app is running (it starts Phoenix)")
|
| 133 |
-
print(" 2. You've run your agent at least once to create spans")
|
| 134 |
-
print(" 3. Phoenix is accessible at http://localhost:6006")
|
| 135 |
-
print(" 4. Run 'python debug_spans.py' to see span column structure")
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
if __name__ == "__main__":
|
| 139 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|