sql_env / docs_draft /reward_design.md
hjerpe's picture
Upload folder using huggingface_hub
9e64e71 verified

Distance-to-Goal Metrics: Complete Reference Table

Priority-Ordered Metric Comparison

Priority Metric Name Effort Value Concept What It Measures Key Value Potential Risks
1 Cardinality Matching 30 min Very High Compare row/value counts "Did you return the right number of things?" Catches over/under-retrieval early; works universally None significant; very robust
2 Value Overlap (Set-based) 45 min Very High Flatten results to value sets, compute Jaccard "How many correct values did you find?" Rewards partial correctness; format-agnostic May over-reward coincidental matches in large results
3 Numeric Range Proximity 1 hour High Logarithmic distance for numbers "Are your numbers in the right ballpark?" Critical for COUNT/SUM/AVG queries; rewards order-of-magnitude Only useful for numeric questions (~40% of dataset)
4 Row-wise Best Match 1.5 hours High Find best row pairing between results "How many rows are correct, ignoring order?" Handles ORDER BY issues; forgiving for column mismatch Computationally expensive for large result sets (100+ rows)
5 Schema Coverage 2 hours Medium-High Compare tables used in queries "Are you querying the right tables?" Guides exploration toward relevant parts of schema Requires SQL parsing; may reward irrelevant table access
6 Column Alignment 2.5 hours Medium Fuzzy match column names "Do your result columns make semantic sense?" Helps disambiguate multi-column results Requires metadata tracking; fuzzy matching can be noisy
7 Rank Correlation 3 hours Medium Spearman correlation on ordered results "Is your ranking correct for TOP-K queries?" Specific to ordered results; robust to ties Only applicable to ~20% of questions; needs scipy dependency
8 SQL Structure 6+ hours Low-Medium Parse query AST, compare structure "Is your query syntactically similar?" Might help with complex multi-join queries HIGH RISK: Reward hacking; overfitting; ignores semantic equivalence
9 Execution Plan 8+ hours Low Compare database query plans "Does your query execute similarly?" Theoretical value for optimization NOT RECOMMENDED: Too complex; DB-specific; doesn't guarantee correctness

Detailed Conceptual Explanations

๐Ÿฅ‡ Tier 1: Must-Have Metrics

1. Cardinality Matching

How It Works:

Input:  agent_result = [row1, row2, row3, row4, row5]  (5 rows)
        gold_result  = [row1, row2, row3]              (3 rows)

Calculation:
  difference = |5 - 3| = 2
  relative_error = 2 / 3 = 0.667
  reward = 1 - min(1.0, 0.667) = 0.333

Conceptual Meaning: "Getting the right number of things is the first step to getting the right things."

Value Provided:

  • Early signal: Before content is right, size can be right
  • Universal: Works for integers (1 value), lists (N values), tables (M rows)
  • Catches common errors:
    • Missing GROUP BY โ†’ too few rows
    • Cartesian join โ†’ too many rows
    • Wrong aggregation โ†’ count off by orders of magnitude

Risk Analysis: โœ… Very low risk

  • No reward hacking opportunities (agent can't game cardinality without semantic progress)
  • Monotonic with correctness (better results = better cardinality)
  • Fast to compute (O(1))

2. Value Overlap (Set-based)

How It Works:

Question: "Which departments have >50 employees?"
Gold:  {Engineering, Sales, Marketing}
Agent: {Engineering, Sales, HR, Legal}  (2 correct, 2 wrong)

Calculation:
  intersection = {Engineering, Sales} = 2 values
  union = {Engineering, Sales, Marketing, HR, Legal} = 5 values
  Jaccard = 2/5 = 0.4

Conceptual Meaning: "You found some of the right answers; find the rest."

Value Provided:

  • Partial credit: Agent gets 0.4 reward even though final answer isn't perfect
  • Format agnostic: Flattens everything to atomic values
    • [("Engineering", 42)] โ†’ {Engineering, 42}
    • [(42, "Engineering")] โ†’ {Engineering, 42} (same!)
  • Compositional: Finding 1 of 3 correct departments = 0.33, finding 2 of 3 = 0.67, finding all 3 = 1.0

Risk Analysis: โš ๏ธ Low-medium risk

  • Coincidental matches: If gold answer is {42} and agent returns entire employees table with 100 rows, might contain 42 by chance
  • Mitigation: Combine with cardinality (penalizes returning too many values)
  • False precision: {42} vs {42.0} vs {42.001} might all become {"42"} after string conversion

3. Numeric Range Proximity

How It Works:

Question: "Average salary in Engineering?"
Gold: 95000

Agent attempt 1: 87000 (off by 8.4%)
  relative_error = 8000 / 95000 = 0.084
  reward = 1 - log10(1 + 0.084) = 1 - log10(1.084) = 1 - 0.035 = 0.965

Agent attempt 2: 9500 (off by 90%, wrong order of magnitude)
  relative_error = 85500 / 95000 = 0.9
  reward = 1 - log10(1 + 0.9) = 1 - log10(1.9) = 1 - 0.279 = 0.721

Agent attempt 3: 950000 (10x too high)
  relative_error = 855000 / 95000 = 9.0
  reward = 1 - log10(1 + 9.0) = 1 - log10(10) = 1 - 1.0 = 0.0

Conceptual Meaning: "Being off by 10% is very different from being off by 10x."

Value Provided:

  • Order-of-magnitude thinking: Rewards agent for "ballpark correct" before exact
  • Logarithmic scale:
    • 95k โ†’ 100k (5% error) = high reward (0.98)
    • 95k โ†’ 190k (100% error) = medium reward (0.70)
    • 95k โ†’ 950k (900% error) = no reward (0.05)
  • Natural for SQL: COUNT, SUM, AVG queries often close but not exact on first try

Risk Analysis: โš ๏ธ Medium risk

  • Only useful for ~40% of questions: Text/categorical answers get no benefit
  • Multiple numbers: If result has [42, 100, 5] and gold is [42], which number to compare?
    • Solution: Use closest match for each gold number
  • Zero-handling: gold=0, agent=1 is infinitely far; needs special case

4. Row-wise Best Match

How It Works:

Question: "Top 3 departments by size"
Gold:   [(Engineering, 65), (Sales, 58), (Marketing, 52)]
Agent:  [(Marketing, 52), (Engineering, 65), (Sales, 58)]  # Wrong order!

Process:
  For each gold row, find best matching agent row:
    Gold row 1: (Engineering, 65)
      vs Agent row 1: (Marketing, 52)     โ†’ 0/2 match = 0.0
      vs Agent row 2: (Engineering, 65)   โ†’ 2/2 match = 1.0 โœ“
      vs Agent row 3: (Sales, 58)         โ†’ 0/2 match = 0.0
      Best match: 1.0
    
    Gold row 2: (Sales, 58)
      Best match: 1.0 (agent row 3)
    
    Gold row 3: (Marketing, 52)
      Best match: 1.0 (agent row 1)
  
  Final reward: (1.0 + 1.0 + 1.0) / 3 = 1.0

Conceptual Meaning: "You got the right rows, just in the wrong orderโ€”that's still mostly correct."

Value Provided:

  • Order-invariant: Catches ORDER BY mistakes without penalizing heavily
  • Extra columns forgiven: If agent returns (Engineering, 65, 95000) and gold is (Engineering, 65), first 2 columns match โ†’ 0.67 reward
  • Partial row matches: Agent got department name right but count wrong โ†’ 0.5 reward per row

Risk Analysis: โš ๏ธ Medium risk

  • Computationally expensive: O(Mร—N) comparisons for M gold rows and N agent rows
    • For 100-row results: 10,000 comparisons per reward calculation
    • Mitigation: Limit to first 20 rows
  • Ambiguous matching: If gold has duplicate rows, which agent row should match which?

๐Ÿฅˆ Tier 2: Nice-to-Have Metrics

5. Schema Coverage

How It Works:

Question: "How many employees in Engineering?"
Gold query:   SELECT COUNT(*) FROM employees WHERE department='Engineering'
              Tables used: {employees}

Agent query:  SELECT COUNT(*) FROM employees e 
              JOIN departments d ON e.dept_id = d.id 
              WHERE d.name='Engineering'
              Tables used: {employees, departments}

Calculation:
  intersection = {employees} = 1
  union = {employees, departments} = 2
  Jaccard = 1/2 = 0.5
  
  Penalty for extra table: 0.1 * 1 = 0.1
  Final: 0.5 - 0.1 = 0.4

Conceptual Meaning: "You're exploring the right part of the database."

Value Provided:

  • Exploration guidance: Early signal before query results are correct
  • Helps with multi-hop: "You found employees, now look at departments"
  • Penalizes shotgun approach: Agent that queries every table gets low reward

Risk Analysis: โš ๏ธ Medium-high risk

  • Multiple valid paths: Simple query might use 1 table, complex query uses 3โ€”both correct
  • Irrelevant table penalty: What if agent explores departments first before finding employees? Gets penalized for valid exploration
  • Requires SQL parsing: Dependency on sqlparse library; edge cases in parsing

6. Column Alignment

How It Works:

Question: "List departments and their average salaries"
Gold columns:    [department_name, avg_salary]
Agent columns:   [dept, average_compensation]

Fuzzy matching:
  "department_name" vs "dept" 
    โ†’ SequenceMatcher = 0.65 (partial match) โœ“
  
  "avg_salary" vs "average_compensation"
    โ†’ SequenceMatcher = 0.45 (weak match) โœ—

Reward: 1/2 columns matched = 0.5

Conceptual Meaning: "Your columns have the right semantic meaning."

Value Provided:

  • Disambiguates multi-column results: If result has [42, 100, 5], which column is the answer?
  • Catches projection errors: Agent did SELECT * when should've done SELECT department, COUNT(*)
  • Fuzzy matching helps: "dept" matches "department", "emp_id" matches "employee_id"

Risk Analysis: โš ๏ธ High risk

  • Requires metadata: Need to track column names from query results (not always available in raw SQLite)
  • Fuzzy matching noise: "count" matches "country" (0.7 similarity), "id" matches "bid" (0.67 similarity)
  • Aliasing issues: SELECT COUNT(*) AS total vs SELECT COUNT(*) AS num_employeesโ€”both mean the same thing

7. Rank Correlation

How It Works:

Question: "Top 5 products by revenue"
Gold:   [ProductA: $1M, ProductB: $900K, ProductC: $800K, ProductD: $750K, ProductE: $700K]
        Ranks: [1, 2, 3, 4, 5]

Agent:  [ProductA: $1M, ProductC: $850K, ProductB: $880K, ProductE: $710K, ProductD: $740K]
        Ranks: [1, 3, 2, 5, 4]

Spearman correlation:
  Rank differences: [0, -1, +1, -1, +1]
  Correlation coefficient: 0.9
  
  Reward: (0.9 + 1) / 2 = 0.95

Conceptual Meaning: "You got the relative ordering mostly right."

Value Provided:

  • Specific to TOP-K queries: ~20% of Spider questions involve ranking
  • Robust to ties: Handles "tied for 2nd place" correctly
  • Partial credit for ordering: Top 3 correct but bottom 2 swapped โ†’ still high reward

Risk Analysis: โš ๏ธ Medium risk

  • Limited applicability: Only works for ordered results
  • Requires scipy: Heavy dependency just for one metric
  • Rank vs. value confusion: Agent might get ranking right but values wrong (or vice versa)

๐Ÿฅ‰ Tier 3: Avoid (High Risk, Low Value)

8. SQL Structure Similarity

How It Works:

Gold query:   SELECT d.name, COUNT(*) 
              FROM employees e 
              JOIN departments d ON e.dept_id = d.id 
              GROUP BY d.name

Agent query:  SELECT department, COUNT(*) 
              FROM employees 
              GROUP BY department

Structural comparison:
  Tables: {employees} vs {employees, departments} โ†’ 0.5
  Joins: {(employees, departments)} vs {} โ†’ 0.0
  Aggregates: {COUNT} vs {COUNT} โ†’ 1.0
  Group By: {d.name} vs {department} โ†’ 0.5
  
  Weighted average: 0.5

Conceptual Meaning: "Your query looks syntactically similar to the gold query."

Risk Analysis: ๐Ÿ›‘ VERY HIGH RISK - DO NOT IMPLEMENT

  • Reward hacking: Agent learns to copy SQL structure without understanding semantics
  • Multiple valid solutions: Gold uses JOIN, agent uses subqueryโ€”both correct, but structure reward penalizes
  • Overfitting: Agent optimizes for "looking like gold query" instead of "getting right answer"
  • Research evidence: SQL-TRAIL paper found structure-based rewards hurt generalization

9. Execution Plan Similarity

How It Works:

Gold query execution plan:
  1. Scan departments (10 rows)
  2. Scan employees (1000 rows)  
  3. Hash join (O(N))
  4. Aggregate (O(N))

Agent query execution plan:
  1. Scan employees (1000 rows)
  2. Nested loop with departments (O(Nยฒ))
  3. Aggregate (O(N))

Similarity: 2/4 steps similar = 0.5

Conceptual Meaning: "Your query executes in a similar way."

Risk Analysis: ๐Ÿ›‘ VERY HIGH RISK - DO NOT IMPLEMENT

  • Database-specific: SQLite plans differ from PostgreSQL plans
  • Doesn't guarantee correctness: Two queries with similar plans can have different results
  • Computationally expensive: Running EXPLAIN on every query doubles execution time
  • Complexity: Comparing tree structures is non-trivial
  • No research evidence: No prior work shows this helps RL training

๐Ÿงฎ Combining Multiple Metrics into Final Reward

The Challenge

You have multiple distance-to-goal metrics. How do you combine them into a single scalar reward?

Current state:
  cardinality_score = 0.8
  value_overlap_score = 0.6
  numeric_range_score = 0.9
  row_match_score = 0.7

Need: single_reward = ???

๐ŸŽฏ Method 1: Weighted Average (RECOMMENDED for MVP)

Formula:

def weighted_average_reward(scores, weights):
    """Simple weighted average of applicable metrics."""
    total_weight = sum(weights.values())
    return sum(scores[k] * weights[k] for k in scores) / total_weight

Example Implementation:

def compute_progress_reward(agent_result, gold_result, agent_query=None):
    # Compute all metrics
    scores = {
        'cardinality': cardinality_reward(agent_result, gold_result),
        'value_overlap': value_overlap_reward(agent_result, gold_result),
        'numeric_range': numeric_range_reward(agent_result, gold_result),
        'row_match': rowwise_best_match(agent_result, gold_result),
    }
    
    # Fixed weights (tune these!)
    weights = {
        'cardinality': 0.25,
        'value_overlap': 0.40,  # Highest weight (most universal)
        'numeric_range': 0.15,
        'row_match': 0.20,
    }
    
    return weighted_average_reward(scores, weights)

Pros:

  • โœ… Simple to implement and understand
  • โœ… Easy to tune (adjust weights based on training performance)
  • โœ… Monotonic (if individual metrics improve, final reward improves)
  • โœ… Bounded [0, 1]

Cons:

  • โš ๏ธ Fixed weights might not be optimal for all question types
  • โš ๏ธ Treats all metrics as equally important (regardless of context)

When to use: Default choice for MVP. Start here.


๐ŸŽฏ Method 2: Adaptive Weighting by Question Type

Formula:

def adaptive_weighted_reward(scores, question_metadata):
    """Adjust weights based on question characteristics."""
    
    # Detect question type
    is_numeric = has_numeric_answer(gold_result)
    is_multirow = len(gold_result) > 1
    is_ordered = "TOP" in question.upper() or "ORDER BY" in gold_query
    
    # Adaptive weights
    weights = {
        'cardinality': 0.25,  # Always important
        'value_overlap': 0.40 if not is_numeric else 0.30,
        'numeric_range': 0.30 if is_numeric else 0.0,  # Only for numeric
        'row_match': 0.20 if is_multirow else 0.10,
    }
    
    # Normalize
    total = sum(weights.values())
    weights = {k: v/total for k, v in weights.items()}
    
    return sum(scores[k] * weights[k] for k in scores)

Example:

Question: "Average salary in Engineering?" (numeric, single-value)
  โ†’ weights: cardinality=0.25, value_overlap=0.30, numeric_range=0.35, row_match=0.10

Question: "List all departments with >50 employees" (text, multi-row)
  โ†’ weights: cardinality=0.25, value_overlap=0.45, numeric_range=0.0, row_match=0.30

Pros:

  • โœ… More accurate reward signal for different question types
  • โœ… Automatically disables irrelevant metrics (e.g., numeric_range for text questions)
  • โœ… Can tune weights per question type independently

Cons:

  • โš ๏ธ More complex to implement (need question type detection)
  • โš ๏ธ More hyperparameters to tune
  • โš ๏ธ Risk of over-engineering

When to use: If fixed weighting shows poor performance on specific question types.


๐ŸŽฏ Method 3: Max Pooling (Optimistic)

Formula:

def max_pooling_reward(scores):
    """Take the best metric (optimistic reward)."""
    return max(scores.values())

Example:

scores = {
    'cardinality': 0.3,  # Wrong row count
    'value_overlap': 0.8,  # Found most values!
    'numeric_range': 0.4,  # Numbers off
    'row_match': 0.5,  # Some rows match
}

reward = max(0.3, 0.8, 0.4, 0.5) = 0.8

Conceptual Meaning: "Give credit for whatever the agent did best."

Pros:

  • โœ… Very forgiving (agent gets credit for any progress)
  • โœ… Encourages diverse exploration strategies
  • โœ… Simple to implement

Cons:

  • ๐Ÿ›‘ Too lenient: Agent might game easiest metric
  • ๐Ÿ›‘ Non-compositional: Doesn't reward improving multiple aspects simultaneously
  • ๐Ÿ›‘ Unstable gradients: Reward can jump dramatically between steps

When to use: If agent is struggling to learn anything (extremely sparse rewards). Use as temporary scaffolding, then switch to weighted average.


๐ŸŽฏ Method 4: Minimum Threshold + Average (Strict)

Formula:

def threshold_average_reward(scores, thresholds):
    """All metrics must meet threshold; then take average."""
    # Check all thresholds
    for metric, score in scores.items():
        if score < thresholds.get(metric, 0.0):
            return 0.0  # Fail if any metric below threshold
    
    # All thresholds met โ†’ return average
    return sum(scores.values()) / len(scores)

Example:

scores = {
    'cardinality': 0.9,
    'value_overlap': 0.7,
    'numeric_range': 0.3,  # Below threshold!
}

thresholds = {
    'cardinality': 0.5,
    'value_overlap': 0.5,
    'numeric_range': 0.5,
}

# numeric_range (0.3) < threshold (0.5) โ†’ return 0.0

Conceptual Meaning: "You must do reasonably well on all aspects to get any reward."

Pros:

  • โœ… Prevents over-optimization of single metric
  • โœ… Encourages balanced progress

Cons:

  • ๐Ÿ›‘ Too strict: Might be too hard for early training
  • ๐Ÿ›‘ Cliff dynamics: Slight improvement might not change reward at all
  • ๐Ÿ›‘ Threshold tuning: Requires careful calibration

When to use: If agent is gaming one metric while ignoring others. Use as penalty mechanism.


๐ŸŽฏ Method 5: Hierarchical (Coarse-to-Fine)

Formula:

def hierarchical_reward(scores):
    """First get cardinality right, then content, then structure."""
    
    # Layer 1: Cardinality (must be >0.5 to proceed)
    if scores['cardinality'] < 0.5:
        return scores['cardinality'] * 0.3  # Low reward, stuck at layer 1
    
    # Layer 2: Content (must be >0.5 to proceed)
    content_score = (scores['value_overlap'] + scores['numeric_range']) / 2
    if content_score < 0.5:
        return 0.3 + content_score * 0.4  # Medium reward, stuck at layer 2
    
    # Layer 3: Structure (all metrics combined)
    final_score = (
        0.2 * scores['cardinality'] +
        0.4 * content_score +
        0.4 * scores['row_match']
    )
    return 0.5 + final_score * 0.5  # High reward, layer 3

Conceptual Meaning: "Learn to get the count right first, then the values, then the structure."

Visualization:

Reward progression:
0.0 โ”€โ”€โ”€โ”€โ”€โ”€โ”€ 0.3 โ”€โ”€โ”€โ”€โ”€โ”€โ”€ 0.7 โ”€โ”€โ”€โ”€โ”€โ”€โ”€ 1.0
     โ†‘            โ†‘            โ†‘
 Cardinality  Content    Structure
   correct     correct     correct

Pros:

  • โœ… Natural curriculum (easier tasks first)
  • โœ… Clear progression signal
  • โœ… Prevents agent from over-optimizing structure before content

Cons:

  • โš ๏ธ More complex logic
  • โš ๏ธ Requires careful threshold tuning
  • โš ๏ธ Might slow down learning if thresholds too strict

When to use: If agent learns poorly with flat reward. Provides curriculum learning.


๐ŸŽฏ Method 6: Product (Multiplicative)

Formula:

def product_reward(scores):
    """Multiply all metrics (all must be good)."""
    product = 1.0
    for score in scores.values():
        product *= score
    return product

Example:

scores = {
    'cardinality': 0.9,
    'value_overlap': 0.8,
    'numeric_range': 0.7,
}

reward = 0.9 ร— 0.8 ร— 0.7 = 0.504

Conceptual Meaning: "All aspects must be good; weak performance on any metric drags down total."

Pros:

  • โœ… Encourages balanced improvement
  • โœ… Penalizes weak performance on any dimension

Cons:

  • ๐Ÿ›‘ Too strict: Single low score (0.1) makes entire reward near zero
  • ๐Ÿ›‘ Vanishing gradients: Product of small numbers becomes very small
  • ๐Ÿ›‘ Not bounded predictably: Can produce very small rewards even for good progress

When to use: Rarely. Only if you need extremely strict "all-or-nothing" reward.


๐ŸŽฏ Method 7: Percentile Aggregation (Robust)

Formula:

def percentile_reward(scores, percentile=50):
    """Use median (or other percentile) of all metrics."""
    import numpy as np
    return np.percentile(list(scores.values()), percentile)

Example:

scores = [0.9, 0.8, 0.3, 0.7, 0.6]
sorted = [0.3, 0.6, 0.7, 0.8, 0.9]

percentile_50 (median) = 0.7
percentile_75 = 0.8
percentile_25 = 0.6

Conceptual Meaning: "Reward based on typical performance, ignoring outliers."

Pros:

  • โœ… Robust to outlier metrics (one very low or very high score doesn't dominate)
  • โœ… Simple to implement
  • โœ… Tunable (change percentile to be more/less strict)

Cons:

  • โš ๏ธ Less interpretable than weighted average
  • โš ๏ธ Ignores some information (throws away best and worst scores)

When to use: If one metric is noisy or unreliable, use median to ignore it.


๐Ÿ“Š Comparison Table: Combination Methods

Method Complexity Interpretability Robustness Training Stability Best Use Case
Weighted Average Low High Medium High MVP default
Adaptive Weighting Medium Medium High High Different question types need different signals
Max Pooling Low Medium Low Low Agent struggling to learn anything
Threshold + Average Medium High Medium Medium Agent gaming one metric
Hierarchical High High High Medium Want curriculum learning
Product Low Low Low Low All aspects must be perfect (rare)
Percentile Low Low High High One metric is noisy/unreliable

๐ŸŽฌ Recommended Implementation Strategy

Phase 1: Start Simple (Week 1)

def compute_progress_reward(agent_result, gold_result):
    """Initial implementation: weighted average of 3 metrics."""
    
    scores = {
        'cardinality': cardinality_reward(agent_result, gold_result),
        'value_overlap': value_overlap_reward(agent_result, gold_result),
        'numeric_range': numeric_range_reward(agent_result, gold_result),
    }
    
    weights = {'cardinality': 0.25, 'value_overlap': 0.50, 'numeric_range': 0.25}
    
    return weighted_average_reward(scores, weights)

Why: Simple, interpretable, easy to debug.


Phase 2: Add Context (Week 2, if needed)

def compute_progress_reward(agent_result, gold_result, question_type):
    """Adaptive weighting based on question type."""
    
    scores = {
        'cardinality': cardinality_reward(agent_result, gold_result),
        'value_overlap': value_overlap_reward(agent_result, gold_result),
        'numeric_range': numeric_range_reward(agent_result, gold_result),
        'row_match': rowwise_best_match(agent_result, gold_result),
    }
    
    # Adapt weights
    if question_type == 'numeric':
        weights = {'cardinality': 0.2, 'value_overlap': 0.3, 
                   'numeric_range': 0.4, 'row_match': 0.1}
    elif question_type == 'multirow':
        weights = {'cardinality': 0.25, 'value_overlap': 0.4, 
                   'numeric_range': 0.05, 'row_match': 0.3}
    else:  # default
        weights = {'cardinality': 0.25, 'value_overlap': 0.5, 
                   'numeric_range': 0.15, 'row_match': 0.1}
    
    return weighted_average_reward(scores, weights)

Why: Improves signal quality without major complexity.


Phase 3: Add Safeguards (Week 3, if agent is gaming)

def compute_progress_reward(agent_result, gold_result, question_type):
    """Weighted average with anti-gaming measures."""
    
    scores = compute_all_scores(agent_result, gold_result)
    
    # Anti-gaming: if cardinality is way off, cap other rewards
    if scores['cardinality'] < 0.3:
        # Agent is nowhere close on size โ†’ limit credit for content
        return scores['cardinality'] * 0.5
    
    # Anti-gaming: if value overlap is low, cap row match
    if scores['value_overlap'] < 0.4:
        scores['row_match'] *= 0.5  # Penalize structure if content is wrong
    
    # Standard weighted average
    return weighted_average_reward(scores, get_weights(question_type))

Why: Prevents reward hacking while keeping interpretability.


๐Ÿงช How to Validate Your Combination Method

Create test suite:

# Test 1: Perfect match
assert compute_reward(gold, gold) == 1.0

# Test 2: Completely wrong
assert compute_reward(random_result, gold) < 0.2

# Test 3: Monotonicity (better result โ†’ higher reward)
result_v1 = partially_correct_result()  # 30% right
result_v2 = more_correct_result()       # 60% right
result_v3 = mostly_correct_result()     # 90% right
assert compute_reward(result_v1, gold) < compute_reward(result_v2, gold) < compute_reward(result_v3, gold)

# Test 4: Bounded [0, 1]
for _ in range(100):
    random_result = generate_random_result()
    reward = compute_reward(random_result, gold)
    assert 0.0 <= reward <= 1.0

# Test 5: Insensitive to format (same values, different structure)
result_format_a = [("Engineering", 42)]
result_format_b = [(42, "Engineering")]
assert abs(compute_reward(result_format_a, gold) - 
           compute_reward(result_format_b, gold)) < 0.1  # Allow small difference

๐Ÿ’ก Final Recommendation

For MVP (Phase 1-3):

  • Use Method 1: Weighted Average with metrics #1-3 (cardinality, value overlap, numeric range)
  • Fixed weights: {0.25, 0.50, 0.25}

If training shows issues:

  • Add Method 2: Adaptive Weighting based on question type
  • Add metric #4 (row-wise match)

If agent games rewards:

  • Add threshold checks from Method 4
  • Add anti-gaming logic from Phase 3 example

Never use:

  • Method 3 (Max Pooling) - too gameable
  • Method 6 (Product) - too strict, vanishing gradients
  • Metrics #8-9 (SQL structure, execution plans) - research shows they hurt

Bottom line: Start simple (weighted average of 3 metrics), add complexity only when needed, always validate with test suite.