abc123 / crossword-app /backend-py /docs /probability_distribution_analysis.md
vimalk78's picture
feat: Add percentile-sorted probability visualization to debug tab
681be4a

Probability Distribution Analysis: Theory vs. Practice

Executive Summary

This document analyzes the actual behavior of the crossword word selection system, complementing the theoretical framework described in composite_scoring_algorithm.md. While the composite scoring theory is sound, empirical analysis reveals significant discrepancies between intended and actual behavior.

Key Findings

  • Similarity dominates: Difficulty-based frequency preferences are too weak to create distinct selection patterns
  • Exponential distributions: Actual probability distributions follow exponential decay, not normal distributions
  • Statistical misconceptions: Using normal distribution concepts (μ ± σ) on exponentially decaying data is misleading
  • Mode-mean divergence: Statistical measures don't represent where selections actually occur

Observed Probability Distributions

Data Source: Technology Topic Analysis

Using the debug visualization with ENABLE_DEBUG_TAB=true, we analyzed the actual probability distributions for different difficulties:

Topic: Technology
Candidates: 150 words
Temperature: 0.2
Selection method: Softmax with composite scoring

Empirical Results

Easy Difficulty

Mean Position: Word #42 (IMPLEMENT)
Distribution Width (σ): 33.4 words
σ Sampling Zone: 70.5% of probability mass
σ Range: Words #9-#76
Top Probability: 2.3%

Medium Difficulty

Mean Position: Word #60 (COMPUTERIZED)
Distribution Width (σ): 42.9 words
σ Sampling Zone: 61.0% of probability mass  
σ Range: Words #17-#103
Top Probability: 1.5%

Hard Difficulty

Mean Position: Word #37 (DIGITISATION)
Distribution Width (σ): 40.2 words
σ Sampling Zone: 82.1% of probability mass
σ Range: Words #1-#77  
Top Probability: 4.1%

Critical Observation

All three difficulty levels show similar exponential decay patterns, with only minor variations in peak height and mean position. This indicates the frequency-based difficulty targeting is not working as intended.

Statistical Misconceptions in Current Approach

The Mode-Mean Divergence Problem

The visualization shows a red line (μ) at positions 37-60, but the highest probability bars are at positions 0-5. This reveals a fundamental statistical concept:

Distribution Type: Exponentially Decaying (Highly Skewed)

Mode (Peak):     Position 0-3     (2-4% probability)
Median:          Position ~15     (Where 50% of probability mass is reached)  
Mean (μ):        Position 37-60   (Weighted average position)

Why μ is "Wrong" for Understanding Selection

In an exponential distribution with long tail:

  1. Mode (0-3): Where individual words have highest probability
  2. Practical sampling zone: First 10-20 words contain ~60-80% of probability mass
  3. Mean (37-60): Pulled far right by 100+ words with tiny probabilities

The mean doesn't represent where sampling actually occurs—it's mathematically correct but practically misleading.

Standard Deviation Misapplication

The σ visualization assumes a normal distribution where:

  • Normal assumption: μ ± σ contains ~68% of probability mass
  • Our reality: Exponential distribution with μ ± σ often missing the high-probability words entirely

For exponential distributions, percentiles or cumulative probability are more meaningful than standard deviation.

Actual vs. Expected Behavior Analysis

What Should Happen (Theory)

According to the composite scoring algorithm:

  • Easy: Gaussian peak at 90th percentile → common words dominate
  • Medium: Gaussian peak at 50th percentile → balanced selection
  • Hard: Gaussian peak at 20th percentile → rare words favored

What Actually Happens (Empirical)

Easy:   MULTIMEDIA, TECH, TECHNOLOGY, IMPLEMENTING... (similar to others)
Medium: TECH, TECHNOLOGY, COMPUTERIZED, TECHNOLOGICAL... (similar pattern)
Hard:   TECH, TECHNOLOGY, DIGITISATION, TECHNICIAN... (still similar)

All difficulties select similar high-similarity technology words, regardless of their frequency percentiles.

Root Cause Analysis

The problem isn't in the Gaussian curves—they work correctly. The issue is in the composite formula:

# Current approach
composite = 0.5 * similarity + 0.5 * frequency_score

# What happens with real data:
# High-similarity word: similarity=0.9, wrong_freq_score=0.1
# → composite = 0.5*0.9 + 0.5*0.1 = 0.50

# Medium-similarity word: similarity=0.7, perfect_freq_score=1.0  
# → composite = 0.5*0.7 + 0.5*1.0 = 0.85

Even with perfect frequency alignment, a word needs very high similarity to compete with high-similarity words that have wrong frequency profiles.

Sampling Mechanics Deep Dive

np.random.choice Behavior

The selection uses np.random.choice with:

  • Without replacement: Each word can only be selected once
  • Probability weighting: Based on computed probabilities
  • Sample size: 10 words from 150 candidates

Where Selections Actually Occur

Despite μ being at position 37-60, most actual selections come from positions 0-30 because:

  1. High probabilities concentrate early: First 20 words often have 60%+ of total probability
  2. Without replacement effect: Once high-probability words are chosen, selection moves to next-highest
  3. Exponential decay: Probability drops rapidly, making later positions unlikely

This explains why the green bars (selected words) appear mostly in the left portion of all distributions, regardless of where μ is located.

Better Visualization Approaches

Current Problems

  • μ ± σ assumes normality: Not applicable to exponential distributions
  • Mean position misleading: Doesn't show where selection actually occurs
  • Standard deviation meaningless: For highly skewed distributions

Recommended Alternatives

1. Cumulative Probability Visualization

First 10 words: 45% of total probability mass
First 20 words: 65% of total probability mass  
First 30 words: 78% of total probability mass
First 50 words: 90% of total probability mass

2. Percentile Markers Instead of μ ± σ

P50 (Median):  Position where 50% of probability mass is reached
P75:           Position where 75% of probability mass is reached  
P90:           Position where 90% of probability mass is reached

3. Mode Annotation

  • Show the actual peak (mode) position
  • Mark the top-5 highest probability words
  • Distinguish between statistical mean and practical selection zone

4. Selection Concentration Metric

Effective Selection Range: Positions covering 80% of selection probability
Selection Concentration: Gini coefficient of probability distribution

Difficulty Differentiation Failure

Expected Pattern

Different difficulty levels should show visually distinct probability distribution patterns:

  • Easy: Steep peak at common words, rapid falloff
  • Medium: Moderate peak, balanced distribution
  • Hard: Peak shifted toward rare words

Observed Pattern

All difficulties show similar exponential decay curves with:

  • Similar-shaped distributions
  • Similar high-probability words (TECH, TECHNOLOGY, etc.)
  • Only minor differences in peak height and position

Quantitative Evidence

Similarity scores of top words (all difficulties):
TECHNOLOGY:     0.95+ similarity to "technology" 
TECH:           0.90+ similarity to "technology"
MULTIMEDIA:     0.85+ similarity to "technology"

These high semantic matches dominate regardless of their frequency percentiles.

Recommended Fixes

1. Multiplicative Scoring (Immediate Fix)

Replace additive formula with multiplicative gates:

# Current (additive)
composite = 0.5 * similarity + 0.5 * frequency_score

# Proposed (multiplicative)  
frequency_modifier = get_frequency_modifier(percentile, difficulty)
composite = similarity * frequency_modifier

# Where frequency_modifier ranges 0.1-1.2 instead of 0.0-1.0

Effect: Frequency acts as a gate rather than just another score component.

2. Two-Stage Filtering (Structural Fix)

# Stage 1: Filter by frequency percentile ranges
easy_candidates = [w for w in candidates if w.percentile > 0.7]      # Common words
medium_candidates = [w for w in candidates if 0.3 < w.percentile < 0.7]  # Medium words  
hard_candidates = [w for w in candidates if w.percentile < 0.3]      # Rare words

# Stage 2: Rank filtered candidates by similarity
selected = softmax_selection(filtered_candidates, similarity_only=True)

Effect: Guarantees different frequency pools for each difficulty, then optimizes within each pool.

3. Exponential Temperature Scaling (Parameter Fix)

Use different temperature values by difficulty to create more distinct distributions:

easy_temperature = 0.1    # Very deterministic (sharp peak)
medium_temperature = 0.3  # Moderate randomness
hard_temperature = 0.2    # Deterministic but different peak

4. Adaptive Frequency Weights (Dynamic Fix)

# Calculate frequency dominance needed to overcome similarity differences
max_similarity_diff = max_similarity - min_similarity  # e.g., 0.95 - 0.6 = 0.35
required_freq_weight = max_similarity_diff / (1 - max_similarity_diff)  # e.g., 0.35/0.65 ≈ 0.54

# Use higher frequency weight when similarity ranges are wide
adaptive_weight = min(0.8, required_freq_weight)

Empirical Data Summary

Word Selection Patterns (Technology Topic)

Easy Mode Top Selections:
- MULTIMEDIA (percentile: ?, similarity: high)
- IMPLEMENT (percentile: ?, similarity: high) 
- TECHNOLOGICAL (percentile: ?, similarity: high)

Hard Mode Top Selections:  
- TECH (percentile: ?, similarity: very high)
- DIGITISATION (percentile: likely low, similarity: high)
- TECHNICIAN (percentile: ?, similarity: high)

Statistical Summary

  • σ Width Variation: Easy (33.4) vs Medium (42.9) vs Hard (40.2) - only 28% difference
  • Peak Variation: 1.5% to 4.1% - moderate difference
  • Mean Position Variation: Position 37 to 60 - 62% range but all in middle zone
  • Selection Concentration: Most selections from first 30 words in all difficulties

Conclusions

The Core Problem

The difficulty-aware word selection system is theoretically sound but practically ineffective because:

  1. Semantic similarity signals are too strong compared to frequency signals
  2. Additive scoring allows high-similarity words to dominate regardless of frequency appropriateness
  3. Statistical visualization assumes normal distributions but data is exponentially skewed

Success Metrics for Fixes

A working system should show:

  1. Visually distinct probability distributions for each difficulty
  2. Different word frequency profiles in actual selections
  3. Mode and mean alignment with intended difficulty targets
  4. Meaningful σ ranges that represent actual selection zones

Next Steps

  1. Implement multiplicative scoring or two-stage filtering
  2. Update visualization to use percentiles instead of μ ± σ
  3. Collect empirical data on word frequency percentiles in actual selections
  4. Validate fixes show distinct patterns across difficulties

This analysis represents empirical findings from the debug visualization system, revealing gaps between the theoretical composite scoring model and its practical implementation.