Probability Distribution Analysis: Theory vs. Practice
Executive Summary
This document analyzes the actual behavior of the crossword word selection system, complementing the theoretical framework described in composite_scoring_algorithm.md. While the composite scoring theory is sound, empirical analysis reveals significant discrepancies between intended and actual behavior.
Key Findings
- Similarity dominates: Difficulty-based frequency preferences are too weak to create distinct selection patterns
- Exponential distributions: Actual probability distributions follow exponential decay, not normal distributions
- Statistical misconceptions: Using normal distribution concepts (μ ± σ) on exponentially decaying data is misleading
- Mode-mean divergence: Statistical measures don't represent where selections actually occur
Observed Probability Distributions
Data Source: Technology Topic Analysis
Using the debug visualization with ENABLE_DEBUG_TAB=true, we analyzed the actual probability distributions for different difficulties:
Topic: Technology
Candidates: 150 words
Temperature: 0.2
Selection method: Softmax with composite scoring
Empirical Results
Easy Difficulty
Mean Position: Word #42 (IMPLEMENT)
Distribution Width (σ): 33.4 words
σ Sampling Zone: 70.5% of probability mass
σ Range: Words #9-#76
Top Probability: 2.3%
Medium Difficulty
Mean Position: Word #60 (COMPUTERIZED)
Distribution Width (σ): 42.9 words
σ Sampling Zone: 61.0% of probability mass
σ Range: Words #17-#103
Top Probability: 1.5%
Hard Difficulty
Mean Position: Word #37 (DIGITISATION)
Distribution Width (σ): 40.2 words
σ Sampling Zone: 82.1% of probability mass
σ Range: Words #1-#77
Top Probability: 4.1%
Critical Observation
All three difficulty levels show similar exponential decay patterns, with only minor variations in peak height and mean position. This indicates the frequency-based difficulty targeting is not working as intended.
Statistical Misconceptions in Current Approach
The Mode-Mean Divergence Problem
The visualization shows a red line (μ) at positions 37-60, but the highest probability bars are at positions 0-5. This reveals a fundamental statistical concept:
Distribution Type: Exponentially Decaying (Highly Skewed)
Mode (Peak): Position 0-3 (2-4% probability)
Median: Position ~15 (Where 50% of probability mass is reached)
Mean (μ): Position 37-60 (Weighted average position)
Why μ is "Wrong" for Understanding Selection
In an exponential distribution with long tail:
- Mode (0-3): Where individual words have highest probability
- Practical sampling zone: First 10-20 words contain ~60-80% of probability mass
- Mean (37-60): Pulled far right by 100+ words with tiny probabilities
The mean doesn't represent where sampling actually occurs—it's mathematically correct but practically misleading.
Standard Deviation Misapplication
The σ visualization assumes a normal distribution where:
- Normal assumption: μ ± σ contains ~68% of probability mass
- Our reality: Exponential distribution with μ ± σ often missing the high-probability words entirely
For exponential distributions, percentiles or cumulative probability are more meaningful than standard deviation.
Actual vs. Expected Behavior Analysis
What Should Happen (Theory)
According to the composite scoring algorithm:
- Easy: Gaussian peak at 90th percentile → common words dominate
- Medium: Gaussian peak at 50th percentile → balanced selection
- Hard: Gaussian peak at 20th percentile → rare words favored
What Actually Happens (Empirical)
Easy: MULTIMEDIA, TECH, TECHNOLOGY, IMPLEMENTING... (similar to others)
Medium: TECH, TECHNOLOGY, COMPUTERIZED, TECHNOLOGICAL... (similar pattern)
Hard: TECH, TECHNOLOGY, DIGITISATION, TECHNICIAN... (still similar)
All difficulties select similar high-similarity technology words, regardless of their frequency percentiles.
Root Cause Analysis
The problem isn't in the Gaussian curves—they work correctly. The issue is in the composite formula:
# Current approach
composite = 0.5 * similarity + 0.5 * frequency_score
# What happens with real data:
# High-similarity word: similarity=0.9, wrong_freq_score=0.1
# → composite = 0.5*0.9 + 0.5*0.1 = 0.50
# Medium-similarity word: similarity=0.7, perfect_freq_score=1.0
# → composite = 0.5*0.7 + 0.5*1.0 = 0.85
Even with perfect frequency alignment, a word needs very high similarity to compete with high-similarity words that have wrong frequency profiles.
Sampling Mechanics Deep Dive
np.random.choice Behavior
The selection uses np.random.choice with:
- Without replacement: Each word can only be selected once
- Probability weighting: Based on computed probabilities
- Sample size: 10 words from 150 candidates
Where Selections Actually Occur
Despite μ being at position 37-60, most actual selections come from positions 0-30 because:
- High probabilities concentrate early: First 20 words often have 60%+ of total probability
- Without replacement effect: Once high-probability words are chosen, selection moves to next-highest
- Exponential decay: Probability drops rapidly, making later positions unlikely
This explains why the green bars (selected words) appear mostly in the left portion of all distributions, regardless of where μ is located.
Better Visualization Approaches
Current Problems
- μ ± σ assumes normality: Not applicable to exponential distributions
- Mean position misleading: Doesn't show where selection actually occurs
- Standard deviation meaningless: For highly skewed distributions
Recommended Alternatives
1. Cumulative Probability Visualization
First 10 words: 45% of total probability mass
First 20 words: 65% of total probability mass
First 30 words: 78% of total probability mass
First 50 words: 90% of total probability mass
2. Percentile Markers Instead of μ ± σ
P50 (Median): Position where 50% of probability mass is reached
P75: Position where 75% of probability mass is reached
P90: Position where 90% of probability mass is reached
3. Mode Annotation
- Show the actual peak (mode) position
- Mark the top-5 highest probability words
- Distinguish between statistical mean and practical selection zone
4. Selection Concentration Metric
Effective Selection Range: Positions covering 80% of selection probability
Selection Concentration: Gini coefficient of probability distribution
Difficulty Differentiation Failure
Expected Pattern
Different difficulty levels should show visually distinct probability distribution patterns:
- Easy: Steep peak at common words, rapid falloff
- Medium: Moderate peak, balanced distribution
- Hard: Peak shifted toward rare words
Observed Pattern
All difficulties show similar exponential decay curves with:
- Similar-shaped distributions
- Similar high-probability words (TECH, TECHNOLOGY, etc.)
- Only minor differences in peak height and position
Quantitative Evidence
Similarity scores of top words (all difficulties):
TECHNOLOGY: 0.95+ similarity to "technology"
TECH: 0.90+ similarity to "technology"
MULTIMEDIA: 0.85+ similarity to "technology"
These high semantic matches dominate regardless of their frequency percentiles.
Recommended Fixes
1. Multiplicative Scoring (Immediate Fix)
Replace additive formula with multiplicative gates:
# Current (additive)
composite = 0.5 * similarity + 0.5 * frequency_score
# Proposed (multiplicative)
frequency_modifier = get_frequency_modifier(percentile, difficulty)
composite = similarity * frequency_modifier
# Where frequency_modifier ranges 0.1-1.2 instead of 0.0-1.0
Effect: Frequency acts as a gate rather than just another score component.
2. Two-Stage Filtering (Structural Fix)
# Stage 1: Filter by frequency percentile ranges
easy_candidates = [w for w in candidates if w.percentile > 0.7] # Common words
medium_candidates = [w for w in candidates if 0.3 < w.percentile < 0.7] # Medium words
hard_candidates = [w for w in candidates if w.percentile < 0.3] # Rare words
# Stage 2: Rank filtered candidates by similarity
selected = softmax_selection(filtered_candidates, similarity_only=True)
Effect: Guarantees different frequency pools for each difficulty, then optimizes within each pool.
3. Exponential Temperature Scaling (Parameter Fix)
Use different temperature values by difficulty to create more distinct distributions:
easy_temperature = 0.1 # Very deterministic (sharp peak)
medium_temperature = 0.3 # Moderate randomness
hard_temperature = 0.2 # Deterministic but different peak
4. Adaptive Frequency Weights (Dynamic Fix)
# Calculate frequency dominance needed to overcome similarity differences
max_similarity_diff = max_similarity - min_similarity # e.g., 0.95 - 0.6 = 0.35
required_freq_weight = max_similarity_diff / (1 - max_similarity_diff) # e.g., 0.35/0.65 ≈ 0.54
# Use higher frequency weight when similarity ranges are wide
adaptive_weight = min(0.8, required_freq_weight)
Empirical Data Summary
Word Selection Patterns (Technology Topic)
Easy Mode Top Selections:
- MULTIMEDIA (percentile: ?, similarity: high)
- IMPLEMENT (percentile: ?, similarity: high)
- TECHNOLOGICAL (percentile: ?, similarity: high)
Hard Mode Top Selections:
- TECH (percentile: ?, similarity: very high)
- DIGITISATION (percentile: likely low, similarity: high)
- TECHNICIAN (percentile: ?, similarity: high)
Statistical Summary
- σ Width Variation: Easy (33.4) vs Medium (42.9) vs Hard (40.2) - only 28% difference
- Peak Variation: 1.5% to 4.1% - moderate difference
- Mean Position Variation: Position 37 to 60 - 62% range but all in middle zone
- Selection Concentration: Most selections from first 30 words in all difficulties
Conclusions
The Core Problem
The difficulty-aware word selection system is theoretically sound but practically ineffective because:
- Semantic similarity signals are too strong compared to frequency signals
- Additive scoring allows high-similarity words to dominate regardless of frequency appropriateness
- Statistical visualization assumes normal distributions but data is exponentially skewed
Success Metrics for Fixes
A working system should show:
- Visually distinct probability distributions for each difficulty
- Different word frequency profiles in actual selections
- Mode and mean alignment with intended difficulty targets
- Meaningful σ ranges that represent actual selection zones
Next Steps
- Implement multiplicative scoring or two-stage filtering
- Update visualization to use percentiles instead of μ ± σ
- Collect empirical data on word frequency percentiles in actual selections
- Validate fixes show distinct patterns across difficulties
This analysis represents empirical findings from the debug visualization system, revealing gaps between the theoretical composite scoring model and its practical implementation.