Spaces:

vimalk78
/

abc123

Sleeping

App Files Files

xet

Community

vimalk78 commited on Aug 29

Commit

425eda1

1 Parent(s): d5df3cd

fix: Improve word selection and clue generation for crosswords

Browse files

- Remove quality clue filtering to fix hard mode selecting common words
- Reduce thematic pool from 400 to 150 words for better performance
- Add semantic neighbor-based clue generation using embeddings
- Update temperature (0.7→0.2) and difficulty_weight (0.3→0.5) for better selection
- Add comprehensive documentation on embedding limitations

Signed-off-by: Vimal Kumar <vimal78@gmail.com>

Files changed (2) hide show

crossword-app/backend-py/docs/embedding_limitations_and_clue_generation.md +238 -0
crossword-app/backend-py/src/services/thematic_word_service.py +131 -75

crossword-app/backend-py/docs/embedding_limitations_and_clue_generation.md ADDED Viewed

	@@ -0,0 +1,238 @@

+# Embedding Limitations and Clue Generation Analysis
+## Executive Summary
+This document analyzes why our current semantic neighbor approach for crossword clue generation produces suboptimal results, explores the fundamental limitations of sentence transformers for entity relationships, and proposes practical solutions for better crossword clues.
+## The Problem: Poor Quality Clues from Semantic Neighbors
+### Current Clue Examples
+```
+PANESAR    → "Associated with pandya, parmar and pankaj"
+RAJOURI    → "Associated with raji, rajini and rajni"
+RAJPUTANA  → "Related to rajput (a member of the dominant hindu military caste...)"
+DRAVIDA    → "Related to dravidian (a member of one of the aboriginal races...)"
+TENDULKAR  → "Associated with ganguly, sachin and dravid"
+```
+### Why These Are Poor Crossword Clues
+1. **PANESAR**: Semantic neighbors are just phonetically similar Indian names
+2. **RAJPUTANA**: The clue contains "rajput" which is part of the answer
+3. **Generic formatting**: "Associated with X, Y, Z" is not crossword-style
+4. **Missing entity context**: No indication that PANESAR is a cricketer, RAJOURI is a place
+## Root Cause Analysis: Sentence Transformer Limitations
+### The PANESAR Case Study
+**Expected neighbors for a crossword:**
+- cricket, england, spinner, bowler
+**Actual neighbors from embeddings:**
+```
+PANESAR similarities:
+  cricket   : 0.526 (moderate)
+  england   : 0.264 (very low!)
+  spinner   : 0.361 (low)
+  bowler    : 0.476 (moderate)
+  pandya    : 0.788 (very high!)
+  parmar    : 0.731 (very high!)
+  pankaj    : 0.702 (very high!)
+  panaji    : 0.696 (very high!)
+```
+### Why This Happens: What Embeddings Actually Encode
+Sentence transformers like `all-mpnet-base-v2` are trained to encode **sentence-level semantics**, not **entity relationships**. When extracting single-word embeddings, they capture:
+#### ✅ What They Capture Well:
+1. **Morphological similarity**: Words with similar spelling/phonetics
+2. **Syntactic patterns**: How words are used grammatically
+3. **Distributional similarity**: Words appearing in similar sentence contexts
+#### ❌ What They Miss:
+1. **Encyclopedic knowledge**: "Panesar is a cricketer"
+2. **Entity relationships**: "Panesar played for England"
+3. **Factual attributes**: "Rajouri is in Kashmir"
+### The 768-Dimensional Problem
+For PANESAR, the embedding dimensions are encoding:
+- **High weight**: "Sounds like an Indian surname" (pan- prefix pattern)
+- **High weight**: "Appears with other Indian names in text"
+- **Medium weight**: "Sometimes mentioned with cricket terms"
+- **Low weight**: "Played for England team"
+The model learned **surface patterns** rather than **semantic facts**.
+## Training Data Distribution Effects
+### Why Phonetic Similarity Dominates
+The training corpus likely contained:
+```
+"Indian names like Pandya, Parmar, and Patel..." (frequent)
+"Panesar and Pankaj are common surnames..." (frequent)
+vs.
+"Panesar bowled for England in the 2007 series..." (infrequent)
+```
+**Result**: Phonetic/cultural patterns get higher weight than factual relationships.
+## Fundamental Issue: Wrong Type of Similarity
+### What We Need vs What We Get
+**For crosswords, we need:**
+- PANESAR → cricketer, spinner, England-born
+- RAJOURI → district, Kashmir, border region
+- TENDULKAR → batsman, records, Mumbai
+**What embeddings give us:**
+- PANESAR → pandya, parmar (phonetic similarity)
+- RAJOURI → raji, rajini (name pattern similarity)
+- TENDULKAR → ganguly, dravid (co-occurrence similarity)
+## Knowledge-Augmented Embedding Solutions
+### Available Models with Entity Knowledge
+#### 1. Wikipedia2Vec
+- **Pros**: Trained on Wikipedia with entity linking, knows factual relationships
+- **Cons**: Complex setup, requires Wikipedia dump download
+- **Example**: Would know "Monty Panesar" → "English cricketer"
+#### 2. BERT-Entity / LUKE
+- **Pros**: Specifically designed for entity understanding
+- **Cons**: Heavier model, requires entity recognition pipeline
+- **Example**: Understands entity types and relationships
+#### 3. ConceptNet Numberbatch
+- **Pros**: Combines word embeddings with knowledge graph
+- **Cons**: Large download (several GB), complex integration
+- **Example**: Knows factual relationships like "cricket player from England"
+#### 4. ERNIE (Enhanced Representation through kNowledge IntEgration)
+- **Pros**: Integrates knowledge graphs during training
+- **Cons**: Primarily Chinese focus, complex setup
+- **Example**: Better entity-relationship understanding
+#### 5. KnowBERT
+- **Pros**: BERT + Knowledge bases (WordNet, Wikipedia)
+- **Cons**: Multiple components, heavy setup
+- **Example**: Combines language understanding with encyclopedic knowledge
+## Practical Solutions for Our System
+### Option 1: Hybrid Approach (Recommended)
+Keep current embeddings but augment with lightweight knowledge base:
+```python
+# Small knowledge file
+entity_facts = {
+    "panesar": {
+        "type": "person",
+        "domain": "cricket",
+        "attributes": ["spinner", "england", "monty"],
+        "clue_template": "English {domain} player known as {nickname}"
+    },
+    "rajouri": {
+        "type": "place",
+        "domain": "geography",
+        "attributes": ["district", "kashmir", "border"],
+        "clue_template": "{domain} district in disputed region"
+    }
+}
+def generate_hybrid_clue(word):
+    if word in entity_facts:
+        return generate_factual_clue(word, entity_facts[word])
+    else:
+        return generate_semantic_neighbor_clue(word)
+```
+### Option 2: Entity Type Classification
+Use embedding clusters to identify entity types:
+```python
+# Pre-compute clusters
+person_cluster = words_near(["gandhi", "nehru", "shakespeare"])
+place_cluster = words_near(["delhi", "mumbai", "london"])
+sport_cluster = words_near(["cricket", "football", "tennis"])
+# Classify and generate appropriate clues
+if word in person_cluster and word in sport_cluster:
+    return f"Sports personality"
+elif word in place_cluster:
+    return f"Geographic location"
+```
+### Option 3: Knowledge Graph from Co-occurrences
+Build relationships from the training corpus:
+```python
+# Extract from embedding neighborhoods
+def build_knowledge_graph():
+    knowledge = {}
+    for word in vocabulary:
+        neighbors = get_semantic_neighbors(word)
+        # Identify patterns
+        if any(n in cricket_terms for n in neighbors):
+            knowledge[word]["domain"] = "cricket"
+        if any(n in place_names for n in neighbors):
+            knowledge[word]["type"] = "place"
+```
+## Implementation Recommendations
+### Phase 1: Immediate Improvement
+1. **Add entity knowledge file** for top 1000 words in vocabulary
+2. **Implement hybrid clue generation** (facts first, then neighbors)
+3. **Better clue formatting** (proper crossword style)
+### Phase 2: Enhanced System
+1. **Entity type classification** using embedding clustering
+2. **Automated knowledge extraction** from neighbor patterns
+3. **Domain-specific clue templates**
+### Phase 3: Advanced Solutions
+1. **Evaluate Wikipedia2Vec** for full factual embeddings
+2. **Build comprehensive knowledge base** for crossword entities
+3. **Train custom embeddings** on crossword-specific data
+## Current System Status
+### What Works
+- ✅ Proper difficulty-based word selection (rare words for hard mode)
+- ✅ Fast performance using existing embeddings
+- ✅ Better than generic templates (slight improvement)
+### What Needs Improvement
+- ❌ Clue quality still poor for domain-specific entities
+- ❌ Phonetic similarity dominates factual relationships
+- ❌ No understanding of entity types or attributes
+## Conclusion
+The semantic neighbor approach revealed fundamental limitations of sentence transformers for entity-relationship understanding. While it's better than generic templates, it's insufficient for quality crossword clues.
+The recommended path forward is a **hybrid approach** that augments current embeddings with a lightweight knowledge base, providing factual context for common crossword entities while maintaining system performance and simplicity.
+## Technical Notes
+- **Current model**: `sentence-transformers/all-mpnet-base-v2` (768 dimensions)
+- **Vocabulary size**: ~30,000 words
+- **Performance impact**: Semantic neighbor lookup adds ~50ms per word
+- **Storage requirements**: Current approach uses existing embeddings (~500MB)
+---
+*This analysis was conducted during the crossword generation optimization project, August 2025*

crossword-app/backend-py/src/services/thematic_word_service.py CHANGED Viewed

@@ -286,6 +286,7 @@ class ThematicWordService:
         self.similarity_temperature = float(os.getenv("SIMILARITY_TEMPERATURE", "0.2"))
         self.use_softmax_selection = os.getenv("USE_SOFTMAX_SELECTION", "true").lower() == "true"
         self.difficulty_weight = float(os.getenv("DIFFICULTY_WEIGHT", "0.5"))
         # Core components
         self.vocab_manager = VocabularyManager(str(self.cache_dir), self.vocab_size_limit)
@@ -1083,11 +1084,15 @@ class ThematicWordService:
         if custom_sentence:
             input_list.append(custom_sentence)  # Now: ["Art", "i will always love you"]
-        # Get thematic words (get extra for filtering)
         # a result is a tuple of  (word, similarity, word_tier)
         raw_results = self.generate_thematic_words(
             input_list,
-            num_words=400,  # Larger pool for composite scoring to work with
             min_similarity=min_similarity,
             multi_theme=multi_theme,
             difficulty=difficulty
@@ -1126,9 +1131,10 @@ class ThematicWordService:
                     # Sort words within tier alphabetically
                     tier_words = sorted(tier_groups[tier], key=lambda x: x[0])
                     for word, similarity in tier_words:
-                        log_lines.append(f"    {word:<15} (similarity: {similarity:.3f})")
-            # uncomment this log line if want to print all words returned
             logger.info("\n".join(log_lines))
         else:
             logger.info("📊 No thematic words generated")
@@ -1137,7 +1143,7 @@ class ThematicWordService:
         # Let softmax with composite scoring handle difficulty selection
         candidate_words = []
-        logger.info(f"📊 Generating clues for all {len(raw_results)} thematically relevant words")
         for word, similarity, tier in raw_results:
             word_data = {
                 "word": word.upper(),
@@ -1148,64 +1154,32 @@ class ThematicWordService:
             }
             candidate_words.append(word_data)
-        # Step 5: Filter candidates by clue quality and select best words
-        logger.info(f"📊 Generated {len(candidate_words)} candidate words, filtering for clue quality")
-        # Separate words by clue quality
-        quality_words = []  # Words with proper WordNet-based clues
-        fallback_words = []  # Words with generic fallback clues
-        fallback_patterns = ["Related to", "Crossword answer"]
-        for word_data in candidate_words:
-            clue = word_data["clue"]
-            has_fallback = any(pattern in clue for pattern in fallback_patterns)
-            if has_fallback:
-                fallback_words.append(word_data)
-            else:
-                quality_words.append(word_data)
-        # Prioritize quality words, use fallback only if needed
         final_words = []
         # Select words using either softmax weighted selection or traditional random selection
         if self.use_softmax_selection:
-            logger.info(f"🎲 Using softmax weighted selection (temperature: {self.similarity_temperature})")
-            # First, try to get enough words from quality words using softmax
-            if quality_words and len(quality_words) > requested_words:
-                selected_quality = self._softmax_weighted_selection(quality_words, requested_words, difficulty=difficulty)
-                final_words.extend(selected_quality)
-            elif quality_words:
-                final_words.extend(quality_words)  # Take all quality words if not enough
-            # If we don't have enough, supplement with softmax-selected fallback words
-            if len(final_words) < requested_words and fallback_words:
-                needed = requested_words - len(final_words)
-                if len(fallback_words) > needed:
-                    selected_fallback = self._softmax_weighted_selection(fallback_words, needed, difficulty=difficulty)
-                    final_words.extend(selected_fallback)
-                else:
-                    final_words.extend(fallback_words)  # Take all fallback words if not enough
         else:
-            logger.info("📊 Using traditional random selection")
-            # Original random selection logic
-            if quality_words:
-                random.shuffle(quality_words)  # Randomize selection
-                final_words.extend(quality_words[:requested_words])
-            # If we don't have enough quality words, add some fallback words
-            if len(final_words) < requested_words and fallback_words:
-                needed = requested_words - len(final_words)
-                random.shuffle(fallback_words)
-                final_words.extend(fallback_words[:needed])
-        # Final shuffle to avoid quality-based ordering (always done for output consistency)
         random.shuffle(final_words)
-        logger.info(f"✅ Selected {len(final_words)} words ({len([w for w in final_words if not any(p in w['clue'] for p in fallback_patterns)])} quality, {len([w for w in final_words if any(p in w['clue'] for p in fallback_patterns)])} fallback)")
         logger.info(f"📝 Final words: {[w['word'] for w in final_words]}")
         return final_words
@@ -1220,8 +1194,103 @@ class ThematicWordService:
         criteria = difficulty_criteria.get(difficulty, difficulty_criteria["medium"])
         return criteria["min_len"] <= len(word) <= criteria["max_len"]
     def _generate_crossword_clue(self, word: str, topics: List[str]) -> str:
-        """Generate a crossword clue for the word using WordNet."""
         # Initialize WordNet clue generator if not already done
         if not hasattr(self, '_wordnet_generator') or self._wordnet_generator is None:
             try:
@@ -1235,38 +1304,25 @@ class ThematicWordService:
                 logger.warning(f"⚠️ Failed to initialize WordNet clue generator: {e}")
                 self._wordnet_generator = None
-        # Use WordNet generator if available
         if self._wordnet_generator:
             try:
                 primary_topic = topics[0] if topics else "general"
                 clue = self._wordnet_generator.generate_clue(word, primary_topic)
-                if clue and len(clue.strip()) > 0:
                     return clue
             except Exception as e:
                 logger.warning(f"⚠️ WordNet clue generation failed for '{word}': {e}")
-        # Fallback to simple templates if WordNet fails
         word_lower = word.lower()
         primary_topic = topics[0] if topics else "general"
-        topic_lower = primary_topic.lower()
-        # Topic-specific clue templates as fallback
-        if any(keyword in topic_lower for keyword in ["animal", "pet", "wildlife"]):
-            return f"{word_lower} (animal)"
-        elif any(keyword in topic_lower for keyword in ["tech", "computer", "software", "digital"]):
-            return f"{word_lower} (technology)"
-        elif any(keyword in topic_lower for keyword in ["science", "biology", "chemistry", "physics"]):
-            return f"{word_lower} (science)"
-        elif any(keyword in topic_lower for keyword in ["geo", "place", "city", "country", "location"]):
-            return f"{word_lower} (geography)"
-        elif any(keyword in topic_lower for keyword in ["food", "cooking", "cuisine", "recipe"]):
-            return f"{word_lower} (food)"
-        elif any(keyword in topic_lower for keyword in ["music", "song", "instrument", "audio"]):
-            return f"{word_lower} (music)"
-        elif any(keyword in topic_lower for keyword in ["sport", "game", "athletic", "exercise"]):
-            return f"{word_lower} (sports)"
-        else:
-            return f"{word_lower} (related to {topic_lower})"
 # Backwards compatibility aliases

         self.similarity_temperature = float(os.getenv("SIMILARITY_TEMPERATURE", "0.2"))
         self.use_softmax_selection = os.getenv("USE_SOFTMAX_SELECTION", "true").lower() == "true"
         self.difficulty_weight = float(os.getenv("DIFFICULTY_WEIGHT", "0.5"))
+        self.thematic_pool_size = int(os.getenv("THEMATIC_POOL_SIZE", "150"))
         # Core components
         self.vocab_manager = VocabularyManager(str(self.cache_dir), self.vocab_size_limit)
         if custom_sentence:
             input_list.append(custom_sentence)  # Now: ["Art", "i will always love you"]
+        # Get thematic words (optimized pool size for performance)
+        # Dynamic scaling: scale pool size with request size, but cap at configured max
+        thematic_pool = min(self.thematic_pool_size, max(generation_target * 5, 50))
+        logger.info(f"🚀 Optimized thematic pool size: {thematic_pool} (was 400) - {((400-thematic_pool)/400*100):.1f}% reduction")
         # a result is a tuple of  (word, similarity, word_tier)
         raw_results = self.generate_thematic_words(
             input_list,
+            num_words=thematic_pool,  # Optimized pool size (default 150, was 400)
             min_similarity=min_similarity,
             multi_theme=multi_theme,
             difficulty=difficulty
                     # Sort words within tier alphabetically
                     tier_words = sorted(tier_groups[tier], key=lambda x: x[0])
                     for word, similarity in tier_words:
+                        percentile = self.word_percentiles.get(word.lower(), 0.0)
+                        log_lines.append(f"    {word:<15} (similarity: {similarity:.3f}, percentile: {percentile:.3f})")
+            # Log all thematic words grouped by tiers (with similarity and percentile)
             logger.info("\n".join(log_lines))
         else:
             logger.info("📊 No thematic words generated")
         # Let softmax with composite scoring handle difficulty selection
         candidate_words = []
+        logger.info(f"📊 Generating clues for {len(raw_results)} thematically relevant words (optimized from 400)")
         for word, similarity, tier in raw_results:
             word_data = {
                 "word": word.upper(),
             }
             candidate_words.append(word_data)
+        # Step 5: Select best words using softmax on ALL candidates (ignore clue quality)
+        logger.info(f"📊 Generated {len(candidate_words)} candidate words, applying softmax selection on ALL words")
         final_words = []
         # Select words using either softmax weighted selection or traditional random selection
         if self.use_softmax_selection:
+            logger.info(f"🎲 Using softmax weighted selection on all {len(candidate_words)} candidates (temperature: {self.similarity_temperature})")
+            # Apply softmax selection to ALL candidate words regardless of clue quality
+            if len(candidate_words) > requested_words:
+                selected_words = self._softmax_weighted_selection(candidate_words, requested_words, difficulty=difficulty)
+                final_words.extend(selected_words)
+            else:
+                final_words.extend(candidate_words)  # Take all words if not enough
         else:
+            logger.info("📊 Using traditional random selection on all candidates")
+            # Original random selection logic - use ALL candidates
+            random.shuffle(candidate_words)  # Randomize selection
+            final_words.extend(candidate_words[:requested_words])
+        # Final shuffle for output consistency
         random.shuffle(final_words)
+        logger.info(f"✅ Selected {len(final_words)} words from {len(candidate_words)} total candidates")
         logger.info(f"📝 Final words: {[w['word'] for w in final_words]}")
         return final_words
         criteria = difficulty_criteria.get(difficulty, difficulty_criteria["medium"])
         return criteria["min_len"] <= len(word) <= criteria["max_len"]
+    def _get_semantic_neighbors(self, word: str, n: int = 6) -> List[str]:
+        """Get semantic neighbors of a word using embeddings.
+        Args:
+            word: Word to find neighbors for
+            n: Number of neighbors to return (excluding the word itself)
+        Returns:
+            List of neighbor words, ordered by similarity
+        """
+        if not self.is_initialized or not hasattr(self, 'vocab_embeddings'):
+            return []
+        word_lower = word.lower()
+        if word_lower not in self.vocabulary:
+            return []
+        try:
+            # Get word embedding
+            word_idx = self.vocabulary.index(word_lower)
+            word_embedding = self.vocab_embeddings[word_idx]
+            # Compute similarities with all vocabulary
+            similarities = np.dot(self.vocab_embeddings, word_embedding)
+            # Get top similar words (excluding self)
+            top_indices = np.argsort(similarities)[-(n+1):-1][::-1]  # Get n+1, then exclude self
+            neighbors = []
+            for idx in top_indices:
+                neighbor = self.vocabulary[idx]
+                if neighbor != word_lower:  # Skip the word itself
+                    neighbors.append(neighbor)
+                if len(neighbors) >= n:
+                    break
+            return neighbors
+        except Exception as e:
+            logger.warning(f"⚠️ Failed to get semantic neighbors for '{word}': {e}")
+            return []
+    def _generate_semantic_neighbor_clue(self, word: str, topics: List[str]) -> str:
+        """Generate a clue using semantic neighbors.
+        Args:
+            word: Word to generate clue for
+            topics: Context topics for clue generation
+        Returns:
+            Generated clue based on semantic neighbors
+        """
+        neighbors = self._get_semantic_neighbors(word, n=5)
+        if not neighbors:
+            return None
+        # Try to get WordNet definitions for neighbors
+        neighbor_descriptions = []
+        usable_neighbors = []
+        for neighbor in neighbors:
+            # Try WordNet on neighbor if generator available
+            if hasattr(self, '_wordnet_generator') and self._wordnet_generator:
+                try:
+                    desc = self._wordnet_generator.generate_clue(neighbor, topics[0] if topics else "general")
+                    if desc and len(desc.strip()) > 5 and not any(pattern in desc for pattern in ["Related to", "Crossword answer"]):
+                        neighbor_descriptions.append((neighbor, desc))
+                        continue
+                except:
+                    pass
+            # Keep neighbor for direct use
+            usable_neighbors.append(neighbor)
+        # Generate clue based on available information
+        if neighbor_descriptions:
+            # Use WordNet description of neighbors
+            neighbor, desc = neighbor_descriptions[0]
+            if len(neighbor_descriptions) > 1:
+                neighbor2, desc2 = neighbor_descriptions[1]
+                return f"Like {neighbor} ({desc.split('.')[0].lower()}), related to {neighbor2}"
+            else:
+                return f"Related to {neighbor} ({desc.split('.')[0].lower()})"
+        elif len(usable_neighbors) >= 2:
+            # Use neighbor words directly
+            if len(usable_neighbors) >= 3:
+                return f"Associated with {usable_neighbors[0]}, {usable_neighbors[1]} and {usable_neighbors[2]}"
+            else:
+                return f"Related to {usable_neighbors[0]} and {usable_neighbors[1]}"
+        elif len(usable_neighbors) == 1:
+            return f"Connected to {usable_neighbors[0]}"
+        else:
+            return None
     def _generate_crossword_clue(self, word: str, topics: List[str]) -> str:
+        """Generate a crossword clue for the word using multiple strategies."""
         # Initialize WordNet clue generator if not already done
         if not hasattr(self, '_wordnet_generator') or self._wordnet_generator is None:
             try:
                 logger.warning(f"⚠️ Failed to initialize WordNet clue generator: {e}")
                 self._wordnet_generator = None
+        # Strategy 1: Try WordNet on the main word
         if self._wordnet_generator:
             try:
                 primary_topic = topics[0] if topics else "general"
                 clue = self._wordnet_generator.generate_clue(word, primary_topic)
+                if clue and len(clue.strip()) > 0 and not any(pattern in clue for pattern in ["Related to", "Crossword answer"]):
                     return clue
             except Exception as e:
                 logger.warning(f"⚠️ WordNet clue generation failed for '{word}': {e}")
+        # Strategy 2: Try semantic neighbor-based clues
+        semantic_clue = self._generate_semantic_neighbor_clue(word, topics)
+        if semantic_clue:
+            return semantic_clue
+        # Strategy 3: Simple fallback
         word_lower = word.lower()
         primary_topic = topics[0] if topics else "general"
+        return f"Crossword answer: {word_lower}"
 # Backwards compatibility aliases