vimalk78 commited on
Commit
425eda1
·
1 Parent(s): d5df3cd

fix: Improve word selection and clue generation for crosswords

Browse files

- Remove quality clue filtering to fix hard mode selecting common words
- Reduce thematic pool from 400 to 150 words for better performance
- Add semantic neighbor-based clue generation using embeddings
- Update temperature (0.7→0.2) and difficulty_weight (0.3→0.5) for better selection
- Add comprehensive documentation on embedding limitations

Signed-off-by: Vimal Kumar <vimal78@gmail.com>

crossword-app/backend-py/docs/embedding_limitations_and_clue_generation.md ADDED
@@ -0,0 +1,238 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Embedding Limitations and Clue Generation Analysis
2
+
3
+ ## Executive Summary
4
+
5
+ This document analyzes why our current semantic neighbor approach for crossword clue generation produces suboptimal results, explores the fundamental limitations of sentence transformers for entity relationships, and proposes practical solutions for better crossword clues.
6
+
7
+ ## The Problem: Poor Quality Clues from Semantic Neighbors
8
+
9
+ ### Current Clue Examples
10
+ ```
11
+ PANESAR → "Associated with pandya, parmar and pankaj"
12
+ RAJOURI → "Associated with raji, rajini and rajni"
13
+ RAJPUTANA → "Related to rajput (a member of the dominant hindu military caste...)"
14
+ DRAVIDA → "Related to dravidian (a member of one of the aboriginal races...)"
15
+ TENDULKAR → "Associated with ganguly, sachin and dravid"
16
+ ```
17
+
18
+ ### Why These Are Poor Crossword Clues
19
+
20
+ 1. **PANESAR**: Semantic neighbors are just phonetically similar Indian names
21
+ 2. **RAJPUTANA**: The clue contains "rajput" which is part of the answer
22
+ 3. **Generic formatting**: "Associated with X, Y, Z" is not crossword-style
23
+ 4. **Missing entity context**: No indication that PANESAR is a cricketer, RAJOURI is a place
24
+
25
+ ## Root Cause Analysis: Sentence Transformer Limitations
26
+
27
+ ### The PANESAR Case Study
28
+
29
+ **Expected neighbors for a crossword:**
30
+ - cricket, england, spinner, bowler
31
+
32
+ **Actual neighbors from embeddings:**
33
+ ```
34
+ PANESAR similarities:
35
+ cricket : 0.526 (moderate)
36
+ england : 0.264 (very low!)
37
+ spinner : 0.361 (low)
38
+ bowler : 0.476 (moderate)
39
+
40
+ pandya : 0.788 (very high!)
41
+ parmar : 0.731 (very high!)
42
+ pankaj : 0.702 (very high!)
43
+ panaji : 0.696 (very high!)
44
+ ```
45
+
46
+ ### Why This Happens: What Embeddings Actually Encode
47
+
48
+ Sentence transformers like `all-mpnet-base-v2` are trained to encode **sentence-level semantics**, not **entity relationships**. When extracting single-word embeddings, they capture:
49
+
50
+ #### ✅ What They Capture Well:
51
+ 1. **Morphological similarity**: Words with similar spelling/phonetics
52
+ 2. **Syntactic patterns**: How words are used grammatically
53
+ 3. **Distributional similarity**: Words appearing in similar sentence contexts
54
+
55
+ #### ❌ What They Miss:
56
+ 1. **Encyclopedic knowledge**: "Panesar is a cricketer"
57
+ 2. **Entity relationships**: "Panesar played for England"
58
+ 3. **Factual attributes**: "Rajouri is in Kashmir"
59
+
60
+ ### The 768-Dimensional Problem
61
+
62
+ For PANESAR, the embedding dimensions are encoding:
63
+ - **High weight**: "Sounds like an Indian surname" (pan- prefix pattern)
64
+ - **High weight**: "Appears with other Indian names in text"
65
+ - **Medium weight**: "Sometimes mentioned with cricket terms"
66
+ - **Low weight**: "Played for England team"
67
+
68
+ The model learned **surface patterns** rather than **semantic facts**.
69
+
70
+ ## Training Data Distribution Effects
71
+
72
+ ### Why Phonetic Similarity Dominates
73
+
74
+ The training corpus likely contained:
75
+ ```
76
+ "Indian names like Pandya, Parmar, and Patel..." (frequent)
77
+ "Panesar and Pankaj are common surnames..." (frequent)
78
+
79
+ vs.
80
+
81
+ "Panesar bowled for England in the 2007 series..." (infrequent)
82
+ ```
83
+
84
+ **Result**: Phonetic/cultural patterns get higher weight than factual relationships.
85
+
86
+ ## Fundamental Issue: Wrong Type of Similarity
87
+
88
+ ### What We Need vs What We Get
89
+
90
+ **For crosswords, we need:**
91
+ - PANESAR → cricketer, spinner, England-born
92
+ - RAJOURI → district, Kashmir, border region
93
+ - TENDULKAR → batsman, records, Mumbai
94
+
95
+ **What embeddings give us:**
96
+ - PANESAR → pandya, parmar (phonetic similarity)
97
+ - RAJOURI → raji, rajini (name pattern similarity)
98
+ - TENDULKAR → ganguly, dravid (co-occurrence similarity)
99
+
100
+ ## Knowledge-Augmented Embedding Solutions
101
+
102
+ ### Available Models with Entity Knowledge
103
+
104
+ #### 1. Wikipedia2Vec
105
+ - **Pros**: Trained on Wikipedia with entity linking, knows factual relationships
106
+ - **Cons**: Complex setup, requires Wikipedia dump download
107
+ - **Example**: Would know "Monty Panesar" → "English cricketer"
108
+
109
+ #### 2. BERT-Entity / LUKE
110
+ - **Pros**: Specifically designed for entity understanding
111
+ - **Cons**: Heavier model, requires entity recognition pipeline
112
+ - **Example**: Understands entity types and relationships
113
+
114
+ #### 3. ConceptNet Numberbatch
115
+ - **Pros**: Combines word embeddings with knowledge graph
116
+ - **Cons**: Large download (several GB), complex integration
117
+ - **Example**: Knows factual relationships like "cricket player from England"
118
+
119
+ #### 4. ERNIE (Enhanced Representation through kNowledge IntEgration)
120
+ - **Pros**: Integrates knowledge graphs during training
121
+ - **Cons**: Primarily Chinese focus, complex setup
122
+ - **Example**: Better entity-relationship understanding
123
+
124
+ #### 5. KnowBERT
125
+ - **Pros**: BERT + Knowledge bases (WordNet, Wikipedia)
126
+ - **Cons**: Multiple components, heavy setup
127
+ - **Example**: Combines language understanding with encyclopedic knowledge
128
+
129
+ ## Practical Solutions for Our System
130
+
131
+ ### Option 1: Hybrid Approach (Recommended)
132
+
133
+ Keep current embeddings but augment with lightweight knowledge base:
134
+
135
+ ```python
136
+ # Small knowledge file
137
+ entity_facts = {
138
+ "panesar": {
139
+ "type": "person",
140
+ "domain": "cricket",
141
+ "attributes": ["spinner", "england", "monty"],
142
+ "clue_template": "English {domain} player known as {nickname}"
143
+ },
144
+ "rajouri": {
145
+ "type": "place",
146
+ "domain": "geography",
147
+ "attributes": ["district", "kashmir", "border"],
148
+ "clue_template": "{domain} district in disputed region"
149
+ }
150
+ }
151
+
152
+ def generate_hybrid_clue(word):
153
+ if word in entity_facts:
154
+ return generate_factual_clue(word, entity_facts[word])
155
+ else:
156
+ return generate_semantic_neighbor_clue(word)
157
+ ```
158
+
159
+ ### Option 2: Entity Type Classification
160
+
161
+ Use embedding clusters to identify entity types:
162
+
163
+ ```python
164
+ # Pre-compute clusters
165
+ person_cluster = words_near(["gandhi", "nehru", "shakespeare"])
166
+ place_cluster = words_near(["delhi", "mumbai", "london"])
167
+ sport_cluster = words_near(["cricket", "football", "tennis"])
168
+
169
+ # Classify and generate appropriate clues
170
+ if word in person_cluster and word in sport_cluster:
171
+ return f"Sports personality"
172
+ elif word in place_cluster:
173
+ return f"Geographic location"
174
+ ```
175
+
176
+ ### Option 3: Knowledge Graph from Co-occurrences
177
+
178
+ Build relationships from the training corpus:
179
+
180
+ ```python
181
+ # Extract from embedding neighborhoods
182
+ def build_knowledge_graph():
183
+ knowledge = {}
184
+ for word in vocabulary:
185
+ neighbors = get_semantic_neighbors(word)
186
+
187
+ # Identify patterns
188
+ if any(n in cricket_terms for n in neighbors):
189
+ knowledge[word]["domain"] = "cricket"
190
+ if any(n in place_names for n in neighbors):
191
+ knowledge[word]["type"] = "place"
192
+ ```
193
+
194
+ ## Implementation Recommendations
195
+
196
+ ### Phase 1: Immediate Improvement
197
+ 1. **Add entity knowledge file** for top 1000 words in vocabulary
198
+ 2. **Implement hybrid clue generation** (facts first, then neighbors)
199
+ 3. **Better clue formatting** (proper crossword style)
200
+
201
+ ### Phase 2: Enhanced System
202
+ 1. **Entity type classification** using embedding clustering
203
+ 2. **Automated knowledge extraction** from neighbor patterns
204
+ 3. **Domain-specific clue templates**
205
+
206
+ ### Phase 3: Advanced Solutions
207
+ 1. **Evaluate Wikipedia2Vec** for full factual embeddings
208
+ 2. **Build comprehensive knowledge base** for crossword entities
209
+ 3. **Train custom embeddings** on crossword-specific data
210
+
211
+ ## Current System Status
212
+
213
+ ### What Works
214
+ - ✅ Proper difficulty-based word selection (rare words for hard mode)
215
+ - ✅ Fast performance using existing embeddings
216
+ - ✅ Better than generic templates (slight improvement)
217
+
218
+ ### What Needs Improvement
219
+ - ❌ Clue quality still poor for domain-specific entities
220
+ - ❌ Phonetic similarity dominates factual relationships
221
+ - ❌ No understanding of entity types or attributes
222
+
223
+ ## Conclusion
224
+
225
+ The semantic neighbor approach revealed fundamental limitations of sentence transformers for entity-relationship understanding. While it's better than generic templates, it's insufficient for quality crossword clues.
226
+
227
+ The recommended path forward is a **hybrid approach** that augments current embeddings with a lightweight knowledge base, providing factual context for common crossword entities while maintaining system performance and simplicity.
228
+
229
+ ## Technical Notes
230
+
231
+ - **Current model**: `sentence-transformers/all-mpnet-base-v2` (768 dimensions)
232
+ - **Vocabulary size**: ~30,000 words
233
+ - **Performance impact**: Semantic neighbor lookup adds ~50ms per word
234
+ - **Storage requirements**: Current approach uses existing embeddings (~500MB)
235
+
236
+ ---
237
+
238
+ *This analysis was conducted during the crossword generation optimization project, August 2025*
crossword-app/backend-py/src/services/thematic_word_service.py CHANGED
@@ -286,6 +286,7 @@ class ThematicWordService:
286
  self.similarity_temperature = float(os.getenv("SIMILARITY_TEMPERATURE", "0.2"))
287
  self.use_softmax_selection = os.getenv("USE_SOFTMAX_SELECTION", "true").lower() == "true"
288
  self.difficulty_weight = float(os.getenv("DIFFICULTY_WEIGHT", "0.5"))
 
289
 
290
  # Core components
291
  self.vocab_manager = VocabularyManager(str(self.cache_dir), self.vocab_size_limit)
@@ -1083,11 +1084,15 @@ class ThematicWordService:
1083
  if custom_sentence:
1084
  input_list.append(custom_sentence) # Now: ["Art", "i will always love you"]
1085
 
1086
- # Get thematic words (get extra for filtering)
 
 
 
 
1087
  # a result is a tuple of (word, similarity, word_tier)
1088
  raw_results = self.generate_thematic_words(
1089
  input_list,
1090
- num_words=400, # Larger pool for composite scoring to work with
1091
  min_similarity=min_similarity,
1092
  multi_theme=multi_theme,
1093
  difficulty=difficulty
@@ -1126,9 +1131,10 @@ class ThematicWordService:
1126
  # Sort words within tier alphabetically
1127
  tier_words = sorted(tier_groups[tier], key=lambda x: x[0])
1128
  for word, similarity in tier_words:
1129
- log_lines.append(f" {word:<15} (similarity: {similarity:.3f})")
 
1130
 
1131
- # uncomment this log line if want to print all words returned
1132
  logger.info("\n".join(log_lines))
1133
  else:
1134
  logger.info("📊 No thematic words generated")
@@ -1137,7 +1143,7 @@ class ThematicWordService:
1137
  # Let softmax with composite scoring handle difficulty selection
1138
  candidate_words = []
1139
 
1140
- logger.info(f"📊 Generating clues for all {len(raw_results)} thematically relevant words")
1141
  for word, similarity, tier in raw_results:
1142
  word_data = {
1143
  "word": word.upper(),
@@ -1148,64 +1154,32 @@ class ThematicWordService:
1148
  }
1149
  candidate_words.append(word_data)
1150
 
1151
- # Step 5: Filter candidates by clue quality and select best words
1152
- logger.info(f"📊 Generated {len(candidate_words)} candidate words, filtering for clue quality")
1153
-
1154
- # Separate words by clue quality
1155
- quality_words = [] # Words with proper WordNet-based clues
1156
- fallback_words = [] # Words with generic fallback clues
1157
 
1158
- fallback_patterns = ["Related to", "Crossword answer"]
1159
-
1160
- for word_data in candidate_words:
1161
- clue = word_data["clue"]
1162
- has_fallback = any(pattern in clue for pattern in fallback_patterns)
1163
-
1164
- if has_fallback:
1165
- fallback_words.append(word_data)
1166
- else:
1167
- quality_words.append(word_data)
1168
-
1169
- # Prioritize quality words, use fallback only if needed
1170
  final_words = []
1171
 
1172
  # Select words using either softmax weighted selection or traditional random selection
1173
  if self.use_softmax_selection:
1174
- logger.info(f"🎲 Using softmax weighted selection (temperature: {self.similarity_temperature})")
1175
-
1176
- # First, try to get enough words from quality words using softmax
1177
- if quality_words and len(quality_words) > requested_words:
1178
- selected_quality = self._softmax_weighted_selection(quality_words, requested_words, difficulty=difficulty)
1179
- final_words.extend(selected_quality)
1180
- elif quality_words:
1181
- final_words.extend(quality_words) # Take all quality words if not enough
1182
 
1183
- # If we don't have enough, supplement with softmax-selected fallback words
1184
- if len(final_words) < requested_words and fallback_words:
1185
- needed = requested_words - len(final_words)
1186
- if len(fallback_words) > needed:
1187
- selected_fallback = self._softmax_weighted_selection(fallback_words, needed, difficulty=difficulty)
1188
- final_words.extend(selected_fallback)
1189
- else:
1190
- final_words.extend(fallback_words) # Take all fallback words if not enough
1191
  else:
1192
- logger.info("📊 Using traditional random selection")
1193
-
1194
- # Original random selection logic
1195
- if quality_words:
1196
- random.shuffle(quality_words) # Randomize selection
1197
- final_words.extend(quality_words[:requested_words])
1198
 
1199
- # If we don't have enough quality words, add some fallback words
1200
- if len(final_words) < requested_words and fallback_words:
1201
- needed = requested_words - len(final_words)
1202
- random.shuffle(fallback_words)
1203
- final_words.extend(fallback_words[:needed])
1204
 
1205
- # Final shuffle to avoid quality-based ordering (always done for output consistency)
1206
  random.shuffle(final_words)
1207
 
1208
- logger.info(f"✅ Selected {len(final_words)} words ({len([w for w in final_words if not any(p in w['clue'] for p in fallback_patterns)])} quality, {len([w for w in final_words if any(p in w['clue'] for p in fallback_patterns)])} fallback)")
1209
  logger.info(f"📝 Final words: {[w['word'] for w in final_words]}")
1210
  return final_words
1211
 
@@ -1220,8 +1194,103 @@ class ThematicWordService:
1220
  criteria = difficulty_criteria.get(difficulty, difficulty_criteria["medium"])
1221
  return criteria["min_len"] <= len(word) <= criteria["max_len"]
1222
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1223
  def _generate_crossword_clue(self, word: str, topics: List[str]) -> str:
1224
- """Generate a crossword clue for the word using WordNet."""
1225
  # Initialize WordNet clue generator if not already done
1226
  if not hasattr(self, '_wordnet_generator') or self._wordnet_generator is None:
1227
  try:
@@ -1235,38 +1304,25 @@ class ThematicWordService:
1235
  logger.warning(f"⚠️ Failed to initialize WordNet clue generator: {e}")
1236
  self._wordnet_generator = None
1237
 
1238
- # Use WordNet generator if available
1239
  if self._wordnet_generator:
1240
  try:
1241
  primary_topic = topics[0] if topics else "general"
1242
  clue = self._wordnet_generator.generate_clue(word, primary_topic)
1243
- if clue and len(clue.strip()) > 0:
1244
  return clue
1245
  except Exception as e:
1246
  logger.warning(f"⚠️ WordNet clue generation failed for '{word}': {e}")
1247
 
1248
- # Fallback to simple templates if WordNet fails
 
 
 
 
 
1249
  word_lower = word.lower()
1250
  primary_topic = topics[0] if topics else "general"
1251
- topic_lower = primary_topic.lower()
1252
-
1253
- # Topic-specific clue templates as fallback
1254
- if any(keyword in topic_lower for keyword in ["animal", "pet", "wildlife"]):
1255
- return f"{word_lower} (animal)"
1256
- elif any(keyword in topic_lower for keyword in ["tech", "computer", "software", "digital"]):
1257
- return f"{word_lower} (technology)"
1258
- elif any(keyword in topic_lower for keyword in ["science", "biology", "chemistry", "physics"]):
1259
- return f"{word_lower} (science)"
1260
- elif any(keyword in topic_lower for keyword in ["geo", "place", "city", "country", "location"]):
1261
- return f"{word_lower} (geography)"
1262
- elif any(keyword in topic_lower for keyword in ["food", "cooking", "cuisine", "recipe"]):
1263
- return f"{word_lower} (food)"
1264
- elif any(keyword in topic_lower for keyword in ["music", "song", "instrument", "audio"]):
1265
- return f"{word_lower} (music)"
1266
- elif any(keyword in topic_lower for keyword in ["sport", "game", "athletic", "exercise"]):
1267
- return f"{word_lower} (sports)"
1268
- else:
1269
- return f"{word_lower} (related to {topic_lower})"
1270
 
1271
 
1272
  # Backwards compatibility aliases
 
286
  self.similarity_temperature = float(os.getenv("SIMILARITY_TEMPERATURE", "0.2"))
287
  self.use_softmax_selection = os.getenv("USE_SOFTMAX_SELECTION", "true").lower() == "true"
288
  self.difficulty_weight = float(os.getenv("DIFFICULTY_WEIGHT", "0.5"))
289
+ self.thematic_pool_size = int(os.getenv("THEMATIC_POOL_SIZE", "150"))
290
 
291
  # Core components
292
  self.vocab_manager = VocabularyManager(str(self.cache_dir), self.vocab_size_limit)
 
1084
  if custom_sentence:
1085
  input_list.append(custom_sentence) # Now: ["Art", "i will always love you"]
1086
 
1087
+ # Get thematic words (optimized pool size for performance)
1088
+ # Dynamic scaling: scale pool size with request size, but cap at configured max
1089
+ thematic_pool = min(self.thematic_pool_size, max(generation_target * 5, 50))
1090
+ logger.info(f"🚀 Optimized thematic pool size: {thematic_pool} (was 400) - {((400-thematic_pool)/400*100):.1f}% reduction")
1091
+
1092
  # a result is a tuple of (word, similarity, word_tier)
1093
  raw_results = self.generate_thematic_words(
1094
  input_list,
1095
+ num_words=thematic_pool, # Optimized pool size (default 150, was 400)
1096
  min_similarity=min_similarity,
1097
  multi_theme=multi_theme,
1098
  difficulty=difficulty
 
1131
  # Sort words within tier alphabetically
1132
  tier_words = sorted(tier_groups[tier], key=lambda x: x[0])
1133
  for word, similarity in tier_words:
1134
+ percentile = self.word_percentiles.get(word.lower(), 0.0)
1135
+ log_lines.append(f" {word:<15} (similarity: {similarity:.3f}, percentile: {percentile:.3f})")
1136
 
1137
+ # Log all thematic words grouped by tiers (with similarity and percentile)
1138
  logger.info("\n".join(log_lines))
1139
  else:
1140
  logger.info("📊 No thematic words generated")
 
1143
  # Let softmax with composite scoring handle difficulty selection
1144
  candidate_words = []
1145
 
1146
+ logger.info(f"📊 Generating clues for {len(raw_results)} thematically relevant words (optimized from 400)")
1147
  for word, similarity, tier in raw_results:
1148
  word_data = {
1149
  "word": word.upper(),
 
1154
  }
1155
  candidate_words.append(word_data)
1156
 
1157
+ # Step 5: Select best words using softmax on ALL candidates (ignore clue quality)
1158
+ logger.info(f"📊 Generated {len(candidate_words)} candidate words, applying softmax selection on ALL words")
 
 
 
 
1159
 
 
 
 
 
 
 
 
 
 
 
 
 
1160
  final_words = []
1161
 
1162
  # Select words using either softmax weighted selection or traditional random selection
1163
  if self.use_softmax_selection:
1164
+ logger.info(f"🎲 Using softmax weighted selection on all {len(candidate_words)} candidates (temperature: {self.similarity_temperature})")
 
 
 
 
 
 
 
1165
 
1166
+ # Apply softmax selection to ALL candidate words regardless of clue quality
1167
+ if len(candidate_words) > requested_words:
1168
+ selected_words = self._softmax_weighted_selection(candidate_words, requested_words, difficulty=difficulty)
1169
+ final_words.extend(selected_words)
1170
+ else:
1171
+ final_words.extend(candidate_words) # Take all words if not enough
 
 
1172
  else:
1173
+ logger.info("📊 Using traditional random selection on all candidates")
 
 
 
 
 
1174
 
1175
+ # Original random selection logic - use ALL candidates
1176
+ random.shuffle(candidate_words) # Randomize selection
1177
+ final_words.extend(candidate_words[:requested_words])
 
 
1178
 
1179
+ # Final shuffle for output consistency
1180
  random.shuffle(final_words)
1181
 
1182
+ logger.info(f"✅ Selected {len(final_words)} words from {len(candidate_words)} total candidates")
1183
  logger.info(f"📝 Final words: {[w['word'] for w in final_words]}")
1184
  return final_words
1185
 
 
1194
  criteria = difficulty_criteria.get(difficulty, difficulty_criteria["medium"])
1195
  return criteria["min_len"] <= len(word) <= criteria["max_len"]
1196
 
1197
+ def _get_semantic_neighbors(self, word: str, n: int = 6) -> List[str]:
1198
+ """Get semantic neighbors of a word using embeddings.
1199
+
1200
+ Args:
1201
+ word: Word to find neighbors for
1202
+ n: Number of neighbors to return (excluding the word itself)
1203
+
1204
+ Returns:
1205
+ List of neighbor words, ordered by similarity
1206
+ """
1207
+ if not self.is_initialized or not hasattr(self, 'vocab_embeddings'):
1208
+ return []
1209
+
1210
+ word_lower = word.lower()
1211
+ if word_lower not in self.vocabulary:
1212
+ return []
1213
+
1214
+ try:
1215
+ # Get word embedding
1216
+ word_idx = self.vocabulary.index(word_lower)
1217
+ word_embedding = self.vocab_embeddings[word_idx]
1218
+
1219
+ # Compute similarities with all vocabulary
1220
+ similarities = np.dot(self.vocab_embeddings, word_embedding)
1221
+
1222
+ # Get top similar words (excluding self)
1223
+ top_indices = np.argsort(similarities)[-(n+1):-1][::-1] # Get n+1, then exclude self
1224
+
1225
+ neighbors = []
1226
+ for idx in top_indices:
1227
+ neighbor = self.vocabulary[idx]
1228
+ if neighbor != word_lower: # Skip the word itself
1229
+ neighbors.append(neighbor)
1230
+ if len(neighbors) >= n:
1231
+ break
1232
+
1233
+ return neighbors
1234
+
1235
+ except Exception as e:
1236
+ logger.warning(f"⚠️ Failed to get semantic neighbors for '{word}': {e}")
1237
+ return []
1238
+
1239
+ def _generate_semantic_neighbor_clue(self, word: str, topics: List[str]) -> str:
1240
+ """Generate a clue using semantic neighbors.
1241
+
1242
+ Args:
1243
+ word: Word to generate clue for
1244
+ topics: Context topics for clue generation
1245
+
1246
+ Returns:
1247
+ Generated clue based on semantic neighbors
1248
+ """
1249
+ neighbors = self._get_semantic_neighbors(word, n=5)
1250
+ if not neighbors:
1251
+ return None
1252
+
1253
+ # Try to get WordNet definitions for neighbors
1254
+ neighbor_descriptions = []
1255
+ usable_neighbors = []
1256
+
1257
+ for neighbor in neighbors:
1258
+ # Try WordNet on neighbor if generator available
1259
+ if hasattr(self, '_wordnet_generator') and self._wordnet_generator:
1260
+ try:
1261
+ desc = self._wordnet_generator.generate_clue(neighbor, topics[0] if topics else "general")
1262
+ if desc and len(desc.strip()) > 5 and not any(pattern in desc for pattern in ["Related to", "Crossword answer"]):
1263
+ neighbor_descriptions.append((neighbor, desc))
1264
+ continue
1265
+ except:
1266
+ pass
1267
+
1268
+ # Keep neighbor for direct use
1269
+ usable_neighbors.append(neighbor)
1270
+
1271
+ # Generate clue based on available information
1272
+ if neighbor_descriptions:
1273
+ # Use WordNet description of neighbors
1274
+ neighbor, desc = neighbor_descriptions[0]
1275
+ if len(neighbor_descriptions) > 1:
1276
+ neighbor2, desc2 = neighbor_descriptions[1]
1277
+ return f"Like {neighbor} ({desc.split('.')[0].lower()}), related to {neighbor2}"
1278
+ else:
1279
+ return f"Related to {neighbor} ({desc.split('.')[0].lower()})"
1280
+
1281
+ elif len(usable_neighbors) >= 2:
1282
+ # Use neighbor words directly
1283
+ if len(usable_neighbors) >= 3:
1284
+ return f"Associated with {usable_neighbors[0]}, {usable_neighbors[1]} and {usable_neighbors[2]}"
1285
+ else:
1286
+ return f"Related to {usable_neighbors[0]} and {usable_neighbors[1]}"
1287
+ elif len(usable_neighbors) == 1:
1288
+ return f"Connected to {usable_neighbors[0]}"
1289
+ else:
1290
+ return None
1291
+
1292
  def _generate_crossword_clue(self, word: str, topics: List[str]) -> str:
1293
+ """Generate a crossword clue for the word using multiple strategies."""
1294
  # Initialize WordNet clue generator if not already done
1295
  if not hasattr(self, '_wordnet_generator') or self._wordnet_generator is None:
1296
  try:
 
1304
  logger.warning(f"⚠️ Failed to initialize WordNet clue generator: {e}")
1305
  self._wordnet_generator = None
1306
 
1307
+ # Strategy 1: Try WordNet on the main word
1308
  if self._wordnet_generator:
1309
  try:
1310
  primary_topic = topics[0] if topics else "general"
1311
  clue = self._wordnet_generator.generate_clue(word, primary_topic)
1312
+ if clue and len(clue.strip()) > 0 and not any(pattern in clue for pattern in ["Related to", "Crossword answer"]):
1313
  return clue
1314
  except Exception as e:
1315
  logger.warning(f"⚠️ WordNet clue generation failed for '{word}': {e}")
1316
 
1317
+ # Strategy 2: Try semantic neighbor-based clues
1318
+ semantic_clue = self._generate_semantic_neighbor_clue(word, topics)
1319
+ if semantic_clue:
1320
+ return semantic_clue
1321
+
1322
+ # Strategy 3: Simple fallback
1323
  word_lower = word.lower()
1324
  primary_topic = topics[0] if topics else "general"
1325
+ return f"Crossword answer: {word_lower}"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1326
 
1327
 
1328
  # Backwards compatibility aliases