ctrltokyo
/

Reason-Code-ModernColBERT

@@ -25,18 +25,13 @@ pipeline_tag: sentence-similarity
 # Reason-Code-ModernColBERT
-The **first ColBERT (late-interaction) model specifically designed for code search and retrieval**.
-Combines the token-granular matching advantages of ColBERT with reasoning-enhanced queries, extending the [ReasonIR methodology](https://arxiv.org/abs/2504.20595) to the code domain.
-## Why Late-Interaction for Code?
-All existing SOTA code search models (CodeXEmbed, Nomic Embed Code, Voyage Code) use bi-encoder / single-vector architectures. ColBERT's late-interaction approach computes token-level similarity (MaxSim), which is particularly well-suited for code because:
-- Code has rich token-level structure (identifiers, operators, keywords, types)
-- A query like "sort array in reverse order" needs to match specific code tokens (`.sort()`, `reverse=True`)
-- MaxSim naturally captures partial matches between NL query tokens and code tokens
-- On reasoning tasks, [Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) (150M) outperformed 7B dense models
 ## Model Details

 # Reason-Code-ModernColBERT
+A **ColBERT (late-interaction) model for code search and retrieval**, trained with reasoning-enhanced synthetic queries.
+Extends the [ReasonIR methodology](https://arxiv.org/abs/2504.20595) to the code domain — generating reasoning-intensive queries that require understanding algorithms, edge cases, and design patterns, not just keyword matching. See also [LateOn-Code](https://huggingface.co/lightonai/LateOn-Code) by LightOn AI for another ColBERT approach to code search using direct fine-tuning on CoIR data.
+## Why Reasoning-Enhanced Training for Code?
+Standard code search training uses docstring→code pairs. Our approach generates **reasoning-intensive queries** that require understanding the code's algorithm, behavior, and edge cases — not just surface-level keyword matching. This is the same methodology that enabled [Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) to outperform 7B dense models on reasoning tasks at only 150M parameters.
 ## Model Details