ctrltokyo commited on
Commit
06d103f
·
verified ·
1 Parent(s): b9a34e5

Update: cite LateOn-Code, remove first-ColBERT claim, clarify reasoning approach

Browse files
Files changed (1) hide show
  1. README.md +4 -9
README.md CHANGED
@@ -25,18 +25,13 @@ pipeline_tag: sentence-similarity
25
 
26
  # Reason-Code-ModernColBERT
27
 
28
- The **first ColBERT (late-interaction) model specifically designed for code search and retrieval**.
29
 
30
- Combines the token-granular matching advantages of ColBERT with reasoning-enhanced queries, extending the [ReasonIR methodology](https://arxiv.org/abs/2504.20595) to the code domain.
31
 
32
- ## Why Late-Interaction for Code?
33
 
34
- All existing SOTA code search models (CodeXEmbed, Nomic Embed Code, Voyage Code) use bi-encoder / single-vector architectures. ColBERT's late-interaction approach computes token-level similarity (MaxSim), which is particularly well-suited for code because:
35
-
36
- - Code has rich token-level structure (identifiers, operators, keywords, types)
37
- - A query like "sort array in reverse order" needs to match specific code tokens (`.sort()`, `reverse=True`)
38
- - MaxSim naturally captures partial matches between NL query tokens and code tokens
39
- - On reasoning tasks, [Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) (150M) outperformed 7B dense models
40
 
41
  ## Model Details
42
 
 
25
 
26
  # Reason-Code-ModernColBERT
27
 
28
+ A **ColBERT (late-interaction) model for code search and retrieval**, trained with reasoning-enhanced synthetic queries.
29
 
30
+ Extends the [ReasonIR methodology](https://arxiv.org/abs/2504.20595) to the code domain — generating reasoning-intensive queries that require understanding algorithms, edge cases, and design patterns, not just keyword matching. See also [LateOn-Code](https://huggingface.co/lightonai/LateOn-Code) by LightOn AI for another ColBERT approach to code search using direct fine-tuning on CoIR data.
31
 
32
+ ## Why Reasoning-Enhanced Training for Code?
33
 
34
+ Standard code search training uses docstring→code pairs. Our approach generates **reasoning-intensive queries** that require understanding the code's algorithm, behavior, and edge cases — not just surface-level keyword matching. This is the same methodology that enabled [Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) to outperform 7B dense models on reasoning tasks at only 150M parameters.
 
 
 
 
 
35
 
36
  ## Model Details
37