Update: cite LateOn-Code, remove first-ColBERT claim, clarify reasoning approach
Browse files
README.md
CHANGED
|
@@ -25,18 +25,13 @@ pipeline_tag: sentence-similarity
|
|
| 25 |
|
| 26 |
# Reason-Code-ModernColBERT
|
| 27 |
|
| 28 |
-
|
| 29 |
|
| 30 |
-
|
| 31 |
|
| 32 |
-
## Why
|
| 33 |
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
- Code has rich token-level structure (identifiers, operators, keywords, types)
|
| 37 |
-
- A query like "sort array in reverse order" needs to match specific code tokens (`.sort()`, `reverse=True`)
|
| 38 |
-
- MaxSim naturally captures partial matches between NL query tokens and code tokens
|
| 39 |
-
- On reasoning tasks, [Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) (150M) outperformed 7B dense models
|
| 40 |
|
| 41 |
## Model Details
|
| 42 |
|
|
|
|
| 25 |
|
| 26 |
# Reason-Code-ModernColBERT
|
| 27 |
|
| 28 |
+
A **ColBERT (late-interaction) model for code search and retrieval**, trained with reasoning-enhanced synthetic queries.
|
| 29 |
|
| 30 |
+
Extends the [ReasonIR methodology](https://arxiv.org/abs/2504.20595) to the code domain — generating reasoning-intensive queries that require understanding algorithms, edge cases, and design patterns, not just keyword matching. See also [LateOn-Code](https://huggingface.co/lightonai/LateOn-Code) by LightOn AI for another ColBERT approach to code search using direct fine-tuning on CoIR data.
|
| 31 |
|
| 32 |
+
## Why Reasoning-Enhanced Training for Code?
|
| 33 |
|
| 34 |
+
Standard code search training uses docstring→code pairs. Our approach generates **reasoning-intensive queries** that require understanding the code's algorithm, behavior, and edge cases — not just surface-level keyword matching. This is the same methodology that enabled [Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) to outperform 7B dense models on reasoning tasks at only 150M parameters.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
## Model Details
|
| 37 |
|