Upload CodeCompass-Embed v2 — #1 on CSN-Python (NDCG@10=0.979), 12-task CoIR eval
Browse files
README.md
CHANGED
|
@@ -48,7 +48,7 @@ model-index:
|
|
| 48 |
|
| 49 |
# CodeCompass-Embed
|
| 50 |
|
| 51 |
-
**CodeCompass-Embed** is a
|
| 52 |
|
| 53 |
## Model Highlights
|
| 54 |
|
|
@@ -186,7 +186,7 @@ For optimal performance, use these instruction prefixes for queries:
|
|
| 186 |
|
| 187 |
## Training Details
|
| 188 |
|
| 189 |
-
- **Base Model**: Qwen2.5-Coder-0.5B
|
| 190 |
- **Training Data**: 100K GPT-filtered gold-standard samples from CoRNStack, StackOverflow, CodeSearchNet + hard negatives
|
| 191 |
- **Architecture**: Bidirectional attention across all 24 layers, mean pooling, L2 normalization
|
| 192 |
- **Loss**: InfoNCE with temperature τ=0.05
|
|
|
|
| 48 |
|
| 49 |
# CodeCompass-Embed
|
| 50 |
|
| 51 |
+
**CodeCompass-Embed** is a 494M-parameter embedding model for semantic code search and retrieval, trained on 86B tokens total. It produces 896-dimensional embeddings optimized for matching natural language queries to code across Python, Java, JavaScript, Go, Ruby, and PHP, achieving state-of-the-art results on the [CoIR code retrieval benchmark](https://github.com/CoIR-team/coir).
|
| 52 |
|
| 53 |
## Model Highlights
|
| 54 |
|
|
|
|
| 186 |
|
| 187 |
## Training Details
|
| 188 |
|
| 189 |
+
- **Base Model**: [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B) — a causal language model converted to a bidirectional embedding model with full attention across all 24 layers
|
| 190 |
- **Training Data**: 100K GPT-filtered gold-standard samples from CoRNStack, StackOverflow, CodeSearchNet + hard negatives
|
| 191 |
- **Architecture**: Bidirectional attention across all 24 layers, mean pooling, L2 normalization
|
| 192 |
- **Loss**: InfoNCE with temperature τ=0.05
|