faisalmumtaz commited on
Commit
09dd21f
·
verified ·
1 Parent(s): 7ba274d

Upload CodeCompass-Embed v2 — #1 on CSN-Python (NDCG@10=0.979), 12-task CoIR eval

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -48,7 +48,7 @@ model-index:
48
 
49
  # CodeCompass-Embed
50
 
51
- **CodeCompass-Embed** is a code embedding model fine-tuned from [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B) for semantic code search and retrieval tasks.
52
 
53
  ## Model Highlights
54
 
@@ -186,7 +186,7 @@ For optimal performance, use these instruction prefixes for queries:
186
 
187
  ## Training Details
188
 
189
- - **Base Model**: Qwen2.5-Coder-0.5B (continued fine-tuning from previous CodeCompass checkpoint)
190
  - **Training Data**: 100K GPT-filtered gold-standard samples from CoRNStack, StackOverflow, CodeSearchNet + hard negatives
191
  - **Architecture**: Bidirectional attention across all 24 layers, mean pooling, L2 normalization
192
  - **Loss**: InfoNCE with temperature τ=0.05
 
48
 
49
  # CodeCompass-Embed
50
 
51
+ **CodeCompass-Embed** is a 494M-parameter embedding model for semantic code search and retrieval, trained on 86B tokens total. It produces 896-dimensional embeddings optimized for matching natural language queries to code across Python, Java, JavaScript, Go, Ruby, and PHP, achieving state-of-the-art results on the [CoIR code retrieval benchmark](https://github.com/CoIR-team/coir).
52
 
53
  ## Model Highlights
54
 
 
186
 
187
  ## Training Details
188
 
189
+ - **Base Model**: [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B) a causal language model converted to a bidirectional embedding model with full attention across all 24 layers
190
  - **Training Data**: 100K GPT-filtered gold-standard samples from CoRNStack, StackOverflow, CodeSearchNet + hard negatives
191
  - **Architecture**: Bidirectional attention across all 24 layers, mean pooling, L2 normalization
192
  - **Loss**: InfoNCE with temperature τ=0.05