faisalmumtaz commited on
Commit
f44098a
·
verified ·
1 Parent(s): 09dd21f

Upload CodeCompass-Embed v2 — #1 on CSN-Python (NDCG@10=0.979), 12-task CoIR eval

Browse files
Files changed (1) hide show
  1. README.md +9 -3
README.md CHANGED
@@ -186,11 +186,17 @@ For optimal performance, use these instruction prefixes for queries:
186
 
187
  ## Training Details
188
 
189
- - **Base Model**: [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B) — a causal language model converted to a bidirectional embedding model with full attention across all 24 layers
190
- - **Training Data**: 100K GPT-filtered gold-standard samples from CoRNStack, StackOverflow, CodeSearchNet + hard negatives
 
 
 
 
 
 
 
191
  - **Architecture**: Bidirectional attention across all 24 layers, mean pooling, L2 normalization
192
  - **Loss**: InfoNCE with temperature τ=0.05
193
- - **Hard Negatives**: Up to 8 per sample (GPT-validated)
194
  - **Effective Batch Size**: 1024 (via GradCache)
195
  - **Hardware**: NVIDIA H100 (95GB)
196
 
 
186
 
187
  ## Training Details
188
 
189
+ Training followed a two-stage approach:
190
+
191
+ **Stage 1 — Embedding Conversion** (8.8M samples):
192
+ Converted Qwen2.5-Coder-0.5B from a causal language model to a bidirectional embedding model. Trained on 8.8M samples spanning CoRNStack (Python, Java, JavaScript, Go, Ruby, PHP), CoderPile, StackOverflow, and synthetic SQL data with mined hard negatives.
193
+
194
+ **Stage 2 — Hard Negative Refinement** (100K samples):
195
+ Continued fine-tuning on a curated 100K-sample subset with up to 8 hard negatives per sample.
196
+
197
+ - **Base Model**: [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B)
198
  - **Architecture**: Bidirectional attention across all 24 layers, mean pooling, L2 normalization
199
  - **Loss**: InfoNCE with temperature τ=0.05
 
200
  - **Effective Batch Size**: 1024 (via GradCache)
201
  - **Hardware**: NVIDIA H100 (95GB)
202