Upload CodeCompass-Embed v2 — #1 on CSN-Python (NDCG@10=0.979), 12-task CoIR eval

Files changed (1) hide show

README.md CHANGED Viewed

@@ -186,11 +186,17 @@ For optimal performance, use these instruction prefixes for queries:
 ## Training Details
-- **Base Model**: [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B) — a causal language model converted to a bidirectional embedding model with full attention across all 24 layers
-- **Training Data**: 100K GPT-filtered gold-standard samples from CoRNStack, StackOverflow, CodeSearchNet + hard negatives
 - **Architecture**: Bidirectional attention across all 24 layers, mean pooling, L2 normalization
 - **Loss**: InfoNCE with temperature τ=0.05
-- **Hard Negatives**: Up to 8 per sample (GPT-validated)
 - **Effective Batch Size**: 1024 (via GradCache)
 - **Hardware**: NVIDIA H100 (95GB)

 ## Training Details
+Training followed a two-stage approach:
+**Stage 1 — Embedding Conversion** (8.8M samples):
+Converted Qwen2.5-Coder-0.5B from a causal language model to a bidirectional embedding model. Trained on 8.8M samples spanning CoRNStack (Python, Java, JavaScript, Go, Ruby, PHP), CoderPile, StackOverflow, and synthetic SQL data with mined hard negatives.
+**Stage 2 — Hard Negative Refinement** (100K samples):
+Continued fine-tuning on a curated 100K-sample subset with up to 8 hard negatives per sample.
+- **Base Model**: [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B)
 - **Architecture**: Bidirectional attention across all 24 layers, mean pooling, L2 normalization
 - **Loss**: InfoNCE with temperature τ=0.05
 - **Effective Batch Size**: 1024 (via GradCache)
 - **Hardware**: NVIDIA H100 (95GB)