File size: 5,760 Bytes
fcf2699 ba0f032 fcf2699 6172c4e fcf2699 d362a31 fcf2699 d362a31 fcf2699 2002840 3c2fe87 2002840 d362a31 fcf2699 d362a31 d6e018e fcf2699 d362a31 d6e018e fcf2699 d362a31 fcf2699 d362a31 d6e018e d362a31 fcf2699 2002840 fcf2699 d362a31 fcf2699 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
---
license: apache-2.0
datasets:
- bigcode/the-stack-v2
- tiiuae/falcon-refinedweb
library_name: transformers
language:
- code
- en
---
## SageLite-s
### Model Description
SageLite is a new family of open embedding models with an encoder architecture that supports a wide range of tasks in both code and text. SageLite went through three stages of training:
1. **MLM Pretraining**: Standard masked language model (MLM) pretraining on mixed code and text data ([The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)).
2. **Contrastive Pre-Finetuning**: Learning from a large amount of positive pairs mined from web data and GitHub.
3. **Contrastive Fine-Tuning**: Fine-tuning on a small amount of synthetic data.
---
### **Training Data**
This checkpoint is trained on both [The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). Supported languages (15 in total) are: English, C, C#, Go, Java, JavaScript, TypeScript, PHP, Python, and Ruby.
---
### **How to Use**
This checkpoint consists of an encoder (80M model) that extracts code embeddings of 768 dimensions. It can be loaded using the Hugging Face Transformers library and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf).
```python
from transformers import AutoModel, AutoTokenizer
# Specify the checkpoint
checkpoint = "SageLite/SageLite-s"
device = "cuda" # Use "cpu" if GPU is unavailable
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
# Example usage
code_snippet = "def print_hello_world():\tprint('Hello World!')"
inputs = tokenizer.encode(code_snippet, return_tensors="pt").to(device)
embedding = model(inputs)[0] # Extract the embedding
```
### **Code Retrieval Performance**
#### 1. Code2Code Search
| Model Name | # Params | Embd Dim | Python | Java | JS | TS | C# | C | Ruby | PhP | GO | AVG |
|---------------------|----------|----------|--------|-------|-------|--------|--------|--------|--------|--------|--------|--------|
| OpenAI-Code-01 | NA | 3072 | 21.92 | 8.90 | 4.90 | 5.70 | 3.15 | 11.58 | 26.25 | 16.60 | 9.40 | 12.04 |
| OpenAI-Text-3-Small | NA | 1536 | 25.18 | 12.61 | 8.00 | 9.44 | 5.46 | 15.86 | 30.70 | 23.33 | 11.20 | 15.57 |
| OpenAI-Text-3-Large | NA | 3072 | 40.57 | 25.33 | 20.09 | 22.00 | 11.84 | 31.90 | 42.54 | 41.84 | 21.75 | 28.65 |
| CodeSage-v2-Small | 130M | 1024 | 45.60 | 33.65 | 39.96 | 47.78 | 19.19 | 30.55 | 40.12 | 55.39 | 30.96 | 38.13 |
| CodeSage-v2-Base | 356M | 1024 | 55.86 | 42.89 | 45.29 | 54.58 | 23.90 | 38.52 | 56.02 | 64.56 | 42.88 | 47.17 |
| CodeSage-v2-Large | 1.3B | 2048 | 61.11 | 47.09 | 51.18 | 60.67 | 28.04 | 43.40 | 60.74 | 67.87 | 43.86 | 51.55 |
| SageLite-s | 80M | 768 | 47.93 | 30.83 | 35.15 | 37.64 | 18.14 | 30.53 | 42.89 | 50.70 | 21.69 | 35.06 |
| SageLite-l | 850M | 1536 | 64.46 | 45.53 | 50.80 | 54.71 | 30.66 | 47.46 | 61.01 | 68.68 | 39.25 | 51.40 |
#### 2. NL2Code Search
| Model Name | # Params | CoSQA | AdvTest | Python | Java | JS | PhP | GO | Ruby | Avg |
|---------------------|----------|-------|---------|--------|-------|-------|--------|--------|--------|--------|
| OpenAI-Code-01 | NA | 52.20 | 36.03 | 63.13 | 67.85 | 62.30 | 57.47 | 85.22 | 69.28 | 61.69 |
| OpenAI-Text-3-Small | NA | 52.48 | 34.10 | 62.62 | 65.87 | 60.28 | 54.85 | 81.96 | 67.57 | 59.97 |
| OpenAI-Text-3-Large | NA | 55.21 | 46.83 | 70.81 | 72.89 | 68.12 | 59.58 | 87.60 | 75.22 | 67.03 |
| CodeSage-v2-Small | 130M | 52.39 | 47.28 | 68.79 | 68.13 | 65.77 | 60.20 | 80.26 | 72.46 | 64.41 |
| CodeSage-v2-Base | 356M | 50.74 | 52.00 | 70.46 | 70.89 | 69.61 | 62.81 | 82.37 | 73.71 | 66.57 |
| CodeSage-v2-Large | 1.3B | 53.18 | 56.31 | 74.18 | 72.33 | 72.49 | 65.26 | 84.67 | 76.61 | 69.38 |
| SageLite-s | 80M | 56.49 | 42.32 | 67.59 | 66.62 | 62.32 | 58.87 | 79.36 | 70.75 | 63.04 |
| SageLite-l | 850M | 59.76 | 55.55 | 74.25 | 71.76 | 69.35 | 61.62 | 84.09 | 77.14 | 69.19 |
---
### **Text Retrieval Performance ([MTEB Retrieval](https://huggingface.co/spaces/mteb/leaderboard))**
| Metric | SageLite-s | SageLite-l |
|-------------------------------|------------|------------|
| ArguAna | 57.75 | 60.71 |
| CQADupstackWordpressRetrieval | 32.42 | 38.63 |
| FiQA2018 | 34.85 | 46.73 |
| NFCorpus | 29.97 | 33.70 |
| QuoraRetrieval | 85.35 | 87.50 |
| SCIDOCS | 18.99 | 21.38 |
| SciFact | 68.43 | 69.05 |
| Touche2020 | 24.41 | 21.43 |
| TRECCOVID | 70.88 | 76.08 |
| FEVER | 71.72 | 73.64 |
| HotpotQA | 58.81 | 62.96 |
| NQ | 48.26 | 54.48 |
| DBPedia | 34.83 | 40.69 |
| ClimateFEVER | 25.69 | 26.20 |
| MSMARCO | 35.01 | 36.55 |
| average | 46.49 | 49.98 |
---
|