Transformers
PyTorch
code
English
custom_code
Inference Endpoints
File size: 5,760 Bytes
fcf2699
 
 
 
ba0f032
fcf2699
 
 
 
6172c4e
fcf2699
 
 
 
d362a31
 
 
 
 
fcf2699
d362a31
fcf2699
2002840
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c2fe87
2002840
d362a31
fcf2699
d362a31
d6e018e
fcf2699
 
 
 
 
 
 
 
 
 
 
d362a31
d6e018e
fcf2699
 
 
 
 
 
 
 
d362a31
fcf2699
 
d362a31
d6e018e
d362a31
fcf2699
 
 
2002840
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fcf2699
d362a31
fcf2699
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
license: apache-2.0
datasets:
- bigcode/the-stack-v2
- tiiuae/falcon-refinedweb 

library_name: transformers
language:
- code
- en
---

## SageLite-s

### Model Description
SageLite is a new family of open embedding models with an encoder architecture that supports a wide range of tasks in both code and text. SageLite went through three stages of training:
1. **MLM Pretraining**: Standard masked language model (MLM) pretraining on mixed code and text data ([The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)).
2. **Contrastive Pre-Finetuning**: Learning from a large amount of positive pairs mined from web data and GitHub.
3. **Contrastive Fine-Tuning**: Fine-tuning on a small amount of synthetic data.

---

### **Training Data**
This checkpoint is trained on both [The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). Supported languages (15 in total) are: English, C, C#, Go, Java, JavaScript, TypeScript, PHP, Python, and Ruby.

---


### **How to Use**
This checkpoint consists of an encoder (80M model) that extracts code embeddings of 768 dimensions. It can be loaded using the Hugging Face Transformers library and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf).

```python
from transformers import AutoModel, AutoTokenizer

# Specify the checkpoint
checkpoint = "SageLite/SageLite-s"
device = "cuda"  # Use "cpu" if GPU is unavailable

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)

# Example usage
code_snippet = "def print_hello_world():\tprint('Hello World!')"
inputs = tokenizer.encode(code_snippet, return_tensors="pt").to(device)
embedding = model(inputs)[0]  # Extract the embedding
```

### **Code Retrieval Performance**

#### 1. Code2Code Search

| Model Name          | # Params | Embd Dim | Python | Java  | JS    | TS     | C#     | C      | Ruby   | PhP    | GO     | AVG    |
|---------------------|----------|----------|--------|-------|-------|--------|--------|--------|--------|--------|--------|--------|
| OpenAI-Code-01      | NA       | 3072     | 21.92  | 8.90  | 4.90  | 5.70   | 3.15   | 11.58  | 26.25  | 16.60  | 9.40   | 12.04  |
| OpenAI-Text-3-Small | NA       | 1536     | 25.18  | 12.61 | 8.00  | 9.44   | 5.46   | 15.86  | 30.70  | 23.33  | 11.20  | 15.57  |
| OpenAI-Text-3-Large | NA       | 3072     | 40.57  | 25.33 | 20.09 | 22.00  | 11.84  | 31.90  | 42.54  | 41.84  | 21.75  | 28.65  |
| CodeSage-v2-Small   | 130M     | 1024     | 45.60  | 33.65 | 39.96 | 47.78  | 19.19  | 30.55  | 40.12  | 55.39  | 30.96  | 38.13  |
| CodeSage-v2-Base    | 356M     | 1024     | 55.86  | 42.89 | 45.29 | 54.58  | 23.90  | 38.52  | 56.02  | 64.56  | 42.88  | 47.17  |
| CodeSage-v2-Large   | 1.3B     | 2048     | 61.11  | 47.09 | 51.18 | 60.67  | 28.04  | 43.40  | 60.74  | 67.87  | 43.86  | 51.55  |
| SageLite-s          | 80M      | 768      | 47.93  | 30.83 | 35.15 | 37.64  | 18.14  | 30.53  | 42.89  | 50.70  | 21.69  | 35.06  |
| SageLite-l          | 850M     | 1536     | 64.46  | 45.53 | 50.80 | 54.71  | 30.66  | 47.46  | 61.01  | 68.68  | 39.25  | 51.40  |

#### 2. NL2Code Search

| Model Name          | # Params | CoSQA | AdvTest | Python | Java  | JS    | PhP    | GO     | Ruby   | Avg    |
|---------------------|----------|-------|---------|--------|-------|-------|--------|--------|--------|--------|
| OpenAI-Code-01      | NA       | 52.20 | 36.03   | 63.13  | 67.85 | 62.30 | 57.47  | 85.22  | 69.28  | 61.69  |
| OpenAI-Text-3-Small | NA       | 52.48 | 34.10   | 62.62  | 65.87 | 60.28 | 54.85  | 81.96  | 67.57  | 59.97  |
| OpenAI-Text-3-Large | NA       | 55.21 | 46.83   | 70.81  | 72.89 | 68.12 | 59.58  | 87.60  | 75.22  | 67.03  |
| CodeSage-v2-Small   | 130M     | 52.39 | 47.28   | 68.79  | 68.13 | 65.77 | 60.20  | 80.26  | 72.46  | 64.41  |
| CodeSage-v2-Base    | 356M     | 50.74 | 52.00   | 70.46  | 70.89 | 69.61 | 62.81  | 82.37  | 73.71  | 66.57  |
| CodeSage-v2-Large   | 1.3B     | 53.18 | 56.31   | 74.18  | 72.33 | 72.49 | 65.26  | 84.67  | 76.61  | 69.38  |
| SageLite-s          | 80M      | 56.49 | 42.32   | 67.59  | 66.62 | 62.32 | 58.87  | 79.36  | 70.75  | 63.04  |
| SageLite-l          | 850M     | 59.76 | 55.55   | 74.25  | 71.76 | 69.35 | 61.62  | 84.09  | 77.14  | 69.19  |

---

### **Text Retrieval Performance ([MTEB Retrieval](https://huggingface.co/spaces/mteb/leaderboard))**

| Metric                        | SageLite-s | SageLite-l |
|-------------------------------|------------|------------|
| ArguAna                       | 57.75      | 60.71      |
| CQADupstackWordpressRetrieval | 32.42      | 38.63      |
| FiQA2018                      | 34.85      | 46.73      |
| NFCorpus                      | 29.97      | 33.70      |
| QuoraRetrieval                | 85.35      | 87.50      |
| SCIDOCS                       | 18.99      | 21.38      |
| SciFact                       | 68.43      | 69.05      |
| Touche2020                    | 24.41      | 21.43      |
| TRECCOVID                     | 70.88      | 76.08      |
| FEVER                         | 71.72      | 73.64      |
| HotpotQA                      | 58.81      | 62.96      |
| NQ                            | 48.26      | 54.48      |
| DBPedia                       | 34.83      | 40.69      |
| ClimateFEVER                  | 25.69      | 26.20      |
| MSMARCO                       | 35.01      | 36.55      |
| average                       | 46.49      | 49.98      |

---