root commited on
Commit
55d3639
1 Parent(s): b024045

Revert "Add new SentenceTransformer model."

Browse files

This reverts commit b024045f7012157b9e8a62c7a2b82938fa48b059.

Files changed (2) hide show
  1. README.md +145 -100
  2. config_sentence_transformers.json +1 -1
README.md CHANGED
@@ -6,135 +6,180 @@ tags:
6
  - sentence-similarity
7
  - feature-extraction
8
  ---
 
 
 
 
9
 
10
- # SentenceTransformer
11
-
12
- This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 2048-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
13
 
14
  ## Model Details
 
15
 
16
- ### Model Description
17
- - **Model Type:** Sentence Transformer
18
- <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
19
- - **Maximum Sequence Length:** 8192 tokens
20
- - **Output Dimensionality:** 2048 tokens
21
- - **Similarity Function:** Cosine Similarity
22
- <!-- - **Training Dataset:** Unknown -->
23
- <!-- - **Language:** Unknown -->
24
- <!-- - **License:** Unknown -->
25
 
26
- ### Model Sources
27
 
28
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
29
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
30
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
31
 
32
- ### Full Model Architecture
33
 
34
- ```
35
- SentenceTransformer(
36
- (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel
37
- (1): Pooling({'word_embedding_dimension': 2048, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': True})
38
- )
39
- ```
40
-
41
- ## Usage
42
 
43
- ### Direct Usage (Sentence Transformers)
44
 
45
- First install the Sentence Transformers library:
 
46
 
47
- ```bash
48
- pip install -U sentence-transformers
49
- ```
 
50
 
51
- Then you can load this model and run inference.
52
- ```python
53
- from sentence_transformers import SentenceTransformer
54
 
55
- # Download from the 🤗 Hub
56
- model = SentenceTransformer("Kwaipilot/OASIS-code-1.3B")
57
- # Run inference
58
- sentences = [
59
- 'The weather is lovely today.',
60
- "It's so sunny outside!",
61
- 'He drove to the stadium.',
62
- ]
63
- embeddings = model.encode(sentences)
64
- print(embeddings.shape)
65
- # [3, 2048]
66
 
67
- # Get the similarity scores for the embeddings
68
- similarities = model.similarity(embeddings, embeddings)
69
- print(similarities.shape)
70
- # [3, 3]
71
- ```
 
 
72
 
73
- <!--
74
- ### Direct Usage (Transformers)
75
 
76
- <details><summary>Click to see the direct usage in Transformers</summary>
77
 
78
- </details>
79
- -->
 
 
80
 
81
- <!--
82
- ### Downstream Usage (Sentence Transformers)
83
 
84
- You can finetune this model on your own dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
 
86
- <details><summary>Click to expand</summary>
 
 
 
 
87
 
88
- </details>
89
- -->
90
 
91
- <!--
92
- ### Out-of-Scope Use
93
 
94
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
95
- -->
96
 
97
- <!--
98
- ## Bias, Risks and Limitations
99
 
100
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
101
- -->
 
102
 
103
- <!--
104
- ### Recommendations
 
105
 
106
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
107
- -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
108
 
109
- ## Training Details
 
 
110
 
111
- ### Framework Versions
112
- - Python: 3.9.20
113
- - Sentence Transformers: 3.1.1
114
- - Transformers: 4.45.2
115
- - PyTorch: 2.4.1+cu121
116
- - Accelerate: 1.0.0
117
- - Datasets: 3.0.1
118
- - Tokenizers: 0.20.1
119
 
120
- ## Citation
 
 
 
 
 
121
 
122
  ### BibTeX
123
-
124
- <!--
125
- ## Glossary
126
-
127
- *Clearly define terms in order to be accessible across audiences.*
128
- -->
129
-
130
- <!--
131
- ## Model Card Authors
132
-
133
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
134
- -->
135
-
136
- <!--
137
- ## Model Card Contact
138
-
139
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
140
- -->
 
6
  - sentence-similarity
7
  - feature-extraction
8
  ---
9
+ <div align="center">
10
+ <img src="https://raw.githubusercontent.com/Anditty/OASIS/refs/heads/main/Group.svg" width="60%" alt="Kwaipilot" />
11
+ </div>
12
+ <hr>
13
 
14
+ # Kwaipilot OASIS-1.3B
 
 
15
 
16
  ## Model Details
17
+ **Model Name**: OASIS (Optimized Augmentation Strategy for Improved code Search)
18
 
19
+ **Introduction**
 
 
 
 
 
 
 
 
20
 
21
+ OASIS is a state-of-the-art code embedding model developed by Kwaipilot. This model incorporates unique, proprietary methods including **repository-level program analysis**, the **OASIS-instruct data synthesis** algorithm, and a **specialized fusion loss function**, setting new benchmarks in code search efficiency and accuracy.
22
 
23
+ **Intended Use**
 
 
24
 
25
+ This model is ideal for developers and researchers engaged in enhancing **code retrieval systems**. OASIS excels in scenarios requiring semantic understanding and retrieval of code snippets within varied programming contexts.
26
 
27
+ **Training and Performance**
 
 
 
 
 
 
 
28
 
29
+ OASIS was trained on a synthetic dataset created through repository-level analysis, ensuring broad understanding across different coding styles and languages. It has demonstrated state-of-the-art performance on latest code search benchmarks.
30
 
31
+ ## Future Directions
32
+ Kwaipilot upcoming initiatives include:
33
 
34
+ - Open sourcing improved models.
35
+ - Releasing technical reports.
36
+ - Releasing natural language processing models.
37
+ - ...
38
 
 
 
 
39
 
40
+ ## Performance
 
 
 
 
 
 
 
 
 
 
41
 
42
+ | | Size | CoSQA | AdvTest | CSN-Py | CSN-Ja | CSN-JS | CSN-PHP | CSN-Go | CSN-Ruby |
43
+ |-----------------|:-----:|:------:|:---------:|:--------:|:-------:|:-------:|:-------:|:-------:|:-------:|
44
+ |Openai-Embedding-Ada-002 | Unknown | 0.4423| 0.3808 | 0.6802 | 0.7149| 0.6750| 0.6062| 0.8563| 0.7472|
45
+ |jina-embeddings-v2-base-code | 161M |0.6837 |0.385 | 0.6634 | 0.6803| 0.6304| 0.5701| 0.8595| 0.7095|
46
+ | CodeSage-large | 1.3B | 0.4753| 0.5267 | 0.7077 | 0.7021| 0.695 | 0.6133| 0.8371| 0.7192|
47
+ | CodeFuse-CGE-Small | 3.8B | 0.5619| 0.4639 | 0.6958 | 0.6863| 0.6564| 0.6133| 0.8637| 0.7341|
48
+ | OASIS-1.3B | 1.3B | 0.5532| 0.4861 | 0.701 | 0.7199| 0.6727| 0.6217| 0.8732| 0.7333|
49
 
50
+ ## Usage
 
51
 
52
+ ### Direct Usage
53
 
54
+ ```bash
55
+ pip install -U torch
56
+ pip install -U transformers
57
+ ```
58
 
59
+ Avoid using torch=2.5.0 when loading the model with torch_dtype=torch.bfloat16. For optimal performance and stability, please use PyTorch version 2.4.1 or earlier, or upgrade to 2.5.1 or later.
 
60
 
61
+ ```python
62
+ import torch
63
+ import torch.nn.functional as F
64
+
65
+ from torch import Tensor
66
+ from transformers import AutoModel, AutoTokenizer
67
+
68
+ def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
69
+ left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
70
+ if left_padding:
71
+ return last_hidden_states[:, -1]
72
+ else:
73
+ sequence_lengths = attention_mask.sum(dim=1) - 1
74
+ batch_size = last_hidden_states.shape[0]
75
+ return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
76
+
77
+ # Add query prompt
78
+ def get_query_prompt(query: str):
79
+ query_description = 'Given a code search query, retrieve relevant code snippet that answer the query'
80
+ prompt = f'Instruct: {query_description}\nQuery: {query}'
81
+ return prompt
82
+
83
+ query = "How to do quicksort in python?"
84
+
85
+ code1 = """def bubble_sort(arr):
86
+ n = len(arr)
87
+ for i in range(n):
88
+ swapped = False
89
+ for j in range(1, n - i):
90
+ if arr[j - 1] > arr[j]:
91
+ arr[j - 1], arr[j] = arr[j], arr[j - 1]
92
+ swapped = True
93
+ if not swapped:
94
+ break
95
+ return arr"""
96
+
97
+ code2 = """def quick_sort(arr):
98
+ if len(arr) <= 1:
99
+ return arr
100
+ else:
101
+ pivot = arr[0]
102
+ less = [x for x in arr[1:] if x <= pivot]
103
+ greater = [x for x in arr[1:] if x > pivot]
104
+ return quick_sort(less) + [pivot] + quick_sort(greater)"""
105
+
106
+ model = AutoModel.from_pretrained("Kwaipilot/OASIS-code-1.3B", output_hidden_states=True)
107
+ tokenizer = AutoTokenizer.from_pretrained("Kwaipilot/OASIS-code-1.3B")
108
+
109
+ # Tokenize and inference
110
+ inputs = tokenizer([get_query_prompt(query), code1, code2], max_length=8192, padding=True, truncation=True, return_tensors='pt')
111
+ outputs = model(**inputs)
112
+
113
+ # Last token pooling
114
+ embeddings = last_token_pool(outputs.hidden_states[-1], inputs['attention_mask'])
115
+ print(embeddings.shape)
116
+ # torch.Size([3, 2048])
117
 
118
+ embeddings = F.normalize(embeddings, dim=1, p=2)
119
+ similarity = embeddings @ embeddings.T
120
+ print(similarity[0, 1:])
121
+ # tensor([0.6495, 0.8036])
122
+ ```
123
 
 
 
124
 
 
 
125
 
126
+ ### Sentence Transformers
 
127
 
128
+ First install the Sentence Transformers library:
 
129
 
130
+ ```bash
131
+ pip install -U sentence-transformers
132
+ ```
133
 
134
+ Then you can load this model and run inference.
135
+ ```python
136
+ from sentence_transformers import SentenceTransformer
137
 
138
+ # Download from the 🤗 Hub
139
+ model = SentenceTransformer("Kwaipilot/OASIS-code-1.3B")#, model_kwargs={"torch_dtype": torch.bfloat16})
140
+
141
+ query = "How to do quicksort in python?"
142
+
143
+ code1 = """def bubble_sort(arr):
144
+ n = len(arr)
145
+ for i in range(n):
146
+ swapped = False
147
+ for j in range(1, n - i):
148
+ if arr[j - 1] > arr[j]:
149
+ arr[j - 1], arr[j] = arr[j], arr[j - 1]
150
+ swapped = True
151
+ if not swapped:
152
+ break
153
+ return arr"""
154
+
155
+ code2 = """def quick_sort(arr):
156
+ if len(arr) <= 1:
157
+ return arr
158
+ else:
159
+ pivot = arr[0]
160
+ less = [x for x in arr[1:] if x <= pivot]
161
+ greater = [x for x in arr[1:] if x > pivot]
162
+ return quick_sort(less) + [pivot] + quick_sort(greater)"""
163
 
164
+ # Run inference
165
+ query_embedding = model.encode([query], prompt_name="query")
166
+ code_embeddings = model.encode([code1, code2])
167
 
168
+ print(code_embeddings.shape)
169
+ # (2, 2048)
 
 
 
 
 
 
170
 
171
+ # Get the similarity scores for the embeddings
172
+ print(model.similarity(query_embedding[0], code_embeddings[0]))
173
+ print(model.similarity(query_embedding[0], code_embeddings[1]))
174
+ # tensor([[0.6495]])
175
+ # tensor([[0.8036]])
176
+ ```
177
 
178
  ### BibTeX
179
+ ```bibtex
180
+ @misc{kwaipilotoasis,
181
+ title = {Optimized Augmentation Strategy for Improved code Search},
182
+ author = {Kwaipilot team},
183
+ year = {2024},
184
+ }
185
+ ```
 
 
 
 
 
 
 
 
 
 
 
config_sentence_transformers.json CHANGED
@@ -8,5 +8,5 @@
8
  "query": "Instruct: Given a code search query, retrieve relevant code snippet that answer the query\nQuery: "
9
  },
10
  "default_prompt_name": null,
11
- "similarity_fn_name": null
12
  }
 
8
  "query": "Instruct: Given a code search query, retrieve relevant code snippet that answer the query\nQuery: "
9
  },
10
  "default_prompt_name": null,
11
+ "similarity_fn_name": "cosine"
12
  }