SeanLee97 commited on
Commit
1700f8f
1 Parent(s): 98a78d2

update usage

Browse files
Files changed (1) hide show
  1. README.md +151 -60
README.md CHANGED
@@ -1,92 +1,183 @@
1
  ---
 
 
 
 
 
2
  library_name: sentence-transformers
3
- pipeline_tag: sentence-similarity
4
- tags:
5
- - sentence-transformers
6
- - feature-extraction
7
- - sentence-similarity
8
- - transformers
9
-
10
  ---
11
 
12
  # WhereIsAI/UAE-Code-Large-V1
13
 
14
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering or semantic search.
15
 
16
- <!--- Describe your model here -->
 
17
 
18
- ## Usage (Sentence-Transformers)
19
 
20
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
 
21
 
22
- ```
23
- pip install -U sentence-transformers
24
- ```
25
 
26
- Then you can use the model like this:
27
 
28
- ```python
29
- from sentence_transformers import SentenceTransformer
30
- sentences = ["This is an example sentence", "Each sentence is converted"]
31
 
32
- model = SentenceTransformer('WhereIsAI/UAE-Code-Large-V1')
33
- embeddings = model.encode(sentences)
34
- print(embeddings)
35
- ```
36
 
 
37
 
 
 
 
38
 
39
- ## Usage (HuggingFace Transformers)
40
- Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
41
 
42
  ```python
43
- from transformers import AutoTokenizer, AutoModel
44
- import torch
45
-
46
-
47
- def cls_pooling(model_output, attention_mask):
48
- return model_output[0][:,0]
49
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
- # Sentences we want sentence embeddings for
52
- sentences = ['This is an example sentence', 'Each sentence is converted']
53
-
54
- # Load model from HuggingFace Hub
55
- tokenizer = AutoTokenizer.from_pretrained('WhereIsAI/UAE-Code-Large-V1')
56
- model = AutoModel.from_pretrained('WhereIsAI/UAE-Code-Large-V1')
57
-
58
- # Tokenize sentences
59
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
60
-
61
- # Compute token embeddings
62
- with torch.no_grad():
63
- model_output = model(**encoded_input)
64
-
65
- # Perform pooling. In this case, cls pooling.
66
- sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])
67
-
68
- print("Sentence embeddings:")
69
- print(sentence_embeddings)
70
  ```
71
 
 
72
 
 
 
 
 
 
73
 
74
- ## Evaluation Results
75
 
76
- <!--- Describe how your model was evaluated -->
77
 
78
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=WhereIsAI/UAE-Code-Large-V1)
 
 
79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
 
 
81
 
82
- ## Full Model Architecture
83
  ```
84
- SentenceTransformer(
85
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
86
- (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
87
- )
88
  ```
89
 
90
- ## Citing & Authors
91
 
92
- <!--- Describe where people can find more information -->
 
 
 
 
 
 
 
 
1
  ---
2
+ license: mit
3
+ datasets:
4
+ - WhereIsAI/github-issue-similarity
5
+ language:
6
+ - en
7
  library_name: sentence-transformers
8
+ pipeline_tag: feature-extraction
 
 
 
 
 
 
9
  ---
10
 
11
  # WhereIsAI/UAE-Code-Large-V1
12
 
 
13
 
14
+ This model is trained on the [GIS: Github Issue Similarity](https://huggingface.co/datasets/WhereIsAI/github-issue-similarity) dataset using [AnglE](https://github.com/SeanLee97/AnglE) loss (https://arxiv.org/abs/2309.12871).
15
+ It can be used to measure **code/issue similarity**.
16
 
17
+ Results (test set):
18
 
19
+ - Spearman correlation: 71.19
20
+ - Accuracy: 84.37
21
 
 
 
 
22
 
23
+ ## Usage
24
 
25
+ ### 1. angle-emb
 
 
26
 
27
+ You can use it via `angle-emb` as follows:
 
 
 
28
 
29
+ install:
30
 
31
+ ```
32
+ python -m pip install -U angle-emb
33
+ ```
34
 
35
+ example:
 
36
 
37
  ```python
38
+ from scipy import spatial
39
+ from angle_emb import AnglE
40
+
41
+ model = AnglE.from_pretrained('WhereIsAI/UAE-Code-Large-V1').cuda()
42
+
43
+ quick_sort = '''# Approach 2: Quicksort using list comprehension
44
+
45
+ def quicksort(arr):
46
+ if len(arr) <= 1:
47
+ return arr
48
+ else:
49
+ pivot = arr[0]
50
+ left = [x for x in arr[1:] if x < pivot]
51
+ right = [x for x in arr[1:] if x >= pivot]
52
+ return quicksort(left) + [pivot] + quicksort(right)
53
+
54
+ # Example usage
55
+ arr = [1, 7, 4, 1, 10, 9, -2]
56
+ sorted_arr = quicksort(arr)
57
+ print("Sorted Array in Ascending Order:")
58
+ print(sorted_arr)'''
59
+
60
+
61
+ bubble_sort = '''def bubblesort(elements):
62
+ # Looping from size of array from last index[-1] to index [0]
63
+ for n in range(len(elements)-1, 0, -1):
64
+ swapped = False
65
+ for i in range(n):
66
+ if elements[i] > elements[i + 1]:
67
+ swapped = True
68
+ # swapping data if the element is less than next element in the array
69
+ elements[i], elements[i + 1] = elements[i + 1], elements[i]
70
+ if not swapped:
71
+ # exiting the function if we didn't make a single swap
72
+ # meaning that the array is already sorted.
73
+ return
74
+
75
+ elements = [39, 12, 18, 85, 72, 10, 2, 18]
76
+
77
+ print("Unsorted list is,")
78
+ print(elements)
79
+ bubblesort(elements)
80
+ print("Sorted Array is, ")
81
+ print(elements)'''
82
+
83
+ vecs = model.encode([
84
+ 'def echo(): print("hello world")',
85
+ quick_sort,
86
+ bubble_sort
87
+ ])
88
+
89
+
90
+ print('cos sim (0, 1):', 1 - spatial.distance.cosine(vecs[0], vecs[1]))
91
+ print('cos sim (0, 2)', 1 - spatial.distance.cosine(vecs[0], vecs[2]))
92
+ print('cos sim (1, 2):', 1 - spatial.distance.cosine(vecs[1], vecs[2]))
93
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
  ```
95
 
96
+ output:
97
 
98
+ ```
99
+ cos sim (0, 1): 0.34329649806022644
100
+ cos sim (0, 2) 0.3627094626426697
101
+ cos sim (1, 2): 0.6972219347953796
102
+ ```
103
 
104
+ ## sentence-transformers
105
 
106
+ You can also use it via `sentence-transformers`
107
 
108
+ ```python
109
+ from scipy import spatial
110
+ from sentence_transformers import SentenceTransformer
111
 
112
+ model = SentenceTransformer('WhereIsAI/UAE-Code-Large-V1').cuda()
113
+
114
+ quick_sort = '''# Approach 2: Quicksort using list comprehension
115
+
116
+ def quicksort(arr):
117
+ if len(arr) <= 1:
118
+ return arr
119
+ else:
120
+ pivot = arr[0]
121
+ left = [x for x in arr[1:] if x < pivot]
122
+ right = [x for x in arr[1:] if x >= pivot]
123
+ return quicksort(left) + [pivot] + quicksort(right)
124
+
125
+ # Example usage
126
+ arr = [1, 7, 4, 1, 10, 9, -2]
127
+ sorted_arr = quicksort(arr)
128
+ print("Sorted Array in Ascending Order:")
129
+ print(sorted_arr)'''
130
+
131
+
132
+ bubble_sort = '''def bubblesort(elements):
133
+ # Looping from size of array from last index[-1] to index [0]
134
+ for n in range(len(elements)-1, 0, -1):
135
+ swapped = False
136
+ for i in range(n):
137
+ if elements[i] > elements[i + 1]:
138
+ swapped = True
139
+ # swapping data if the element is less than next element in the array
140
+ elements[i], elements[i + 1] = elements[i + 1], elements[i]
141
+ if not swapped:
142
+ # exiting the function if we didn't make a single swap
143
+ # meaning that the array is already sorted.
144
+ return
145
+
146
+ elements = [39, 12, 18, 85, 72, 10, 2, 18]
147
+
148
+ print("Unsorted list is,")
149
+ print(elements)
150
+ bubblesort(elements)
151
+ print("Sorted Array is, ")
152
+ print(elements)'''
153
+
154
+ vecs = model.encode([
155
+ 'def echo(): print("hello world")',
156
+ quick_sort,
157
+ bubble_sort
158
+ ])
159
+
160
+
161
+ print('cos sim (0, 1):', 1 - spatial.distance.cosine(vecs[0], vecs[1]))
162
+ print('cos sim (0, 2)', 1 - spatial.distance.cosine(vecs[0], vecs[2]))
163
+ print('cos sim (1, 2):', 1 - spatial.distance.cosine(vecs[1], vecs[2]))
164
+ ```
165
 
166
+ output:
167
 
 
168
  ```
169
+ cos sim (0, 1): 0.34329649806022644
170
+ cos sim (0, 2) 0.3627094626426697
171
+ cos sim (1, 2): 0.6972219347953796
 
172
  ```
173
 
174
+ # Citation
175
 
176
+ ```bibtex
177
+ @article{li2023angle,
178
+ title={AnglE-optimized Text Embeddings},
179
+ author={Li, Xianming and Li, Jing},
180
+ journal={arXiv preprint arXiv:2309.12871},
181
+ year={2023}
182
+ }
183
+ ```