diwank commited on
Commit
fa37eec
1 Parent(s): 5b57e86

Push model using huggingface_hub.

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,763 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: Alibaba-NLP/gte-base-en-v1.5
3
+ datasets:
4
+ - diwank/hn-upvote-data
5
+ library_name: setfit
6
+ metrics:
7
+ - accuracy
8
+ pipeline_tag: text-classification
9
+ tags:
10
+ - setfit
11
+ - sentence-transformers
12
+ - text-classification
13
+ - generated_from_setfit_trainer
14
+ widget:
15
+ - text: 'Title: Pixar’s Rules of Storytelling
16
+
17
+ Source: b''aerogrammestudio.com'''
18
+ - text: 'Title: What I''ve learned about Open Source community over 30 years
19
+
20
+ Source: b'''''
21
+ - text: 'Title: My Python code is a neural network
22
+
23
+ Source: b'''''
24
+ - text: 'Title: The telltale words that could identify generative AI text
25
+
26
+ Source: b'''''
27
+ - text: 'Title: What I''ve learned about Open Source community over 30 years
28
+
29
+ Source: b'''''
30
+ inference: true
31
+ ---
32
+
33
+ # SetFit with Alibaba-NLP/gte-base-en-v1.5
34
+
35
+ This is a [SetFit](https://github.com/huggingface/setfit) model trained on the [diwank/hn-upvote-data](https://huggingface.co/datasets/diwank/hn-upvote-data) dataset that can be used for Text Classification. This SetFit model uses [Alibaba-NLP/gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) as the Sentence Transformer embedding model. A [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance is used for classification.
36
+
37
+ The model has been trained using an efficient few-shot learning technique that involves:
38
+
39
+ 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
40
+ 2. Training a classification head with features from the fine-tuned Sentence Transformer.
41
+
42
+ ## Model Details
43
+
44
+ ### Model Description
45
+ - **Model Type:** SetFit
46
+ - **Sentence Transformer body:** [Alibaba-NLP/gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5)
47
+ - **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
48
+ - **Maximum Sequence Length:** 8192 tokens
49
+ - **Number of Classes:** 2 classes
50
+ - **Training Dataset:** [diwank/hn-upvote-data](https://huggingface.co/datasets/diwank/hn-upvote-data)
51
+ <!-- - **Language:** Unknown -->
52
+ <!-- - **License:** Unknown -->
53
+
54
+ ### Model Sources
55
+
56
+ - **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
57
+ - **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
58
+ - **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
59
+
60
+ ### Model Labels
61
+ | Label | Examples |
62
+ |:------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
63
+ | 0 | <ul><li>"Title: The telltale words that could identify generative AI text\nSource: b''"</li><li>"Title: What I've learned about Open Source community over 30 years\nSource: b''"</li><li>"Title: My Python code is a neural network\nSource: b''"</li></ul> |
64
+ | 1 | <ul><li>"Title: Rat Park Experiment: A New Theory of Addiction\nSource: b'sub.garrytan.com'"</li><li>"Title: Thinking the unthinkable\nSource: b'anarchistsoccermom.blogspot.com'"</li><li>"Title: Realtime Analysis of the Oroville Dam Disaster\nSource: b'github.com'"</li></ul> |
65
+
66
+ ## Uses
67
+
68
+ ### Direct Use for Inference
69
+
70
+ First install the SetFit library:
71
+
72
+ ```bash
73
+ pip install setfit
74
+ ```
75
+
76
+ Then you can load this model and run inference.
77
+
78
+ ```python
79
+ from setfit import SetFitModel
80
+
81
+ # Download from the 🤗 Hub
82
+ model = SetFitModel.from_pretrained("diwank/hn-upvote-classifier")
83
+ # Run inference
84
+ preds = model("Title: My Python code is a neural network
85
+ Source: b''")
86
+ ```
87
+
88
+ <!--
89
+ ### Downstream Use
90
+
91
+ *List how someone could finetune this model on their own dataset.*
92
+ -->
93
+
94
+ <!--
95
+ ### Out-of-Scope Use
96
+
97
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
98
+ -->
99
+
100
+ <!--
101
+ ## Bias, Risks and Limitations
102
+
103
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
104
+ -->
105
+
106
+ <!--
107
+ ### Recommendations
108
+
109
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
110
+ -->
111
+
112
+ ## Training Details
113
+
114
+ ### Training Set Metrics
115
+ | Training set | Min | Median | Max |
116
+ |:-------------|:----|:--------|:----|
117
+ | Word count | 3 | 10.2389 | 20 |
118
+
119
+ | Label | Training Sample Count |
120
+ |:------|:----------------------|
121
+ | 0 | 3302 |
122
+ | 1 | 1114 |
123
+
124
+ ### Training Hyperparameters
125
+ - batch_size: (256, 16)
126
+ - num_epochs: (1, 16)
127
+ - max_steps: -1
128
+ - sampling_strategy: undersampling
129
+ - body_learning_rate: (4e-05, 2e-05)
130
+ - head_learning_rate: 0.01
131
+ - loss: CosineSimilarityLoss
132
+ - distance_metric: cosine_distance
133
+ - margin: 0.25
134
+ - end_to_end: True
135
+ - use_amp: True
136
+ - warmup_proportion: 0.05
137
+ - l2_weight: 0.2
138
+ - seed: 42
139
+ - eval_max_steps: -1
140
+ - load_best_model_at_end: True
141
+
142
+ ### Training Results
143
+ | Epoch | Step | Training Loss | Validation Loss |
144
+ |:----------:|:---------:|:-------------:|:---------------:|
145
+ | 0.0000 | 1 | 0.1861 | - |
146
+ | 0.0017 | 50 | 0.1334 | - |
147
+ | 0.0035 | 100 | 0.0344 | - |
148
+ | 0.0052 | 150 | 0.0048 | - |
149
+ | 0.0070 | 200 | 0.0027 | - |
150
+ | 0.0087 | 250 | 0.002 | - |
151
+ | 0.0104 | 300 | 0.0016 | - |
152
+ | 0.0122 | 350 | 0.0011 | - |
153
+ | 0.0139 | 400 | 0.001 | - |
154
+ | 0.0157 | 450 | 0.0009 | - |
155
+ | 0.0174 | 500 | 0.0008 | - |
156
+ | 0.0191 | 550 | 0.0006 | - |
157
+ | 0.0209 | 600 | 0.0006 | - |
158
+ | 0.0226 | 650 | 0.0006 | - |
159
+ | 0.0244 | 700 | 0.0005 | - |
160
+ | 0.0261 | 750 | 0.0005 | - |
161
+ | 0.0278 | 800 | 0.0004 | - |
162
+ | 0.0296 | 850 | 0.0004 | - |
163
+ | 0.0313 | 900 | 0.0004 | - |
164
+ | 0.0331 | 950 | 0.0003 | - |
165
+ | 0.0348 | 1000 | 0.0003 | - |
166
+ | 0.0365 | 1050 | 0.0003 | - |
167
+ | 0.0383 | 1100 | 0.0002 | - |
168
+ | 0.0400 | 1150 | 0.0002 | - |
169
+ | 0.0418 | 1200 | 0.0002 | - |
170
+ | 0.0435 | 1250 | 0.0002 | - |
171
+ | 0.0452 | 1300 | 0.0002 | - |
172
+ | 0.0470 | 1350 | 0.0002 | - |
173
+ | 0.0487 | 1400 | 0.0002 | - |
174
+ | 0.0505 | 1450 | 0.0001 | - |
175
+ | 0.0522 | 1500 | 0.0001 | - |
176
+ | 0.0539 | 1550 | 0.0001 | - |
177
+ | 0.0557 | 1600 | 0.0001 | - |
178
+ | 0.0574 | 1650 | 0.0001 | - |
179
+ | 0.0592 | 1700 | 0.0001 | - |
180
+ | 0.0609 | 1750 | 0.0001 | - |
181
+ | 0.0626 | 1800 | 0.0001 | - |
182
+ | 0.0644 | 1850 | 0.0001 | - |
183
+ | 0.0661 | 1900 | 0.0001 | - |
184
+ | 0.0679 | 1950 | 0.0001 | - |
185
+ | 0.0696 | 2000 | 0.0001 | - |
186
+ | 0.0713 | 2050 | 0.0001 | - |
187
+ | 0.0731 | 2100 | 0.0001 | - |
188
+ | 0.0748 | 2150 | 0.0001 | - |
189
+ | 0.0766 | 2200 | 0.0001 | - |
190
+ | 0.0783 | 2250 | 0.0001 | - |
191
+ | 0.0800 | 2300 | 0.0001 | - |
192
+ | 0.0818 | 2350 | 0.0001 | - |
193
+ | 0.0835 | 2400 | 0.0001 | - |
194
+ | 0.0853 | 2450 | 0.0001 | - |
195
+ | 0.0870 | 2500 | 0.0001 | - |
196
+ | 0.0887 | 2550 | 0.0001 | - |
197
+ | 0.0905 | 2600 | 0.0001 | - |
198
+ | 0.0922 | 2650 | 0.0001 | - |
199
+ | 0.0940 | 2700 | 0.0 | - |
200
+ | 0.0957 | 2750 | 0.0001 | - |
201
+ | 0.0974 | 2800 | 0.0001 | - |
202
+ | 0.0992 | 2850 | 0.0001 | - |
203
+ | 0.1009 | 2900 | 0.0001 | - |
204
+ | 0.1027 | 2950 | 0.0001 | - |
205
+ | 0.1044 | 3000 | 0.0001 | 0.0 |
206
+ | 0.1061 | 3050 | 0.0001 | - |
207
+ | 0.1079 | 3100 | 0.0001 | - |
208
+ | 0.1096 | 3150 | 0.0001 | - |
209
+ | 0.1114 | 3200 | 0.0001 | - |
210
+ | 0.1131 | 3250 | 0.0 | - |
211
+ | 0.1148 | 3300 | 0.0 | - |
212
+ | 0.1166 | 3350 | 0.0 | - |
213
+ | 0.1183 | 3400 | 0.0 | - |
214
+ | 0.1201 | 3450 | 0.0 | - |
215
+ | 0.1218 | 3500 | 0.0 | - |
216
+ | 0.1235 | 3550 | 0.0 | - |
217
+ | 0.1253 | 3600 | 0.0 | - |
218
+ | 0.1270 | 3650 | 0.0 | - |
219
+ | 0.1287 | 3700 | 0.0 | - |
220
+ | 0.1305 | 3750 | 0.0 | - |
221
+ | 0.1322 | 3800 | 0.0 | - |
222
+ | 0.1340 | 3850 | 0.0 | - |
223
+ | 0.1357 | 3900 | 0.0 | - |
224
+ | 0.1374 | 3950 | 0.0 | - |
225
+ | 0.1392 | 4000 | 0.0 | - |
226
+ | 0.1409 | 4050 | 0.0 | - |
227
+ | 0.1427 | 4100 | 0.0 | - |
228
+ | 0.1444 | 4150 | 0.0 | - |
229
+ | 0.1461 | 4200 | 0.0 | - |
230
+ | 0.1479 | 4250 | 0.0 | - |
231
+ | 0.1496 | 4300 | 0.0 | - |
232
+ | 0.1514 | 4350 | 0.0 | - |
233
+ | 0.1531 | 4400 | 0.0 | - |
234
+ | 0.1548 | 4450 | 0.0 | - |
235
+ | 0.1566 | 4500 | 0.0 | - |
236
+ | 0.1583 | 4550 | 0.0 | - |
237
+ | 0.1601 | 4600 | 0.0 | - |
238
+ | 0.1618 | 4650 | 0.0 | - |
239
+ | 0.1635 | 4700 | 0.0 | - |
240
+ | 0.1653 | 4750 | 0.0 | - |
241
+ | 0.1670 | 4800 | 0.0 | - |
242
+ | 0.1688 | 4850 | 0.0 | - |
243
+ | 0.1705 | 4900 | 0.0 | - |
244
+ | 0.1722 | 4950 | 0.0 | - |
245
+ | 0.1740 | 5000 | 0.0 | - |
246
+ | 0.1757 | 5050 | 0.0 | - |
247
+ | 0.1775 | 5100 | 0.0 | - |
248
+ | 0.1792 | 5150 | 0.0 | - |
249
+ | 0.1809 | 5200 | 0.0 | - |
250
+ | 0.1827 | 5250 | 0.0 | - |
251
+ | 0.1844 | 5300 | 0.0 | - |
252
+ | 0.1862 | 5350 | 0.0 | - |
253
+ | 0.1879 | 5400 | 0.0 | - |
254
+ | 0.1896 | 5450 | 0.0 | - |
255
+ | 0.1914 | 5500 | 0.0 | - |
256
+ | 0.1931 | 5550 | 0.0 | - |
257
+ | 0.1949 | 5600 | 0.0 | - |
258
+ | 0.1966 | 5650 | 0.0 | - |
259
+ | 0.1983 | 5700 | 0.0 | - |
260
+ | 0.2001 | 5750 | 0.0 | - |
261
+ | 0.2018 | 5800 | 0.0 | - |
262
+ | 0.2036 | 5850 | 0.0 | - |
263
+ | 0.2053 | 5900 | 0.0 | - |
264
+ | 0.2070 | 5950 | 0.0 | - |
265
+ | 0.2088 | 6000 | 0.0 | 0.0 |
266
+ | 0.2105 | 6050 | 0.0 | - |
267
+ | 0.2123 | 6100 | 0.0 | - |
268
+ | 0.2140 | 6150 | 0.0 | - |
269
+ | 0.2157 | 6200 | 0.0 | - |
270
+ | 0.2175 | 6250 | 0.0 | - |
271
+ | 0.2192 | 6300 | 0.0 | - |
272
+ | 0.2210 | 6350 | 0.0 | - |
273
+ | 0.2227 | 6400 | 0.0 | - |
274
+ | 0.2244 | 6450 | 0.0 | - |
275
+ | 0.2262 | 6500 | 0.0 | - |
276
+ | 0.2279 | 6550 | 0.0 | - |
277
+ | 0.2297 | 6600 | 0.0 | - |
278
+ | 0.2314 | 6650 | 0.0 | - |
279
+ | 0.2331 | 6700 | 0.0 | - |
280
+ | 0.2349 | 6750 | 0.0 | - |
281
+ | 0.2366 | 6800 | 0.0 | - |
282
+ | 0.2384 | 6850 | 0.0 | - |
283
+ | 0.2401 | 6900 | 0.0 | - |
284
+ | 0.2418 | 6950 | 0.0 | - |
285
+ | 0.2436 | 7000 | 0.0 | - |
286
+ | 0.2453 | 7050 | 0.0 | - |
287
+ | 0.2471 | 7100 | 0.0 | - |
288
+ | 0.2488 | 7150 | 0.0 | - |
289
+ | 0.2505 | 7200 | 0.0 | - |
290
+ | 0.2523 | 7250 | 0.0 | - |
291
+ | 0.2540 | 7300 | 0.0 | - |
292
+ | 0.2558 | 7350 | 0.0 | - |
293
+ | 0.2575 | 7400 | 0.0 | - |
294
+ | 0.2592 | 7450 | 0.0 | - |
295
+ | 0.2610 | 7500 | 0.0 | - |
296
+ | 0.2627 | 7550 | 0.0 | - |
297
+ | 0.2645 | 7600 | 0.0 | - |
298
+ | 0.2662 | 7650 | 0.0 | - |
299
+ | 0.2679 | 7700 | 0.0 | - |
300
+ | 0.2697 | 7750 | 0.0 | - |
301
+ | 0.2714 | 7800 | 0.0 | - |
302
+ | 0.2732 | 7850 | 0.0 | - |
303
+ | 0.2749 | 7900 | 0.0 | - |
304
+ | 0.2766 | 7950 | 0.0 | - |
305
+ | 0.2784 | 8000 | 0.0 | - |
306
+ | 0.2801 | 8050 | 0.0 | - |
307
+ | 0.2819 | 8100 | 0.0 | - |
308
+ | 0.2836 | 8150 | 0.0 | - |
309
+ | 0.2853 | 8200 | 0.0 | - |
310
+ | 0.2871 | 8250 | 0.0 | - |
311
+ | 0.2888 | 8300 | 0.0 | - |
312
+ | 0.2906 | 8350 | 0.0 | - |
313
+ | 0.2923 | 8400 | 0.0 | - |
314
+ | 0.2940 | 8450 | 0.0 | - |
315
+ | 0.2958 | 8500 | 0.0 | - |
316
+ | 0.2975 | 8550 | 0.0 | - |
317
+ | 0.2993 | 8600 | 0.0 | - |
318
+ | 0.3010 | 8650 | 0.0 | - |
319
+ | 0.3027 | 8700 | 0.0 | - |
320
+ | 0.3045 | 8750 | 0.0 | - |
321
+ | 0.3062 | 8800 | 0.0 | - |
322
+ | 0.3080 | 8850 | 0.0 | - |
323
+ | 0.3097 | 8900 | 0.0 | - |
324
+ | 0.3114 | 8950 | 0.0 | - |
325
+ | 0.3132 | 9000 | 0.0 | 0.0 |
326
+ | 0.3149 | 9050 | 0.0 | - |
327
+ | 0.3167 | 9100 | 0.0 | - |
328
+ | 0.3184 | 9150 | 0.0 | - |
329
+ | 0.3201 | 9200 | 0.0 | - |
330
+ | 0.3219 | 9250 | 0.0 | - |
331
+ | 0.3236 | 9300 | 0.0 | - |
332
+ | 0.3254 | 9350 | 0.0 | - |
333
+ | 0.3271 | 9400 | 0.0 | - |
334
+ | 0.3288 | 9450 | 0.0 | - |
335
+ | 0.3306 | 9500 | 0.0 | - |
336
+ | 0.3323 | 9550 | 0.0 | - |
337
+ | 0.3341 | 9600 | 0.0 | - |
338
+ | 0.3358 | 9650 | 0.0 | - |
339
+ | 0.3375 | 9700 | 0.0 | - |
340
+ | 0.3393 | 9750 | 0.0 | - |
341
+ | 0.3410 | 9800 | 0.0 | - |
342
+ | 0.3428 | 9850 | 0.0 | - |
343
+ | 0.3445 | 9900 | 0.0 | - |
344
+ | 0.3462 | 9950 | 0.0 | - |
345
+ | 0.3480 | 10000 | 0.0 | - |
346
+ | 0.3497 | 10050 | 0.0 | - |
347
+ | 0.3515 | 10100 | 0.0 | - |
348
+ | 0.3532 | 10150 | 0.0 | - |
349
+ | 0.3549 | 10200 | 0.0 | - |
350
+ | 0.3567 | 10250 | 0.0 | - |
351
+ | 0.3584 | 10300 | 0.0 | - |
352
+ | 0.3602 | 10350 | 0.0 | - |
353
+ | 0.3619 | 10400 | 0.0 | - |
354
+ | 0.3636 | 10450 | 0.0 | - |
355
+ | 0.3654 | 10500 | 0.0 | - |
356
+ | 0.3671 | 10550 | 0.0 | - |
357
+ | 0.3688 | 10600 | 0.0 | - |
358
+ | 0.3706 | 10650 | 0.0 | - |
359
+ | 0.3723 | 10700 | 0.0 | - |
360
+ | 0.3741 | 10750 | 0.0 | - |
361
+ | 0.3758 | 10800 | 0.0 | - |
362
+ | 0.3775 | 10850 | 0.0 | - |
363
+ | 0.3793 | 10900 | 0.0 | - |
364
+ | 0.3810 | 10950 | 0.0 | - |
365
+ | 0.3828 | 11000 | 0.0 | - |
366
+ | 0.3845 | 11050 | 0.0 | - |
367
+ | 0.3862 | 11100 | 0.0 | - |
368
+ | 0.3880 | 11150 | 0.0 | - |
369
+ | 0.3897 | 11200 | 0.0 | - |
370
+ | 0.3915 | 11250 | 0.0 | - |
371
+ | 0.3932 | 11300 | 0.0 | - |
372
+ | 0.3949 | 11350 | 0.0 | - |
373
+ | 0.3967 | 11400 | 0.0 | - |
374
+ | 0.3984 | 11450 | 0.0 | - |
375
+ | 0.4002 | 11500 | 0.0 | - |
376
+ | 0.4019 | 11550 | 0.0 | - |
377
+ | 0.4036 | 11600 | 0.0 | - |
378
+ | 0.4054 | 11650 | 0.0 | - |
379
+ | 0.4071 | 11700 | 0.0 | - |
380
+ | 0.4089 | 11750 | 0.0 | - |
381
+ | 0.4106 | 11800 | 0.0 | - |
382
+ | 0.4123 | 11850 | 0.0 | - |
383
+ | 0.4141 | 11900 | 0.0 | - |
384
+ | 0.4158 | 11950 | 0.0 | - |
385
+ | 0.4176 | 12000 | 0.0 | 0.0 |
386
+ | 0.4193 | 12050 | 0.0 | - |
387
+ | 0.4210 | 12100 | 0.0 | - |
388
+ | 0.4228 | 12150 | 0.0 | - |
389
+ | 0.4245 | 12200 | 0.0 | - |
390
+ | 0.4263 | 12250 | 0.0 | - |
391
+ | 0.4280 | 12300 | 0.0 | - |
392
+ | 0.4297 | 12350 | 0.0 | - |
393
+ | 0.4315 | 12400 | 0.0 | - |
394
+ | 0.4332 | 12450 | 0.0 | - |
395
+ | 0.4350 | 12500 | 0.0 | - |
396
+ | 0.4367 | 12550 | 0.0 | - |
397
+ | 0.4384 | 12600 | 0.0 | - |
398
+ | 0.4402 | 12650 | 0.0 | - |
399
+ | 0.4419 | 12700 | 0.0 | - |
400
+ | 0.4437 | 12750 | 0.0 | - |
401
+ | 0.4454 | 12800 | 0.0 | - |
402
+ | 0.4471 | 12850 | 0.0 | - |
403
+ | 0.4489 | 12900 | 0.0 | - |
404
+ | 0.4506 | 12950 | 0.0 | - |
405
+ | 0.4524 | 13000 | 0.0 | - |
406
+ | 0.4541 | 13050 | 0.0 | - |
407
+ | 0.4558 | 13100 | 0.0 | - |
408
+ | 0.4576 | 13150 | 0.0 | - |
409
+ | 0.4593 | 13200 | 0.0 | - |
410
+ | 0.4611 | 13250 | 0.0 | - |
411
+ | 0.4628 | 13300 | 0.0 | - |
412
+ | 0.4645 | 13350 | 0.0 | - |
413
+ | 0.4663 | 13400 | 0.0 | - |
414
+ | 0.4680 | 13450 | 0.0 | - |
415
+ | 0.4698 | 13500 | 0.0 | - |
416
+ | 0.4715 | 13550 | 0.0 | - |
417
+ | 0.4732 | 13600 | 0.0 | - |
418
+ | 0.4750 | 13650 | 0.0 | - |
419
+ | 0.4767 | 13700 | 0.0 | - |
420
+ | 0.4785 | 13750 | 0.0 | - |
421
+ | 0.4802 | 13800 | 0.0 | - |
422
+ | 0.4819 | 13850 | 0.0 | - |
423
+ | 0.4837 | 13900 | 0.0 | - |
424
+ | 0.4854 | 13950 | 0.0 | - |
425
+ | 0.4872 | 14000 | 0.0 | - |
426
+ | 0.4889 | 14050 | 0.0 | - |
427
+ | 0.4906 | 14100 | 0.0 | - |
428
+ | 0.4924 | 14150 | 0.0 | - |
429
+ | 0.4941 | 14200 | 0.0 | - |
430
+ | 0.4959 | 14250 | 0.0 | - |
431
+ | 0.4976 | 14300 | 0.0 | - |
432
+ | 0.4993 | 14350 | 0.0 | - |
433
+ | 0.5011 | 14400 | 0.0 | - |
434
+ | 0.5028 | 14450 | 0.0 | - |
435
+ | 0.5046 | 14500 | 0.0 | - |
436
+ | 0.5063 | 14550 | 0.0 | - |
437
+ | 0.5080 | 14600 | 0.0 | - |
438
+ | 0.5098 | 14650 | 0.0 | - |
439
+ | 0.5115 | 14700 | 0.0 | - |
440
+ | 0.5133 | 14750 | 0.0 | - |
441
+ | 0.5150 | 14800 | 0.0 | - |
442
+ | 0.5167 | 14850 | 0.0 | - |
443
+ | 0.5185 | 14900 | 0.0 | - |
444
+ | 0.5202 | 14950 | 0.0 | - |
445
+ | 0.5220 | 15000 | 0.0 | 0.0 |
446
+ | 0.5237 | 15050 | 0.0 | - |
447
+ | 0.5254 | 15100 | 0.0 | - |
448
+ | 0.5272 | 15150 | 0.0 | - |
449
+ | 0.5289 | 15200 | 0.0 | - |
450
+ | 0.5307 | 15250 | 0.0 | - |
451
+ | 0.5324 | 15300 | 0.0 | - |
452
+ | 0.5341 | 15350 | 0.0 | - |
453
+ | 0.5359 | 15400 | 0.0 | - |
454
+ | 0.5376 | 15450 | 0.0 | - |
455
+ | 0.5394 | 15500 | 0.0 | - |
456
+ | 0.5411 | 15550 | 0.0 | - |
457
+ | 0.5428 | 15600 | 0.0 | - |
458
+ | 0.5446 | 15650 | 0.0 | - |
459
+ | 0.5463 | 15700 | 0.0 | - |
460
+ | 0.5481 | 15750 | 0.0 | - |
461
+ | 0.5498 | 15800 | 0.0 | - |
462
+ | 0.5515 | 15850 | 0.0 | - |
463
+ | 0.5533 | 15900 | 0.0 | - |
464
+ | 0.5550 | 15950 | 0.0 | - |
465
+ | 0.5568 | 16000 | 0.0 | - |
466
+ | 0.5585 | 16050 | 0.0 | - |
467
+ | 0.5602 | 16100 | 0.0 | - |
468
+ | 0.5620 | 16150 | 0.0 | - |
469
+ | 0.5637 | 16200 | 0.0 | - |
470
+ | 0.5655 | 16250 | 0.0 | - |
471
+ | 0.5672 | 16300 | 0.0 | - |
472
+ | 0.5689 | 16350 | 0.0 | - |
473
+ | 0.5707 | 16400 | 0.0 | - |
474
+ | 0.5724 | 16450 | 0.0 | - |
475
+ | 0.5742 | 16500 | 0.0 | - |
476
+ | 0.5759 | 16550 | 0.0 | - |
477
+ | 0.5776 | 16600 | 0.0 | - |
478
+ | 0.5794 | 16650 | 0.0 | - |
479
+ | 0.5811 | 16700 | 0.0 | - |
480
+ | 0.5829 | 16750 | 0.0 | - |
481
+ | 0.5846 | 16800 | 0.0 | - |
482
+ | 0.5863 | 16850 | 0.0 | - |
483
+ | 0.5881 | 16900 | 0.0 | - |
484
+ | 0.5898 | 16950 | 0.0 | - |
485
+ | 0.5916 | 17000 | 0.0 | - |
486
+ | 0.5933 | 17050 | 0.0 | - |
487
+ | 0.5950 | 17100 | 0.0 | - |
488
+ | 0.5968 | 17150 | 0.0 | - |
489
+ | 0.5985 | 17200 | 0.0 | - |
490
+ | 0.6003 | 17250 | 0.0 | - |
491
+ | 0.6020 | 17300 | 0.0 | - |
492
+ | 0.6037 | 17350 | 0.0 | - |
493
+ | 0.6055 | 17400 | 0.0 | - |
494
+ | 0.6072 | 17450 | 0.0 | - |
495
+ | 0.6089 | 17500 | 0.0 | - |
496
+ | 0.6107 | 17550 | 0.0 | - |
497
+ | 0.6124 | 17600 | 0.0 | - |
498
+ | 0.6142 | 17650 | 0.0 | - |
499
+ | 0.6159 | 17700 | 0.0 | - |
500
+ | 0.6176 | 17750 | 0.0 | - |
501
+ | 0.6194 | 17800 | 0.0 | - |
502
+ | 0.6211 | 17850 | 0.0 | - |
503
+ | 0.6229 | 17900 | 0.0 | - |
504
+ | 0.6246 | 17950 | 0.0 | - |
505
+ | 0.6263 | 18000 | 0.0 | 0.0 |
506
+ | 0.6281 | 18050 | 0.0 | - |
507
+ | 0.6298 | 18100 | 0.0 | - |
508
+ | 0.6316 | 18150 | 0.0 | - |
509
+ | 0.6333 | 18200 | 0.0 | - |
510
+ | 0.6350 | 18250 | 0.0 | - |
511
+ | 0.6368 | 18300 | 0.0 | - |
512
+ | 0.6385 | 18350 | 0.0 | - |
513
+ | 0.6403 | 18400 | 0.0 | - |
514
+ | 0.6420 | 18450 | 0.0 | - |
515
+ | 0.6437 | 18500 | 0.0 | - |
516
+ | 0.6455 | 18550 | 0.0 | - |
517
+ | 0.6472 | 18600 | 0.0 | - |
518
+ | 0.6490 | 18650 | 0.0 | - |
519
+ | 0.6507 | 18700 | 0.0 | - |
520
+ | 0.6524 | 18750 | 0.0 | - |
521
+ | 0.6542 | 18800 | 0.0 | - |
522
+ | 0.6559 | 18850 | 0.0 | - |
523
+ | 0.6577 | 18900 | 0.0 | - |
524
+ | 0.6594 | 18950 | 0.0 | - |
525
+ | 0.6611 | 19000 | 0.0 | - |
526
+ | 0.6629 | 19050 | 0.0 | - |
527
+ | 0.6646 | 19100 | 0.0 | - |
528
+ | 0.6664 | 19150 | 0.0 | - |
529
+ | 0.6681 | 19200 | 0.0 | - |
530
+ | 0.6698 | 19250 | 0.0 | - |
531
+ | 0.6716 | 19300 | 0.0 | - |
532
+ | 0.6733 | 19350 | 0.0 | - |
533
+ | 0.6751 | 19400 | 0.0 | - |
534
+ | 0.6768 | 19450 | 0.0 | - |
535
+ | 0.6785 | 19500 | 0.0 | - |
536
+ | 0.6803 | 19550 | 0.0 | - |
537
+ | 0.6820 | 19600 | 0.0 | - |
538
+ | 0.6838 | 19650 | 0.0 | - |
539
+ | 0.6855 | 19700 | 0.0 | - |
540
+ | 0.6872 | 19750 | 0.0 | - |
541
+ | 0.6890 | 19800 | 0.0 | - |
542
+ | 0.6907 | 19850 | 0.0 | - |
543
+ | 0.6925 | 19900 | 0.0 | - |
544
+ | 0.6942 | 19950 | 0.0 | - |
545
+ | 0.6959 | 20000 | 0.0 | - |
546
+ | 0.6977 | 20050 | 0.0 | - |
547
+ | 0.6994 | 20100 | 0.0 | - |
548
+ | 0.7012 | 20150 | 0.0 | - |
549
+ | 0.7029 | 20200 | 0.0 | - |
550
+ | 0.7046 | 20250 | 0.0 | - |
551
+ | 0.7064 | 20300 | 0.0 | - |
552
+ | 0.7081 | 20350 | 0.0 | - |
553
+ | 0.7099 | 20400 | 0.0 | - |
554
+ | 0.7116 | 20450 | 0.0 | - |
555
+ | 0.7133 | 20500 | 0.0 | - |
556
+ | 0.7151 | 20550 | 0.0 | - |
557
+ | 0.7168 | 20600 | 0.0 | - |
558
+ | 0.7186 | 20650 | 0.0 | - |
559
+ | 0.7203 | 20700 | 0.0 | - |
560
+ | 0.7220 | 20750 | 0.0 | - |
561
+ | 0.7238 | 20800 | 0.0 | - |
562
+ | 0.7255 | 20850 | 0.0 | - |
563
+ | 0.7273 | 20900 | 0.0 | - |
564
+ | 0.7290 | 20950 | 0.0 | - |
565
+ | **0.7307** | **21000** | **0.0** | **0.0** |
566
+ | 0.7325 | 21050 | 0.0 | - |
567
+ | 0.7342 | 21100 | 0.0 | - |
568
+ | 0.7360 | 21150 | 0.0 | - |
569
+ | 0.7377 | 21200 | 0.0 | - |
570
+ | 0.7394 | 21250 | 0.0 | - |
571
+ | 0.7412 | 21300 | 0.0 | - |
572
+ | 0.7429 | 21350 | 0.0 | - |
573
+ | 0.7447 | 21400 | 0.0 | - |
574
+ | 0.7464 | 21450 | 0.0 | - |
575
+ | 0.7481 | 21500 | 0.0 | - |
576
+ | 0.7499 | 21550 | 0.0 | - |
577
+ | 0.7516 | 21600 | 0.0 | - |
578
+ | 0.7534 | 21650 | 0.0 | - |
579
+ | 0.7551 | 21700 | 0.0 | - |
580
+ | 0.7568 | 21750 | 0.0 | - |
581
+ | 0.7586 | 21800 | 0.0 | - |
582
+ | 0.7603 | 21850 | 0.0 | - |
583
+ | 0.7621 | 21900 | 0.0 | - |
584
+ | 0.7638 | 21950 | 0.0 | - |
585
+ | 0.7655 | 22000 | 0.0 | - |
586
+ | 0.7673 | 22050 | 0.0 | - |
587
+ | 0.7690 | 22100 | 0.0 | - |
588
+ | 0.7708 | 22150 | 0.0 | - |
589
+ | 0.7725 | 22200 | 0.0 | - |
590
+ | 0.7742 | 22250 | 0.0 | - |
591
+ | 0.7760 | 22300 | 0.0 | - |
592
+ | 0.7777 | 22350 | 0.0 | - |
593
+ | 0.7795 | 22400 | 0.0 | - |
594
+ | 0.7812 | 22450 | 0.0 | - |
595
+ | 0.7829 | 22500 | 0.0 | - |
596
+ | 0.7847 | 22550 | 0.0 | - |
597
+ | 0.7864 | 22600 | 0.0 | - |
598
+ | 0.7882 | 22650 | 0.0 | - |
599
+ | 0.7899 | 22700 | 0.0 | - |
600
+ | 0.7916 | 22750 | 0.0 | - |
601
+ | 0.7934 | 22800 | 0.0 | - |
602
+ | 0.7951 | 22850 | 0.0 | - |
603
+ | 0.7969 | 22900 | 0.0 | - |
604
+ | 0.7986 | 22950 | 0.0 | - |
605
+ | 0.8003 | 23000 | 0.0 | - |
606
+ | 0.8021 | 23050 | 0.0 | - |
607
+ | 0.8038 | 23100 | 0.0 | - |
608
+ | 0.8056 | 23150 | 0.0 | - |
609
+ | 0.8073 | 23200 | 0.0 | - |
610
+ | 0.8090 | 23250 | 0.0 | - |
611
+ | 0.8108 | 23300 | 0.0 | - |
612
+ | 0.8125 | 23350 | 0.0 | - |
613
+ | 0.8143 | 23400 | 0.0 | - |
614
+ | 0.8160 | 23450 | 0.0 | - |
615
+ | 0.8177 | 23500 | 0.0 | - |
616
+ | 0.8195 | 23550 | 0.0 | - |
617
+ | 0.8212 | 23600 | 0.0 | - |
618
+ | 0.8230 | 23650 | 0.0 | - |
619
+ | 0.8247 | 23700 | 0.0 | - |
620
+ | 0.8264 | 23750 | 0.0 | - |
621
+ | 0.8282 | 23800 | 0.0 | - |
622
+ | 0.8299 | 23850 | 0.0 | - |
623
+ | 0.8317 | 23900 | 0.0 | - |
624
+ | 0.8334 | 23950 | 0.0 | - |
625
+ | 0.8351 | 24000 | 0.0 | 0.0 |
626
+ | 0.8369 | 24050 | 0.0 | - |
627
+ | 0.8386 | 24100 | 0.0 | - |
628
+ | 0.8404 | 24150 | 0.0 | - |
629
+ | 0.8421 | 24200 | 0.0 | - |
630
+ | 0.8438 | 24250 | 0.0 | - |
631
+ | 0.8456 | 24300 | 0.0 | - |
632
+ | 0.8473 | 24350 | 0.0 | - |
633
+ | 0.8491 | 24400 | 0.0 | - |
634
+ | 0.8508 | 24450 | 0.0 | - |
635
+ | 0.8525 | 24500 | 0.0 | - |
636
+ | 0.8543 | 24550 | 0.0 | - |
637
+ | 0.8560 | 24600 | 0.0 | - |
638
+ | 0.8577 | 24650 | 0.0 | - |
639
+ | 0.8595 | 24700 | 0.0 | - |
640
+ | 0.8612 | 24750 | 0.0 | - |
641
+ | 0.8630 | 24800 | 0.0 | - |
642
+ | 0.8647 | 24850 | 0.0 | - |
643
+ | 0.8664 | 24900 | 0.0 | - |
644
+ | 0.8682 | 24950 | 0.0 | - |
645
+ | 0.8699 | 25000 | 0.0 | - |
646
+ | 0.8717 | 25050 | 0.0 | - |
647
+ | 0.8734 | 25100 | 0.0 | - |
648
+ | 0.8751 | 25150 | 0.0 | - |
649
+ | 0.8769 | 25200 | 0.0 | - |
650
+ | 0.8786 | 25250 | 0.0 | - |
651
+ | 0.8804 | 25300 | 0.0 | - |
652
+ | 0.8821 | 25350 | 0.0 | - |
653
+ | 0.8838 | 25400 | 0.0 | - |
654
+ | 0.8856 | 25450 | 0.0 | - |
655
+ | 0.8873 | 25500 | 0.0 | - |
656
+ | 0.8891 | 25550 | 0.0 | - |
657
+ | 0.8908 | 25600 | 0.0 | - |
658
+ | 0.8925 | 25650 | 0.0 | - |
659
+ | 0.8943 | 25700 | 0.0 | - |
660
+ | 0.8960 | 25750 | 0.0 | - |
661
+ | 0.8978 | 25800 | 0.0 | - |
662
+ | 0.8995 | 25850 | 0.0 | - |
663
+ | 0.9012 | 25900 | 0.0 | - |
664
+ | 0.9030 | 25950 | 0.0 | - |
665
+ | 0.9047 | 26000 | 0.0 | - |
666
+ | 0.9065 | 26050 | 0.0 | - |
667
+ | 0.9082 | 26100 | 0.0 | - |
668
+ | 0.9099 | 26150 | 0.0 | - |
669
+ | 0.9117 | 26200 | 0.0 | - |
670
+ | 0.9134 | 26250 | 0.0 | - |
671
+ | 0.9152 | 26300 | 0.0 | - |
672
+ | 0.9169 | 26350 | 0.0 | - |
673
+ | 0.9186 | 26400 | 0.0 | - |
674
+ | 0.9204 | 26450 | 0.0 | - |
675
+ | 0.9221 | 26500 | 0.0 | - |
676
+ | 0.9239 | 26550 | 0.0 | - |
677
+ | 0.9256 | 26600 | 0.0 | - |
678
+ | 0.9273 | 26650 | 0.0 | - |
679
+ | 0.9291 | 26700 | 0.0 | - |
680
+ | 0.9308 | 26750 | 0.0 | - |
681
+ | 0.9326 | 26800 | 0.0 | - |
682
+ | 0.9343 | 26850 | 0.0 | - |
683
+ | 0.9360 | 26900 | 0.0 | - |
684
+ | 0.9378 | 26950 | 0.0 | - |
685
+ | 0.9395 | 27000 | 0.0 | 0.0 |
686
+ | 0.9413 | 27050 | 0.0 | - |
687
+ | 0.9430 | 27100 | 0.0 | - |
688
+ | 0.9447 | 27150 | 0.0 | - |
689
+ | 0.9465 | 27200 | 0.0 | - |
690
+ | 0.9482 | 27250 | 0.0 | - |
691
+ | 0.9500 | 27300 | 0.0 | - |
692
+ | 0.9517 | 27350 | 0.0 | - |
693
+ | 0.9534 | 27400 | 0.0 | - |
694
+ | 0.9552 | 27450 | 0.0 | - |
695
+ | 0.9569 | 27500 | 0.0 | - |
696
+ | 0.9587 | 27550 | 0.0 | - |
697
+ | 0.9604 | 27600 | 0.0 | - |
698
+ | 0.9621 | 27650 | 0.0 | - |
699
+ | 0.9639 | 27700 | 0.0 | - |
700
+ | 0.9656 | 27750 | 0.0 | - |
701
+ | 0.9674 | 27800 | 0.0 | - |
702
+ | 0.9691 | 27850 | 0.0 | - |
703
+ | 0.9708 | 27900 | 0.0 | - |
704
+ | 0.9726 | 27950 | 0.0 | - |
705
+ | 0.9743 | 28000 | 0.0 | - |
706
+ | 0.9761 | 28050 | 0.0 | - |
707
+ | 0.9778 | 28100 | 0.0 | - |
708
+ | 0.9795 | 28150 | 0.0 | - |
709
+ | 0.9813 | 28200 | 0.0 | - |
710
+ | 0.9830 | 28250 | 0.0 | - |
711
+ | 0.9848 | 28300 | 0.0 | - |
712
+ | 0.9865 | 28350 | 0.0 | - |
713
+ | 0.9882 | 28400 | 0.0 | - |
714
+ | 0.9900 | 28450 | 0.0 | - |
715
+ | 0.9917 | 28500 | 0.0 | - |
716
+ | 0.9935 | 28550 | 0.0 | - |
717
+ | 0.9952 | 28600 | 0.0 | - |
718
+ | 0.9969 | 28650 | 0.0 | - |
719
+ | 0.9987 | 28700 | 0.0 | - |
720
+
721
+ * The bold row denotes the saved checkpoint.
722
+ ### Framework Versions
723
+ - Python: 3.10.14
724
+ - SetFit: 1.0.3
725
+ - Sentence Transformers: 3.0.1
726
+ - Transformers: 4.41.2
727
+ - PyTorch: 2.3.1+cu121
728
+ - Datasets: 2.20.0
729
+ - Tokenizers: 0.19.1
730
+
731
+ ## Citation
732
+
733
+ ### BibTeX
734
+ ```bibtex
735
+ @article{https://doi.org/10.48550/arxiv.2209.11055,
736
+ doi = {10.48550/ARXIV.2209.11055},
737
+ url = {https://arxiv.org/abs/2209.11055},
738
+ author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
739
+ keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
740
+ title = {Efficient Few-Shot Learning Without Prompts},
741
+ publisher = {arXiv},
742
+ year = {2022},
743
+ copyright = {Creative Commons Attribution 4.0 International}
744
+ }
745
+ ```
746
+
747
+ <!--
748
+ ## Glossary
749
+
750
+ *Clearly define terms in order to be accessible across audiences.*
751
+ -->
752
+
753
+ <!--
754
+ ## Model Card Authors
755
+
756
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
757
+ -->
758
+
759
+ <!--
760
+ ## Model Card Contact
761
+
762
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
763
+ -->
config.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "checkpoints/step_21000",
3
+ "architectures": [
4
+ "NewModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.0,
7
+ "auto_map": {
8
+ "AutoConfig": "configuration.NewConfig",
9
+ "AutoModel": "modeling.NewModel",
10
+ "AutoModelForMaskedLM": "Alibaba-NLP/new-impl--modeling.NewForMaskedLM",
11
+ "AutoModelForMultipleChoice": "Alibaba-NLP/new-impl--modeling.NewForMultipleChoice",
12
+ "AutoModelForQuestionAnswering": "Alibaba-NLP/new-impl--modeling.NewForQuestionAnswering",
13
+ "AutoModelForSequenceClassification": "Alibaba-NLP/new-impl--modeling.NewForSequenceClassification",
14
+ "AutoModelForTokenClassification": "Alibaba-NLP/new-impl--modeling.NewForTokenClassification"
15
+ },
16
+ "classifier_dropout": null,
17
+ "hidden_act": "gelu",
18
+ "hidden_dropout_prob": 0.1,
19
+ "hidden_size": 768,
20
+ "initializer_range": 0.02,
21
+ "intermediate_size": 3072,
22
+ "layer_norm_eps": 1e-12,
23
+ "layer_norm_type": "layer_norm",
24
+ "logn_attention_clip1": false,
25
+ "logn_attention_scale": false,
26
+ "max_position_embeddings": 8192,
27
+ "model_type": "new",
28
+ "num_attention_heads": 12,
29
+ "num_hidden_layers": 12,
30
+ "pack_qkv": true,
31
+ "pad_token_id": 0,
32
+ "position_embedding_type": "rope",
33
+ "rope_scaling": {
34
+ "factor": 2.0,
35
+ "type": "ntk"
36
+ },
37
+ "rope_theta": 500000,
38
+ "torch_dtype": "float32",
39
+ "transformers_version": "4.41.2",
40
+ "type_vocab_size": 0,
41
+ "unpad_inputs": false,
42
+ "use_memory_efficient_attention": false,
43
+ "vocab_size": 30528
44
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.0.1",
4
+ "transformers": "4.41.2",
5
+ "pytorch": "2.3.1+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": null
10
+ }
config_setfit.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "labels": null,
3
+ "normalize_embeddings": false
4
+ }
configuration.py ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 The GTE Team Authors and Alibaba Group.
3
+ # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ """ NEW model configuration"""
17
+ from transformers.configuration_utils import PretrainedConfig
18
+ from transformers.utils import logging
19
+
20
+ logger = logging.get_logger(__name__)
21
+
22
+
23
+ class NewConfig(PretrainedConfig):
24
+ r"""
25
+ This is the configuration class to store the configuration of a [`NewModel`] or a [`TFNewModel`]. It is used to
26
+ instantiate a NEW model according to the specified arguments, defining the model architecture. Instantiating a
27
+ configuration with the defaults will yield a similar configuration to that of the NEW
28
+ [izhx/new-base-en](https://huggingface.co/izhx/new-base-en) architecture.
29
+
30
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
31
+ documentation from [`PretrainedConfig`] for more information.
32
+
33
+
34
+ Args:
35
+ vocab_size (`int`, *optional*, defaults to 30522):
36
+ Vocabulary size of the NEW model. Defines the number of different tokens that can be represented by the
37
+ `inputs_ids` passed when calling [`NewModel`] or [`TFNewModel`].
38
+ hidden_size (`int`, *optional*, defaults to 768):
39
+ Dimensionality of the encoder layers and the pooler layer.
40
+ num_hidden_layers (`int`, *optional*, defaults to 12):
41
+ Number of hidden layers in the Transformer encoder.
42
+ num_attention_heads (`int`, *optional*, defaults to 12):
43
+ Number of attention heads for each attention layer in the Transformer encoder.
44
+ intermediate_size (`int`, *optional*, defaults to 3072):
45
+ Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
46
+ hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
47
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
48
+ `"relu"`, `"silu"` and `"gelu_new"` are supported.
49
+ hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
50
+ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
51
+ attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
52
+ The dropout ratio for the attention probabilities.
53
+ max_position_embeddings (`int`, *optional*, defaults to 512):
54
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
55
+ just in case (e.g., 512 or 1024 or 2048).
56
+ type_vocab_size (`int`, *optional*, defaults to 2):
57
+ The vocabulary size of the `token_type_ids` passed when calling [`NewModel`] or [`TFNewModel`].
58
+ initializer_range (`float`, *optional*, defaults to 0.02):
59
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
60
+ layer_norm_eps (`float`, *optional*, defaults to 1e-12):
61
+ The epsilon used by the layer normalization layers.
62
+ position_embedding_type (`str`, *optional*, defaults to `"rope"`):
63
+ Type of position embedding. Choose one of `"absolute"`, `"rope"`.
64
+ rope_theta (`float`, *optional*, defaults to 10000.0):
65
+ The base period of the RoPE embeddings.
66
+ rope_scaling (`Dict`, *optional*):
67
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
68
+ strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
69
+ `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
70
+ `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
71
+ these scaling strategies behave:
72
+ https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
73
+ experimental feature, subject to breaking API changes in future versions.
74
+ classifier_dropout (`float`, *optional*):
75
+ The dropout ratio for the classification head.
76
+
77
+ Examples:
78
+
79
+ ```python
80
+ >>> from transformers import NewConfig, NewModel
81
+
82
+ >>> # Initializing a NEW izhx/new-base-en style configuration
83
+ >>> configuration = NewConfig()
84
+
85
+ >>> # Initializing a model (with random weights) from the izhx/new-base-en style configuration
86
+ >>> model = NewModel(configuration)
87
+
88
+ >>> # Accessing the model configuration
89
+ >>> configuration = model.config
90
+ ```"""
91
+
92
+ model_type = "new"
93
+
94
+ def __init__(
95
+ self,
96
+ vocab_size=30528,
97
+ hidden_size=768,
98
+ num_hidden_layers=12,
99
+ num_attention_heads=12,
100
+ intermediate_size=3072,
101
+ hidden_act="gelu",
102
+ hidden_dropout_prob=0.1,
103
+ attention_probs_dropout_prob=0.0,
104
+ max_position_embeddings=2048,
105
+ type_vocab_size=1,
106
+ initializer_range=0.02,
107
+ layer_norm_type='layer_norm',
108
+ layer_norm_eps=1e-12,
109
+ # pad_token_id=0,
110
+ position_embedding_type="rope",
111
+ rope_theta=10000.0,
112
+ rope_scaling=None,
113
+ classifier_dropout=None,
114
+ pack_qkv=True,
115
+ unpad_inputs=False,
116
+ use_memory_efficient_attention=False,
117
+ logn_attention_scale=False,
118
+ logn_attention_clip1=False,
119
+ **kwargs,
120
+ ):
121
+ super().__init__(**kwargs)
122
+
123
+ self.vocab_size = vocab_size
124
+ self.hidden_size = hidden_size
125
+ self.num_hidden_layers = num_hidden_layers
126
+ self.num_attention_heads = num_attention_heads
127
+ self.hidden_act = hidden_act
128
+ self.intermediate_size = intermediate_size
129
+ self.hidden_dropout_prob = hidden_dropout_prob
130
+ self.attention_probs_dropout_prob = attention_probs_dropout_prob
131
+ self.max_position_embeddings = max_position_embeddings
132
+ self.type_vocab_size = type_vocab_size
133
+ self.initializer_range = initializer_range
134
+ self.layer_norm_type = layer_norm_type
135
+ self.layer_norm_eps = layer_norm_eps
136
+ self.position_embedding_type = position_embedding_type
137
+ self.rope_theta = rope_theta
138
+ self.rope_scaling = rope_scaling
139
+ self.classifier_dropout = classifier_dropout
140
+
141
+ self.pack_qkv = pack_qkv
142
+ self.unpad_inputs = unpad_inputs
143
+ self.use_memory_efficient_attention = use_memory_efficient_attention
144
+ self.logn_attention_scale = logn_attention_scale
145
+ self.logn_attention_clip1 = logn_attention_clip1
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a093ed1fdf2f06fd4c1710af50ca339b90b560b2fe9c03090e2803f03de4bfb3
3
+ size 547119128
model_head.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4fb6dc8a4acf9cb24d86211c38c8eed73137ded8478a23a55994d0bcf7d96e3a
3
+ size 7007
modeling.py ADDED
@@ -0,0 +1,1387 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 The GTE Team Authors and Alibaba Group.
3
+ # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ """PyTorch NEW model."""
17
+
18
+ import math
19
+ from typing import List, Optional, Tuple, Union
20
+
21
+ import torch
22
+ import torch.utils.checkpoint
23
+ from torch import nn
24
+
25
+ from transformers.activations import ACT2FN
26
+ from transformers.modeling_outputs import (
27
+ BaseModelOutput,
28
+ BaseModelOutputWithPooling,
29
+ MaskedLMOutput,
30
+ MultipleChoiceModelOutput,
31
+ QuestionAnsweringModelOutput,
32
+ SequenceClassifierOutput,
33
+ TokenClassifierOutput,
34
+ )
35
+ from transformers.modeling_utils import PreTrainedModel
36
+ from transformers.utils import logging
37
+
38
+ try:
39
+ import xformers.ops as xops
40
+ except ImportError as e:
41
+ xops = None
42
+
43
+ from .configuration import NewConfig
44
+
45
+
46
+ logger = logging.get_logger(__name__)
47
+
48
+
49
+ # Adapted from https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/bert_padding.py
50
+ # Which was adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py
51
+ class IndexFirstAxis(torch.autograd.Function):
52
+ @staticmethod
53
+ def forward(ctx, input, indices):
54
+ ctx.save_for_backward(indices)
55
+ assert input.ndim >= 2
56
+ ctx.first_axis_dim, other_shape = input.shape[0], input.shape[1:]
57
+ second_dim = other_shape.numel()
58
+ # TD [2022-03-04] For some reason torch.gather is a bit faster than indexing.
59
+ # return input[indices]
60
+ # return torch.gather(
61
+ # rearrange(input, "b ... -> b (...)"), 0, repeat(indices, "z -> z d", d=second_dim)
62
+ # ).reshape(-1, *other_shape)
63
+ return torch.gather(
64
+ input.view(ctx.first_axis_dim, second_dim),
65
+ 0,
66
+ indices.unsqueeze(-1).expand(indices.size(0), second_dim)
67
+ ).reshape(-1, *other_shape)
68
+
69
+ @staticmethod
70
+ def backward(ctx, grad_output):
71
+ (indices,) = ctx.saved_tensors
72
+ assert grad_output.ndim >= 2
73
+ other_shape = grad_output.shape[1:]
74
+ # grad_output = rearrange(grad_output, "b ... -> b (...)")
75
+ grad_output = grad_output.view(grad_output.size(0), other_shape.numel())
76
+ grad_input = torch.zeros(
77
+ [ctx.first_axis_dim, grad_output.shape[1]],
78
+ device=grad_output.device,
79
+ dtype=grad_output.dtype,
80
+ )
81
+ # TD [2022-03-04] For some reason torch.scatter is a bit faster than indexing.
82
+ # grad_input[indices] = grad_output
83
+ # grad_input.scatter_(0, repeat(indices, "z -> z d", d=grad_output.shape[1]), grad_output)
84
+ grad_input.scatter_(
85
+ 0, indices.unsqueeze(-1).expand(indices.size(0), grad_output.size(1)), grad_output
86
+ )
87
+ return grad_input.reshape(ctx.first_axis_dim, *other_shape), None
88
+
89
+
90
+ index_first_axis = IndexFirstAxis.apply
91
+
92
+
93
+ def unpad_input(hidden_states, attention_mask=None, indices=None):
94
+ """
95
+ Arguments:
96
+ hidden_states: (batch, seqlen, ...)
97
+ attention_mask: (batch, seqlen), bool / int, 1 means valid and 0 means not valid.
98
+ indices: (total_nnz), the indices of non-masked tokens from the flattened input sequence.
99
+ Return:
100
+ hidden_states: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
101
+ """
102
+ if indices is None:
103
+ assert attention_mask is not None
104
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
105
+
106
+ # TD [2022-03-04] We don't want to index with a bool mask, because Pytorch will expand the
107
+ # bool mask, then call nonzero to get the indices, then index with those. The indices is @dim
108
+ # times larger than it needs to be, wasting memory. It's faster and more memory-efficient to
109
+ # index with integer indices. Moreover, torch's index is a bit slower than it needs to be,
110
+ # so we write custom forward and backward to make it a bit faster.
111
+ hidden_states = hidden_states.view(-1, *hidden_states.shape[2:])
112
+ return index_first_axis(hidden_states, indices)
113
+
114
+
115
+ class IndexPutFirstAxis(torch.autograd.Function):
116
+ @staticmethod
117
+ def forward(
118
+ ctx,
119
+ values: torch.Tensor,
120
+ indices: torch.Tensor,
121
+ first_axis_dim
122
+ ) -> torch.Tensor:
123
+ ctx.save_for_backward(indices)
124
+ assert indices.ndim == 1
125
+ assert values.ndim >= 2
126
+ output = torch.zeros(
127
+ first_axis_dim, *values.shape[1:], device=values.device, dtype=values.dtype
128
+ )
129
+ output[indices] = values
130
+ return output
131
+
132
+ @staticmethod
133
+ def backward(ctx, grad_output: torch.Tensor) -> Tuple[torch.Tensor, None, None]:
134
+ indices, = ctx.saved_tensors
135
+ grad_values = grad_output[indices]
136
+ return grad_values, None, None
137
+
138
+
139
+ index_put_first_axis = IndexPutFirstAxis.apply
140
+
141
+
142
+ def pad_input(inputs: torch.Tensor, indices: torch.Tensor, batch: int, seqlen: int) -> torch.Tensor:
143
+ """Add padding to sequences.
144
+
145
+ Arguments:
146
+ inputs: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
147
+ indices: (total_nnz), `indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()`
148
+ batch: int batch_size
149
+ seqlen: int max sequence length
150
+
151
+ Returns:
152
+ inputs: (batch, seqlen, ...)
153
+ """
154
+ output = index_put_first_axis(inputs, indices, batch * seqlen)
155
+ return output.view(batch, seqlen, *inputs.shape[1:])
156
+
157
+
158
+ def rotate_half(x):
159
+ """Rotates half the hidden dims of the input."""
160
+ x1 = x[..., : x.shape[-1] // 2]
161
+ x2 = x[..., x.shape[-1] // 2 :]
162
+ return torch.cat((-x2, x1), dim=-1)
163
+
164
+
165
+ def apply_rotary_pos_emb(q, k, cos, sin):
166
+ """Applies Rotary Position Embedding to the query and key tensors.
167
+
168
+ Args:
169
+ q (`torch.Tensor`): The query tensor.
170
+ k (`torch.Tensor`): The key tensor.
171
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
172
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
173
+ Returns:
174
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
175
+ """
176
+ cos, sin = cos.to(q.dtype), sin.to(q.dtype)
177
+ q_embed = (q * cos) + (rotate_half(q) * sin)
178
+ k_embed = (k * cos) + (rotate_half(k) * sin)
179
+ return q_embed, k_embed
180
+
181
+
182
+ class RotaryEmbedding(torch.nn.Module):
183
+ def __init__(self, dim, max_position_embeddings=512, base=10000.0, device=None):
184
+ super().__init__()
185
+
186
+ self.dim = dim
187
+ self.max_position_embeddings = max_position_embeddings
188
+ self.base = base
189
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
190
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
191
+
192
+ # Build here to make `torch.jit.trace` work.
193
+ self._set_cos_sin_cache(
194
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
195
+ )
196
+
197
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
198
+ self.max_seq_len_cached = seq_len
199
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
200
+
201
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
202
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
203
+ emb = torch.cat((freqs, freqs), dim=-1)
204
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
205
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
206
+
207
+ def forward(self, x, seq_len=None):
208
+ # x: [bs, num_attention_heads, seq_len, head_size]
209
+ if seq_len > self.max_seq_len_cached:
210
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
211
+
212
+ return (
213
+ self.cos_cached[:seq_len, ...].to(dtype=x.dtype),
214
+ self.sin_cached[:seq_len, ...].to(dtype=x.dtype),
215
+ )
216
+
217
+
218
+ class NTKScalingRotaryEmbedding(RotaryEmbedding):
219
+ """RotaryEmbedding extended with fixed and mixed NTK scaling. https://kexue.fm/archives/9706 """
220
+
221
+ def __init__(self, dim, max_position_embeddings=512, base=10000, device=None, scaling_factor=1.0, mixed_b=None):
222
+ self.scaling_factor = scaling_factor
223
+ self.mixed_b = mixed_b
224
+ super().__init__(dim, max_position_embeddings, base, device)
225
+ max_position_embeddings = max_position_embeddings * self.scaling_factor
226
+ self._set_cos_sin_cache(max_position_embeddings, self.inv_freq.device, torch.get_default_dtype())
227
+
228
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
229
+ self.max_seq_len_cached = seq_len
230
+
231
+ if seq_len > self.max_position_embeddings:
232
+ base = self.base * (self.scaling_factor if self.mixed_b is None else 1)
233
+ inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
234
+
235
+ if self.mixed_b is None:
236
+ inv_freq = inv_freq / self.scaling_factor ** (2 / self.dim) # (6)
237
+ else:
238
+ a = torch.tensor(self.scaling_factor).log() / (self.dim / 2) ** self.mixed_b # (13)
239
+ lambda_1_m = (a * torch.arange(1, self.dim // 2 + 1).float().to(device) ** self.mixed_b).exp() # (12)
240
+ inv_freq = inv_freq / lambda_1_m # (10)
241
+
242
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
243
+
244
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
245
+
246
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
247
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
248
+ emb = torch.cat((freqs, freqs), dim=-1)
249
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
250
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
251
+
252
+
253
+ class RMSNorm(nn.Module):
254
+ def __init__(self, hidden_size, eps=1e-6):
255
+ """
256
+ RMSNorm is equivalent to T5LayerNorm
257
+ """
258
+ super().__init__()
259
+ self.weight = nn.Parameter(torch.ones(hidden_size))
260
+ self.variance_epsilon = eps
261
+
262
+ def forward(self, hidden_states):
263
+ input_dtype = hidden_states.dtype
264
+ hidden_states = hidden_states.to(torch.float32)
265
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
266
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
267
+ return self.weight * hidden_states.to(input_dtype)
268
+
269
+
270
+ LAYER_NORM = {
271
+ 'layer_norm': nn.LayerNorm,
272
+ 'rms_norm': RMSNorm
273
+ }
274
+
275
+
276
+ class NewEmbeddings(nn.Module):
277
+ """
278
+ Embedding and Unpadding.
279
+ """
280
+
281
+ def __init__(self, config: NewConfig):
282
+ super().__init__()
283
+ self.padding_idx = config.pad_token_id
284
+ self.word_embeddings = nn.Embedding(
285
+ config.vocab_size, config.hidden_size, padding_idx=self.padding_idx
286
+ )
287
+
288
+ self.position_embedding_type = config.position_embedding_type
289
+ if self.position_embedding_type == 'absolute':
290
+ self.position_embeddings = nn.Embedding(
291
+ config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx
292
+ )
293
+ elif self.position_embedding_type == 'rope':
294
+ self._init_rope(config)
295
+ else:
296
+ raise ValueError
297
+
298
+ self.type_vocab_size = config.type_vocab_size
299
+ if self.type_vocab_size > 0:
300
+ self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
301
+
302
+ # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
303
+ # any TensorFlow checkpoint file
304
+ self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
305
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
306
+ # position_ids is contiguous in memory and excluded when serialized
307
+ self.register_buffer(
308
+ "position_ids", torch.arange(config.max_position_embeddings), persistent=False
309
+ )
310
+
311
+ def _init_rope(self, config):
312
+ kwargs = dict(
313
+ dim=int(config.hidden_size / config.num_attention_heads),
314
+ max_position_embeddings=config.max_position_embeddings,
315
+ base=config.rope_theta
316
+ )
317
+ if config.rope_scaling is None:
318
+ self.rotary_emb = RotaryEmbedding(**kwargs)
319
+ else:
320
+ kwargs.update(scaling_factor=config.rope_scaling["factor"])
321
+ scaling_type = config.rope_scaling["type"]
322
+ if scaling_type == 'ntk':
323
+ kwargs.update(mixed_b=config.rope_scaling.get('mixed_b', None))
324
+ self.rotary_emb = NTKScalingRotaryEmbedding(**kwargs)
325
+ # elif scaling_type == "linear":
326
+ # self.rotary_emb = LinearScalingRotaryEmbedding(**kwargs)
327
+ # elif scaling_type == "dynamic":
328
+ # self.rotary_emb = DynamicNTKScalingRotaryEmbedding(**kwargs)
329
+ else:
330
+ raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
331
+
332
+ def forward(
333
+ self,
334
+ unpad_inputs: bool,
335
+ input_ids: Optional[torch.Tensor] = None,
336
+ attention_mask: Optional[torch.Tensor] = None,
337
+ length: Optional[List[int]] = None,
338
+ token_type_ids: Optional[torch.Tensor] = None,
339
+ position_ids: Optional[torch.Tensor] = None,
340
+ inputs_embeds: Optional[torch.Tensor] = None,
341
+ ) -> Tuple[torch.Tensor, torch.Tensor, Optional[Tuple], Optional[List[int]]]:
342
+ """
343
+ """
344
+ if inputs_embeds is None:
345
+ device, input_shape = input_ids.device, input_ids.shape
346
+ else:
347
+ device, input_shape = inputs_embeds.device, inputs_embeds.shape[:2]
348
+ batch_size, seq_length = input_shape
349
+
350
+ # Set attention_mask if it's None
351
+ if attention_mask is None:
352
+ attention_mask = torch.ones(input_shape, device=device)
353
+ if length is not None:
354
+ for i, l in enumerate(length):
355
+ attention_mask[i, l:] = 0
356
+
357
+ # Set attention_mask_bool for unpadding
358
+ if unpad_inputs:
359
+ attention_mask_bool = attention_mask.bool()
360
+ if length is None:
361
+ length = attention_mask.sum(-1).tolist()
362
+
363
+ # Get word embeddings
364
+ if inputs_embeds is None:
365
+ if unpad_inputs:
366
+ input_ids = input_ids[attention_mask_bool].unsqueeze(0)
367
+ inputs_embeds = self.word_embeddings(input_ids)
368
+ else:
369
+ if unpad_inputs:
370
+ inputs_embeds = inputs_embeds[attention_mask_bool].unsqueeze(0)
371
+ embeddings = inputs_embeds
372
+
373
+ # Set and unpad position_ids
374
+ if position_ids is None:
375
+ if seq_length > self.position_ids.size(0):
376
+ self.register_buffer(
377
+ "position_ids", torch.arange(seq_length), persistent=False
378
+ )
379
+ if unpad_inputs:
380
+ # [1, cumsum_seq_len]
381
+ position_ids = torch.cat([self.position_ids[:l] for l in length]).unsqueeze(0)
382
+ else:
383
+ # [bs, seq_len]
384
+ position_ids = self.position_ids[:seq_length].expand(batch_size, -1)
385
+ elif unpad_inputs:
386
+ position_ids = position_ids[attention_mask_bool].unsqueeze(0) # [1, cumsum_seq_len]
387
+
388
+ # Compute rotary embedding
389
+ if self.position_embedding_type == 'rope':
390
+ rope_cos, rope_sin = self.rotary_emb(inputs_embeds, seq_len=seq_length)
391
+ rope_cos = rope_cos[position_ids].unsqueeze(2) # [bs, seq_len, 1, dim]
392
+ rope_sin = rope_sin[position_ids].unsqueeze(2) # [bs, seq_len, 1, dim]
393
+ rope_embeds = rope_cos, rope_sin
394
+ else:
395
+ rope_embeds = None
396
+
397
+ if self.type_vocab_size > 0:
398
+ if token_type_ids is None:
399
+ token_type_ids = position_ids.mul(0)
400
+ elif unpad_inputs:
401
+ token_type_ids = token_type_ids[attention_mask_bool].unsqueeze(0)
402
+
403
+ token_type_embeddings = self.token_type_embeddings(token_type_ids)
404
+ embeddings += token_type_embeddings
405
+
406
+ # BERT position
407
+ if self.position_embedding_type == "absolute":
408
+ position_embeddings = self.position_embeddings(position_ids)
409
+ embeddings += position_embeddings
410
+
411
+ embeddings = self.LayerNorm(embeddings)
412
+ embeddings = self.dropout(embeddings)
413
+
414
+ return embeddings, attention_mask, rope_embeds, length
415
+
416
+
417
+ class NewAttention(nn.Module):
418
+ def __init__(self, config: NewConfig, pack_qkv=None, use_memory_efficient_attention=None):
419
+ super().__init__()
420
+ self.config = config
421
+ if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
422
+ raise ValueError(
423
+ f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
424
+ f"heads ({config.num_attention_heads})"
425
+ )
426
+
427
+ self.hidden_size = config.hidden_size
428
+ self.num_attention_heads = config.num_attention_heads
429
+ self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
430
+ self.all_head_size = self.num_attention_heads * self.attention_head_size
431
+
432
+ if pack_qkv is None:
433
+ pack_qkv = config.pack_qkv
434
+ self.pack_qkv = pack_qkv
435
+
436
+ if self.pack_qkv:
437
+ self.qkv_proj = nn.Linear(config.hidden_size, self.all_head_size * 3, bias=True)
438
+ else:
439
+ self.q_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
440
+ self.k_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
441
+ self.v_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
442
+
443
+ self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
444
+ self.o_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=True)
445
+
446
+ if use_memory_efficient_attention is None:
447
+ use_memory_efficient_attention = self.config.use_memory_efficient_attention
448
+ self.use_memory_efficient_attention = use_memory_efficient_attention
449
+ self.memory_efficient_attention = None if xops is None else xops.memory_efficient_attention
450
+ if self.use_memory_efficient_attention:
451
+ assert self.memory_efficient_attention is not None, 'please install xformers'
452
+ if self.config.unpad_inputs:
453
+ assert self.config.use_memory_efficient_attention, 'unpad only with xformers'
454
+
455
+ def forward(
456
+ self,
457
+ hidden_states: torch.Tensor,
458
+ attention_bias: torch.FloatTensor,
459
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
460
+ attention_scale: Optional[torch.FloatTensor] = None,
461
+ head_mask: Optional[torch.FloatTensor] = None,
462
+ output_attentions: Optional[bool] = False,
463
+ qkv_inputs: Optional[Tuple] = None, # For RetroMAE
464
+ padding_inputs: Optional[Tuple] = None, # indices, batch, seqlen
465
+ ) -> Tuple[torch.Tensor, ...]:
466
+ shape_hd = (self.num_attention_heads, self.attention_head_size)
467
+ # qkv
468
+ if self.pack_qkv and qkv_inputs is None:
469
+ qkv_pack = self.qkv_proj(hidden_states).split(self.all_head_size, dim=-1)
470
+ else:
471
+ if qkv_inputs is None:
472
+ qkv_inputs = (hidden_states, hidden_states, hidden_states)
473
+ qkv_pack = [
474
+ getattr(self, n + '_proj')(s) for s, n in zip(qkv_inputs, 'qkv')
475
+ ]
476
+ query_states, key_states, value_states = [t.view(t.shape[:-1] + shape_hd) for t in qkv_pack]
477
+
478
+ if self.config.position_embedding_type == 'rope':
479
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, *rope_embeds)
480
+
481
+ dtype = query_states.dtype
482
+
483
+ if self.config.logn_attention_scale and attention_scale is not None:
484
+ # https://kexue.fm/archives/8823
485
+ query_states = query_states * attention_scale.to(dtype)
486
+
487
+ if padding_inputs is not None:
488
+ query_states = pad_input(query_states.squeeze(), *padding_inputs)
489
+ key_states = pad_input(key_states.squeeze(), *padding_inputs)
490
+ value_states = pad_input(value_states.squeeze(), *padding_inputs)
491
+
492
+ if self.use_memory_efficient_attention:
493
+ assert self.memory_efficient_attention is not None, "xformers is not loaded"
494
+ assert output_attentions is False, "memory_efficient_attention do not output attentions"
495
+ assert head_mask is None, "Not support yet"
496
+ attention_probs = None
497
+ if torch.is_tensor(attention_bias):
498
+ attention_bias = attention_bias.to(dtype)
499
+ context_layer = self.memory_efficient_attention(
500
+ query_states,
501
+ key_states,
502
+ value_states,
503
+ attn_bias=attention_bias,
504
+ p=self.dropout.p
505
+ )
506
+ else:
507
+ context_layer = self._attention(query_states, key_states, value_states, attention_bias, head_mask)
508
+
509
+ if padding_inputs is not None:
510
+ context_layer = unpad_input(context_layer, indices=padding_inputs[0])
511
+
512
+ new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
513
+ context_layer = context_layer.view(new_context_layer_shape)
514
+
515
+ # output proj
516
+ attn_output = self.o_proj(context_layer)
517
+
518
+ # add attentions if we output them
519
+ outputs = (attn_output, attention_probs) if output_attentions else (attn_output,)
520
+ return outputs
521
+
522
+ def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
523
+ """
524
+ Args:
525
+ q/k/v: (B, L, n_head, head_dim),
526
+ Returns:
527
+ attn_output: (B L, n_head, head_dim)
528
+ """
529
+ query_states = query_states.transpose(1, 2)
530
+ key_states = key_states.transpose(1, 2)
531
+ value_states = value_states.transpose(1, 2)
532
+ # Take the dot product between "query" and "key" to get the raw attention scores.
533
+ attention_scores = torch.matmul(query_states, key_states.transpose(-1, -2))
534
+
535
+ attention_scores = attention_scores / math.sqrt(self.attention_head_size)
536
+ if attention_bias is not None:
537
+ # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
538
+ attention_scores = attention_scores + attention_bias
539
+
540
+ # Normalize the attention scores to probabilities.
541
+ attention_probs = nn.functional.softmax(attention_scores, dim=-1)
542
+
543
+ # This is actually dropping out entire tokens to attend to, which might
544
+ # seem a bit unusual, but is taken from the original Transformer paper.
545
+ attention_probs = self.dropout(attention_probs)
546
+
547
+ # Mask heads if we want to
548
+ if head_mask is not None:
549
+ attention_probs = attention_probs * head_mask
550
+
551
+ context_layer = torch.matmul(attention_probs, value_states)
552
+
553
+ context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
554
+ return context_layer
555
+
556
+
557
+ class NewSdpaAttention(NewAttention):
558
+ """
559
+ New attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
560
+ `NewAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
561
+ SDPA API.
562
+ """
563
+ def __init__(self, config: NewConfig, **kwargs):
564
+ super().__init__(config, **kwargs)
565
+ torch.backends.cuda.enable_mem_efficient_sdp(False)
566
+ logger.warning(
567
+ "Disable memory efficient attention kernel for `NewSdpaAttention`, you can set "
568
+ "`use_memory_efficient_attention=True` if it expected to use."
569
+ )
570
+
571
+ def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
572
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
573
+ query_states.transpose(1, 2),
574
+ key_states.transpose(1, 2),
575
+ value_states.transpose(1, 2),
576
+ attn_mask=attention_bias,
577
+ dropout_p=self.dropout.p if self.training else 0.0,
578
+ )
579
+ attn_output = attn_output.permute(0, 2, 1, 3).contiguous()
580
+ return attn_output
581
+
582
+
583
+ NEW_ATTENTION_CLASSES = {
584
+ "eager": NewAttention,
585
+ # "flash_attention_2": , # TODO: xformers will dispatch to flash_attn
586
+ "sdpa": NewSdpaAttention,
587
+ }
588
+
589
+
590
+ class NewGatedMLP(nn.Module):
591
+ """
592
+ GLU Variants Improve Transformer.
593
+ """
594
+
595
+ def __init__(self, config: NewConfig):
596
+ super().__init__()
597
+ self.intermediate_size = config.intermediate_size
598
+ self.up_gate_proj = nn.Linear(config.hidden_size, self.intermediate_size * 2, bias=False)
599
+ self.down_proj = nn.Linear(self.intermediate_size, config.hidden_size, bias=True)
600
+ self.act_fn = ACT2FN[config.hidden_act]
601
+ if config.hidden_dropout_prob > 0:
602
+ self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
603
+ else:
604
+ self.hidden_dropout = None
605
+
606
+ def forward(self, hidden_states):
607
+ up_gate = self.up_gate_proj(hidden_states)
608
+ up_states, gate = torch.split(up_gate, self.intermediate_size, dim=-1)
609
+ gate = self.act_fn(gate)
610
+ gated_states = gate * up_states
611
+ if self.hidden_dropout is not None:
612
+ gated_states = self.hidden_dropout(gated_states)
613
+ down_states = self.down_proj(gated_states)
614
+ return down_states
615
+
616
+
617
+ class NewLayer(nn.Module):
618
+ def __init__(
619
+ self,
620
+ config: NewConfig,
621
+ pack_qkv=None,
622
+ use_memory_efficient_attention=None,
623
+ attn_implementation=None
624
+ ):
625
+ super().__init__()
626
+ if attn_implementation is None:
627
+ attn_implementation = config._attn_implementation
628
+ if attn_implementation != 'eager':
629
+ use_memory_efficient_attention = False
630
+ self.attention = NEW_ATTENTION_CLASSES[attn_implementation](
631
+ config, pack_qkv=pack_qkv, use_memory_efficient_attention=use_memory_efficient_attention
632
+ )
633
+ self.mlp = NewGatedMLP(config)
634
+
635
+ ln_class = LAYER_NORM[config.layer_norm_type]
636
+ self.attn_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
637
+ self.mlp_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
638
+
639
+ if config.hidden_dropout_prob > 0:
640
+ self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
641
+ else:
642
+ self.hidden_dropout = None
643
+
644
+ def forward(
645
+ self,
646
+ hidden_states: torch.Tensor,
647
+ attention_bias: torch.FloatTensor,
648
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
649
+ attention_scale: Optional[torch.FloatTensor] = None,
650
+ subset_indices: Optional[torch.LongTensor] = None,
651
+ head_mask: Optional[torch.FloatTensor] = None,
652
+ output_attentions: Optional[bool] = False,
653
+ qkv_inputs: Optional[Tuple] = None, # For RetroMAE
654
+ padding_inputs: Optional[Tuple] = None,
655
+ ) -> Tuple[torch.Tensor, ...]:
656
+ # Multi head self attention
657
+ residual = hidden_states if qkv_inputs is None else qkv_inputs[0]
658
+ attention_outputs = self.attention(
659
+ hidden_states,
660
+ attention_bias,
661
+ rope_embeds,
662
+ attention_scale,
663
+ head_mask,
664
+ output_attentions=output_attentions,
665
+ qkv_inputs=qkv_inputs,
666
+ padding_inputs=padding_inputs,
667
+ )
668
+ hidden_states = attention_outputs[0]
669
+ if self.hidden_dropout is not None:
670
+ hidden_states = self.hidden_dropout(hidden_states)
671
+ hidden_states = residual + hidden_states
672
+
673
+ # In pretraining, after the attention of last layer, we only need the masked tokens.
674
+ if subset_indices is not None:
675
+ hidden_states = hidden_states[subset_indices]
676
+
677
+ hidden_states = self.attn_ln(hidden_states)
678
+
679
+ # Fully Connected
680
+ residual = hidden_states
681
+ hidden_states = self.mlp(hidden_states)
682
+ if self.hidden_dropout is not None:
683
+ hidden_states = self.hidden_dropout(hidden_states)
684
+ hidden_states = residual + hidden_states
685
+ hidden_states = self.mlp_ln(hidden_states)
686
+
687
+ # add self attentions if we output attention weights
688
+ outputs = (hidden_states,) + attention_outputs[1:]
689
+ return outputs
690
+
691
+
692
+ class NewEncoder(nn.Module):
693
+ def __init__(self, config):
694
+ super().__init__()
695
+ self.config = config
696
+ self.layer = nn.ModuleList([NewLayer(config) for _ in range(config.num_hidden_layers)])
697
+ self.gradient_checkpointing = False
698
+
699
+ def forward(
700
+ self,
701
+ hidden_states: torch.Tensor,
702
+ attention_bias: Optional[torch.FloatTensor] = None,
703
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
704
+ attention_scale: Optional[torch.FloatTensor] = None,
705
+ subset_indices: Optional[torch.LongTensor] = None,
706
+ head_mask: Optional[torch.FloatTensor] = None,
707
+ output_attentions: Optional[bool] = False,
708
+ output_hidden_states: Optional[bool] = False,
709
+ return_dict: Optional[bool] = True,
710
+ ) -> Union[Tuple[torch.Tensor], BaseModelOutput]:
711
+ all_hidden_states = () if output_hidden_states else None
712
+ all_self_attentions = () if output_attentions else None
713
+
714
+ for i, layer_module in enumerate(self.layer):
715
+ if output_hidden_states:
716
+ all_hidden_states = all_hidden_states + (hidden_states,)
717
+
718
+ if i >= len(self.layer) - 1:
719
+ layer_subset_indices = subset_indices
720
+ else:
721
+ layer_subset_indices = None
722
+
723
+ layer_head_mask = head_mask[i] if head_mask is not None else None
724
+
725
+ if self.gradient_checkpointing and self.training:
726
+ layer_outputs = self._gradient_checkpointing_func(
727
+ layer_module.__call__,
728
+ hidden_states,
729
+ attention_bias,
730
+ rope_embeds,
731
+ attention_scale,
732
+ layer_subset_indices,
733
+ layer_head_mask,
734
+ )
735
+ else:
736
+ layer_outputs = layer_module(
737
+ hidden_states,
738
+ attention_bias,
739
+ rope_embeds,
740
+ attention_scale,
741
+ layer_subset_indices,
742
+ layer_head_mask,
743
+ output_attentions,
744
+ )
745
+
746
+ hidden_states = layer_outputs[0]
747
+ if output_attentions:
748
+ all_self_attentions = all_self_attentions + (layer_outputs[1],)
749
+
750
+ if output_hidden_states:
751
+ all_hidden_states = all_hidden_states + (hidden_states,)
752
+
753
+ if not return_dict:
754
+ return tuple(
755
+ v
756
+ for v in [
757
+ hidden_states,
758
+ all_hidden_states,
759
+ all_self_attentions,
760
+ ]
761
+ if v is not None
762
+ )
763
+ return BaseModelOutput(
764
+ last_hidden_state=hidden_states,
765
+ hidden_states=all_hidden_states,
766
+ attentions=all_self_attentions,
767
+ )
768
+
769
+
770
+ # Copied from transformers.models.bert.modeling_bert.BertPooler with Bert->New
771
+ class NewPooler(nn.Module):
772
+ def __init__(self, config):
773
+ super().__init__()
774
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
775
+ self.activation = nn.Tanh()
776
+
777
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
778
+ # We "pool" the model by simply taking the hidden state corresponding
779
+ # to the first token.
780
+ first_token_tensor = hidden_states[:, 0]
781
+ pooled_output = self.dense(first_token_tensor)
782
+ pooled_output = self.activation(pooled_output)
783
+ return pooled_output
784
+
785
+
786
+ class NewPreTrainedModel(PreTrainedModel):
787
+ """
788
+ An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
789
+ models.
790
+ """
791
+
792
+ config_class = NewConfig
793
+ base_model_prefix = "new"
794
+ supports_gradient_checkpointing = True
795
+
796
+ def _init_weights(self, module):
797
+ """Initialize the weights"""
798
+ if isinstance(module, nn.Linear):
799
+ # Slightly different from the TF version which uses truncated_normal for initialization
800
+ # cf https://github.com/pytorch/pytorch/pull/5617
801
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
802
+ if module.bias is not None:
803
+ module.bias.data.zero_()
804
+ elif isinstance(module, nn.Embedding):
805
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
806
+ if module.padding_idx is not None:
807
+ module.weight.data[module.padding_idx].zero_()
808
+ elif isinstance(module, nn.LayerNorm):
809
+ module.bias.data.zero_()
810
+ module.weight.data.fill_(1.0)
811
+
812
+
813
+ class NewModel(NewPreTrainedModel):
814
+ """
815
+ The bare New Model transformer outputting raw hidden-states without any specific head on top.
816
+ """
817
+
818
+ def __init__(self, config: NewConfig, add_pooling_layer=False):
819
+ super().__init__(config)
820
+ self.config = config
821
+
822
+ self.embeddings = NewEmbeddings(config)
823
+ self.encoder = NewEncoder(config)
824
+
825
+ self.pooler = NewPooler(config) if add_pooling_layer else None
826
+
827
+ # Initialize weights and apply final processing
828
+ self.post_init()
829
+
830
+ def get_input_embeddings(self):
831
+ return self.embeddings.word_embeddings
832
+
833
+ def set_input_embeddings(self, value):
834
+ self.embeddings.word_embeddings = value
835
+
836
+ def forward(
837
+ self,
838
+ input_ids: Optional[torch.Tensor] = None,
839
+ attention_mask: Optional[torch.Tensor] = None,
840
+ length: Optional[List[int]] = None,
841
+ subset_indices: Optional[torch.LongTensor] = None,
842
+ token_type_ids: Optional[torch.Tensor] = None,
843
+ position_ids: Optional[torch.Tensor] = None,
844
+ head_mask: Optional[torch.Tensor] = None,
845
+ inputs_embeds: Optional[torch.Tensor] = None,
846
+ output_attentions: Optional[bool] = None,
847
+ output_hidden_states: Optional[bool] = None,
848
+ return_dict: Optional[bool] = None,
849
+ unpad_inputs: Optional[bool] = None,
850
+ ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPooling]:
851
+ r"""
852
+ length (`list` of length `batch_size`, *optional*):
853
+ If is `None`, return padded `last_hidden_state`.
854
+ subset_indices ():
855
+ pass
856
+ unpad_inputs (`bool`, *optional*):
857
+ pass
858
+ """
859
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
860
+ output_hidden_states = (
861
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
862
+ )
863
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
864
+ unpad_inputs = unpad_inputs if unpad_inputs is not None else self.config.unpad_inputs
865
+ output_padded = length is None
866
+
867
+ if input_ids is not None and inputs_embeds is not None:
868
+ raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
869
+ elif input_ids is not None:
870
+ self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
871
+ input_shape = input_ids.size()
872
+ elif inputs_embeds is not None:
873
+ input_shape = inputs_embeds.size()[:-1]
874
+ else:
875
+ raise ValueError("You have to specify either input_ids or inputs_embeds")
876
+
877
+ # TODO: not used
878
+ # # Prepare head mask if needed
879
+ # # 1.0 in head_mask indicate we keep the head
880
+ # # attention_probs has shape bsz x n_heads x N x N
881
+ # # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
882
+ # # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
883
+ # head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
884
+
885
+ # Get embeddings, may unpad them
886
+ (embedding_output, attention_mask, rope_embeds, length) = self.embeddings(
887
+ unpad_inputs,
888
+ input_ids=input_ids,
889
+ attention_mask=attention_mask,
890
+ length=length,
891
+ token_type_ids=token_type_ids,
892
+ position_ids=position_ids,
893
+ inputs_embeds=inputs_embeds
894
+ )
895
+
896
+ batch_size, seq_length = input_shape
897
+
898
+ if unpad_inputs:
899
+ assert self.config.use_memory_efficient_attention
900
+ attention_bias = xops.fmha.attn_bias.BlockDiagonalMask.from_seqlens(length)
901
+ else:
902
+ # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
903
+ # ourselves in which case we just need to make it broadcastable to all heads.
904
+ attention_bias = self.get_extended_attention_mask(attention_mask, input_shape)
905
+ if self.config.use_memory_efficient_attention:
906
+ # Invalid shape for attention bias: torch.Size([48, 1, 1, 512]) (expected (48, 12, 512, 512))
907
+ attention_bias = attention_bias.expand(-1, self.config.num_attention_heads, seq_length, -1)
908
+
909
+ if self.config.logn_attention_scale:
910
+ # attention scale log_512(input_len)
911
+ attention_scale = attention_mask.sum(1).log() / torch.tensor(self.config.max_position_embeddings).log()
912
+ # inference-time logn scale need clip 1
913
+ if self.config.logn_attention_clip1:
914
+ attention_scale.clip_(1)
915
+ attention_scale = attention_scale[:, None, None, None]
916
+ else:
917
+ attention_scale = None
918
+
919
+ encoder_outputs = self.encoder(
920
+ embedding_output,
921
+ attention_bias=attention_bias,
922
+ rope_embeds=rope_embeds,
923
+ attention_scale=attention_scale,
924
+ subset_indices=subset_indices,
925
+ head_mask=head_mask,
926
+ output_attentions=output_attentions,
927
+ output_hidden_states=output_hidden_states,
928
+ return_dict=return_dict,
929
+ )
930
+ sequence_output = encoder_outputs[0]
931
+ if unpad_inputs and output_padded:
932
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
933
+ sequence_output = pad_input(
934
+ sequence_output.squeeze(), indices, batch_size, seq_length
935
+ )
936
+
937
+ pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
938
+
939
+ if not return_dict:
940
+ return (sequence_output, pooled_output) + encoder_outputs[1:]
941
+
942
+ return BaseModelOutputWithPooling(
943
+ last_hidden_state=sequence_output,
944
+ pooler_output=pooled_output,
945
+ hidden_states=encoder_outputs.hidden_states,
946
+ attentions=encoder_outputs.attentions,
947
+ )
948
+
949
+
950
+ class NewLMPredictionHead(nn.Module):
951
+ def __init__(self, config):
952
+ super().__init__()
953
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
954
+ self.transform_act_fn = ACT2FN[config.hidden_act]
955
+ self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
956
+
957
+ # The output weights are the same as the input embeddings, but there is
958
+ # an output-only bias for each token.
959
+ self.decoder = nn.Linear(config.hidden_size, config.vocab_size)
960
+
961
+ def forward(self, hidden_states):
962
+ hidden_states = self.dense(hidden_states)
963
+ hidden_states = self.transform_act_fn(hidden_states)
964
+ hidden_states = self.norm(hidden_states)
965
+ hidden_states = self.decoder(hidden_states)
966
+ return hidden_states
967
+
968
+
969
+ class NewForMaskedLM(NewPreTrainedModel):
970
+ _tied_weights_keys = ["lm_head.decoder.bias", "lm_head.decoder.weight"]
971
+
972
+ def __init__(self, config: NewConfig):
973
+ super().__init__(config)
974
+ self.new = NewModel(config, add_pooling_layer=False)
975
+ self.lm_head = NewLMPredictionHead(config)
976
+ self.loss_fct = nn.CrossEntropyLoss()
977
+
978
+ # Initialize weights and apply final processing
979
+ self.post_init()
980
+
981
+ def get_output_embeddings(self):
982
+ return self.lm_head.decoder
983
+
984
+ def set_output_embeddings(self, new_embeddings):
985
+ self.lm_head.decoder = new_embeddings
986
+
987
+ def forward(
988
+ self,
989
+ input_ids: Optional[torch.Tensor] = None,
990
+ attention_mask: Optional[torch.Tensor] = None,
991
+ token_type_ids: Optional[torch.Tensor] = None,
992
+ position_ids: Optional[torch.Tensor] = None,
993
+ head_mask: Optional[torch.Tensor] = None,
994
+ inputs_embeds: Optional[torch.Tensor] = None,
995
+ labels: Optional[torch.Tensor] = None,
996
+ output_attentions: Optional[bool] = None,
997
+ output_hidden_states: Optional[bool] = None,
998
+ return_dict: Optional[bool] = None,
999
+ unpad_inputs: Optional[bool] = None,
1000
+ ) -> Union[Tuple[torch.Tensor], MaskedLMOutput]:
1001
+ r"""
1002
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1003
+ Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
1004
+ config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
1005
+ loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
1006
+ """
1007
+
1008
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1009
+
1010
+ if labels is None or not self.new.config.unpad_inputs:
1011
+ length = None
1012
+ subset_indices = None
1013
+ else:
1014
+ length = attention_mask.sum(-1).tolist()
1015
+ labels = labels[attention_mask.bool()].unsqueeze(0)
1016
+ subset_indices = labels > -100
1017
+
1018
+ outputs = self.new(
1019
+ input_ids,
1020
+ attention_mask=attention_mask,
1021
+ length=length,
1022
+ subset_indices=subset_indices,
1023
+ token_type_ids=token_type_ids,
1024
+ position_ids=position_ids,
1025
+ head_mask=head_mask,
1026
+ inputs_embeds=inputs_embeds,
1027
+ output_attentions=output_attentions,
1028
+ output_hidden_states=output_hidden_states,
1029
+ return_dict=return_dict,
1030
+ unpad_inputs=unpad_inputs,
1031
+ )
1032
+
1033
+ sequence_output = outputs[0]
1034
+ prediction_scores = self.lm_head(sequence_output)
1035
+
1036
+ masked_lm_loss = None
1037
+ if labels is not None:
1038
+ if subset_indices is None:
1039
+ mask = attention_mask.bool()
1040
+ prediction_scores = prediction_scores[mask]
1041
+ labels = labels[mask]
1042
+ else:
1043
+ labels = labels[subset_indices]
1044
+ masked_lm_loss = self.loss_fct(prediction_scores, labels)
1045
+
1046
+ if not return_dict:
1047
+ output = (prediction_scores,) + outputs[2:]
1048
+ return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
1049
+
1050
+ return MaskedLMOutput(
1051
+ loss=masked_lm_loss,
1052
+ logits=prediction_scores,
1053
+ hidden_states=outputs.hidden_states,
1054
+ attentions=outputs.attentions,
1055
+ )
1056
+
1057
+
1058
+ class NewForSequenceClassification(NewPreTrainedModel):
1059
+ def __init__(self, config):
1060
+ super().__init__(config)
1061
+ self.num_labels = config.num_labels
1062
+ self.config = config
1063
+
1064
+ self.new = NewModel(config, add_pooling_layer=True)
1065
+ classifier_dropout = (
1066
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1067
+ )
1068
+ self.dropout = nn.Dropout(classifier_dropout)
1069
+ self.classifier = nn.Linear(config.hidden_size, config.num_labels)
1070
+
1071
+ # Initialize weights and apply final processing
1072
+ self.post_init()
1073
+
1074
+ def forward(
1075
+ self,
1076
+ input_ids: Optional[torch.Tensor] = None,
1077
+ attention_mask: Optional[torch.Tensor] = None,
1078
+ token_type_ids: Optional[torch.Tensor] = None,
1079
+ position_ids: Optional[torch.Tensor] = None,
1080
+ head_mask: Optional[torch.Tensor] = None,
1081
+ inputs_embeds: Optional[torch.Tensor] = None,
1082
+ labels: Optional[torch.Tensor] = None,
1083
+ output_attentions: Optional[bool] = None,
1084
+ output_hidden_states: Optional[bool] = None,
1085
+ return_dict: Optional[bool] = None,
1086
+ unpad_inputs: Optional[bool] = None,
1087
+ ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
1088
+ r"""
1089
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1090
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1091
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1092
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1093
+ """
1094
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1095
+
1096
+ outputs = self.new(
1097
+ input_ids,
1098
+ attention_mask=attention_mask,
1099
+ token_type_ids=token_type_ids,
1100
+ position_ids=position_ids,
1101
+ head_mask=head_mask,
1102
+ inputs_embeds=inputs_embeds,
1103
+ output_attentions=output_attentions,
1104
+ output_hidden_states=output_hidden_states,
1105
+ return_dict=return_dict,
1106
+ unpad_inputs=unpad_inputs,
1107
+ )
1108
+
1109
+ pooled_output = outputs[1]
1110
+
1111
+ pooled_output = self.dropout(pooled_output)
1112
+ logits = self.classifier(pooled_output)
1113
+
1114
+ loss = None
1115
+ if labels is not None:
1116
+ if self.config.problem_type is None:
1117
+ if self.num_labels == 1:
1118
+ self.config.problem_type = "regression"
1119
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1120
+ self.config.problem_type = "single_label_classification"
1121
+ else:
1122
+ self.config.problem_type = "multi_label_classification"
1123
+
1124
+ if self.config.problem_type == "regression":
1125
+ loss_fct = nn.MSELoss()
1126
+ if self.num_labels == 1:
1127
+ loss = loss_fct(logits.squeeze(), labels.squeeze())
1128
+ else:
1129
+ loss = loss_fct(logits, labels)
1130
+ elif self.config.problem_type == "single_label_classification":
1131
+ loss_fct = nn.CrossEntropyLoss()
1132
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
1133
+ elif self.config.problem_type == "multi_label_classification":
1134
+ loss_fct = nn.BCEWithLogitsLoss()
1135
+ loss = loss_fct(logits, labels)
1136
+
1137
+ if not return_dict:
1138
+ output = (logits,) + outputs[2:]
1139
+ return ((loss,) + output) if loss is not None else output
1140
+
1141
+ return SequenceClassifierOutput(
1142
+ loss=loss,
1143
+ logits=logits,
1144
+ hidden_states=outputs.hidden_states,
1145
+ attentions=outputs.attentions,
1146
+ )
1147
+
1148
+
1149
+ class NewForMultipleChoice(NewPreTrainedModel):
1150
+ def __init__(self, config):
1151
+ super().__init__(config)
1152
+
1153
+ self.new = NewModel(config, add_pooling_layer=True)
1154
+ classifier_dropout = (
1155
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1156
+ )
1157
+ self.dropout = nn.Dropout(classifier_dropout)
1158
+ self.classifier = nn.Linear(config.hidden_size, 1)
1159
+
1160
+ # Initialize weights and apply final processing
1161
+ self.post_init()
1162
+
1163
+ def forward(
1164
+ self,
1165
+ input_ids: Optional[torch.Tensor] = None,
1166
+ attention_mask: Optional[torch.Tensor] = None,
1167
+ token_type_ids: Optional[torch.Tensor] = None,
1168
+ position_ids: Optional[torch.Tensor] = None,
1169
+ head_mask: Optional[torch.Tensor] = None,
1170
+ inputs_embeds: Optional[torch.Tensor] = None,
1171
+ labels: Optional[torch.Tensor] = None,
1172
+ output_attentions: Optional[bool] = None,
1173
+ output_hidden_states: Optional[bool] = None,
1174
+ return_dict: Optional[bool] = None,
1175
+ unpad_inputs: Optional[bool] = None,
1176
+ ) -> Union[Tuple[torch.Tensor], MultipleChoiceModelOutput]:
1177
+ r"""
1178
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1179
+ Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
1180
+ num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
1181
+ `input_ids` above)
1182
+ """
1183
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1184
+ num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
1185
+
1186
+ input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
1187
+ attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
1188
+ token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
1189
+ position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
1190
+ inputs_embeds = (
1191
+ inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))
1192
+ if inputs_embeds is not None
1193
+ else None
1194
+ )
1195
+
1196
+ outputs = self.new(
1197
+ input_ids,
1198
+ attention_mask=attention_mask,
1199
+ token_type_ids=token_type_ids,
1200
+ position_ids=position_ids,
1201
+ head_mask=head_mask,
1202
+ inputs_embeds=inputs_embeds,
1203
+ output_attentions=output_attentions,
1204
+ output_hidden_states=output_hidden_states,
1205
+ return_dict=return_dict,
1206
+ unpad_inputs=unpad_inputs,
1207
+ )
1208
+
1209
+ pooled_output = outputs[1]
1210
+
1211
+ pooled_output = self.dropout(pooled_output)
1212
+ logits = self.classifier(pooled_output)
1213
+ reshaped_logits = logits.view(-1, num_choices)
1214
+
1215
+ loss = None
1216
+ if labels is not None:
1217
+ loss_fct = nn.CrossEntropyLoss()
1218
+ loss = loss_fct(reshaped_logits, labels)
1219
+
1220
+ if not return_dict:
1221
+ output = (reshaped_logits,) + outputs[2:]
1222
+ return ((loss,) + output) if loss is not None else output
1223
+
1224
+ return MultipleChoiceModelOutput(
1225
+ loss=loss,
1226
+ logits=reshaped_logits,
1227
+ hidden_states=outputs.hidden_states,
1228
+ attentions=outputs.attentions,
1229
+ )
1230
+
1231
+
1232
+ class NewForTokenClassification(NewPreTrainedModel):
1233
+ def __init__(self, config):
1234
+ super().__init__(config)
1235
+ self.num_labels = config.num_labels
1236
+
1237
+ self.new = NewModel(config, add_pooling_layer=False)
1238
+ classifier_dropout = (
1239
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1240
+ )
1241
+ self.dropout = nn.Dropout(classifier_dropout)
1242
+ self.classifier = nn.Linear(config.hidden_size, config.num_labels)
1243
+
1244
+ # Initialize weights and apply final processing
1245
+ self.post_init()
1246
+
1247
+ def forward(
1248
+ self,
1249
+ input_ids: Optional[torch.Tensor] = None,
1250
+ attention_mask: Optional[torch.Tensor] = None,
1251
+ token_type_ids: Optional[torch.Tensor] = None,
1252
+ position_ids: Optional[torch.Tensor] = None,
1253
+ head_mask: Optional[torch.Tensor] = None,
1254
+ inputs_embeds: Optional[torch.Tensor] = None,
1255
+ labels: Optional[torch.Tensor] = None,
1256
+ output_attentions: Optional[bool] = None,
1257
+ output_hidden_states: Optional[bool] = None,
1258
+ return_dict: Optional[bool] = None,
1259
+ unpad_inputs: Optional[bool] = None,
1260
+ ) -> Union[Tuple[torch.Tensor], TokenClassifierOutput]:
1261
+ r"""
1262
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1263
+ Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
1264
+ """
1265
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1266
+
1267
+ outputs = self.new(
1268
+ input_ids,
1269
+ attention_mask=attention_mask,
1270
+ token_type_ids=token_type_ids,
1271
+ position_ids=position_ids,
1272
+ head_mask=head_mask,
1273
+ inputs_embeds=inputs_embeds,
1274
+ output_attentions=output_attentions,
1275
+ output_hidden_states=output_hidden_states,
1276
+ return_dict=return_dict,
1277
+ unpad_inputs=unpad_inputs,
1278
+ )
1279
+
1280
+ sequence_output = outputs[0]
1281
+
1282
+ sequence_output = self.dropout(sequence_output)
1283
+ logits = self.classifier(sequence_output)
1284
+
1285
+ loss = None
1286
+ if labels is not None:
1287
+ loss_fct = nn.CrossEntropyLoss()
1288
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
1289
+
1290
+ if not return_dict:
1291
+ output = (logits,) + outputs[2:]
1292
+ return ((loss,) + output) if loss is not None else output
1293
+
1294
+ return TokenClassifierOutput(
1295
+ loss=loss,
1296
+ logits=logits,
1297
+ hidden_states=outputs.hidden_states,
1298
+ attentions=outputs.attentions,
1299
+ )
1300
+
1301
+
1302
+ class NewForQuestionAnswering(NewPreTrainedModel):
1303
+ def __init__(self, config):
1304
+ super().__init__(config)
1305
+ self.num_labels = config.num_labels
1306
+
1307
+ self.new = NewModel(config, add_pooling_layer=False)
1308
+ self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
1309
+
1310
+ # Initialize weights and apply final processing
1311
+ self.post_init()
1312
+
1313
+ def forward(
1314
+ self,
1315
+ input_ids: Optional[torch.Tensor] = None,
1316
+ attention_mask: Optional[torch.Tensor] = None,
1317
+ token_type_ids: Optional[torch.Tensor] = None,
1318
+ position_ids: Optional[torch.Tensor] = None,
1319
+ head_mask: Optional[torch.Tensor] = None,
1320
+ inputs_embeds: Optional[torch.Tensor] = None,
1321
+ start_positions: Optional[torch.Tensor] = None,
1322
+ end_positions: Optional[torch.Tensor] = None,
1323
+ output_attentions: Optional[bool] = None,
1324
+ output_hidden_states: Optional[bool] = None,
1325
+ return_dict: Optional[bool] = None,
1326
+ unpad_inputs: Optional[bool] = None,
1327
+ ) -> Union[Tuple[torch.Tensor], QuestionAnsweringModelOutput]:
1328
+ r"""
1329
+ start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1330
+ Labels for position (index) of the start of the labelled span for computing the token classification loss.
1331
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1332
+ are not taken into account for computing the loss.
1333
+ end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1334
+ Labels for position (index) of the end of the labelled span for computing the token classification loss.
1335
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1336
+ are not taken into account for computing the loss.
1337
+ """
1338
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1339
+
1340
+ outputs = self.new(
1341
+ input_ids,
1342
+ attention_mask=attention_mask,
1343
+ token_type_ids=token_type_ids,
1344
+ position_ids=position_ids,
1345
+ head_mask=head_mask,
1346
+ inputs_embeds=inputs_embeds,
1347
+ output_attentions=output_attentions,
1348
+ output_hidden_states=output_hidden_states,
1349
+ return_dict=return_dict,
1350
+ unpad_inputs=unpad_inputs,
1351
+ )
1352
+
1353
+ sequence_output = outputs[0]
1354
+
1355
+ logits = self.qa_outputs(sequence_output)
1356
+ start_logits, end_logits = logits.split(1, dim=-1)
1357
+ start_logits = start_logits.squeeze(-1).contiguous()
1358
+ end_logits = end_logits.squeeze(-1).contiguous()
1359
+
1360
+ total_loss = None
1361
+ if start_positions is not None and end_positions is not None:
1362
+ # If we are on multi-GPU, split add a dimension
1363
+ if len(start_positions.size()) > 1:
1364
+ start_positions = start_positions.squeeze(-1)
1365
+ if len(end_positions.size()) > 1:
1366
+ end_positions = end_positions.squeeze(-1)
1367
+ # sometimes the start/end positions are outside our model inputs, we ignore these terms
1368
+ ignored_index = start_logits.size(1)
1369
+ start_positions = start_positions.clamp(0, ignored_index)
1370
+ end_positions = end_positions.clamp(0, ignored_index)
1371
+
1372
+ loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
1373
+ start_loss = loss_fct(start_logits, start_positions)
1374
+ end_loss = loss_fct(end_logits, end_positions)
1375
+ total_loss = (start_loss + end_loss) / 2
1376
+
1377
+ if not return_dict:
1378
+ output = (start_logits, end_logits) + outputs[2:]
1379
+ return ((total_loss,) + output) if total_loss is not None else output
1380
+
1381
+ return QuestionAnsweringModelOutput(
1382
+ loss=total_loss,
1383
+ start_logits=start_logits,
1384
+ end_logits=end_logits,
1385
+ hidden_states=outputs.hidden_states,
1386
+ attentions=outputs.attentions,
1387
+ )
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 8192,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "mask_token": "[MASK]",
48
+ "max_length": 512,
49
+ "model_max_length": 8192,
50
+ "pad_to_multiple_of": null,
51
+ "pad_token": "[PAD]",
52
+ "pad_token_type_id": 0,
53
+ "padding_side": "right",
54
+ "sep_token": "[SEP]",
55
+ "stride": 0,
56
+ "strip_accents": null,
57
+ "tokenize_chinese_chars": true,
58
+ "tokenizer_class": "BertTokenizer",
59
+ "truncation_side": "right",
60
+ "truncation_strategy": "longest_first",
61
+ "unk_token": "[UNK]"
62
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff