Corran commited on
Commit
cee2228
1 Parent(s): c83d11e

Add SetFit model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 512,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
README.md ADDED
@@ -0,0 +1,252 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: setfit
3
+ tags:
4
+ - setfit
5
+ - sentence-transformers
6
+ - text-classification
7
+ - generated_from_setfit_trainer
8
+ metrics:
9
+ - accuracy
10
+ widget:
11
+ - text: This paper focuses on mining association rules between sets of items in large
12
+ databases, which can reveal interesting patterns and relationships among the data.
13
+ - text: In this paper, the authors explore the economic concepts of fairness and retaliation
14
+ within the context of reciprocity, demonstrating how these principles shape market
15
+ behaviors and interactions.
16
+ - text: Further research is needed to explore the applicability of the proposed model
17
+ to more complex multi-echelon inventory systems with additional features, such
18
+ as lead time variability and supplier reliability.
19
+ - text: The NCEP/NCAR 40-Year Reanalysis Project provides retrospective atmospheric
20
+ data sets by assimilating observational data into a model, resulting in improved
21
+ estimates of historical weather patterns for meteorological research and applications.
22
+ - text: This study aims to assess the accuracy of aerosol optical properties retrieved
23
+ from Aerosol Robotic Network (AERONET) Sun and sky radiance measurements using
24
+ ground-based reference data.
25
+ pipeline_tag: text-classification
26
+ inference: true
27
+ base_model: jinaai/jina-embeddings-v2-small-en
28
+ model-index:
29
+ - name: SetFit with jinaai/jina-embeddings-v2-small-en
30
+ results:
31
+ - task:
32
+ type: text-classification
33
+ name: Text Classification
34
+ dataset:
35
+ name: Unknown
36
+ type: unknown
37
+ split: test
38
+ metrics:
39
+ - type: accuracy
40
+ value: 0.8492307692307692
41
+ name: Accuracy
42
+ ---
43
+
44
+ # SetFit with jinaai/jina-embeddings-v2-small-en
45
+
46
+ This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [jinaai/jina-embeddings-v2-small-en](https://huggingface.co/jinaai/jina-embeddings-v2-small-en) as the Sentence Transformer embedding model. A [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance is used for classification.
47
+
48
+ The model has been trained using an efficient few-shot learning technique that involves:
49
+
50
+ 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
51
+ 2. Training a classification head with features from the fine-tuned Sentence Transformer.
52
+
53
+ ## Model Details
54
+
55
+ ### Model Description
56
+ - **Model Type:** SetFit
57
+ - **Sentence Transformer body:** [jinaai/jina-embeddings-v2-small-en](https://huggingface.co/jinaai/jina-embeddings-v2-small-en)
58
+ - **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
59
+ - **Maximum Sequence Length:** 8192 tokens
60
+ - **Number of Classes:** 13 classes
61
+ <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
62
+ <!-- - **Language:** Unknown -->
63
+ <!-- - **License:** Unknown -->
64
+
65
+ ### Model Sources
66
+
67
+ - **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
68
+ - **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
69
+ - **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
70
+
71
+ ### Model Labels
72
+ | Label | Examples |
73
+ |:----------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
74
+ | Aims | <ul><li>'This study aims to provide an in-depth analysis of the impact of Coronavirus Disease 2019 (COVID-19) on Italy, focusing on the early stages of the outbreak and the subsequent government response.'</li><li>'In this paper, we propose SegNet, a deep convolutional encoder-decoder architecture for real-time image segmentation.'</li><li>'This study aims to develop a mathematical model for analyzing genetic variation using restriction endonucleases.'</li></ul> |
75
+ | Background | <ul><li>'Previous studies have demonstrated that statins, including pravastatin, can reduce the risk of coronary events in patients with elevated cholesterol levels. However, the efficacy of pravastatin in patients with average cholesterol levels is less clear.'</li><li>'Previous studies have shown that statins, including pravastatin, can reduce the risk of coronary events in patients with elevated cholesterol levels. However, this study investigates the effect of pravastatin on patients with average cholesterol levels.'</li><li>'Previous studies have shown that statins, including pravastatin, can reduce the risk of coronary events in patients with elevated cholesterol levels. However, this trial investigates the effect of pravastatin on patients with average cholesterol levels.'</li></ul> |
76
+ | Hypothesis | <ul><li>'Despite having average cholesterol levels, patients who received Pravastatin experienced a significant reduction in coronary events, suggesting a potential role for statins in preventing cardiovascular events beyond cholesterol level management in internal medicine.'</li><li>'This prospective observational study aimed to investigate the association between glycaemia levels and the risk of developing macrovascular and microvascular complications in individuals with type 2 diabetes, as previously identified in the UKPDS 35 study.'</li><li>'The results suggest that self-regulatory skills, particularly in the area of attention, significantly impact academic performance in elementary school students.'</li></ul> |
77
+ | Implications | <ul><li>'From 1995 to 1998, the UK Prospective Diabetes Study (UKPDS) 35 observed a significant association between higher glycaemia levels and increased risk of both macrovascular and microvascular complications in patients with type 2 diabetes.'</li><li>'The UKPDS 35 study provides robust evidence that every 1 mmol/L increase in HbA1c is associated with a 25% increased risk of macrovascular events and a 37% increased risk of microvascular complications in patients with type 2 diabetes, highlighting the importance of strict glycaemic control in internal medicine.'</li><li>"This study provides valuable insights into the early dynamics of the COVID-19 outbreak in Italy, contributing to the understanding of the disease's transmission patterns and impact on public health."</li></ul> |
78
+ | Importance | <ul><li>'Stroke and transient ischemic attack (TIA) are leading causes of long-term disability and mortality in internal medicine, with an estimated 15 million survivors worldwide.'</li><li>'The accurate assessment of insulin resistance and beta-cell function is crucial in the diagnosis and management of various metabolic disorders, including type 2 diabetes and metabolic syndrome.'</li><li>'The COVID-19 outbreak in Italy, which began in late February 2020, quickly became one of the most severe epidemic hotspots in Europe.'</li></ul> |
79
+ | Keywords | <ul><li>'Pravastatin is a statin drug commonly used in the treatment of hypercholesterolemia, specifically to lower low-density lipoprotein (LDL) cholesterol levels and reduce the risk of cardiovascular events in internal medicine.'</li><li>'Self-regulation refers to the ability of students to manage their emotions, behavior, and cognitive processes to achieve optimal learning (Zimmerman & Kitsantas, 2005).'</li><li>'The proposed method utilizes deep convolutional neural networks to extract rich features from input images, enabling both object detection and semantic segmentation with high accuracy in the field of artificial intelligence.'</li></ul> |
80
+ | Limitations | <ul><li>'However, it is important to note that the Homeostasis Model Assessment (HOMA) index does not directly measure insulin sensitivity or β-cell function, but rather provides an estimate based on fasting plasma glucose and insulin concentrations.'</li><li>'Despite providing a useful estimate of insulin resistance and beta-cell function, the Homeostasis Model Assessment has limitations in its applicability to individuals with extreme glucose or insulin levels, as well as those with certain diseases such as liver disease or pregnancy.'</li><li>'Despite the large sample size and long follow-up period, the observational nature of the study limits the ability to establish causality between glycaemia and the observed complications in type 2 diabetes.'</li></ul> |
81
+ | Method | <ul><li>'The study employed a randomized, double-blind, placebo-controlled design to investigate the effect of Pravastatin on coronary events in patients with average cholesterol levels.'</li><li>'Patients with a history of myocardial infarction and an average cholesterol level between 180 and 240 mg/dL were included in the study.'</li><li>'The study aimed to assess the impact of Pravastatin administration on the incidence of coronary events in internal medicine patients with average cholesterol levels.'</li></ul> |
82
+ | None | <ul><li>'The study enrolled patients with a recent myocardial infarction and an average cholesterol level, who were then randomly assigned to receive either pravastatin or placebo.'</li><li>'This systematic review and meta-analysis aimed to assess the efficacy and safety of dual antiplatelet therapy with aspirin and clopidogrel in the secondary prevention of stroke and transient ischemic attack in the field of internal medicine.'</li><li>'This study aims to evaluate the effectiveness of the Homeostasis Model Assessment (HOMA) in estimating insulin resistance and pancreatic beta-cell function in internal medicine, offering valuable insights for the diagnosis and management of metabolic disorders.'</li></ul> |
83
+ | Purpose | <ul><li>'This study investigates the impact of Pravastatin on reducing coronary events in internal medicine patients with average cholesterol levels after a myocardial infarction.'</li><li>'This systematic review and meta-analysis aimed to assess the efficacy and safety of dual antiplatelet therapy with aspirin and clopidogrel in the secondary prevention of stroke and transient ischemic attack in internal medicine.'</li><li>'This study aims to evaluate the effectiveness of the Homeostasis Model Assessment (HOMA) in estimating insulin resistance and beta-cell function in internal medicine patients, addressing the need for a simple and widely applicable method for diagnosing and monitoring these conditions.'</li></ul> |
84
+ | Reccomendations | <ul><li>'Further studies are needed to investigate the optimal duration of dual antiplatelet therapy in secondary prevention of stroke and transient ischemic attack, as well as the role of individual patient characteristics in determining the most effective treatment regimen.'</li><li>'Further research is warranted to explore the underlying mechanisms linking glycaemia to macrovascular and microvascular complications in type 2 diabetes, particularly in multi-ethnic populations.'</li><li>'Further studies are needed to investigate the potential role of IL-6 signaling in the prevention of bone loss in postmenopausal women.'</li></ul> |
85
+ | Result | <ul><li>'Despite having average cholesterol levels, patients treated with Pravastatin did not experience a significant reduction in coronary events compared to the placebo group.'</li><li>'In interviews with patients who experienced a reduction in coronary events after Pravastatin treatment, themes included improved energy levels and increased confidence in managing their heart health.'</li><li>'The study found that Pravastatin significantly reduced the risk of coronary events in patients with average cholesterol levels, consistent with previous research suggesting that statins benefit a wider population beyond those with hypercholesterolemia.'</li></ul> |
86
+ | Uncertainty | <ul><li>'Despite the widespread use of pravastatin in post-myocardial infarction patients with average cholesterol levels, the evidence regarding its impact on coronary events remains inconclusive and sometimes contradictory.'</li><li>'Despite the findings of this study showing a reduction in coronary events with Pravastatin use in patients with average cholesterol levels, contrasting evidence exists suggesting no significant benefit in similar patient populations (Miller et al., 2018).'</li><li>'Despite the proven benefits of dual antiplatelet therapy with aspirin and clopidogrel in the secondary prevention of cardiovascular events, particularly in coronary artery disease, there is a paucity of data specifically addressing its use in stroke or transient ischemic attack (TIA) patients.'</li></ul> |
87
+
88
+ ## Evaluation
89
+
90
+ ### Metrics
91
+ | Label | Accuracy |
92
+ |:--------|:---------|
93
+ | **all** | 0.8492 |
94
+
95
+ ## Uses
96
+
97
+ ### Direct Use for Inference
98
+
99
+ First install the SetFit library:
100
+
101
+ ```bash
102
+ pip install setfit
103
+ ```
104
+
105
+ Then you can load this model and run inference.
106
+
107
+ ```python
108
+ from setfit import SetFitModel
109
+
110
+ # Download from the 🤗 Hub
111
+ model = SetFitModel.from_pretrained("Corran/SciGenSetfit3")
112
+ # Run inference
113
+ preds = model("This paper focuses on mining association rules between sets of items in large databases, which can reveal interesting patterns and relationships among the data.")
114
+ ```
115
+
116
+ <!--
117
+ ### Downstream Use
118
+
119
+ *List how someone could finetune this model on their own dataset.*
120
+ -->
121
+
122
+ <!--
123
+ ### Out-of-Scope Use
124
+
125
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
126
+ -->
127
+
128
+ <!--
129
+ ## Bias, Risks and Limitations
130
+
131
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
132
+ -->
133
+
134
+ <!--
135
+ ### Recommendations
136
+
137
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
138
+ -->
139
+
140
+ ## Training Details
141
+
142
+ ### Training Set Metrics
143
+ | Training set | Min | Median | Max |
144
+ |:-------------|:----|:--------|:----|
145
+ | Word count | 11 | 28.3123 | 71 |
146
+
147
+ | Label | Training Sample Count |
148
+ |:----------------|:----------------------|
149
+ | Aims | 200 |
150
+ | Background | 200 |
151
+ | Hypothesis | 200 |
152
+ | Implications | 200 |
153
+ | Importance | 200 |
154
+ | Keywords | 200 |
155
+ | Limitations | 200 |
156
+ | Method | 200 |
157
+ | None | 200 |
158
+ | Purpose | 200 |
159
+ | Reccomendations | 200 |
160
+ | Result | 200 |
161
+ | Uncertainty | 200 |
162
+
163
+ ### Training Hyperparameters
164
+ - batch_size: (256, 256)
165
+ - num_epochs: (1, 1)
166
+ - max_steps: -1
167
+ - sampling_strategy: oversampling
168
+ - num_iterations: 40
169
+ - body_learning_rate: (2e-05, 1e-05)
170
+ - head_learning_rate: 0.01
171
+ - loss: CosineSimilarityLoss
172
+ - distance_metric: cosine_distance
173
+ - margin: 0.25
174
+ - end_to_end: False
175
+ - use_amp: False
176
+ - warmup_proportion: 0.1
177
+ - seed: 42
178
+ - eval_max_steps: -1
179
+ - load_best_model_at_end: False
180
+
181
+ ### Training Results
182
+ | Epoch | Step | Training Loss | Validation Loss |
183
+ |:------:|:----:|:-------------:|:---------------:|
184
+ | 0.0025 | 1 | 0.2913 | - |
185
+ | 0.1229 | 50 | 0.2365 | - |
186
+ | 0.2457 | 100 | 0.185 | - |
187
+ | 0.3686 | 150 | 0.159 | - |
188
+ | 0.4914 | 200 | 0.1456 | - |
189
+ | 0.6143 | 250 | 0.1658 | - |
190
+ | 0.7371 | 300 | 0.1189 | - |
191
+ | 0.8600 | 350 | 0.1235 | - |
192
+ | 0.9828 | 400 | 0.1282 | - |
193
+ | 0.0049 | 1 | 0.1257 | - |
194
+ | 0.0615 | 50 | 0.1371 | - |
195
+ | 0.1230 | 100 | 0.1226 | - |
196
+ | 0.1845 | 150 | 0.1099 | - |
197
+ | 0.2460 | 200 | 0.0897 | - |
198
+ | 0.3075 | 250 | 0.1009 | - |
199
+ | 0.3690 | 300 | 0.0659 | - |
200
+ | 0.4305 | 350 | 0.0711 | - |
201
+ | 0.4920 | 400 | 0.0745 | - |
202
+ | 0.5535 | 450 | 0.0807 | - |
203
+ | 0.6150 | 500 | 0.0736 | - |
204
+ | 0.6765 | 550 | 0.0571 | - |
205
+ | 0.7380 | 600 | 0.0649 | - |
206
+ | 0.7995 | 650 | 0.0672 | - |
207
+ | 0.8610 | 700 | 0.0586 | - |
208
+ | 0.9225 | 750 | 0.0624 | - |
209
+ | 0.9840 | 800 | 0.0614 | - |
210
+
211
+ ### Framework Versions
212
+ - Python: 3.10.12
213
+ - SetFit: 1.0.3
214
+ - Sentence Transformers: 2.2.2
215
+ - Transformers: 4.36.2
216
+ - PyTorch: 2.1.0+cu121
217
+ - Datasets: 2.16.1
218
+ - Tokenizers: 0.15.0
219
+
220
+ ## Citation
221
+
222
+ ### BibTeX
223
+ ```bibtex
224
+ @article{https://doi.org/10.48550/arxiv.2209.11055,
225
+ doi = {10.48550/ARXIV.2209.11055},
226
+ url = {https://arxiv.org/abs/2209.11055},
227
+ author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
228
+ keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
229
+ title = {Efficient Few-Shot Learning Without Prompts},
230
+ publisher = {arXiv},
231
+ year = {2022},
232
+ copyright = {Creative Commons Attribution 4.0 International}
233
+ }
234
+ ```
235
+
236
+ <!--
237
+ ## Glossary
238
+
239
+ *Clearly define terms in order to be accessible across audiences.*
240
+ -->
241
+
242
+ <!--
243
+ ## Model Card Authors
244
+
245
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
246
+ -->
247
+
248
+ <!--
249
+ ## Model Card Contact
250
+
251
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
252
+ -->
config.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/root/.cache/torch/sentence_transformers/jinaai_jina-embeddings-v2-small-en/",
3
+ "architectures": [
4
+ "JinaBertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.0,
7
+ "attn_implementation": null,
8
+ "auto_map": {
9
+ "AutoConfig": "configuration_bert.JinaBertConfig",
10
+ "AutoModel": "modeling_bert.JinaBertModel",
11
+ "AutoModelForMaskedLM": "jinaai/jina-bert-implementation--modeling_bert.JinaBertForMaskedLM",
12
+ "AutoModelForSequenceClassification": "jinaai/jina-bert-implementation--modeling_bert.JinaBertForSequenceClassification"
13
+ },
14
+ "classifier_dropout": null,
15
+ "emb_pooler": "mean",
16
+ "feed_forward_type": "geglu",
17
+ "gradient_checkpointing": false,
18
+ "hidden_act": "gelu",
19
+ "hidden_dropout_prob": 0.1,
20
+ "hidden_size": 512,
21
+ "initializer_range": 0.02,
22
+ "intermediate_size": 2048,
23
+ "layer_norm_eps": 1e-12,
24
+ "max_position_embeddings": 8192,
25
+ "model_max_length": 8192,
26
+ "model_type": "bert",
27
+ "num_attention_heads": 8,
28
+ "num_hidden_layers": 4,
29
+ "pad_token_id": 0,
30
+ "position_embedding_type": "alibi",
31
+ "torch_dtype": "float32",
32
+ "transformers_version": "4.36.2",
33
+ "type_vocab_size": 2,
34
+ "use_cache": true,
35
+ "vocab_size": 30528
36
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.2.2",
4
+ "transformers": "4.31.0",
5
+ "pytorch": "2.0.1"
6
+ }
7
+ }
config_setfit.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "normalize_embeddings": false,
3
+ "labels": [
4
+ "Aims",
5
+ "Background",
6
+ "Hypothesis",
7
+ "Implications",
8
+ "Importance",
9
+ "Keywords",
10
+ "Limitations",
11
+ "Method",
12
+ "None",
13
+ "Purpose",
14
+ "Reccomendations",
15
+ "Result",
16
+ "Uncertainty"
17
+ ]
18
+ }
configuration_bert.py ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
3
+ # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
4
+ # Copyright (c) 2023 Jina AI GmbH. All rights reserved.
5
+ #
6
+ # Licensed under the Apache License, Version 2.0 (the "License");
7
+ # you may not use this file except in compliance with the License.
8
+ # You may obtain a copy of the License at
9
+ #
10
+ # http://www.apache.org/licenses/LICENSE-2.0
11
+ #
12
+ # Unless required by applicable law or agreed to in writing, software
13
+ # distributed under the License is distributed on an "AS IS" BASIS,
14
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15
+ # See the License for the specific language governing permissions and
16
+ # limitations under the License.
17
+ """ BERT model configuration"""
18
+ from collections import OrderedDict
19
+ from typing import Mapping
20
+
21
+ from transformers.configuration_utils import PretrainedConfig
22
+ from transformers.onnx import OnnxConfig
23
+ from transformers.utils import logging
24
+
25
+
26
+ logger = logging.get_logger(__name__)
27
+
28
+
29
+ class JinaBertConfig(PretrainedConfig):
30
+ r"""
31
+ This is the configuration class to store the configuration of a [`JinaBertModel`]. It is used to
32
+ instantiate a BERT model according to the specified arguments, defining the model architecture. Instantiating a
33
+ configuration with the defaults will yield a similar configuration to that of the BERT
34
+ [bert-base-uncased](https://huggingface.co/bert-base-uncased) architecture.
35
+
36
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
37
+ documentation from [`PretrainedConfig`] for more information.
38
+
39
+
40
+ Args:
41
+ vocab_size (`int`, *optional*, defaults to 30522):
42
+ Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
43
+ `inputs_ids` passed when calling [`BertModel`] or [`TFBertModel`].
44
+ hidden_size (`int`, *optional*, defaults to 768):
45
+ Dimensionality of the encoder layers and the pooler layer.
46
+ num_hidden_layers (`int`, *optional*, defaults to 12):
47
+ Number of hidden layers in the Transformer encoder.
48
+ num_attention_heads (`int`, *optional*, defaults to 12):
49
+ Number of attention heads for each attention layer in the Transformer encoder.
50
+ intermediate_size (`int`, *optional*, defaults to 3072):
51
+ Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
52
+ hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
53
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
54
+ `"relu"`, `"silu"` and `"gelu_new"` are supported.
55
+ hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
56
+ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
57
+ attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
58
+ The dropout ratio for the attention probabilities.
59
+ max_position_embeddings (`int`, *optional*, defaults to 512):
60
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
61
+ just in case (e.g., 512 or 1024 or 2048).
62
+ type_vocab_size (`int`, *optional*, defaults to 2):
63
+ The vocabulary size of the `token_type_ids` passed when calling [`BertModel`] or [`TFBertModel`].
64
+ initializer_range (`float`, *optional*, defaults to 0.02):
65
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
66
+ layer_norm_eps (`float`, *optional*, defaults to 1e-12):
67
+ The epsilon used by the layer normalization layers.
68
+ position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
69
+ Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
70
+ positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
71
+ [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
72
+ For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
73
+ with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
74
+ is_decoder (`bool`, *optional*, defaults to `False`):
75
+ Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.
76
+ use_cache (`bool`, *optional*, defaults to `True`):
77
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
78
+ relevant if `config.is_decoder=True`.
79
+ classifier_dropout (`float`, *optional*):
80
+ The dropout ratio for the classification head.
81
+ feed_forward_type (`str`, *optional*, defaults to `"original"`):
82
+ The type of feed forward layer to use in the bert layers.
83
+ Can be one of GLU variants, e.g. `"reglu"`, `"geglu"`
84
+ emb_pooler (`str`, *optional*, defaults to `None`):
85
+ The function to use for pooling the last layer embeddings to get the sentence embeddings.
86
+ Should be one of `None`, `"mean"`.
87
+ attn_implementation (`str`, *optional*, defaults to `"torch"`):
88
+ The implementation of the self-attention layer. Can be one of:
89
+ - `None` for the original implementation,
90
+ - `torch` for the PyTorch SDPA implementation,
91
+
92
+ Examples:
93
+
94
+ ```python
95
+ >>> from transformers import JinaBertConfig, JinaBertModel
96
+
97
+ >>> # Initializing a JinaBert configuration
98
+ >>> configuration = JinaBertConfig()
99
+
100
+ >>> # Initializing a model (with random weights) from the configuration
101
+ >>> model = JinaBertModel(configuration)
102
+
103
+ >>> # Accessing the model configuration
104
+ >>> configuration = model.config
105
+
106
+ >>> # Encode text inputs
107
+ >>> embeddings = model.encode(text_inputs)
108
+ ```"""
109
+ model_type = "bert"
110
+
111
+ def __init__(
112
+ self,
113
+ vocab_size=30522,
114
+ hidden_size=768,
115
+ num_hidden_layers=12,
116
+ num_attention_heads=12,
117
+ intermediate_size=3072,
118
+ hidden_act="gelu",
119
+ hidden_dropout_prob=0.1,
120
+ attention_probs_dropout_prob=0.1,
121
+ max_position_embeddings=512,
122
+ type_vocab_size=2,
123
+ initializer_range=0.02,
124
+ layer_norm_eps=1e-12,
125
+ pad_token_id=0,
126
+ position_embedding_type="absolute",
127
+ use_cache=True,
128
+ classifier_dropout=None,
129
+ feed_forward_type="original",
130
+ emb_pooler=None,
131
+ attn_implementation='torch',
132
+ **kwargs,
133
+ ):
134
+ super().__init__(pad_token_id=pad_token_id, **kwargs)
135
+
136
+ self.vocab_size = vocab_size
137
+ self.hidden_size = hidden_size
138
+ self.num_hidden_layers = num_hidden_layers
139
+ self.num_attention_heads = num_attention_heads
140
+ self.hidden_act = hidden_act
141
+ self.intermediate_size = intermediate_size
142
+ self.hidden_dropout_prob = hidden_dropout_prob
143
+ self.attention_probs_dropout_prob = attention_probs_dropout_prob
144
+ self.max_position_embeddings = max_position_embeddings
145
+ self.type_vocab_size = type_vocab_size
146
+ self.initializer_range = initializer_range
147
+ self.layer_norm_eps = layer_norm_eps
148
+ self.position_embedding_type = position_embedding_type
149
+ self.use_cache = use_cache
150
+ self.classifier_dropout = classifier_dropout
151
+ self.feed_forward_type = feed_forward_type
152
+ self.emb_pooler = emb_pooler
153
+ self.attn_implementation = attn_implementation
154
+
155
+ class JinaBertOnnxConfig(OnnxConfig):
156
+ @property
157
+ def inputs(self) -> Mapping[str, Mapping[int, str]]:
158
+ if self.task == "multiple-choice":
159
+ dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
160
+ else:
161
+ dynamic_axis = {0: "batch", 1: "sequence"}
162
+ return OrderedDict(
163
+ [
164
+ ("input_ids", dynamic_axis),
165
+ ("attention_mask", dynamic_axis),
166
+ ("token_type_ids", dynamic_axis),
167
+ ]
168
+ )
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0c1529f97f7d63f60cb1caf5043049c5be4b244a452b7596283781b007c81a7b
3
+ size 130769960
model_head.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f033529ac4cb485d155c9a7f4466798a188186c7c1c599327842835c47c9c7a3
3
+ size 54959
modeling_bert.py ADDED
@@ -0,0 +1,2355 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
3
+ # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
4
+ # Copyright (c) 2023 Jina AI GmbH. All rights reserved.
5
+ #
6
+ # Licensed under the Apache License, Version 2.0 (the "License");
7
+ # you may not use this file except in compliance with the License.
8
+ # You may obtain a copy of the License at
9
+ #
10
+ # http://www.apache.org/licenses/LICENSE-2.0
11
+ #
12
+ # Unless required by applicable law or agreed to in writing, software
13
+ # distributed under the License is distributed on an "AS IS" BASIS,
14
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15
+ # See the License for the specific language governing permissions and
16
+ # limitations under the License.
17
+ """PyTorch BERT model."""
18
+
19
+
20
+ import math
21
+ import os
22
+ import warnings
23
+ from dataclasses import dataclass
24
+ from typing import List, Optional, Tuple, Union
25
+ import numpy as np
26
+
27
+ import torch
28
+ import torch.utils.checkpoint
29
+ from torch import nn
30
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
31
+
32
+ from transformers.activations import ACT2FN
33
+ from transformers.modeling_outputs import (
34
+ BaseModelOutputWithPastAndCrossAttentions,
35
+ BaseModelOutputWithPoolingAndCrossAttentions,
36
+ CausalLMOutputWithCrossAttentions,
37
+ MaskedLMOutput,
38
+ MultipleChoiceModelOutput,
39
+ NextSentencePredictorOutput,
40
+ QuestionAnsweringModelOutput,
41
+ SequenceClassifierOutput,
42
+ TokenClassifierOutput,
43
+ )
44
+ from transformers.modeling_utils import PreTrainedModel
45
+ from transformers.pytorch_utils import (
46
+ apply_chunking_to_forward,
47
+ find_pruneable_heads_and_indices,
48
+ prune_linear_layer,
49
+ )
50
+ from transformers.utils import (
51
+ ModelOutput,
52
+ add_code_sample_docstrings,
53
+ add_start_docstrings,
54
+ add_start_docstrings_to_model_forward,
55
+ logging,
56
+ replace_return_docstrings,
57
+ )
58
+ from .configuration_bert import JinaBertConfig
59
+
60
+ # Torch implementation
61
+ try:
62
+ from torch.nn.functional import scaled_dot_product_attention
63
+ except ImportError:
64
+ scaled_dot_product_attention = None
65
+
66
+ # This is used by encode but user may not have it installed
67
+ try:
68
+ from tqdm.autonotebook import trange
69
+
70
+ has_tqdm = True
71
+ except ImportError:
72
+ has_tqdm = False
73
+
74
+ logger = logging.get_logger(__name__)
75
+
76
+ _CHECKPOINT_FOR_DOC = "bert-base-uncased"
77
+ _CONFIG_FOR_DOC = "JinaBertConfig"
78
+
79
+ # TokenClassification docstring
80
+ _CHECKPOINT_FOR_TOKEN_CLASSIFICATION = (
81
+ "dbmdz/bert-large-cased-finetuned-conll03-english"
82
+ )
83
+ _TOKEN_CLASS_EXPECTED_OUTPUT = "['O', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'I-LOC', 'O', 'I-LOC', 'I-LOC'] "
84
+ _TOKEN_CLASS_EXPECTED_LOSS = 0.01
85
+
86
+ # QuestionAnswering docstring
87
+ _CHECKPOINT_FOR_QA = "deepset/bert-base-cased-squad2"
88
+ _QA_EXPECTED_OUTPUT = "'a nice puppet'"
89
+ _QA_EXPECTED_LOSS = 7.41
90
+ _QA_TARGET_START_INDEX = 14
91
+ _QA_TARGET_END_INDEX = 15
92
+
93
+ # SequenceClassification docstring
94
+ _CHECKPOINT_FOR_SEQUENCE_CLASSIFICATION = "textattack/bert-base-uncased-yelp-polarity"
95
+ _SEQ_CLASS_EXPECTED_OUTPUT = "'LABEL_1'"
96
+ _SEQ_CLASS_EXPECTED_LOSS = 0.01
97
+
98
+
99
+ def load_tf_weights_in_bert(model, config, tf_checkpoint_path):
100
+ """Load tf checkpoints in a pytorch model."""
101
+ try:
102
+ import re
103
+
104
+ import numpy as np
105
+ import tensorflow as tf
106
+ except ImportError:
107
+ logger.error(
108
+ "Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see "
109
+ "https://www.tensorflow.org/install/ for installation instructions."
110
+ )
111
+ raise
112
+ tf_path = os.path.abspath(tf_checkpoint_path)
113
+ logger.info(f"Converting TensorFlow checkpoint from {tf_path}")
114
+ # Load weights from TF model
115
+ init_vars = tf.train.list_variables(tf_path)
116
+ names = []
117
+ arrays = []
118
+ for name, shape in init_vars:
119
+ logger.info(f"Loading TF weight {name} with shape {shape}")
120
+ array = tf.train.load_variable(tf_path, name)
121
+ names.append(name)
122
+ arrays.append(array)
123
+
124
+ for name, array in zip(names, arrays):
125
+ name = name.split("/")
126
+ # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v
127
+ # which are not required for using pretrained model
128
+ if any(
129
+ n
130
+ in [
131
+ "adam_v",
132
+ "adam_m",
133
+ "AdamWeightDecayOptimizer",
134
+ "AdamWeightDecayOptimizer_1",
135
+ "global_step",
136
+ ]
137
+ for n in name
138
+ ):
139
+ logger.info(f"Skipping {'/'.join(name)}")
140
+ continue
141
+ pointer = model
142
+ for m_name in name:
143
+ if re.fullmatch(r"[A-Za-z]+_\d+", m_name):
144
+ scope_names = re.split(r"_(\d+)", m_name)
145
+ else:
146
+ scope_names = [m_name]
147
+ if scope_names[0] == "kernel" or scope_names[0] == "gamma":
148
+ pointer = getattr(pointer, "weight")
149
+ elif scope_names[0] == "output_bias" or scope_names[0] == "beta":
150
+ pointer = getattr(pointer, "bias")
151
+ elif scope_names[0] == "output_weights":
152
+ pointer = getattr(pointer, "weight")
153
+ elif scope_names[0] == "squad":
154
+ pointer = getattr(pointer, "classifier")
155
+ else:
156
+ try:
157
+ pointer = getattr(pointer, scope_names[0])
158
+ except AttributeError:
159
+ logger.info(f"Skipping {'/'.join(name)}")
160
+ continue
161
+ if len(scope_names) >= 2:
162
+ num = int(scope_names[1])
163
+ pointer = pointer[num]
164
+ if m_name[-11:] == "_embeddings":
165
+ pointer = getattr(pointer, "weight")
166
+ elif m_name == "kernel":
167
+ array = np.transpose(array)
168
+ try:
169
+ if pointer.shape != array.shape:
170
+ raise ValueError(
171
+ f"Pointer shape {pointer.shape} and array shape {array.shape} mismatched"
172
+ )
173
+ except ValueError as e:
174
+ e.args += (pointer.shape, array.shape)
175
+ raise
176
+ logger.info(f"Initialize PyTorch weight {name}")
177
+ pointer.data = torch.from_numpy(array)
178
+ return model
179
+
180
+
181
+ class JinaBertEmbeddings(nn.Module):
182
+ """Construct the embeddings from word, position and token_type embeddings."""
183
+
184
+ def __init__(self, config: JinaBertConfig):
185
+ super().__init__()
186
+ self.word_embeddings = nn.Embedding(
187
+ config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id
188
+ )
189
+ if config.position_embedding_type != "alibi":
190
+ self.position_embeddings = nn.Embedding(
191
+ config.max_position_embeddings, config.hidden_size
192
+ )
193
+ self.token_type_embeddings = nn.Embedding(
194
+ config.type_vocab_size, config.hidden_size
195
+ )
196
+
197
+ # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
198
+ # any TensorFlow checkpoint file
199
+ self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
200
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
201
+ # position_ids (1, len position emb) is contiguous in memory and exported when serialized
202
+ self.position_embedding_type = getattr(
203
+ config, "position_embedding_type", "absolute"
204
+ )
205
+ self.register_buffer(
206
+ "position_ids",
207
+ torch.arange(config.max_position_embeddings).expand((1, -1)),
208
+ persistent=False,
209
+ )
210
+ self.register_buffer(
211
+ "token_type_ids",
212
+ torch.zeros(self.position_ids.size(), dtype=torch.long),
213
+ persistent=False,
214
+ )
215
+
216
+ def forward(
217
+ self,
218
+ input_ids: Optional[torch.LongTensor] = None,
219
+ token_type_ids: Optional[torch.LongTensor] = None,
220
+ position_ids: Optional[torch.LongTensor] = None,
221
+ inputs_embeds: Optional[torch.FloatTensor] = None,
222
+ past_key_values_length: int = 0,
223
+ ) -> torch.Tensor:
224
+ if input_ids is not None:
225
+ input_shape = input_ids.size()
226
+ else:
227
+ input_shape = inputs_embeds.size()[:-1]
228
+
229
+ seq_length = input_shape[1]
230
+
231
+ if position_ids is None:
232
+ position_ids = self.position_ids[
233
+ :, past_key_values_length : seq_length + past_key_values_length
234
+ ]
235
+
236
+ # Setting the token_type_ids to the registered buffer in constructor where it is all zeros, which usually occurs
237
+ # when its auto-generated, registered buffer helps users when tracing the model without passing token_type_ids, solves
238
+ # issue #5664
239
+ if token_type_ids is None:
240
+ if hasattr(self, "token_type_ids"):
241
+ buffered_token_type_ids = self.token_type_ids[:, :seq_length]
242
+ buffered_token_type_ids_expanded = buffered_token_type_ids.expand(
243
+ input_shape[0], seq_length
244
+ )
245
+ token_type_ids = buffered_token_type_ids_expanded
246
+ else:
247
+ token_type_ids = torch.zeros(
248
+ input_shape, dtype=torch.long, device=self.position_ids.device
249
+ )
250
+
251
+ if inputs_embeds is None:
252
+ inputs_embeds = self.word_embeddings(input_ids)
253
+ token_type_embeddings = self.token_type_embeddings(token_type_ids)
254
+
255
+ embeddings = inputs_embeds + token_type_embeddings
256
+ if self.position_embedding_type == "absolute":
257
+ position_embeddings = self.position_embeddings(position_ids)
258
+ embeddings += position_embeddings
259
+ embeddings = self.LayerNorm(embeddings)
260
+ embeddings = self.dropout(embeddings)
261
+ return embeddings
262
+
263
+
264
+ class JinaBertSelfAttention(nn.Module):
265
+ def __init__(self, config: JinaBertConfig, position_embedding_type=None):
266
+ super().__init__()
267
+ if config.hidden_size % config.num_attention_heads != 0 and not hasattr(
268
+ config, "embedding_size"
269
+ ):
270
+ raise ValueError(
271
+ f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
272
+ f"heads ({config.num_attention_heads})"
273
+ )
274
+
275
+ self.attn_implementation = config.attn_implementation
276
+ self.num_attention_heads = config.num_attention_heads
277
+ self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
278
+ self.all_head_size = self.num_attention_heads * self.attention_head_size
279
+
280
+ self.query = nn.Linear(config.hidden_size, self.all_head_size)
281
+ self.key = nn.Linear(config.hidden_size, self.all_head_size)
282
+ self.value = nn.Linear(config.hidden_size, self.all_head_size)
283
+
284
+ self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
285
+ self.position_embedding_type = position_embedding_type or getattr(
286
+ config, "position_embedding_type", "absolute"
287
+ )
288
+ if (
289
+ self.position_embedding_type == "relative_key"
290
+ or self.position_embedding_type == "relative_key_query"
291
+ ):
292
+ self.max_position_embeddings = config.max_position_embeddings
293
+ self.distance_embedding = nn.Embedding(
294
+ 2 * config.max_position_embeddings - 1, self.attention_head_size
295
+ )
296
+
297
+ self.is_decoder = config.is_decoder
298
+
299
+ def transpose_for_scores(self, x: torch.Tensor) -> torch.Tensor:
300
+ new_x_shape = x.size()[:-1] + (
301
+ self.num_attention_heads,
302
+ self.attention_head_size,
303
+ )
304
+ x = x.view(new_x_shape)
305
+ return x.permute(0, 2, 1, 3)
306
+
307
+ def forward(
308
+ self,
309
+ hidden_states: torch.Tensor,
310
+ attention_mask: Optional[torch.FloatTensor] = None,
311
+ head_mask: Optional[torch.FloatTensor] = None,
312
+ encoder_hidden_states: Optional[torch.FloatTensor] = None,
313
+ encoder_attention_mask: Optional[torch.FloatTensor] = None,
314
+ past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
315
+ output_attentions: Optional[bool] = False,
316
+ bias: Optional[torch.FloatTensor] = None,
317
+ ) -> Tuple[torch.Tensor]:
318
+ mixed_query_layer = self.query(hidden_states)
319
+
320
+ # If this is instantiated as a cross-attention module, the keys
321
+ # and values come from an encoder; the attention mask needs to be
322
+ # such that the encoder's padding tokens are not attended to.
323
+ is_cross_attention = encoder_hidden_states is not None
324
+
325
+ if is_cross_attention and past_key_value is not None:
326
+ # reuse k,v, cross_attentions
327
+ key_layer = past_key_value[0]
328
+ value_layer = past_key_value[1]
329
+ attention_mask = encoder_attention_mask
330
+ elif is_cross_attention:
331
+ key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
332
+ value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
333
+ attention_mask = encoder_attention_mask
334
+ elif past_key_value is not None:
335
+ key_layer = self.transpose_for_scores(self.key(hidden_states))
336
+ value_layer = self.transpose_for_scores(self.value(hidden_states))
337
+ key_layer = torch.cat([past_key_value[0], key_layer], dim=2)
338
+ value_layer = torch.cat([past_key_value[1], value_layer], dim=2)
339
+ else:
340
+ key_layer = self.transpose_for_scores(self.key(hidden_states))
341
+ value_layer = self.transpose_for_scores(self.value(hidden_states))
342
+
343
+ query_layer = self.transpose_for_scores(mixed_query_layer)
344
+
345
+ use_cache = past_key_value is not None
346
+ if self.is_decoder:
347
+ # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.
348
+ # Further calls to cross_attention layer can then reuse all cross-attention
349
+ # key/value_states (first "if" case)
350
+ # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of
351
+ # all previous decoder key/value_states. Further calls to uni-directional self-attention
352
+ # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)
353
+ # if encoder bi-directional self-attention `past_key_value` is always `None`
354
+ past_key_value = (key_layer, value_layer)
355
+
356
+ if self.attn_implementation == 'torch' and scaled_dot_product_attention is not None:
357
+ b, _, s, _ = query_layer.shape
358
+ new_bias = attention_mask + bias
359
+ attn = scaled_dot_product_attention(query_layer, key_layer, value_layer, new_bias)
360
+ attn = attn.permute(0, 2, 1, 3).contiguous()
361
+ return (attn.view(b, s, self.all_head_size),)
362
+
363
+ # Take the dot product between "query" and "key" to get the raw attention scores.
364
+ attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
365
+
366
+ if (
367
+ self.position_embedding_type == "relative_key"
368
+ or self.position_embedding_type == "relative_key_query"
369
+ ):
370
+ query_length, key_length = query_layer.shape[2], key_layer.shape[2]
371
+ if use_cache:
372
+ position_ids_l = torch.tensor(
373
+ key_length - 1, dtype=torch.long, device=hidden_states.device
374
+ ).view(-1, 1)
375
+ else:
376
+ position_ids_l = torch.arange(
377
+ query_length, dtype=torch.long, device=hidden_states.device
378
+ ).view(-1, 1)
379
+ position_ids_r = torch.arange(
380
+ key_length, dtype=torch.long, device=hidden_states.device
381
+ ).view(1, -1)
382
+ distance = position_ids_l - position_ids_r
383
+
384
+ positional_embedding = self.distance_embedding(
385
+ distance + self.max_position_embeddings - 1
386
+ )
387
+ positional_embedding = positional_embedding.to(
388
+ dtype=query_layer.dtype
389
+ ) # fp16 compatibility
390
+
391
+ if self.position_embedding_type == "relative_key":
392
+ relative_position_scores = torch.einsum(
393
+ "bhld,lrd->bhlr", query_layer, positional_embedding
394
+ )
395
+ attention_scores = attention_scores + relative_position_scores
396
+ elif self.position_embedding_type == "relative_key_query":
397
+ relative_position_scores_query = torch.einsum(
398
+ "bhld,lrd->bhlr", query_layer, positional_embedding
399
+ )
400
+ relative_position_scores_key = torch.einsum(
401
+ "bhrd,lrd->bhlr", key_layer, positional_embedding
402
+ )
403
+ attention_scores = (
404
+ attention_scores
405
+ + relative_position_scores_query
406
+ + relative_position_scores_key
407
+ )
408
+
409
+ attention_scores = attention_scores / math.sqrt(self.attention_head_size)
410
+ if attention_mask is not None:
411
+ # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
412
+ attention_scores = attention_scores + attention_mask
413
+
414
+ # Normalize the attention scores to probabilities.
415
+ attention_probs = nn.functional.softmax(attention_scores + bias, dim=-1)
416
+
417
+ # This is actually dropping out entire tokens to attend to, which might
418
+ # seem a bit unusual, but is taken from the original Transformer paper.
419
+ attention_probs = self.dropout(attention_probs)
420
+
421
+ # Mask heads if we want to
422
+ if head_mask is not None:
423
+ attention_probs = attention_probs * head_mask
424
+
425
+ context_layer = torch.matmul(attention_probs, value_layer)
426
+
427
+ context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
428
+ new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
429
+ context_layer = context_layer.view(new_context_layer_shape)
430
+
431
+ outputs = (
432
+ (context_layer, attention_probs) if output_attentions else (context_layer,)
433
+ )
434
+
435
+ if self.is_decoder:
436
+ outputs = outputs + (past_key_value,)
437
+ return outputs
438
+
439
+
440
+ class JinaBertSelfOutput(nn.Module):
441
+ def __init__(self, config):
442
+ super().__init__()
443
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
444
+ self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
445
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
446
+
447
+ def forward(
448
+ self, hidden_states: torch.Tensor, input_tensor: torch.Tensor
449
+ ) -> torch.Tensor:
450
+ hidden_states = self.dense(hidden_states)
451
+ hidden_states = self.dropout(hidden_states)
452
+ hidden_states = self.LayerNorm(hidden_states + input_tensor)
453
+ return hidden_states
454
+
455
+
456
+ class JinaBertAttention(nn.Module):
457
+ def __init__(self, config, position_embedding_type=None):
458
+ super().__init__()
459
+ self.self = JinaBertSelfAttention(
460
+ config, position_embedding_type=position_embedding_type
461
+ )
462
+ self.output = JinaBertSelfOutput(config)
463
+ self.pruned_heads = set()
464
+
465
+ def prune_heads(self, heads):
466
+ if len(heads) == 0:
467
+ return
468
+ heads, index = find_pruneable_heads_and_indices(
469
+ heads,
470
+ self.self.num_attention_heads,
471
+ self.self.attention_head_size,
472
+ self.pruned_heads,
473
+ )
474
+
475
+ # Prune linear layers
476
+ self.self.query = prune_linear_layer(self.self.query, index)
477
+ self.self.key = prune_linear_layer(self.self.key, index)
478
+ self.self.value = prune_linear_layer(self.self.value, index)
479
+ self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
480
+
481
+ # Update hyper params and store pruned heads
482
+ self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
483
+ self.self.all_head_size = (
484
+ self.self.attention_head_size * self.self.num_attention_heads
485
+ )
486
+ self.pruned_heads = self.pruned_heads.union(heads)
487
+
488
+ def forward(
489
+ self,
490
+ hidden_states: torch.Tensor,
491
+ attention_mask: Optional[torch.FloatTensor] = None,
492
+ head_mask: Optional[torch.FloatTensor] = None,
493
+ encoder_hidden_states: Optional[torch.FloatTensor] = None,
494
+ encoder_attention_mask: Optional[torch.FloatTensor] = None,
495
+ past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
496
+ output_attentions: Optional[bool] = False,
497
+ bias: Optional[torch.FloatTensor] = None,
498
+ ) -> Tuple[torch.Tensor]:
499
+ self_outputs = self.self(
500
+ hidden_states,
501
+ attention_mask,
502
+ head_mask,
503
+ encoder_hidden_states,
504
+ encoder_attention_mask,
505
+ past_key_value,
506
+ output_attentions,
507
+ bias,
508
+ )
509
+ attention_output = self.output(self_outputs[0], hidden_states)
510
+ outputs = (attention_output,) + self_outputs[
511
+ 1:
512
+ ] # add attentions if we output them
513
+ return outputs
514
+
515
+
516
+ class JinaBertIntermediate(nn.Module):
517
+ def __init__(self, config):
518
+ super().__init__()
519
+ self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
520
+ if isinstance(config.hidden_act, str):
521
+ self.intermediate_act_fn = ACT2FN[config.hidden_act]
522
+ else:
523
+ self.intermediate_act_fn = config.hidden_act
524
+
525
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
526
+ hidden_states = self.dense(hidden_states)
527
+ hidden_states = self.intermediate_act_fn(hidden_states)
528
+ return hidden_states
529
+
530
+
531
+ class JinaBertOutput(nn.Module):
532
+ def __init__(self, config: JinaBertConfig):
533
+ super().__init__()
534
+ self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
535
+ self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
536
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
537
+
538
+ def forward(
539
+ self, hidden_states: torch.Tensor, input_tensor: torch.Tensor
540
+ ) -> torch.Tensor:
541
+ hidden_states = self.dense(hidden_states)
542
+ hidden_states = self.dropout(hidden_states)
543
+ hidden_states = self.LayerNorm(hidden_states + input_tensor)
544
+ return hidden_states
545
+
546
+
547
+ class JinaBertGLUMLP(nn.Module):
548
+ def __init__(self, config: JinaBertConfig):
549
+ super().__init__()
550
+ self.config = config
551
+ self.gated_layers = nn.Linear(
552
+ config.hidden_size, config.intermediate_size * 2, bias=False
553
+ )
554
+ if config.feed_forward_type == 'reglu':
555
+ self.act = nn.ReLU()
556
+ elif config.feed_forward_type == 'geglu':
557
+ self.act = nn.GELU()
558
+ else:
559
+ raise ValueError(
560
+ f"feed_forward_type {config.feed_forward_type} not supported"
561
+ )
562
+ self.wo = nn.Linear(config.intermediate_size, config.hidden_size)
563
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
564
+ self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
565
+
566
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
567
+ residual_connection = hidden_states
568
+ # compute the activation
569
+ hidden_states = self.gated_layers(hidden_states)
570
+ gated = hidden_states[:, :, : self.config.intermediate_size]
571
+ non_gated = hidden_states[:, :, self.config.intermediate_size :]
572
+ hidden_states = self.act(gated) * non_gated
573
+ hidden_states = self.dropout(hidden_states)
574
+ # multiply by the second matrix
575
+ hidden_states = self.wo(hidden_states)
576
+ # add the residual connection and post-LN
577
+ hidden_states = self.layernorm(hidden_states + residual_connection)
578
+ return hidden_states
579
+
580
+
581
+ class JinaBertLayer(nn.Module):
582
+ def __init__(self, config: JinaBertConfig):
583
+ super().__init__()
584
+ self.chunk_size_feed_forward = config.chunk_size_feed_forward
585
+ self.seq_len_dim = 1
586
+ self.attention = JinaBertAttention(config)
587
+ self.is_decoder = config.is_decoder
588
+ self.add_cross_attention = config.add_cross_attention
589
+ self.feed_forward_type = config.feed_forward_type
590
+ if self.add_cross_attention:
591
+ if not self.is_decoder:
592
+ raise ValueError(
593
+ f"{self} should be used as a decoder model if cross attention is added"
594
+ )
595
+ self.crossattention = JinaBertAttention(
596
+ config, position_embedding_type="absolute"
597
+ )
598
+ if self.feed_forward_type.endswith('glu'):
599
+ self.mlp = JinaBertGLUMLP(config)
600
+ else:
601
+ self.intermediate = JinaBertIntermediate(config)
602
+ self.output = JinaBertOutput(config)
603
+
604
+ def forward(
605
+ self,
606
+ hidden_states: torch.Tensor,
607
+ attention_mask: Optional[torch.FloatTensor] = None,
608
+ head_mask: Optional[torch.FloatTensor] = None,
609
+ encoder_hidden_states: Optional[torch.FloatTensor] = None,
610
+ encoder_attention_mask: Optional[torch.FloatTensor] = None,
611
+ bias: Optional[torch.FloatTensor] = None,
612
+ past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
613
+ output_attentions: Optional[bool] = False,
614
+ ) -> Tuple[torch.Tensor]:
615
+ # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
616
+ self_attn_past_key_value = (
617
+ past_key_value[:2] if past_key_value is not None else None
618
+ )
619
+ self_attention_outputs = self.attention(
620
+ hidden_states,
621
+ attention_mask,
622
+ head_mask,
623
+ output_attentions=output_attentions,
624
+ past_key_value=self_attn_past_key_value,
625
+ bias=bias,
626
+ )
627
+ attention_output = self_attention_outputs[0]
628
+
629
+ # if decoder, the last output is tuple of self-attn cache
630
+ if self.is_decoder:
631
+ outputs = self_attention_outputs[1:-1]
632
+ present_key_value = self_attention_outputs[-1]
633
+ else:
634
+ outputs = self_attention_outputs[
635
+ 1:
636
+ ] # add self attentions if we output attention weights
637
+
638
+ cross_attn_present_key_value = None
639
+ if self.is_decoder and encoder_hidden_states is not None:
640
+ if not hasattr(self, "crossattention"):
641
+ raise ValueError(
642
+ f"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers"
643
+ " by setting `config.add_cross_attention=True`"
644
+ )
645
+
646
+ # cross_attn cached key/values tuple is at positions 3,4 of past_key_value tuple
647
+ cross_attn_past_key_value = (
648
+ past_key_value[-2:] if past_key_value is not None else None
649
+ )
650
+ cross_attention_outputs = self.crossattention(
651
+ attention_output,
652
+ attention_mask,
653
+ head_mask,
654
+ encoder_hidden_states,
655
+ encoder_attention_mask,
656
+ cross_attn_past_key_value,
657
+ output_attentions,
658
+ )
659
+ attention_output = cross_attention_outputs[0]
660
+ outputs = (
661
+ outputs + cross_attention_outputs[1:-1]
662
+ ) # add cross attentions if we output attention weights
663
+
664
+ # add cross-attn cache to positions 3,4 of present_key_value tuple
665
+ cross_attn_present_key_value = cross_attention_outputs[-1]
666
+ present_key_value = present_key_value + cross_attn_present_key_value
667
+
668
+ if self.feed_forward_type.endswith('glu'):
669
+ layer_output = self.mlp(attention_output)
670
+ else:
671
+ layer_output = apply_chunking_to_forward(
672
+ self.feed_forward_chunk,
673
+ self.chunk_size_feed_forward,
674
+ self.seq_len_dim,
675
+ attention_output,
676
+ )
677
+ outputs = (layer_output,) + outputs
678
+
679
+ # if decoder, return the attn key/values as the last output
680
+ if self.is_decoder:
681
+ outputs = outputs + (present_key_value,)
682
+
683
+ return outputs
684
+
685
+ def feed_forward_chunk(self, attention_output):
686
+ intermediate_output = self.intermediate(attention_output)
687
+ layer_output = self.output(intermediate_output, attention_output)
688
+ return layer_output
689
+
690
+
691
+ class JinaBertEncoder(nn.Module):
692
+ def __init__(self, config: JinaBertConfig):
693
+ super().__init__()
694
+ self.config = config
695
+ self.layer = nn.ModuleList(
696
+ [JinaBertLayer(config) for _ in range(config.num_hidden_layers)]
697
+ )
698
+ self.gradient_checkpointing = False
699
+ self.num_attention_heads = config.num_attention_heads
700
+ self.register_buffer(
701
+ "alibi",
702
+ self.rebuild_alibi_tensor(size=config.max_position_embeddings),
703
+ persistent=False,
704
+ )
705
+
706
+ def rebuild_alibi_tensor(
707
+ self, size: int, device: Optional[Union[torch.device, str]] = None
708
+ ):
709
+ # Alibi
710
+ # Following https://github.com/ofirpress/attention_with_linear_biases/issues/5 (Implementation 1)
711
+ # In the causal case, you can exploit the fact that softmax is invariant to a uniform translation
712
+ # of the logits, which makes the math work out *after* applying causal masking. If no causal masking
713
+ # will be applied, it is necessary to construct the diagonal mask.
714
+ n_heads = self.num_attention_heads
715
+
716
+ def _get_alibi_head_slopes(n_heads: int) -> List[float]:
717
+ def get_slopes_power_of_2(n):
718
+ start = 2 ** (-(2 ** -(math.log2(n) - 3)))
719
+ ratio = start
720
+ return [start * ratio**i for i in range(n)]
721
+
722
+ if math.log2(n_heads).is_integer():
723
+ return get_slopes_power_of_2(
724
+ n_heads
725
+ ) # In the paper, we only train models that have 2^a heads for some a. This function has
726
+ else: # some good properties that only occur when the input is a power of 2. To maintain that even
727
+ closest_power_of_2 = 2 ** math.floor(
728
+ math.log2(n_heads)
729
+ ) # when the number of heads is not a power of 2, we use this workaround.
730
+ return (
731
+ get_slopes_power_of_2(closest_power_of_2)
732
+ + _get_alibi_head_slopes(2 * closest_power_of_2)[0::2][
733
+ : n_heads - closest_power_of_2
734
+ ]
735
+ )
736
+
737
+ context_position = torch.arange(size, device=device)[:, None]
738
+ memory_position = torch.arange(size, device=device)[None, :]
739
+ relative_position = torch.abs(memory_position - context_position)
740
+ # [n_heads, max_token_length, max_token_length]
741
+ relative_position = relative_position.unsqueeze(0).expand(n_heads, -1, -1)
742
+ slopes = torch.Tensor(_get_alibi_head_slopes(n_heads)).to(device) * -1
743
+ alibi = slopes.unsqueeze(1).unsqueeze(1) * relative_position
744
+ # [1, n_heads, max_token_length, max_token_length]
745
+ alibi = alibi.unsqueeze(0)
746
+ assert alibi.shape == torch.Size([1, n_heads, size, size])
747
+
748
+ self._current_alibi_size = size
749
+ return alibi
750
+
751
+ def forward(
752
+ self,
753
+ hidden_states: torch.Tensor,
754
+ attention_mask: Optional[torch.FloatTensor] = None,
755
+ head_mask: Optional[torch.FloatTensor] = None,
756
+ encoder_hidden_states: Optional[torch.FloatTensor] = None,
757
+ encoder_attention_mask: Optional[torch.FloatTensor] = None,
758
+ past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
759
+ use_cache: Optional[bool] = None,
760
+ output_attentions: Optional[bool] = False,
761
+ output_hidden_states: Optional[bool] = False,
762
+ return_dict: Optional[bool] = True,
763
+ ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPastAndCrossAttentions]:
764
+ all_hidden_states = () if output_hidden_states else None
765
+ all_self_attentions = () if output_attentions else None
766
+ all_cross_attentions = (
767
+ () if output_attentions and self.config.add_cross_attention else None
768
+ )
769
+
770
+ # Add alibi matrix to extended_attention_mask
771
+ _, seqlen, _ = hidden_states.size()
772
+ if self._current_alibi_size < seqlen:
773
+ # Rebuild the alibi tensor when needed
774
+ warnings.warn(
775
+ f'Increasing alibi size from {self._current_alibi_size} to {seqlen}.'
776
+ )
777
+ self.register_buffer(
778
+ "alibi",
779
+ self.rebuild_alibi_tensor(size=seqlen, device=hidden_states.device).to(
780
+ hidden_states.dtype
781
+ ),
782
+ persistent=False,
783
+ )
784
+ elif self.alibi.device != hidden_states.device:
785
+ # Device catch-up
786
+ self.alibi = self.alibi.to(hidden_states.device)
787
+
788
+ alibi_bias = self.alibi[:, :, :seqlen, :seqlen]
789
+ if self.gradient_checkpointing and self.training:
790
+ if use_cache:
791
+ logger.warning_once(
792
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
793
+ )
794
+ use_cache = False
795
+
796
+ next_decoder_cache = () if use_cache else None
797
+ for i, layer_module in enumerate(self.layer):
798
+ if output_hidden_states:
799
+ all_hidden_states = all_hidden_states + (hidden_states,)
800
+
801
+ layer_head_mask = head_mask[i] if head_mask is not None else None
802
+ past_key_value = past_key_values[i] if past_key_values is not None else None
803
+
804
+ if self.gradient_checkpointing and self.training:
805
+
806
+ def create_custom_forward(module):
807
+ def custom_forward(*inputs):
808
+ return module(*inputs, past_key_value, output_attentions)
809
+
810
+ return custom_forward
811
+
812
+ layer_outputs = torch.utils.checkpoint.checkpoint(
813
+ create_custom_forward(layer_module),
814
+ hidden_states,
815
+ attention_mask,
816
+ layer_head_mask,
817
+ encoder_hidden_states,
818
+ encoder_attention_mask,
819
+ alibi_bias,
820
+ )
821
+ else:
822
+ layer_outputs = layer_module(
823
+ hidden_states,
824
+ attention_mask,
825
+ layer_head_mask,
826
+ encoder_hidden_states,
827
+ encoder_attention_mask,
828
+ alibi_bias,
829
+ past_key_value,
830
+ output_attentions,
831
+ )
832
+
833
+ hidden_states = layer_outputs[0]
834
+ if use_cache:
835
+ next_decoder_cache += (layer_outputs[-1],)
836
+ if output_attentions:
837
+ all_self_attentions = all_self_attentions + (layer_outputs[1],)
838
+ if self.config.add_cross_attention:
839
+ all_cross_attentions = all_cross_attentions + (layer_outputs[2],)
840
+
841
+ if output_hidden_states:
842
+ all_hidden_states = all_hidden_states + (hidden_states,)
843
+
844
+ if not return_dict:
845
+ return tuple(
846
+ v
847
+ for v in [
848
+ hidden_states,
849
+ next_decoder_cache,
850
+ all_hidden_states,
851
+ all_self_attentions,
852
+ all_cross_attentions,
853
+ ]
854
+ if v is not None
855
+ )
856
+ return BaseModelOutputWithPastAndCrossAttentions(
857
+ last_hidden_state=hidden_states,
858
+ past_key_values=next_decoder_cache,
859
+ hidden_states=all_hidden_states,
860
+ attentions=all_self_attentions,
861
+ cross_attentions=all_cross_attentions,
862
+ )
863
+
864
+
865
+ class JinaBertPooler(nn.Module):
866
+ def __init__(self, config):
867
+ super().__init__()
868
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
869
+ self.activation = nn.Tanh()
870
+
871
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
872
+ # We "pool" the model by simply taking the hidden state corresponding
873
+ # to the first token.
874
+ first_token_tensor = hidden_states[:, 0]
875
+ pooled_output = self.dense(first_token_tensor)
876
+ pooled_output = self.activation(pooled_output)
877
+ return pooled_output
878
+
879
+
880
+ class JinaBertPredictionHeadTransform(nn.Module):
881
+ def __init__(self, config):
882
+ super().__init__()
883
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
884
+ if isinstance(config.hidden_act, str):
885
+ self.transform_act_fn = ACT2FN[config.hidden_act]
886
+ else:
887
+ self.transform_act_fn = config.hidden_act
888
+ self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
889
+
890
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
891
+ hidden_states = self.dense(hidden_states)
892
+ hidden_states = self.transform_act_fn(hidden_states)
893
+ hidden_states = self.LayerNorm(hidden_states)
894
+ return hidden_states
895
+
896
+
897
+ class JinaBertLMPredictionHead(nn.Module):
898
+ def __init__(self, config):
899
+ super().__init__()
900
+ self.transform = JinaBertPredictionHeadTransform(config)
901
+
902
+ # The output weights are the same as the input embeddings, but there is
903
+ # an output-only bias for each token.
904
+ self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
905
+
906
+ self.bias = nn.Parameter(torch.zeros(config.vocab_size))
907
+
908
+ # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
909
+ self.decoder.bias = self.bias
910
+
911
+ def forward(self, hidden_states):
912
+ hidden_states = self.transform(hidden_states)
913
+ hidden_states = self.decoder(hidden_states)
914
+ return hidden_states
915
+
916
+
917
+ class JinaBertOnlyMLMHead(nn.Module):
918
+ def __init__(self, config):
919
+ super().__init__()
920
+ self.predictions = JinaBertLMPredictionHead(config)
921
+
922
+ def forward(self, sequence_output: torch.Tensor) -> torch.Tensor:
923
+ prediction_scores = self.predictions(sequence_output)
924
+ return prediction_scores
925
+
926
+
927
+ class JinaBertOnlyNSPHead(nn.Module):
928
+ def __init__(self, config):
929
+ super().__init__()
930
+ self.seq_relationship = nn.Linear(config.hidden_size, 2)
931
+
932
+ def forward(self, pooled_output):
933
+ seq_relationship_score = self.seq_relationship(pooled_output)
934
+ return seq_relationship_score
935
+
936
+
937
+ class JinaBertPreTrainingHeads(nn.Module):
938
+ def __init__(self, config):
939
+ super().__init__()
940
+ self.predictions = JinaBertLMPredictionHead(config)
941
+ self.seq_relationship = nn.Linear(config.hidden_size, 2)
942
+
943
+ def forward(self, sequence_output, pooled_output):
944
+ prediction_scores = self.predictions(sequence_output)
945
+ seq_relationship_score = self.seq_relationship(pooled_output)
946
+ return prediction_scores, seq_relationship_score
947
+
948
+
949
+ class JinaBertPreTrainedModel(PreTrainedModel):
950
+ """
951
+ An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
952
+ models.
953
+ """
954
+
955
+ config_class = JinaBertConfig
956
+ load_tf_weights = load_tf_weights_in_bert
957
+ base_model_prefix = "bert"
958
+ supports_gradient_checkpointing = True
959
+ _no_split_modules = ["JinaBertLayer"]
960
+
961
+ def _init_weights(self, module):
962
+ """Initialize the weights"""
963
+ if isinstance(module, nn.Linear):
964
+ # Slightly different from the TF version which uses truncated_normal for initialization
965
+ # cf https://github.com/pytorch/pytorch/pull/5617
966
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
967
+ if module.bias is not None:
968
+ module.bias.data.zero_()
969
+ elif isinstance(module, nn.Embedding):
970
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
971
+ if module.padding_idx is not None:
972
+ module.weight.data[module.padding_idx].zero_()
973
+ elif isinstance(module, nn.LayerNorm):
974
+ module.bias.data.zero_()
975
+ module.weight.data.fill_(1.0)
976
+
977
+ def _set_gradient_checkpointing(self, module, value=False):
978
+ if isinstance(module, JinaBertEncoder):
979
+ module.gradient_checkpointing = value
980
+
981
+
982
+ @dataclass
983
+ class JinaBertForPreTrainingOutput(ModelOutput):
984
+ """
985
+ Output type of [`BertForPreTraining`].
986
+
987
+ Args:
988
+ loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`):
989
+ Total loss as the sum of the masked language modeling loss and the next sequence prediction
990
+ (classification) loss.
991
+ prediction_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
992
+ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
993
+ seq_relationship_logits (`torch.FloatTensor` of shape `(batch_size, 2)`):
994
+ Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
995
+ before SoftMax).
996
+ hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
997
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
998
+ shape `(batch_size, sequence_length, hidden_size)`.
999
+
1000
+ Hidden-states of the model at the output of each layer plus the initial embedding outputs.
1001
+ attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
1002
+ Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
1003
+ sequence_length)`.
1004
+
1005
+ Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
1006
+ heads.
1007
+ """
1008
+
1009
+ loss: Optional[torch.FloatTensor] = None
1010
+ prediction_logits: torch.FloatTensor = None
1011
+ seq_relationship_logits: torch.FloatTensor = None
1012
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
1013
+ attentions: Optional[Tuple[torch.FloatTensor]] = None
1014
+
1015
+
1016
+ BERT_START_DOCSTRING = r"""
1017
+
1018
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
1019
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
1020
+ etc.)
1021
+
1022
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
1023
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
1024
+ and behavior.
1025
+
1026
+ Parameters:
1027
+ config ([`BertConfig`]): Model configuration class with all the parameters of the model.
1028
+ Initializing with a config file does not load the weights associated with the model, only the
1029
+ configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
1030
+ """
1031
+
1032
+ BERT_INPUTS_DOCSTRING = r"""
1033
+ Args:
1034
+ input_ids (`torch.LongTensor` of shape `({0})`):
1035
+ Indices of input sequence tokens in the vocabulary.
1036
+
1037
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
1038
+ [`PreTrainedTokenizer.__call__`] for details.
1039
+
1040
+ [What are input IDs?](../glossary#input-ids)
1041
+ attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*):
1042
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
1043
+
1044
+ - 1 for tokens that are **not masked**,
1045
+ - 0 for tokens that are **masked**.
1046
+
1047
+ [What are attention masks?](../glossary#attention-mask)
1048
+ token_type_ids (`torch.LongTensor` of shape `({0})`, *optional*):
1049
+ Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
1050
+ 1]`:
1051
+
1052
+ - 0 corresponds to a *sentence A* token,
1053
+ - 1 corresponds to a *sentence B* token.
1054
+
1055
+ [What are token type IDs?](../glossary#token-type-ids)
1056
+ position_ids (`torch.LongTensor` of shape `({0})`, *optional*):
1057
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
1058
+ config.max_position_embeddings - 1]`.
1059
+
1060
+ [What are position IDs?](../glossary#position-ids)
1061
+ head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
1062
+ Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
1063
+
1064
+ - 1 indicates the head is **not masked**,
1065
+ - 0 indicates the head is **masked**.
1066
+
1067
+ inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_size)`, *optional*):
1068
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
1069
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
1070
+ model's internal embedding lookup matrix.
1071
+ output_attentions (`bool`, *optional*):
1072
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
1073
+ tensors for more detail.
1074
+ output_hidden_states (`bool`, *optional*):
1075
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
1076
+ more detail.
1077
+ return_dict (`bool`, *optional*):
1078
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
1079
+ """
1080
+
1081
+
1082
+ @add_start_docstrings(
1083
+ "The bare Bert Model transformer outputting raw hidden-states without any specific head on top.",
1084
+ BERT_START_DOCSTRING,
1085
+ )
1086
+ class JinaBertModel(JinaBertPreTrainedModel):
1087
+ """
1088
+
1089
+ The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
1090
+ cross-attention is added between the self-attention layers, following the architecture described in [Attention is
1091
+ all you need](https://arxiv.org/abs/1706.03762) by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
1092
+ Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
1093
+
1094
+ To behave as an decoder the model needs to be initialized with the `is_decoder` argument of the configuration set
1095
+ to `True`. To be used in a Seq2Seq model, the model needs to initialized with both `is_decoder` argument and
1096
+ `add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as an input to the forward pass.
1097
+ """
1098
+
1099
+ def __init__(self, config: JinaBertConfig, add_pooling_layer=True):
1100
+ super().__init__(config)
1101
+ self.config = config
1102
+
1103
+ self.emb_pooler = config.emb_pooler
1104
+ self._name_or_path = config._name_or_path
1105
+ if self.emb_pooler:
1106
+ from transformers import AutoTokenizer
1107
+
1108
+ self.tokenizer = AutoTokenizer.from_pretrained(config._name_or_path)
1109
+
1110
+ self.embeddings = JinaBertEmbeddings(config)
1111
+ self.encoder = JinaBertEncoder(config)
1112
+
1113
+ self.pooler = JinaBertPooler(config) if add_pooling_layer else None
1114
+
1115
+ # Initialize weights and apply final processing
1116
+ self.post_init()
1117
+
1118
+ @torch.inference_mode()
1119
+ def encode(
1120
+ self: 'JinaBertModel',
1121
+ sentences: Union[str, List[str]],
1122
+ batch_size: int = 32,
1123
+ show_progress_bar: Optional[bool] = None,
1124
+ output_value: str = 'sentence_embedding',
1125
+ convert_to_numpy: bool = True,
1126
+ convert_to_tensor: bool = False,
1127
+ device: Optional[torch.device] = None,
1128
+ normalize_embeddings: bool = False,
1129
+ **tokenizer_kwargs,
1130
+ ) -> Union[List[torch.Tensor], np.ndarray, torch.Tensor]:
1131
+ """
1132
+ Computes sentence embeddings
1133
+
1134
+ Args:
1135
+ sentences(`str` or `List[str]`):
1136
+ Sentence or sentences to be encoded
1137
+ batch_size(`int`, *optional*, defaults to 32):
1138
+ Batch size for the computation
1139
+ show_progress_bar(`bool`, *optional*, defaults to None):
1140
+ Show a progress bar when encoding sentences.
1141
+ If set to None, progress bar is only shown when `logger.level == logging.INFO` or `logger.level == logging.DEBUG`.
1142
+ output_value(`str`, *optional*, defaults to 'sentence_embedding'):
1143
+ Default sentence_embedding, to get sentence embeddings.
1144
+ Can be set to token_embeddings to get wordpiece token embeddings.
1145
+ Set to None, to get all output values
1146
+ convert_to_numpy(`bool`, *optional*, defaults to True):
1147
+ If true, the output is a list of numpy vectors.
1148
+ Else, it is a list of pytorch tensors.
1149
+ convert_to_tensor(`bool`, *optional*, defaults to False):
1150
+ If true, you get one large tensor as return.
1151
+ Overwrites any setting from convert_to_numpy
1152
+ device(`torch.device`, *optional*, defaults to None):
1153
+ Which torch.device to use for the computation
1154
+ normalize_embeddings(`bool`, *optional*, defaults to False):
1155
+ If set to true, returned vectors will have length 1. In that case, the faster dot-product (util.dot_score) instead of cosine similarity can be used.
1156
+ tokenizer_kwargs(`Dict[str, Any]`, *optional*, defaults to {}):
1157
+ Keyword arguments for the tokenizer
1158
+
1159
+ Returns:
1160
+ By default, a list of tensors is returned.
1161
+ If convert_to_tensor, a stacked tensor is returned.
1162
+ If convert_to_numpy, a numpy matrix is returned.
1163
+ """
1164
+ if not self.emb_pooler:
1165
+ warnings.warn("No emb_pooler specified, defaulting to mean pooling.")
1166
+ self.emb_pooler = 'mean'
1167
+ from transformers import AutoTokenizer
1168
+
1169
+ self.tokenizer = AutoTokenizer.from_pretrained(self._name_or_path)
1170
+ is_training = self.training
1171
+ self.eval()
1172
+
1173
+ if show_progress_bar is None:
1174
+ show_progress_bar = (
1175
+ logger.getEffectiveLevel() == logging.INFO
1176
+ or logger.getEffectiveLevel() == logging.DEBUG
1177
+ )
1178
+
1179
+ if convert_to_tensor:
1180
+ convert_to_numpy = False
1181
+
1182
+ if output_value != 'sentence_embedding':
1183
+ convert_to_tensor = False
1184
+ convert_to_numpy = False
1185
+
1186
+ input_was_string = False
1187
+ if isinstance(sentences, str) or not hasattr(sentences, '__len__'):
1188
+ sentences = [sentences]
1189
+ input_was_string = True
1190
+
1191
+ if device is not None:
1192
+ self.to(device)
1193
+
1194
+ # TODO: Maybe use better length heuristic?
1195
+ permutation = np.argsort([-len(i) for i in sentences])
1196
+ inverse_permutation = np.argsort(permutation)
1197
+ sentences = [sentences[idx] for idx in permutation]
1198
+
1199
+ tokenizer_kwargs['padding'] = tokenizer_kwargs.get('padding', True)
1200
+ tokenizer_kwargs['max_length'] = tokenizer_kwargs.get('max_length', 8192)
1201
+ tokenizer_kwargs['truncation'] = tokenizer_kwargs.get('truncation', True)
1202
+
1203
+ all_embeddings = []
1204
+
1205
+ if has_tqdm:
1206
+ range_iter = trange(
1207
+ 0,
1208
+ len(sentences),
1209
+ batch_size,
1210
+ desc="Encoding",
1211
+ disable=not show_progress_bar,
1212
+ )
1213
+ else:
1214
+ range_iter = range(0, len(sentences), batch_size)
1215
+
1216
+ for i in range_iter:
1217
+ encoded_input = self.tokenizer(
1218
+ sentences[i : i + batch_size],
1219
+ return_tensors='pt',
1220
+ **tokenizer_kwargs,
1221
+ ).to(self.device)
1222
+ token_embs = self.forward(**encoded_input)[0]
1223
+
1224
+ # Accumulate in fp32 to avoid overflow
1225
+ token_embs = token_embs.float()
1226
+
1227
+ if output_value == 'token_embeddings':
1228
+ raise NotImplementedError
1229
+ elif output_value is None:
1230
+ raise NotImplementedError
1231
+ else:
1232
+ embeddings = self.mean_pooling(
1233
+ token_embs, encoded_input['attention_mask']
1234
+ )
1235
+
1236
+ if normalize_embeddings:
1237
+ embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
1238
+
1239
+ if convert_to_numpy:
1240
+ embeddings = embeddings.cpu()
1241
+ all_embeddings.extend(embeddings)
1242
+
1243
+ all_embeddings = [all_embeddings[idx] for idx in inverse_permutation]
1244
+
1245
+ if convert_to_tensor:
1246
+ all_embeddings = torch.stack(all_embeddings)
1247
+ elif convert_to_numpy:
1248
+ all_embeddings = np.asarray([emb.numpy() for emb in all_embeddings])
1249
+
1250
+ if input_was_string:
1251
+ all_embeddings = all_embeddings[0]
1252
+
1253
+ self.train(is_training)
1254
+ return all_embeddings
1255
+
1256
+ def mean_pooling(
1257
+ self, token_embeddings: torch.Tensor, attention_mask: torch.Tensor
1258
+ ):
1259
+ input_mask_expanded = (
1260
+ attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
1261
+ )
1262
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
1263
+ input_mask_expanded.sum(1), min=1e-9
1264
+ )
1265
+
1266
+ def get_input_embeddings(self):
1267
+ return self.embeddings.word_embeddings
1268
+
1269
+ def set_input_embeddings(self, value):
1270
+ self.embeddings.word_embeddings = value
1271
+
1272
+ def _prune_heads(self, heads_to_prune):
1273
+ """
1274
+ Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
1275
+ class PreTrainedModel
1276
+ """
1277
+ for layer, heads in heads_to_prune.items():
1278
+ self.encoder.layer[layer].attention.prune_heads(heads)
1279
+
1280
+ @add_start_docstrings_to_model_forward(
1281
+ BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length")
1282
+ )
1283
+ @add_code_sample_docstrings(
1284
+ checkpoint=_CHECKPOINT_FOR_DOC,
1285
+ output_type=BaseModelOutputWithPoolingAndCrossAttentions,
1286
+ config_class=_CONFIG_FOR_DOC,
1287
+ )
1288
+ def forward(
1289
+ self,
1290
+ input_ids: Optional[torch.Tensor] = None,
1291
+ attention_mask: Optional[torch.Tensor] = None,
1292
+ token_type_ids: Optional[torch.Tensor] = None,
1293
+ position_ids: Optional[torch.Tensor] = None,
1294
+ head_mask: Optional[torch.Tensor] = None,
1295
+ inputs_embeds: Optional[torch.Tensor] = None,
1296
+ encoder_hidden_states: Optional[torch.Tensor] = None,
1297
+ encoder_attention_mask: Optional[torch.Tensor] = None,
1298
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1299
+ use_cache: Optional[bool] = None,
1300
+ output_attentions: Optional[bool] = None,
1301
+ output_hidden_states: Optional[bool] = None,
1302
+ return_dict: Optional[bool] = None,
1303
+ ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPoolingAndCrossAttentions]:
1304
+ r"""
1305
+ encoder_hidden_states (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
1306
+ Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
1307
+ the model is configured as a decoder.
1308
+ encoder_attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
1309
+ Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
1310
+ the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:
1311
+
1312
+ - 1 for tokens that are **not masked**,
1313
+ - 0 for tokens that are **masked**.
1314
+ past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
1315
+ Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
1316
+
1317
+ If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
1318
+ don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
1319
+ `decoder_input_ids` of shape `(batch_size, sequence_length)`.
1320
+ use_cache (`bool`, *optional*):
1321
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
1322
+ `past_key_values`).
1323
+ """
1324
+ output_attentions = (
1325
+ output_attentions
1326
+ if output_attentions is not None
1327
+ else self.config.output_attentions
1328
+ )
1329
+ output_hidden_states = (
1330
+ output_hidden_states
1331
+ if output_hidden_states is not None
1332
+ else self.config.output_hidden_states
1333
+ )
1334
+ return_dict = (
1335
+ return_dict if return_dict is not None else self.config.use_return_dict
1336
+ )
1337
+
1338
+ if self.config.is_decoder:
1339
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
1340
+ else:
1341
+ use_cache = False
1342
+
1343
+ if input_ids is not None and inputs_embeds is not None:
1344
+ raise ValueError(
1345
+ "You cannot specify both input_ids and inputs_embeds at the same time"
1346
+ )
1347
+ elif input_ids is not None:
1348
+ # self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
1349
+ input_shape = input_ids.size()
1350
+ elif inputs_embeds is not None:
1351
+ input_shape = inputs_embeds.size()[:-1]
1352
+ else:
1353
+ raise ValueError("You have to specify either input_ids or inputs_embeds")
1354
+
1355
+ batch_size, seq_length = input_shape
1356
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
1357
+
1358
+ # past_key_values_length
1359
+ past_key_values_length = (
1360
+ past_key_values[0][0].shape[2] if past_key_values is not None else 0
1361
+ )
1362
+
1363
+ if attention_mask is None:
1364
+ attention_mask = torch.ones(
1365
+ ((batch_size, seq_length + past_key_values_length)), device=device
1366
+ )
1367
+
1368
+ if token_type_ids is None:
1369
+ if hasattr(self.embeddings, "token_type_ids"):
1370
+ buffered_token_type_ids = self.embeddings.token_type_ids[:, :seq_length]
1371
+ buffered_token_type_ids_expanded = buffered_token_type_ids.expand(
1372
+ batch_size, seq_length
1373
+ )
1374
+ token_type_ids = buffered_token_type_ids_expanded
1375
+ else:
1376
+ token_type_ids = torch.zeros(
1377
+ input_shape, dtype=torch.long, device=device
1378
+ )
1379
+
1380
+ # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
1381
+ # ourselves in which case we just need to make it broadcastable to all heads.
1382
+ extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(
1383
+ attention_mask, input_shape
1384
+ )
1385
+
1386
+ # If a 2D or 3D attention mask is provided for the cross-attention
1387
+ # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
1388
+ if self.config.is_decoder and encoder_hidden_states is not None:
1389
+ (
1390
+ encoder_batch_size,
1391
+ encoder_sequence_length,
1392
+ _,
1393
+ ) = encoder_hidden_states.size()
1394
+ encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
1395
+ if encoder_attention_mask is None:
1396
+ encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)
1397
+ encoder_extended_attention_mask = self.invert_attention_mask(
1398
+ encoder_attention_mask
1399
+ )
1400
+ else:
1401
+ encoder_extended_attention_mask = None
1402
+
1403
+ # Prepare head mask if needed
1404
+ # 1.0 in head_mask indicate we keep the head
1405
+ # attention_probs has shape bsz x n_heads x N x N
1406
+ # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
1407
+ # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
1408
+ head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
1409
+
1410
+ embedding_output = self.embeddings(
1411
+ input_ids=input_ids,
1412
+ position_ids=position_ids,
1413
+ token_type_ids=token_type_ids,
1414
+ inputs_embeds=inputs_embeds,
1415
+ past_key_values_length=past_key_values_length,
1416
+ )
1417
+ encoder_outputs = self.encoder(
1418
+ embedding_output,
1419
+ attention_mask=extended_attention_mask,
1420
+ head_mask=head_mask,
1421
+ encoder_hidden_states=encoder_hidden_states,
1422
+ encoder_attention_mask=encoder_extended_attention_mask,
1423
+ past_key_values=past_key_values,
1424
+ use_cache=use_cache,
1425
+ output_attentions=output_attentions,
1426
+ output_hidden_states=output_hidden_states,
1427
+ return_dict=return_dict,
1428
+ )
1429
+ sequence_output = encoder_outputs[0]
1430
+ pooled_output = (
1431
+ self.pooler(sequence_output) if self.pooler is not None else None
1432
+ )
1433
+
1434
+ if not return_dict:
1435
+ return (sequence_output, pooled_output) + encoder_outputs[1:]
1436
+
1437
+ return BaseModelOutputWithPoolingAndCrossAttentions(
1438
+ last_hidden_state=sequence_output,
1439
+ pooler_output=pooled_output,
1440
+ past_key_values=encoder_outputs.past_key_values,
1441
+ hidden_states=encoder_outputs.hidden_states,
1442
+ attentions=encoder_outputs.attentions,
1443
+ cross_attentions=encoder_outputs.cross_attentions,
1444
+ )
1445
+
1446
+
1447
+ @add_start_docstrings(
1448
+ """
1449
+ Bert Model with two heads on top as done during the pretraining: a `masked language modeling` head and a `next
1450
+ sentence prediction (classification)` head.
1451
+ """,
1452
+ BERT_START_DOCSTRING,
1453
+ )
1454
+ class JinaBertForPreTraining(JinaBertPreTrainedModel):
1455
+ _tied_weights_keys = ["predictions.decoder.bias", "cls.predictions.decoder.weight"]
1456
+
1457
+ def __init__(self, config):
1458
+ super().__init__(config)
1459
+
1460
+ self.bert = JinaBertModel(config)
1461
+ self.cls = JinaBertPreTrainingHeads(config)
1462
+
1463
+ # Initialize weights and apply final processing
1464
+ self.post_init()
1465
+
1466
+ def get_output_embeddings(self):
1467
+ return self.cls.predictions.decoder
1468
+
1469
+ def set_output_embeddings(self, new_embeddings):
1470
+ self.cls.predictions.decoder = new_embeddings
1471
+
1472
+ @add_start_docstrings_to_model_forward(
1473
+ BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length")
1474
+ )
1475
+ @replace_return_docstrings(
1476
+ output_type=JinaBertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC
1477
+ )
1478
+ def forward(
1479
+ self,
1480
+ input_ids: Optional[torch.Tensor] = None,
1481
+ attention_mask: Optional[torch.Tensor] = None,
1482
+ token_type_ids: Optional[torch.Tensor] = None,
1483
+ position_ids: Optional[torch.Tensor] = None,
1484
+ head_mask: Optional[torch.Tensor] = None,
1485
+ inputs_embeds: Optional[torch.Tensor] = None,
1486
+ labels: Optional[torch.Tensor] = None,
1487
+ next_sentence_label: Optional[torch.Tensor] = None,
1488
+ output_attentions: Optional[bool] = None,
1489
+ output_hidden_states: Optional[bool] = None,
1490
+ return_dict: Optional[bool] = None,
1491
+ ) -> Union[Tuple[torch.Tensor], JinaBertForPreTrainingOutput]:
1492
+ r"""
1493
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1494
+ Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
1495
+ config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked),
1496
+ the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
1497
+ next_sentence_label (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1498
+ Labels for computing the next sequence prediction (classification) loss. Input should be a sequence
1499
+ pair (see `input_ids` docstring) Indices should be in `[0, 1]`:
1500
+
1501
+ - 0 indicates sequence B is a continuation of sequence A,
1502
+ - 1 indicates sequence B is a random sequence.
1503
+ kwargs (`Dict[str, any]`, optional, defaults to *{}*):
1504
+ Used to hide legacy arguments that have been deprecated.
1505
+
1506
+ Returns:
1507
+ """
1508
+ return_dict = (
1509
+ return_dict if return_dict is not None else self.config.use_return_dict
1510
+ )
1511
+
1512
+ outputs = self.bert(
1513
+ input_ids,
1514
+ attention_mask=attention_mask,
1515
+ token_type_ids=token_type_ids,
1516
+ position_ids=position_ids,
1517
+ head_mask=head_mask,
1518
+ inputs_embeds=inputs_embeds,
1519
+ output_attentions=output_attentions,
1520
+ output_hidden_states=output_hidden_states,
1521
+ return_dict=return_dict,
1522
+ )
1523
+
1524
+ sequence_output, pooled_output = outputs[:2]
1525
+ prediction_scores, seq_relationship_score = self.cls(
1526
+ sequence_output, pooled_output
1527
+ )
1528
+
1529
+ total_loss = None
1530
+ if labels is not None and next_sentence_label is not None:
1531
+ loss_fct = CrossEntropyLoss()
1532
+ masked_lm_loss = loss_fct(
1533
+ prediction_scores.view(-1, self.config.vocab_size), labels.view(-1)
1534
+ )
1535
+ next_sentence_loss = loss_fct(
1536
+ seq_relationship_score.view(-1, 2), next_sentence_label.view(-1)
1537
+ )
1538
+ total_loss = masked_lm_loss + next_sentence_loss
1539
+
1540
+ if not return_dict:
1541
+ output = (prediction_scores, seq_relationship_score) + outputs[2:]
1542
+ return ((total_loss,) + output) if total_loss is not None else output
1543
+
1544
+ return JinaBertForPreTrainingOutput(
1545
+ loss=total_loss,
1546
+ prediction_logits=prediction_scores,
1547
+ seq_relationship_logits=seq_relationship_score,
1548
+ hidden_states=outputs.hidden_states,
1549
+ attentions=outputs.attentions,
1550
+ )
1551
+
1552
+
1553
+ @add_start_docstrings(
1554
+ """JinaBert Model with a `language modeling` head on top for CLM fine-tuning.""",
1555
+ BERT_START_DOCSTRING,
1556
+ )
1557
+ class JinaBertLMHeadModel(JinaBertPreTrainedModel):
1558
+ _tied_weights_keys = ["predictions.decoder.bias", "cls.predictions.decoder.weight"]
1559
+
1560
+ def __init__(self, config):
1561
+ super().__init__(config)
1562
+
1563
+ if not config.is_decoder:
1564
+ logger.warning(
1565
+ "If you want to use `JinaBertLMHeadModel` as a standalone, add `is_decoder=True.`"
1566
+ )
1567
+
1568
+ self.bert = JinaBertModel(config, add_pooling_layer=False)
1569
+ self.cls = JinaBertOnlyMLMHead(config)
1570
+
1571
+ # Initialize weights and apply final processing
1572
+ self.post_init()
1573
+
1574
+ def get_output_embeddings(self):
1575
+ return self.cls.predictions.decoder
1576
+
1577
+ def set_output_embeddings(self, new_embeddings):
1578
+ self.cls.predictions.decoder = new_embeddings
1579
+
1580
+ @add_start_docstrings_to_model_forward(
1581
+ BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length")
1582
+ )
1583
+ @add_code_sample_docstrings(
1584
+ checkpoint=_CHECKPOINT_FOR_DOC,
1585
+ output_type=CausalLMOutputWithCrossAttentions,
1586
+ config_class=_CONFIG_FOR_DOC,
1587
+ )
1588
+ def forward(
1589
+ self,
1590
+ input_ids: Optional[torch.Tensor] = None,
1591
+ attention_mask: Optional[torch.Tensor] = None,
1592
+ token_type_ids: Optional[torch.Tensor] = None,
1593
+ position_ids: Optional[torch.Tensor] = None,
1594
+ head_mask: Optional[torch.Tensor] = None,
1595
+ inputs_embeds: Optional[torch.Tensor] = None,
1596
+ encoder_hidden_states: Optional[torch.Tensor] = None,
1597
+ encoder_attention_mask: Optional[torch.Tensor] = None,
1598
+ labels: Optional[torch.Tensor] = None,
1599
+ past_key_values: Optional[List[torch.Tensor]] = None,
1600
+ use_cache: Optional[bool] = None,
1601
+ output_attentions: Optional[bool] = None,
1602
+ output_hidden_states: Optional[bool] = None,
1603
+ return_dict: Optional[bool] = None,
1604
+ ) -> Union[Tuple[torch.Tensor], CausalLMOutputWithCrossAttentions]:
1605
+ r"""
1606
+ encoder_hidden_states (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
1607
+ Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
1608
+ the model is configured as a decoder.
1609
+ encoder_attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
1610
+ Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
1611
+ the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:
1612
+
1613
+ - 1 for tokens that are **not masked**,
1614
+ - 0 for tokens that are **masked**.
1615
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1616
+ Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
1617
+ `[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are
1618
+ ignored (masked), the loss is only computed for the tokens with labels n `[0, ..., config.vocab_size]`
1619
+ past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
1620
+ Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
1621
+
1622
+ If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
1623
+ don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
1624
+ `decoder_input_ids` of shape `(batch_size, sequence_length)`.
1625
+ use_cache (`bool`, *optional*):
1626
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
1627
+ `past_key_values`).
1628
+ """
1629
+ return_dict = (
1630
+ return_dict if return_dict is not None else self.config.use_return_dict
1631
+ )
1632
+ if labels is not None:
1633
+ use_cache = False
1634
+
1635
+ outputs = self.bert(
1636
+ input_ids,
1637
+ attention_mask=attention_mask,
1638
+ token_type_ids=token_type_ids,
1639
+ position_ids=position_ids,
1640
+ head_mask=head_mask,
1641
+ inputs_embeds=inputs_embeds,
1642
+ encoder_hidden_states=encoder_hidden_states,
1643
+ encoder_attention_mask=encoder_attention_mask,
1644
+ past_key_values=past_key_values,
1645
+ use_cache=use_cache,
1646
+ output_attentions=output_attentions,
1647
+ output_hidden_states=output_hidden_states,
1648
+ return_dict=return_dict,
1649
+ )
1650
+
1651
+ sequence_output = outputs[0]
1652
+ prediction_scores = self.cls(sequence_output)
1653
+
1654
+ lm_loss = None
1655
+ if labels is not None:
1656
+ # we are doing next-token prediction; shift prediction scores and input ids by one
1657
+ shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
1658
+ labels = labels[:, 1:].contiguous()
1659
+ loss_fct = CrossEntropyLoss()
1660
+ lm_loss = loss_fct(
1661
+ shifted_prediction_scores.view(-1, self.config.vocab_size),
1662
+ labels.view(-1),
1663
+ )
1664
+
1665
+ if not return_dict:
1666
+ output = (prediction_scores,) + outputs[2:]
1667
+ return ((lm_loss,) + output) if lm_loss is not None else output
1668
+
1669
+ return CausalLMOutputWithCrossAttentions(
1670
+ loss=lm_loss,
1671
+ logits=prediction_scores,
1672
+ past_key_values=outputs.past_key_values,
1673
+ hidden_states=outputs.hidden_states,
1674
+ attentions=outputs.attentions,
1675
+ cross_attentions=outputs.cross_attentions,
1676
+ )
1677
+
1678
+ def prepare_inputs_for_generation(
1679
+ self,
1680
+ input_ids,
1681
+ past_key_values=None,
1682
+ attention_mask=None,
1683
+ use_cache=True,
1684
+ **model_kwargs,
1685
+ ):
1686
+ input_shape = input_ids.shape
1687
+ # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
1688
+ if attention_mask is None:
1689
+ attention_mask = input_ids.new_ones(input_shape)
1690
+
1691
+ # cut decoder_input_ids if past_key_values is used
1692
+ if past_key_values is not None:
1693
+ input_ids = input_ids[:, -1:]
1694
+
1695
+ return {
1696
+ "input_ids": input_ids,
1697
+ "attention_mask": attention_mask,
1698
+ "past_key_values": past_key_values,
1699
+ "use_cache": use_cache,
1700
+ }
1701
+
1702
+ def _reorder_cache(self, past_key_values, beam_idx):
1703
+ reordered_past = ()
1704
+ for layer_past in past_key_values:
1705
+ reordered_past += (
1706
+ tuple(
1707
+ past_state.index_select(0, beam_idx) for past_state in layer_past
1708
+ ),
1709
+ )
1710
+ return reordered_past
1711
+
1712
+
1713
+ @add_start_docstrings(
1714
+ """JinaBert Model with a `language modeling` head on top.""", BERT_START_DOCSTRING
1715
+ )
1716
+ class JinaBertForMaskedLM(JinaBertPreTrainedModel):
1717
+ _tied_weights_keys = ["predictions.decoder.bias", "cls.predictions.decoder.weight"]
1718
+
1719
+ def __init__(self, config):
1720
+ super().__init__(config)
1721
+
1722
+ if config.is_decoder:
1723
+ logger.warning(
1724
+ "If you want to use `JinaBertForMaskedLM` make sure `config.is_decoder=False` for "
1725
+ "bi-directional self-attention."
1726
+ )
1727
+
1728
+ self.bert = JinaBertModel(config, add_pooling_layer=False)
1729
+ self.cls = JinaBertOnlyMLMHead(config)
1730
+
1731
+ # Initialize weights and apply final processing
1732
+ self.post_init()
1733
+
1734
+ def get_output_embeddings(self):
1735
+ return self.cls.predictions.decoder
1736
+
1737
+ def set_output_embeddings(self, new_embeddings):
1738
+ self.cls.predictions.decoder = new_embeddings
1739
+
1740
+ @add_start_docstrings_to_model_forward(
1741
+ BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length")
1742
+ )
1743
+ @add_code_sample_docstrings(
1744
+ checkpoint=_CHECKPOINT_FOR_DOC,
1745
+ output_type=MaskedLMOutput,
1746
+ config_class=_CONFIG_FOR_DOC,
1747
+ expected_output="'paris'",
1748
+ expected_loss=0.88,
1749
+ )
1750
+ def forward(
1751
+ self,
1752
+ input_ids: Optional[torch.Tensor] = None,
1753
+ attention_mask: Optional[torch.Tensor] = None,
1754
+ token_type_ids: Optional[torch.Tensor] = None,
1755
+ position_ids: Optional[torch.Tensor] = None,
1756
+ head_mask: Optional[torch.Tensor] = None,
1757
+ inputs_embeds: Optional[torch.Tensor] = None,
1758
+ encoder_hidden_states: Optional[torch.Tensor] = None,
1759
+ encoder_attention_mask: Optional[torch.Tensor] = None,
1760
+ labels: Optional[torch.Tensor] = None,
1761
+ output_attentions: Optional[bool] = None,
1762
+ output_hidden_states: Optional[bool] = None,
1763
+ return_dict: Optional[bool] = None,
1764
+ ) -> Union[Tuple[torch.Tensor], MaskedLMOutput]:
1765
+ r"""
1766
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1767
+ Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
1768
+ config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
1769
+ loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
1770
+ """
1771
+
1772
+ return_dict = (
1773
+ return_dict if return_dict is not None else self.config.use_return_dict
1774
+ )
1775
+
1776
+ outputs = self.bert(
1777
+ input_ids,
1778
+ attention_mask=attention_mask,
1779
+ token_type_ids=token_type_ids,
1780
+ position_ids=position_ids,
1781
+ head_mask=head_mask,
1782
+ inputs_embeds=inputs_embeds,
1783
+ encoder_hidden_states=encoder_hidden_states,
1784
+ encoder_attention_mask=encoder_attention_mask,
1785
+ output_attentions=output_attentions,
1786
+ output_hidden_states=output_hidden_states,
1787
+ return_dict=return_dict,
1788
+ )
1789
+
1790
+ sequence_output = outputs[0]
1791
+ prediction_scores = self.cls(sequence_output)
1792
+
1793
+ masked_lm_loss = None
1794
+ if labels is not None:
1795
+ loss_fct = CrossEntropyLoss() # -100 index = padding token
1796
+ masked_lm_loss = loss_fct(
1797
+ prediction_scores.view(-1, self.config.vocab_size), labels.view(-1)
1798
+ )
1799
+
1800
+ if not return_dict:
1801
+ output = (prediction_scores,) + outputs[2:]
1802
+ return (
1803
+ ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
1804
+ )
1805
+
1806
+ return MaskedLMOutput(
1807
+ loss=masked_lm_loss,
1808
+ logits=prediction_scores,
1809
+ hidden_states=outputs.hidden_states,
1810
+ attentions=outputs.attentions,
1811
+ )
1812
+
1813
+ def prepare_inputs_for_generation(
1814
+ self, input_ids, attention_mask=None, **model_kwargs
1815
+ ):
1816
+ input_shape = input_ids.shape
1817
+ effective_batch_size = input_shape[0]
1818
+
1819
+ # add a dummy token
1820
+ if self.config.pad_token_id is None:
1821
+ raise ValueError("The PAD token should be defined for generation")
1822
+
1823
+ attention_mask = torch.cat(
1824
+ [attention_mask, attention_mask.new_zeros((attention_mask.shape[0], 1))],
1825
+ dim=-1,
1826
+ )
1827
+ dummy_token = torch.full(
1828
+ (effective_batch_size, 1),
1829
+ self.config.pad_token_id,
1830
+ dtype=torch.long,
1831
+ device=input_ids.device,
1832
+ )
1833
+ input_ids = torch.cat([input_ids, dummy_token], dim=1)
1834
+
1835
+ return {"input_ids": input_ids, "attention_mask": attention_mask}
1836
+
1837
+
1838
+ @add_start_docstrings(
1839
+ """JinaBert Model with a `next sentence prediction (classification)` head on top.""",
1840
+ BERT_START_DOCSTRING,
1841
+ )
1842
+ class JinaBertForNextSentencePrediction(JinaBertPreTrainedModel):
1843
+ def __init__(self, config):
1844
+ super().__init__(config)
1845
+
1846
+ self.bert = JinaBertModel(config)
1847
+ self.cls = JinaBertOnlyNSPHead(config)
1848
+
1849
+ # Initialize weights and apply final processing
1850
+ self.post_init()
1851
+
1852
+ @add_start_docstrings_to_model_forward(
1853
+ BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length")
1854
+ )
1855
+ @replace_return_docstrings(
1856
+ output_type=NextSentencePredictorOutput, config_class=_CONFIG_FOR_DOC
1857
+ )
1858
+ def forward(
1859
+ self,
1860
+ input_ids: Optional[torch.Tensor] = None,
1861
+ attention_mask: Optional[torch.Tensor] = None,
1862
+ token_type_ids: Optional[torch.Tensor] = None,
1863
+ position_ids: Optional[torch.Tensor] = None,
1864
+ head_mask: Optional[torch.Tensor] = None,
1865
+ inputs_embeds: Optional[torch.Tensor] = None,
1866
+ labels: Optional[torch.Tensor] = None,
1867
+ output_attentions: Optional[bool] = None,
1868
+ output_hidden_states: Optional[bool] = None,
1869
+ return_dict: Optional[bool] = None,
1870
+ **kwargs,
1871
+ ) -> Union[Tuple[torch.Tensor], NextSentencePredictorOutput]:
1872
+ r"""
1873
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1874
+ Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair
1875
+ (see `input_ids` docstring). Indices should be in `[0, 1]`:
1876
+
1877
+ - 0 indicates sequence B is a continuation of sequence A,
1878
+ - 1 indicates sequence B is a random sequence.
1879
+
1880
+ Returns:
1881
+ """
1882
+
1883
+ if "next_sentence_label" in kwargs:
1884
+ warnings.warn(
1885
+ "The `next_sentence_label` argument is deprecated and will be removed in a future version, use"
1886
+ " `labels` instead.",
1887
+ FutureWarning,
1888
+ )
1889
+ labels = kwargs.pop("next_sentence_label")
1890
+
1891
+ return_dict = (
1892
+ return_dict if return_dict is not None else self.config.use_return_dict
1893
+ )
1894
+
1895
+ outputs = self.bert(
1896
+ input_ids,
1897
+ attention_mask=attention_mask,
1898
+ token_type_ids=token_type_ids,
1899
+ position_ids=position_ids,
1900
+ head_mask=head_mask,
1901
+ inputs_embeds=inputs_embeds,
1902
+ output_attentions=output_attentions,
1903
+ output_hidden_states=output_hidden_states,
1904
+ return_dict=return_dict,
1905
+ )
1906
+
1907
+ pooled_output = outputs[1]
1908
+
1909
+ seq_relationship_scores = self.cls(pooled_output)
1910
+
1911
+ next_sentence_loss = None
1912
+ if labels is not None:
1913
+ loss_fct = CrossEntropyLoss()
1914
+ next_sentence_loss = loss_fct(
1915
+ seq_relationship_scores.view(-1, 2), labels.view(-1)
1916
+ )
1917
+
1918
+ if not return_dict:
1919
+ output = (seq_relationship_scores,) + outputs[2:]
1920
+ return (
1921
+ ((next_sentence_loss,) + output)
1922
+ if next_sentence_loss is not None
1923
+ else output
1924
+ )
1925
+
1926
+ return NextSentencePredictorOutput(
1927
+ loss=next_sentence_loss,
1928
+ logits=seq_relationship_scores,
1929
+ hidden_states=outputs.hidden_states,
1930
+ attentions=outputs.attentions,
1931
+ )
1932
+
1933
+
1934
+ @add_start_docstrings(
1935
+ """
1936
+ JinaBert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled
1937
+ output) e.g. for GLUE tasks.
1938
+ """,
1939
+ BERT_START_DOCSTRING,
1940
+ )
1941
+ class JinaBertForSequenceClassification(JinaBertPreTrainedModel):
1942
+ def __init__(self, config):
1943
+ super().__init__(config)
1944
+ self.num_labels = config.num_labels
1945
+ self.config = config
1946
+
1947
+ self.bert = JinaBertModel(config)
1948
+ classifier_dropout = (
1949
+ config.classifier_dropout
1950
+ if config.classifier_dropout is not None
1951
+ else config.hidden_dropout_prob
1952
+ )
1953
+ self.dropout = nn.Dropout(classifier_dropout)
1954
+ self.classifier = nn.Linear(config.hidden_size, config.num_labels)
1955
+
1956
+ # Initialize weights and apply final processing
1957
+ self.post_init()
1958
+
1959
+ @add_start_docstrings_to_model_forward(
1960
+ BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length")
1961
+ )
1962
+ @add_code_sample_docstrings(
1963
+ checkpoint=_CHECKPOINT_FOR_SEQUENCE_CLASSIFICATION,
1964
+ output_type=SequenceClassifierOutput,
1965
+ config_class=_CONFIG_FOR_DOC,
1966
+ expected_output=_SEQ_CLASS_EXPECTED_OUTPUT,
1967
+ expected_loss=_SEQ_CLASS_EXPECTED_LOSS,
1968
+ )
1969
+ def forward(
1970
+ self,
1971
+ input_ids: Optional[torch.Tensor] = None,
1972
+ attention_mask: Optional[torch.Tensor] = None,
1973
+ token_type_ids: Optional[torch.Tensor] = None,
1974
+ position_ids: Optional[torch.Tensor] = None,
1975
+ head_mask: Optional[torch.Tensor] = None,
1976
+ inputs_embeds: Optional[torch.Tensor] = None,
1977
+ labels: Optional[torch.Tensor] = None,
1978
+ output_attentions: Optional[bool] = None,
1979
+ output_hidden_states: Optional[bool] = None,
1980
+ return_dict: Optional[bool] = None,
1981
+ ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
1982
+ r"""
1983
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1984
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1985
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1986
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1987
+ """
1988
+ return_dict = (
1989
+ return_dict if return_dict is not None else self.config.use_return_dict
1990
+ )
1991
+
1992
+ outputs = self.bert(
1993
+ input_ids,
1994
+ attention_mask=attention_mask,
1995
+ token_type_ids=token_type_ids,
1996
+ position_ids=position_ids,
1997
+ head_mask=head_mask,
1998
+ inputs_embeds=inputs_embeds,
1999
+ output_attentions=output_attentions,
2000
+ output_hidden_states=output_hidden_states,
2001
+ return_dict=return_dict,
2002
+ )
2003
+
2004
+ pooled_output = outputs[1]
2005
+
2006
+ pooled_output = self.dropout(pooled_output)
2007
+ logits = self.classifier(pooled_output)
2008
+
2009
+ loss = None
2010
+ if labels is not None:
2011
+ if self.config.problem_type is None:
2012
+ if self.num_labels == 1:
2013
+ self.config.problem_type = "regression"
2014
+ elif self.num_labels > 1 and (
2015
+ labels.dtype == torch.long or labels.dtype == torch.int
2016
+ ):
2017
+ self.config.problem_type = "single_label_classification"
2018
+ else:
2019
+ self.config.problem_type = "multi_label_classification"
2020
+
2021
+ if self.config.problem_type == "regression":
2022
+ loss_fct = MSELoss()
2023
+ if self.num_labels == 1:
2024
+ loss = loss_fct(logits.squeeze(), labels.squeeze())
2025
+ else:
2026
+ loss = loss_fct(logits, labels)
2027
+ elif self.config.problem_type == "single_label_classification":
2028
+ loss_fct = CrossEntropyLoss()
2029
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
2030
+ elif self.config.problem_type == "multi_label_classification":
2031
+ loss_fct = BCEWithLogitsLoss()
2032
+ loss = loss_fct(logits, labels)
2033
+ if not return_dict:
2034
+ output = (logits,) + outputs[2:]
2035
+ return ((loss,) + output) if loss is not None else output
2036
+
2037
+ return SequenceClassifierOutput(
2038
+ loss=loss,
2039
+ logits=logits,
2040
+ hidden_states=outputs.hidden_states,
2041
+ attentions=outputs.attentions,
2042
+ )
2043
+
2044
+
2045
+ @add_start_docstrings(
2046
+ """
2047
+ JinaBert Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a
2048
+ softmax) e.g. for RocStories/SWAG tasks.
2049
+ """,
2050
+ BERT_START_DOCSTRING,
2051
+ )
2052
+ class JinaBertForMultipleChoice(JinaBertPreTrainedModel):
2053
+ def __init__(self, config):
2054
+ super().__init__(config)
2055
+
2056
+ self.bert = JinaBertModel(config)
2057
+ classifier_dropout = (
2058
+ config.classifier_dropout
2059
+ if config.classifier_dropout is not None
2060
+ else config.hidden_dropout_prob
2061
+ )
2062
+ self.dropout = nn.Dropout(classifier_dropout)
2063
+ self.classifier = nn.Linear(config.hidden_size, 1)
2064
+
2065
+ # Initialize weights and apply final processing
2066
+ self.post_init()
2067
+
2068
+ @add_start_docstrings_to_model_forward(
2069
+ BERT_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length")
2070
+ )
2071
+ @add_code_sample_docstrings(
2072
+ checkpoint=_CHECKPOINT_FOR_DOC,
2073
+ output_type=MultipleChoiceModelOutput,
2074
+ config_class=_CONFIG_FOR_DOC,
2075
+ )
2076
+ def forward(
2077
+ self,
2078
+ input_ids: Optional[torch.Tensor] = None,
2079
+ attention_mask: Optional[torch.Tensor] = None,
2080
+ token_type_ids: Optional[torch.Tensor] = None,
2081
+ position_ids: Optional[torch.Tensor] = None,
2082
+ head_mask: Optional[torch.Tensor] = None,
2083
+ inputs_embeds: Optional[torch.Tensor] = None,
2084
+ labels: Optional[torch.Tensor] = None,
2085
+ output_attentions: Optional[bool] = None,
2086
+ output_hidden_states: Optional[bool] = None,
2087
+ return_dict: Optional[bool] = None,
2088
+ ) -> Union[Tuple[torch.Tensor], MultipleChoiceModelOutput]:
2089
+ r"""
2090
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
2091
+ Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
2092
+ num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
2093
+ `input_ids` above)
2094
+ """
2095
+ return_dict = (
2096
+ return_dict if return_dict is not None else self.config.use_return_dict
2097
+ )
2098
+ num_choices = (
2099
+ input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
2100
+ )
2101
+
2102
+ input_ids = (
2103
+ input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
2104
+ )
2105
+ attention_mask = (
2106
+ attention_mask.view(-1, attention_mask.size(-1))
2107
+ if attention_mask is not None
2108
+ else None
2109
+ )
2110
+ token_type_ids = (
2111
+ token_type_ids.view(-1, token_type_ids.size(-1))
2112
+ if token_type_ids is not None
2113
+ else None
2114
+ )
2115
+ position_ids = (
2116
+ position_ids.view(-1, position_ids.size(-1))
2117
+ if position_ids is not None
2118
+ else None
2119
+ )
2120
+ inputs_embeds = (
2121
+ inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))
2122
+ if inputs_embeds is not None
2123
+ else None
2124
+ )
2125
+
2126
+ outputs = self.bert(
2127
+ input_ids,
2128
+ attention_mask=attention_mask,
2129
+ token_type_ids=token_type_ids,
2130
+ position_ids=position_ids,
2131
+ head_mask=head_mask,
2132
+ inputs_embeds=inputs_embeds,
2133
+ output_attentions=output_attentions,
2134
+ output_hidden_states=output_hidden_states,
2135
+ return_dict=return_dict,
2136
+ )
2137
+
2138
+ pooled_output = outputs[1]
2139
+
2140
+ pooled_output = self.dropout(pooled_output)
2141
+ logits = self.classifier(pooled_output)
2142
+ reshaped_logits = logits.view(-1, num_choices)
2143
+
2144
+ loss = None
2145
+ if labels is not None:
2146
+ loss_fct = CrossEntropyLoss()
2147
+ loss = loss_fct(reshaped_logits, labels)
2148
+
2149
+ if not return_dict:
2150
+ output = (reshaped_logits,) + outputs[2:]
2151
+ return ((loss,) + output) if loss is not None else output
2152
+
2153
+ return MultipleChoiceModelOutput(
2154
+ loss=loss,
2155
+ logits=reshaped_logits,
2156
+ hidden_states=outputs.hidden_states,
2157
+ attentions=outputs.attentions,
2158
+ )
2159
+
2160
+
2161
+ @add_start_docstrings(
2162
+ """
2163
+ JinaBert Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for
2164
+ Named-Entity-Recognition (NER) tasks.
2165
+ """,
2166
+ BERT_START_DOCSTRING,
2167
+ )
2168
+ class JinaBertForTokenClassification(JinaBertPreTrainedModel):
2169
+ def __init__(self, config):
2170
+ super().__init__(config)
2171
+ self.num_labels = config.num_labels
2172
+
2173
+ self.bert = JinaBertModel(config, add_pooling_layer=False)
2174
+ classifier_dropout = (
2175
+ config.classifier_dropout
2176
+ if config.classifier_dropout is not None
2177
+ else config.hidden_dropout_prob
2178
+ )
2179
+ self.dropout = nn.Dropout(classifier_dropout)
2180
+ self.classifier = nn.Linear(config.hidden_size, config.num_labels)
2181
+
2182
+ # Initialize weights and apply final processing
2183
+ self.post_init()
2184
+
2185
+ @add_start_docstrings_to_model_forward(
2186
+ BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length")
2187
+ )
2188
+ @add_code_sample_docstrings(
2189
+ checkpoint=_CHECKPOINT_FOR_TOKEN_CLASSIFICATION,
2190
+ output_type=TokenClassifierOutput,
2191
+ config_class=_CONFIG_FOR_DOC,
2192
+ expected_output=_TOKEN_CLASS_EXPECTED_OUTPUT,
2193
+ expected_loss=_TOKEN_CLASS_EXPECTED_LOSS,
2194
+ )
2195
+ def forward(
2196
+ self,
2197
+ input_ids: Optional[torch.Tensor] = None,
2198
+ attention_mask: Optional[torch.Tensor] = None,
2199
+ token_type_ids: Optional[torch.Tensor] = None,
2200
+ position_ids: Optional[torch.Tensor] = None,
2201
+ head_mask: Optional[torch.Tensor] = None,
2202
+ inputs_embeds: Optional[torch.Tensor] = None,
2203
+ labels: Optional[torch.Tensor] = None,
2204
+ output_attentions: Optional[bool] = None,
2205
+ output_hidden_states: Optional[bool] = None,
2206
+ return_dict: Optional[bool] = None,
2207
+ ) -> Union[Tuple[torch.Tensor], TokenClassifierOutput]:
2208
+ r"""
2209
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
2210
+ Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
2211
+ """
2212
+ return_dict = (
2213
+ return_dict if return_dict is not None else self.config.use_return_dict
2214
+ )
2215
+
2216
+ outputs = self.bert(
2217
+ input_ids,
2218
+ attention_mask=attention_mask,
2219
+ token_type_ids=token_type_ids,
2220
+ position_ids=position_ids,
2221
+ head_mask=head_mask,
2222
+ inputs_embeds=inputs_embeds,
2223
+ output_attentions=output_attentions,
2224
+ output_hidden_states=output_hidden_states,
2225
+ return_dict=return_dict,
2226
+ )
2227
+
2228
+ sequence_output = outputs[0]
2229
+
2230
+ sequence_output = self.dropout(sequence_output)
2231
+ logits = self.classifier(sequence_output)
2232
+
2233
+ loss = None
2234
+ if labels is not None:
2235
+ loss_fct = CrossEntropyLoss()
2236
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
2237
+
2238
+ if not return_dict:
2239
+ output = (logits,) + outputs[2:]
2240
+ return ((loss,) + output) if loss is not None else output
2241
+
2242
+ return TokenClassifierOutput(
2243
+ loss=loss,
2244
+ logits=logits,
2245
+ hidden_states=outputs.hidden_states,
2246
+ attentions=outputs.attentions,
2247
+ )
2248
+
2249
+
2250
+ @add_start_docstrings(
2251
+ """
2252
+ JinaBert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear
2253
+ layers on top of the hidden-states output to compute `span start logits` and `span end logits`).
2254
+ """,
2255
+ BERT_START_DOCSTRING,
2256
+ )
2257
+ class JinaBertForQuestionAnswering(JinaBertPreTrainedModel):
2258
+ def __init__(self, config):
2259
+ super().__init__(config)
2260
+ self.num_labels = config.num_labels
2261
+
2262
+ self.bert = JinaBertModel(config, add_pooling_layer=False)
2263
+ self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
2264
+
2265
+ # Initialize weights and apply final processing
2266
+ self.post_init()
2267
+
2268
+ @add_start_docstrings_to_model_forward(
2269
+ BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length")
2270
+ )
2271
+ @add_code_sample_docstrings(
2272
+ checkpoint=_CHECKPOINT_FOR_QA,
2273
+ output_type=QuestionAnsweringModelOutput,
2274
+ config_class=_CONFIG_FOR_DOC,
2275
+ qa_target_start_index=_QA_TARGET_START_INDEX,
2276
+ qa_target_end_index=_QA_TARGET_END_INDEX,
2277
+ expected_output=_QA_EXPECTED_OUTPUT,
2278
+ expected_loss=_QA_EXPECTED_LOSS,
2279
+ )
2280
+ def forward(
2281
+ self,
2282
+ input_ids: Optional[torch.Tensor] = None,
2283
+ attention_mask: Optional[torch.Tensor] = None,
2284
+ token_type_ids: Optional[torch.Tensor] = None,
2285
+ position_ids: Optional[torch.Tensor] = None,
2286
+ head_mask: Optional[torch.Tensor] = None,
2287
+ inputs_embeds: Optional[torch.Tensor] = None,
2288
+ start_positions: Optional[torch.Tensor] = None,
2289
+ end_positions: Optional[torch.Tensor] = None,
2290
+ output_attentions: Optional[bool] = None,
2291
+ output_hidden_states: Optional[bool] = None,
2292
+ return_dict: Optional[bool] = None,
2293
+ ) -> Union[Tuple[torch.Tensor], QuestionAnsweringModelOutput]:
2294
+ r"""
2295
+ start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
2296
+ Labels for position (index) of the start of the labelled span for computing the token classification loss.
2297
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
2298
+ are not taken into account for computing the loss.
2299
+ end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
2300
+ Labels for position (index) of the end of the labelled span for computing the token classification loss.
2301
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
2302
+ are not taken into account for computing the loss.
2303
+ """
2304
+ return_dict = (
2305
+ return_dict if return_dict is not None else self.config.use_return_dict
2306
+ )
2307
+
2308
+ outputs = self.bert(
2309
+ input_ids,
2310
+ attention_mask=attention_mask,
2311
+ token_type_ids=token_type_ids,
2312
+ position_ids=position_ids,
2313
+ head_mask=head_mask,
2314
+ inputs_embeds=inputs_embeds,
2315
+ output_attentions=output_attentions,
2316
+ output_hidden_states=output_hidden_states,
2317
+ return_dict=return_dict,
2318
+ )
2319
+
2320
+ sequence_output = outputs[0]
2321
+
2322
+ logits = self.qa_outputs(sequence_output)
2323
+ start_logits, end_logits = logits.split(1, dim=-1)
2324
+ start_logits = start_logits.squeeze(-1).contiguous()
2325
+ end_logits = end_logits.squeeze(-1).contiguous()
2326
+
2327
+ total_loss = None
2328
+ if start_positions is not None and end_positions is not None:
2329
+ # If we are on multi-GPU, split add a dimension
2330
+ if len(start_positions.size()) > 1:
2331
+ start_positions = start_positions.squeeze(-1)
2332
+ if len(end_positions.size()) > 1:
2333
+ end_positions = end_positions.squeeze(-1)
2334
+ # sometimes the start/end positions are outside our model inputs, we ignore these terms
2335
+ ignored_index = start_logits.size(1)
2336
+ start_positions = start_positions.clamp(0, ignored_index)
2337
+ end_positions = end_positions.clamp(0, ignored_index)
2338
+
2339
+ loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
2340
+ start_loss = loss_fct(start_logits, start_positions)
2341
+ end_loss = loss_fct(end_logits, end_positions)
2342
+ total_loss = (start_loss + end_loss) / 2
2343
+
2344
+ if not return_dict:
2345
+ output = (start_logits, end_logits) + outputs[2:]
2346
+ return ((total_loss,) + output) if total_loss is not None else output
2347
+
2348
+ return QuestionAnsweringModelOutput(
2349
+ loss=total_loss,
2350
+ start_logits=start_logits,
2351
+ end_logits=end_logits,
2352
+ hidden_states=outputs.hidden_states,
2353
+ attentions=outputs.attentions,
2354
+ )
2355
+
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 8192,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 2147483648,
50
+ "never_split": null,
51
+ "pad_token": "[PAD]",
52
+ "sep_token": "[SEP]",
53
+ "strip_accents": null,
54
+ "tokenize_chinese_chars": true,
55
+ "tokenizer_class": "BertTokenizer",
56
+ "unk_token": "[UNK]"
57
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff