vince62s commited on
Commit
bdeff51
1 Parent(s): b395d14

Upload 9 files

Browse files
README.md CHANGED
@@ -1,3 +1,189 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - multilingual
4
+ - af
5
+ - am
6
+ - ar
7
+ - as
8
+ - az
9
+ - be
10
+ - bg
11
+ - bn
12
+ - br
13
+ - bs
14
+ - ca
15
+ - cs
16
+ - cy
17
+ - da
18
+ - de
19
+ - el
20
+ - en
21
+ - eo
22
+ - es
23
+ - et
24
+ - eu
25
+ - fa
26
+ - fi
27
+ - fr
28
+ - fy
29
+ - ga
30
+ - gd
31
+ - gl
32
+ - gu
33
+ - ha
34
+ - he
35
+ - hi
36
+ - hr
37
+ - hu
38
+ - hy
39
+ - id
40
+ - is
41
+ - it
42
+ - ja
43
+ - jv
44
+ - ka
45
+ - kk
46
+ - km
47
+ - kn
48
+ - ko
49
+ - ku
50
+ - ky
51
+ - la
52
+ - lo
53
+ - lt
54
+ - lv
55
+ - mg
56
+ - mk
57
+ - ml
58
+ - mn
59
+ - mr
60
+ - ms
61
+ - my
62
+ - ne
63
+ - nl
64
+ - no
65
+ - om
66
+ - or
67
+ - pa
68
+ - pl
69
+ - ps
70
+ - pt
71
+ - ro
72
+ - ru
73
+ - sa
74
+ - sd
75
+ - si
76
+ - sk
77
+ - sl
78
+ - so
79
+ - sq
80
+ - sr
81
+ - su
82
+ - sv
83
+ - sw
84
+ - ta
85
+ - te
86
+ - th
87
+ - tl
88
+ - tr
89
+ - ug
90
+ - uk
91
+ - ur
92
+ - uz
93
+ - vi
94
+ - xh
95
+ - yi
96
+ - zh
97
+ license: mit
98
  ---
99
+
100
+ # XLM-RoBERTa-XL (xlarge-sized model)
101
+
102
+ XLM-RoBERTa-XL model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It was introduced in the paper [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau and first released in [this repository](https://github.com/pytorch/fairseq/tree/master/examples/xlmr).
103
+
104
+ Disclaimer: The team releasing XLM-RoBERTa-XL did not write a model card for this model so this model card has been written by the Hugging Face team.
105
+
106
+ ## Model description
107
+
108
+ XLM-RoBERTa-XL is a extra large multilingual version of RoBERTa. It is pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.
109
+
110
+ RoBERTa is a transformers model pretrained on a large corpus in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.
111
+
112
+ More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
113
+
114
+ This way, the model learns an inner representation of 100 languages that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the XLM-RoBERTa-XL model as inputs.
115
+
116
+ ## Intended uses & limitations
117
+
118
+ You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?search=xlm-roberta-xl) to look for fine-tuned versions on a task that interests you.
119
+
120
+ Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation, you should look at models like GPT2.
121
+
122
+ ## Usage
123
+
124
+ You can use this model directly with a pipeline for masked language modeling:
125
+
126
+ ```python
127
+ >>> from transformers import pipeline
128
+ >>> unmasker = pipeline('fill-mask', model='facebook/xlm-roberta-xl')
129
+ >>> unmasker("Europe is a <mask> continent.")
130
+
131
+ [{'score': 0.08562745153903961,
132
+ 'token': 38043,
133
+ 'token_str': 'living',
134
+ 'sequence': 'Europe is a living continent.'},
135
+ {'score': 0.0799778401851654,
136
+ 'token': 103494,
137
+ 'token_str': 'dead',
138
+ 'sequence': 'Europe is a dead continent.'},
139
+ {'score': 0.046154674142599106,
140
+ 'token': 72856,
141
+ 'token_str': 'lost',
142
+ 'sequence': 'Europe is a lost continent.'},
143
+ {'score': 0.04358183592557907,
144
+ 'token': 19336,
145
+ 'token_str': 'small',
146
+ 'sequence': 'Europe is a small continent.'},
147
+ {'score': 0.040570393204689026,
148
+ 'token': 34923,
149
+ 'token_str': 'beautiful',
150
+ 'sequence': 'Europe is a beautiful continent.'}]
151
+ ```
152
+
153
+ Here is how to use this model to get the features of a given text in PyTorch:
154
+
155
+ ```python
156
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
157
+
158
+ tokenizer = AutoTokenizer.from_pretrained('facebook/xlm-roberta-xl')
159
+ model = AutoModelForMaskedLM.from_pretrained("facebook/xlm-roberta-xl")
160
+
161
+ # prepare input
162
+ text = "Replace me by any text you'd like."
163
+ encoded_input = tokenizer(text, return_tensors='pt')
164
+
165
+ # forward pass
166
+ output = model(**encoded_input)
167
+ ```
168
+
169
+ ### BibTeX entry and citation info
170
+
171
+ ```bibtex
172
+ @article{DBLP:journals/corr/abs-2105-00572,
173
+ author = {Naman Goyal and
174
+ Jingfei Du and
175
+ Myle Ott and
176
+ Giri Anantharaman and
177
+ Alexis Conneau},
178
+ title = {Larger-Scale Transformers for Multilingual Masked Language Modeling},
179
+ journal = {CoRR},
180
+ volume = {abs/2105.00572},
181
+ year = {2021},
182
+ url = {https://arxiv.org/abs/2105.00572},
183
+ eprinttype = {arXiv},
184
+ eprint = {2105.00572},
185
+ timestamp = {Wed, 12 May 2021 15:54:31 +0200},
186
+ biburl = {https://dblp.org/rec/journals/corr/abs-2105-00572.bib},
187
+ bibsource = {dblp computer science bibliography, https://dblp.org}
188
+ }
189
+ ```
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "XLMRobertaXLForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "classifier_dropout": null,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 2560,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 10240,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 514,
16
+ "model_type": "xlm-roberta-xl",
17
+ "num_attention_heads": 32,
18
+ "num_hidden_layers": 36,
19
+ "pad_token_id": 1,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float32",
22
+ "type_vocab_size": 1,
23
+ "use_cache": true,
24
+ "vocab_size": 250880,
25
+ "tokenizer_class": "XLMRobertaTokenizer",
26
+ "layer_transformation": "softmax",
27
+ "layer_norm": true,
28
+ "dropout": 0.1,
29
+ "estimator_sizes": [2560, 1280]
30
+ }
configuration_xlm_roberta_xl.py ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 The HuggingFace Inc. team.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ """ XLM_ROBERTa_XL configuration"""
16
+
17
+ from collections import OrderedDict
18
+ from typing import Mapping
19
+
20
+ from transformers.configuration_utils import PretrainedConfig
21
+ from transformers.onnx import OnnxConfig
22
+ from transformers.utils import logging
23
+
24
+
25
+ logger = logging.get_logger(__name__)
26
+
27
+
28
+ #from ..deprecated._archive_maps import XLM_ROBERTA_XL_PRETRAINED_CONFIG_ARCHIVE_MAP # noqa: F401, E402
29
+
30
+
31
+ class XLMRobertaXLConfig(PretrainedConfig):
32
+ r"""
33
+ This is the configuration class to store the configuration of a [`XLMRobertaXLModel`] or a [`TFXLMRobertaXLModel`].
34
+ It is used to instantiate a XLM_ROBERTA_XL model according to the specified arguments, defining the model
35
+ architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the
36
+ XLM_ROBERTA_XL [facebook/xlm-roberta-xl](https://huggingface.co/facebook/xlm-roberta-xl) architecture.
37
+
38
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
39
+ documentation from [`PretrainedConfig`] for more information.
40
+
41
+
42
+ Args:
43
+ vocab_size (`int`, *optional*, defaults to 250880):
44
+ Vocabulary size of the XLM_ROBERTA_XL model. Defines the number of different tokens that can be represented
45
+ by the `inputs_ids` passed when calling [`XLMRobertaXLModel`].
46
+ hidden_size (`int`, *optional*, defaults to 2560):
47
+ Dimensionality of the encoder layers and the pooler layer.
48
+ num_hidden_layers (`int`, *optional*, defaults to 36):
49
+ Number of hidden layers in the Transformer encoder.
50
+ num_attention_heads (`int`, *optional*, defaults to 32):
51
+ Number of attention heads for each attention layer in the Transformer encoder.
52
+ intermediate_size (`int`, *optional*, defaults to 10240):
53
+ Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
54
+ hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
55
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
56
+ `"relu"`, `"silu"` and `"gelu_new"` are supported.
57
+ hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
58
+ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
59
+ attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
60
+ The dropout ratio for the attention probabilities.
61
+ max_position_embeddings (`int`, *optional*, defaults to 514):
62
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
63
+ just in case (e.g., 512 or 1024 or 2048).
64
+ type_vocab_size (`int`, *optional*, defaults to 1):
65
+ The vocabulary size of the `token_type_ids` passed when calling [`XLMRobertaXLModel`] or
66
+ [`TFXLMRobertaXLModel`].
67
+ initializer_range (`float`, *optional*, defaults to 0.02):
68
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
69
+ layer_norm_eps (`float`, *optional*, defaults to 1e-5):
70
+ The epsilon used by the layer normalization layers.
71
+ position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
72
+ Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
73
+ positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
74
+ [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
75
+ For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
76
+ with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
77
+ use_cache (`bool`, *optional*, defaults to `True`):
78
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
79
+ relevant if `config.is_decoder=True`.
80
+ classifier_dropout (`float`, *optional*):
81
+ The dropout ratio for the classification head.
82
+
83
+ Examples:
84
+
85
+ ```python
86
+ >>> from transformers import XLMRobertaXLConfig, XLMRobertaXLModel
87
+
88
+ >>> # Initializing a XLM_ROBERTA_XL google-bert/bert-base-uncased style configuration
89
+ >>> configuration = XLMRobertaXLConfig()
90
+
91
+ >>> # Initializing a model (with random weights) from the google-bert/bert-base-uncased style configuration
92
+ >>> model = XLMRobertaXLModel(configuration)
93
+
94
+ >>> # Accessing the model configuration
95
+ >>> configuration = model.config
96
+ ```"""
97
+
98
+ model_type = "xlm-roberta-xl"
99
+
100
+ def __init__(
101
+ self,
102
+ vocab_size=250880,
103
+ hidden_size=2560,
104
+ num_hidden_layers=36,
105
+ num_attention_heads=32,
106
+ intermediate_size=10240,
107
+ hidden_act="gelu",
108
+ hidden_dropout_prob=0.1,
109
+ attention_probs_dropout_prob=0.1,
110
+ max_position_embeddings=514,
111
+ type_vocab_size=1,
112
+ initializer_range=0.02,
113
+ layer_norm_eps=1e-05,
114
+ pad_token_id=1,
115
+ bos_token_id=0,
116
+ eos_token_id=2,
117
+ position_embedding_type="absolute",
118
+ use_cache=True,
119
+ classifier_dropout=None,
120
+ **kwargs,
121
+ ):
122
+ super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
123
+ self.vocab_size = vocab_size
124
+ self.hidden_size = hidden_size
125
+ self.num_hidden_layers = num_hidden_layers
126
+ self.num_attention_heads = num_attention_heads
127
+ self.hidden_act = hidden_act
128
+ self.intermediate_size = intermediate_size
129
+ self.hidden_dropout_prob = hidden_dropout_prob
130
+ self.attention_probs_dropout_prob = attention_probs_dropout_prob
131
+ self.max_position_embeddings = max_position_embeddings
132
+ self.type_vocab_size = type_vocab_size
133
+ self.initializer_range = initializer_range
134
+ self.layer_norm_eps = layer_norm_eps
135
+ self.position_embedding_type = position_embedding_type
136
+ self.use_cache = use_cache
137
+ self.classifier_dropout = classifier_dropout
138
+
139
+
140
+ # Copied from transformers.models.roberta.configuration_roberta.RobertaOnnxConfig with Roberta->XLMRobertaXL
141
+ class XLMRobertaXLOnnxConfig(OnnxConfig):
142
+ @property
143
+ def inputs(self) -> Mapping[str, Mapping[int, str]]:
144
+ if self.task == "multiple-choice":
145
+ dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
146
+ else:
147
+ dynamic_axis = {0: "batch", 1: "sequence"}
148
+ return OrderedDict(
149
+ [
150
+ ("input_ids", dynamic_axis),
151
+ ("attention_mask", dynamic_axis),
152
+ ]
153
+ )
154
+
modeling_xlm_roberta_xl.py ADDED
@@ -0,0 +1,1777 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 The HuggingFace Inc. team.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ """PyTorch XLM RoBERTa xl,xxl model."""
16
+
17
+ import math
18
+ from typing import List, Optional, Tuple, Union
19
+
20
+ import torch
21
+ import torch.utils.checkpoint
22
+ from torch import nn
23
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
24
+ from torch.nn import Parameter, ParameterList
25
+
26
+ from transformers.activations import ACT2FN, gelu
27
+ from transformers.modeling_outputs import (
28
+ BaseModelOutputWithPastAndCrossAttentions,
29
+ BaseModelOutputWithPoolingAndCrossAttentions,
30
+ CausalLMOutputWithCrossAttentions,
31
+ MaskedLMOutput,
32
+ MultipleChoiceModelOutput,
33
+ QuestionAnsweringModelOutput,
34
+ SequenceClassifierOutput,
35
+ TokenClassifierOutput,
36
+ )
37
+ from transformers import PreTrainedModel
38
+ from transformers.pytorch_utils import apply_chunking_to_forward, find_pruneable_heads_and_indices, prune_linear_layer
39
+ from transformers.utils import (
40
+ add_code_sample_docstrings,
41
+ add_start_docstrings,
42
+ add_start_docstrings_to_model_forward,
43
+ logging,
44
+ replace_return_docstrings,
45
+ )
46
+ from .configuration_xlm_roberta_xl import XLMRobertaXLConfig
47
+
48
+
49
+ logger = logging.get_logger(__name__)
50
+
51
+ _CHECKPOINT_FOR_DOC = "facebook/xlm-roberta-xl"
52
+ _CONFIG_FOR_DOC = "XLMRobertaXLConfig"
53
+
54
+
55
+ #from ..deprecated._archive_maps import XLM_ROBERTA_XL_PRETRAINED_MODEL_ARCHIVE_LIST # noqa: F401, E402
56
+
57
+
58
+ class XLMRobertaXLEmbeddings(nn.Module):
59
+ """
60
+ Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.
61
+ """
62
+
63
+ def __init__(self, config):
64
+ super().__init__()
65
+ self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
66
+ self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
67
+ self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
68
+
69
+ # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
70
+ # any TensorFlow checkpoint file
71
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
72
+ # position_ids (1, len position emb) is contiguous in memory and exported when serialized
73
+ self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
74
+ self.register_buffer(
75
+ "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False
76
+ )
77
+ self.register_buffer(
78
+ "token_type_ids", torch.zeros(self.position_ids.size(), dtype=torch.long), persistent=False
79
+ )
80
+
81
+ # End copy
82
+ self.padding_idx = config.pad_token_id
83
+ self.position_embeddings = nn.Embedding(
84
+ config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx
85
+ )
86
+
87
+ def forward(
88
+ self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0
89
+ ):
90
+ if position_ids is None:
91
+ if input_ids is not None:
92
+ # Create the position ids from the input token ids. Any padded tokens remain padded.
93
+ position_ids = create_position_ids_from_input_ids(input_ids, self.padding_idx, past_key_values_length)
94
+ else:
95
+ position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)
96
+
97
+ if input_ids is not None:
98
+ input_shape = input_ids.size()
99
+ else:
100
+ input_shape = inputs_embeds.size()[:-1]
101
+
102
+ seq_length = input_shape[1]
103
+
104
+ # Setting the token_type_ids to the registered buffer in constructor where it is all zeros, which usually occurs
105
+ # when its auto-generated, registered buffer helps users when tracing the model without passing token_type_ids, solves
106
+ # issue #5664
107
+ if token_type_ids is None:
108
+ if hasattr(self, "token_type_ids"):
109
+ buffered_token_type_ids = self.token_type_ids[:, :seq_length]
110
+ buffered_token_type_ids_expanded = buffered_token_type_ids.expand(input_shape[0], seq_length)
111
+ token_type_ids = buffered_token_type_ids_expanded
112
+ else:
113
+ token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)
114
+
115
+ if inputs_embeds is None:
116
+ inputs_embeds = self.word_embeddings(input_ids)
117
+ token_type_embeddings = self.token_type_embeddings(token_type_ids)
118
+
119
+ embeddings = inputs_embeds + token_type_embeddings
120
+ if self.position_embedding_type == "absolute":
121
+ position_embeddings = self.position_embeddings(position_ids)
122
+ embeddings += position_embeddings
123
+
124
+ embeddings = self.dropout(embeddings)
125
+ return embeddings
126
+
127
+ # Copied from transformers.models.roberta.modeling_roberta.RobertaEmbeddings.create_position_ids_from_inputs_embeds
128
+ def create_position_ids_from_inputs_embeds(self, inputs_embeds):
129
+ """
130
+ We are provided embeddings directly. We cannot infer which are padded so just generate sequential position ids.
131
+
132
+ Args:
133
+ inputs_embeds: torch.Tensor
134
+
135
+ Returns: torch.Tensor
136
+ """
137
+ input_shape = inputs_embeds.size()[:-1]
138
+ sequence_length = input_shape[1]
139
+
140
+ position_ids = torch.arange(
141
+ self.padding_idx + 1, sequence_length + self.padding_idx + 1, dtype=torch.long, device=inputs_embeds.device
142
+ )
143
+ return position_ids.unsqueeze(0).expand(input_shape)
144
+
145
+
146
+ # Copied from transformers.models.bert.modeling_bert.BertSelfAttention with Bert->XLMRobertaXL
147
+ class XLMRobertaXLSelfAttention(nn.Module):
148
+ def __init__(self, config, position_embedding_type=None):
149
+ super().__init__()
150
+ if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
151
+ raise ValueError(
152
+ f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
153
+ f"heads ({config.num_attention_heads})"
154
+ )
155
+
156
+ self.num_attention_heads = config.num_attention_heads
157
+ self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
158
+ self.all_head_size = self.num_attention_heads * self.attention_head_size
159
+
160
+ self.query = nn.Linear(config.hidden_size, self.all_head_size)
161
+ self.key = nn.Linear(config.hidden_size, self.all_head_size)
162
+ self.value = nn.Linear(config.hidden_size, self.all_head_size)
163
+
164
+ self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
165
+ self.position_embedding_type = position_embedding_type or getattr(
166
+ config, "position_embedding_type", "absolute"
167
+ )
168
+ if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
169
+ self.max_position_embeddings = config.max_position_embeddings
170
+ self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)
171
+
172
+ self.is_decoder = config.is_decoder
173
+
174
+ def transpose_for_scores(self, x: torch.Tensor) -> torch.Tensor:
175
+ new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
176
+ x = x.view(new_x_shape)
177
+ return x.permute(0, 2, 1, 3)
178
+
179
+ def forward(
180
+ self,
181
+ hidden_states: torch.Tensor,
182
+ attention_mask: Optional[torch.FloatTensor] = None,
183
+ head_mask: Optional[torch.FloatTensor] = None,
184
+ encoder_hidden_states: Optional[torch.FloatTensor] = None,
185
+ encoder_attention_mask: Optional[torch.FloatTensor] = None,
186
+ past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
187
+ output_attentions: Optional[bool] = False,
188
+ ) -> Tuple[torch.Tensor]:
189
+ mixed_query_layer = self.query(hidden_states)
190
+
191
+ # If this is instantiated as a cross-attention module, the keys
192
+ # and values come from an encoder; the attention mask needs to be
193
+ # such that the encoder's padding tokens are not attended to.
194
+ is_cross_attention = encoder_hidden_states is not None
195
+
196
+ if is_cross_attention and past_key_value is not None:
197
+ # reuse k,v, cross_attentions
198
+ key_layer = past_key_value[0]
199
+ value_layer = past_key_value[1]
200
+ attention_mask = encoder_attention_mask
201
+ elif is_cross_attention:
202
+ key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
203
+ value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
204
+ attention_mask = encoder_attention_mask
205
+ elif past_key_value is not None:
206
+ key_layer = self.transpose_for_scores(self.key(hidden_states))
207
+ value_layer = self.transpose_for_scores(self.value(hidden_states))
208
+ key_layer = torch.cat([past_key_value[0], key_layer], dim=2)
209
+ value_layer = torch.cat([past_key_value[1], value_layer], dim=2)
210
+ else:
211
+ key_layer = self.transpose_for_scores(self.key(hidden_states))
212
+ value_layer = self.transpose_for_scores(self.value(hidden_states))
213
+
214
+ query_layer = self.transpose_for_scores(mixed_query_layer)
215
+
216
+ use_cache = past_key_value is not None
217
+ if self.is_decoder:
218
+ # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.
219
+ # Further calls to cross_attention layer can then reuse all cross-attention
220
+ # key/value_states (first "if" case)
221
+ # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of
222
+ # all previous decoder key/value_states. Further calls to uni-directional self-attention
223
+ # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)
224
+ # if encoder bi-directional self-attention `past_key_value` is always `None`
225
+ past_key_value = (key_layer, value_layer)
226
+
227
+ # Take the dot product between "query" and "key" to get the raw attention scores.
228
+ attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
229
+
230
+ if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
231
+ query_length, key_length = query_layer.shape[2], key_layer.shape[2]
232
+ if use_cache:
233
+ position_ids_l = torch.tensor(key_length - 1, dtype=torch.long, device=hidden_states.device).view(
234
+ -1, 1
235
+ )
236
+ else:
237
+ position_ids_l = torch.arange(query_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
238
+ position_ids_r = torch.arange(key_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
239
+ distance = position_ids_l - position_ids_r
240
+
241
+ positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
242
+ positional_embedding = positional_embedding.to(dtype=query_layer.dtype) # fp16 compatibility
243
+
244
+ if self.position_embedding_type == "relative_key":
245
+ relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
246
+ attention_scores = attention_scores + relative_position_scores
247
+ elif self.position_embedding_type == "relative_key_query":
248
+ relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
249
+ relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
250
+ attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
251
+
252
+ attention_scores = attention_scores / math.sqrt(self.attention_head_size)
253
+ if attention_mask is not None:
254
+ # Apply the attention mask is (precomputed for all layers in XLMRobertaXLModel forward() function)
255
+ attention_scores = attention_scores + attention_mask
256
+
257
+ # Normalize the attention scores to probabilities.
258
+ attention_probs = nn.functional.softmax(attention_scores, dim=-1)
259
+
260
+ # This is actually dropping out entire tokens to attend to, which might
261
+ # seem a bit unusual, but is taken from the original Transformer paper.
262
+ attention_probs = self.dropout(attention_probs)
263
+
264
+ # Mask heads if we want to
265
+ if head_mask is not None:
266
+ attention_probs = attention_probs * head_mask
267
+
268
+ context_layer = torch.matmul(attention_probs, value_layer)
269
+
270
+ context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
271
+ new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
272
+ context_layer = context_layer.view(new_context_layer_shape)
273
+
274
+ outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
275
+
276
+ if self.is_decoder:
277
+ outputs = outputs + (past_key_value,)
278
+ return outputs
279
+
280
+
281
+ class XLMRobertaXLSelfOutput(nn.Module):
282
+ def __init__(self, config):
283
+ super().__init__()
284
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
285
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
286
+
287
+ def forward(self, hidden_states, input_tensor):
288
+ hidden_states = self.dense(hidden_states)
289
+ hidden_states = self.dropout(hidden_states)
290
+ hidden_states = hidden_states + input_tensor
291
+ return hidden_states
292
+
293
+
294
+ class XLMRobertaXLAttention(nn.Module):
295
+ def __init__(self, config, position_embedding_type=None):
296
+ super().__init__()
297
+ self.self_attn_layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
298
+ self.self = XLMRobertaXLSelfAttention(config, position_embedding_type=position_embedding_type)
299
+ self.output = XLMRobertaXLSelfOutput(config)
300
+ self.pruned_heads = set()
301
+
302
+ def prune_heads(self, heads):
303
+ if len(heads) == 0:
304
+ return
305
+ heads, index = find_pruneable_heads_and_indices(
306
+ heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
307
+ )
308
+
309
+ # Prune linear layers
310
+ self.self.query = prune_linear_layer(self.self.query, index)
311
+ self.self.key = prune_linear_layer(self.self.key, index)
312
+ self.self.value = prune_linear_layer(self.self.value, index)
313
+ self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
314
+
315
+ # Update hyper params and store pruned heads
316
+ self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
317
+ self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
318
+ self.pruned_heads = self.pruned_heads.union(heads)
319
+
320
+ def forward(
321
+ self,
322
+ hidden_states,
323
+ attention_mask=None,
324
+ head_mask=None,
325
+ encoder_hidden_states=None,
326
+ encoder_attention_mask=None,
327
+ past_key_value=None,
328
+ output_attentions=False,
329
+ ):
330
+ intermediate = self.self_attn_layer_norm(hidden_states)
331
+ self_outputs = self.self(
332
+ intermediate,
333
+ attention_mask,
334
+ head_mask,
335
+ encoder_hidden_states,
336
+ encoder_attention_mask,
337
+ past_key_value,
338
+ output_attentions,
339
+ )
340
+ attention_output = self.output(self_outputs[0], hidden_states)
341
+ outputs = (attention_output,) + self_outputs[1:] # add attentions if we output them
342
+ return outputs
343
+
344
+
345
+ # Copied from transformers.models.bert.modeling_bert.BertIntermediate
346
+ class XLMRobertaXLIntermediate(nn.Module):
347
+ def __init__(self, config):
348
+ super().__init__()
349
+ self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
350
+ if isinstance(config.hidden_act, str):
351
+ self.intermediate_act_fn = ACT2FN[config.hidden_act]
352
+ else:
353
+ self.intermediate_act_fn = config.hidden_act
354
+
355
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
356
+ hidden_states = self.dense(hidden_states)
357
+ hidden_states = self.intermediate_act_fn(hidden_states)
358
+ return hidden_states
359
+
360
+
361
+ class XLMRobertaXLOutput(nn.Module):
362
+ def __init__(self, config):
363
+ super().__init__()
364
+ self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
365
+
366
+ def forward(self, hidden_states, input_tensor):
367
+ hidden_states = self.dense(hidden_states)
368
+ hidden_states = hidden_states + input_tensor
369
+ return hidden_states
370
+
371
+
372
+ class XLMRobertaXLLayer(nn.Module):
373
+ def __init__(self, config):
374
+ super().__init__()
375
+ self.chunk_size_feed_forward = config.chunk_size_feed_forward
376
+ self.seq_len_dim = 1
377
+ self.attention = XLMRobertaXLAttention(config)
378
+ self.is_decoder = config.is_decoder
379
+ self.add_cross_attention = config.add_cross_attention
380
+ if self.add_cross_attention:
381
+ if not self.is_decoder:
382
+ raise ValueError(f"{self} should be used as a decoder model if cross attention is added")
383
+ self.crossattention = XLMRobertaXLAttention(config, position_embedding_type="absolute")
384
+ self.intermediate = XLMRobertaXLIntermediate(config)
385
+ self.output = XLMRobertaXLOutput(config)
386
+ self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
387
+
388
+ def forward(
389
+ self,
390
+ hidden_states,
391
+ attention_mask=None,
392
+ head_mask=None,
393
+ encoder_hidden_states=None,
394
+ encoder_attention_mask=None,
395
+ past_key_value=None,
396
+ output_attentions=False,
397
+ ):
398
+ # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
399
+ self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
400
+ self_attention_outputs = self.attention(
401
+ hidden_states,
402
+ attention_mask,
403
+ head_mask,
404
+ output_attentions=output_attentions,
405
+ past_key_value=self_attn_past_key_value,
406
+ )
407
+ attention_output = self_attention_outputs[0]
408
+
409
+ # if decoder, the last output is tuple of self-attn cache
410
+ if self.is_decoder:
411
+ outputs = self_attention_outputs[1:-1]
412
+ present_key_value = self_attention_outputs[-1]
413
+ else:
414
+ outputs = self_attention_outputs[1:] # add self attentions if we output attention weights
415
+
416
+ cross_attn_present_key_value = None
417
+ if self.is_decoder and encoder_hidden_states is not None:
418
+ if not hasattr(self, "crossattention"):
419
+ raise ValueError(
420
+ f"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers"
421
+ " by setting `config.add_cross_attention=True`"
422
+ )
423
+
424
+ # cross_attn cached key/values tuple is at positions 3,4 of past_key_value tuple
425
+ cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None
426
+ cross_attention_outputs = self.crossattention(
427
+ attention_output,
428
+ attention_mask,
429
+ head_mask,
430
+ encoder_hidden_states,
431
+ encoder_attention_mask,
432
+ cross_attn_past_key_value,
433
+ output_attentions,
434
+ )
435
+ attention_output = cross_attention_outputs[0]
436
+ outputs = outputs + cross_attention_outputs[1:-1] # add cross attentions if we output attention weights
437
+
438
+ # add cross-attn cache to positions 3,4 of present_key_value tuple
439
+ cross_attn_present_key_value = cross_attention_outputs[-1]
440
+ present_key_value = present_key_value + cross_attn_present_key_value
441
+
442
+ layer_output = apply_chunking_to_forward(
443
+ self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
444
+ )
445
+ outputs = (layer_output,) + outputs
446
+
447
+ # if decoder, return the attn key/values as the last output
448
+ if self.is_decoder:
449
+ outputs = outputs + (present_key_value,)
450
+
451
+ return outputs
452
+
453
+ def feed_forward_chunk(self, attention_output):
454
+ intermediate_output = self.LayerNorm(attention_output)
455
+ intermediate_output = self.intermediate(intermediate_output)
456
+ layer_output = self.output(intermediate_output, attention_output)
457
+ return layer_output
458
+
459
+
460
+ class XLMRobertaXLEncoder(nn.Module):
461
+ def __init__(self, config):
462
+ super().__init__()
463
+ self.config = config
464
+ self.layer = nn.ModuleList([XLMRobertaXLLayer(config) for _ in range(config.num_hidden_layers)])
465
+ self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
466
+ self.gradient_checkpointing = False
467
+
468
+ def forward(
469
+ self,
470
+ hidden_states,
471
+ attention_mask=None,
472
+ head_mask=None,
473
+ encoder_hidden_states=None,
474
+ encoder_attention_mask=None,
475
+ past_key_values=None,
476
+ use_cache=None,
477
+ output_attentions=False,
478
+ output_hidden_states=False,
479
+ return_dict=True,
480
+ ):
481
+ if self.gradient_checkpointing and self.training:
482
+ if use_cache:
483
+ logger.warning_once(
484
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
485
+ )
486
+ use_cache = False
487
+ all_hidden_states = () if output_hidden_states else None
488
+ all_self_attentions = () if output_attentions else None
489
+ all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None
490
+
491
+ next_decoder_cache = () if use_cache else None
492
+ for i, layer_module in enumerate(self.layer):
493
+ if output_hidden_states:
494
+ all_hidden_states = all_hidden_states + (hidden_states,)
495
+
496
+ layer_head_mask = head_mask[i] if head_mask is not None else None
497
+ past_key_value = past_key_values[i] if past_key_values is not None else None
498
+
499
+ if self.gradient_checkpointing and self.training:
500
+ layer_outputs = self._gradient_checkpointing_func(
501
+ layer_module.__call__,
502
+ hidden_states,
503
+ attention_mask,
504
+ layer_head_mask,
505
+ encoder_hidden_states,
506
+ encoder_attention_mask,
507
+ past_key_value,
508
+ output_attentions,
509
+ )
510
+ else:
511
+ layer_outputs = layer_module(
512
+ hidden_states,
513
+ attention_mask,
514
+ layer_head_mask,
515
+ encoder_hidden_states,
516
+ encoder_attention_mask,
517
+ past_key_value,
518
+ output_attentions,
519
+ )
520
+
521
+ hidden_states = layer_outputs[0]
522
+ if use_cache:
523
+ next_decoder_cache += (layer_outputs[-1],)
524
+ if output_attentions:
525
+ all_self_attentions = all_self_attentions + (layer_outputs[1],)
526
+ if self.config.add_cross_attention:
527
+ all_cross_attentions = all_cross_attentions + (layer_outputs[2],)
528
+
529
+ hidden_states = self.LayerNorm(hidden_states)
530
+
531
+ if output_hidden_states:
532
+ all_hidden_states = all_hidden_states + (hidden_states,)
533
+
534
+ if not return_dict:
535
+ return tuple(
536
+ v
537
+ for v in [
538
+ hidden_states,
539
+ next_decoder_cache,
540
+ all_hidden_states,
541
+ all_self_attentions,
542
+ all_cross_attentions,
543
+ ]
544
+ if v is not None
545
+ )
546
+ return BaseModelOutputWithPastAndCrossAttentions(
547
+ last_hidden_state=hidden_states,
548
+ past_key_values=next_decoder_cache,
549
+ hidden_states=all_hidden_states,
550
+ attentions=all_self_attentions,
551
+ cross_attentions=all_cross_attentions,
552
+ )
553
+
554
+
555
+ # Copied from transformers.models.bert.modeling_bert.BertPooler
556
+ class XLMRobertaXLPooler(nn.Module):
557
+ def __init__(self, config):
558
+ super().__init__()
559
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
560
+ self.activation = nn.Tanh()
561
+
562
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
563
+ # We "pool" the model by simply taking the hidden state corresponding
564
+ # to the first token.
565
+ first_token_tensor = hidden_states[:, 0]
566
+ pooled_output = self.dense(first_token_tensor)
567
+ pooled_output = self.activation(pooled_output)
568
+ return pooled_output
569
+
570
+
571
+ class XLMRobertaXLPreTrainedModel(PreTrainedModel):
572
+ """
573
+ An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
574
+ models.
575
+ """
576
+
577
+ config_class = XLMRobertaXLConfig
578
+ base_model_prefix = "roberta"
579
+
580
+ # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights
581
+ def _init_weights(self, module):
582
+ """Initialize the weights"""
583
+ if isinstance(module, nn.Linear):
584
+ # Slightly different from the TF version which uses truncated_normal for initialization
585
+ # cf https://github.com/pytorch/pytorch/pull/5617
586
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
587
+ if module.bias is not None:
588
+ module.bias.data.zero_()
589
+ elif isinstance(module, nn.Embedding):
590
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
591
+ if module.padding_idx is not None:
592
+ module.weight.data[module.padding_idx].zero_()
593
+ elif isinstance(module, nn.LayerNorm):
594
+ module.bias.data.zero_()
595
+ module.weight.data.fill_(1.0)
596
+
597
+
598
+ XLM_ROBERTA_XL_START_DOCSTRING = r"""
599
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
600
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
601
+ etc.) This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)
602
+ subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
603
+ general usage and behavior.
604
+
605
+ Parameters:
606
+ config ([`XLMRobertaXLConfig`]): Model configuration class with all the parameters of the
607
+ model. Initializing with a config file does not load the weights associated with the model, only the
608
+ configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
609
+ """
610
+
611
+ XLM_ROBERTA_XL_INPUTS_DOCSTRING = r"""
612
+ Args:
613
+ input_ids (`torch.LongTensor` of shape `({0})`):
614
+ Indices of input sequence tokens in the vocabulary. Indices can be obtained using [`AutoTokenizer`]. See
615
+ [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for details. [What are input
616
+ IDs?](../glossary#input-ids)
617
+ attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*):
618
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
619
+
620
+ - 1 for tokens that are **not masked**,
621
+ - 0 for tokens that are **masked**.
622
+ [What are attention masks?](../glossary#attention-mask)
623
+ token_type_ids (`torch.LongTensor` of shape `({0})`, *optional*):
624
+ Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
625
+ 1]`:
626
+
627
+ - 0 corresponds to a *sentence A* token,
628
+ - 1 corresponds to a *sentence B* token.
629
+ [What are token type IDs?](../glossary#token-type-ids)
630
+ position_ids (`torch.LongTensor` of shape `({0})`, *optional*):
631
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
632
+ config.max_position_embeddings - 1]`. [What are position IDs?](../glossary#position-ids)
633
+ head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
634
+ Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
635
+
636
+ - 1 indicates the head is **not masked**,
637
+ - 0 indicates the head is **masked**.
638
+ inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_size)`, *optional*):
639
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
640
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
641
+ model's internal embedding lookup matrix.
642
+ output_attentions (`bool`, *optional*):
643
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
644
+ tensors for more detail.
645
+ output_hidden_states (`bool`, *optional*):
646
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
647
+ more detail.
648
+ return_dict (`bool`, *optional*):
649
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
650
+ """
651
+
652
+
653
+ @add_start_docstrings(
654
+ "The bare XLM-RoBERTa-XL Model transformer outputting raw hidden-states without any specific head on top.",
655
+ XLM_ROBERTA_XL_START_DOCSTRING,
656
+ )
657
+ class XLMRobertaXLModel(XLMRobertaXLPreTrainedModel):
658
+ """
659
+ The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
660
+ cross-attention is added between the self-attention layers, following the architecture described in *Attention is
661
+ all you need*_ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
662
+ Kaiser and Illia Polosukhin. To behave as an decoder the model needs to be initialized with the `is_decoder`
663
+ argument of the configuration set to `True`. To be used in a Seq2Seq model, the model needs to initialized with
664
+ both `is_decoder` argument and `add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as
665
+ an input to the forward pass. .. _*Attention is all you need*: https://arxiv.org/abs/1706.03762
666
+ """
667
+
668
+ # Copied from transformers.models.bert.modeling_bert.BertModel.__init__ with Bert->XLMRobertaXL
669
+ def __init__(self, config, add_pooling_layer=True):
670
+ super().__init__(config)
671
+ self.config = config
672
+
673
+ self.embeddings = XLMRobertaXLEmbeddings(config)
674
+ self.encoder = XLMRobertaXLEncoder(config)
675
+
676
+ self.pooler = XLMRobertaXLPooler(config) if add_pooling_layer else None
677
+
678
+ # Initialize weights and apply final processing
679
+ self.post_init()
680
+
681
+ def get_input_embeddings(self):
682
+ return self.embeddings.word_embeddings
683
+
684
+ def set_input_embeddings(self, value):
685
+ self.embeddings.word_embeddings = value
686
+
687
+ def _prune_heads(self, heads_to_prune):
688
+ """
689
+ Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
690
+ class PreTrainedModel
691
+ """
692
+ for layer, heads in heads_to_prune.items():
693
+ self.encoder.layer[layer].attention.prune_heads(heads)
694
+
695
+ @add_start_docstrings_to_model_forward(XLM_ROBERTA_XL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
696
+ @add_code_sample_docstrings(
697
+ checkpoint=_CHECKPOINT_FOR_DOC,
698
+ output_type=BaseModelOutputWithPoolingAndCrossAttentions,
699
+ config_class=_CONFIG_FOR_DOC,
700
+ )
701
+ # Copied from transformers.models.bert.modeling_bert.BertModel.forward
702
+ def forward(
703
+ self,
704
+ input_ids: Optional[torch.Tensor] = None,
705
+ attention_mask: Optional[torch.Tensor] = None,
706
+ token_type_ids: Optional[torch.Tensor] = None,
707
+ position_ids: Optional[torch.Tensor] = None,
708
+ head_mask: Optional[torch.Tensor] = None,
709
+ inputs_embeds: Optional[torch.Tensor] = None,
710
+ encoder_hidden_states: Optional[torch.Tensor] = None,
711
+ encoder_attention_mask: Optional[torch.Tensor] = None,
712
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
713
+ use_cache: Optional[bool] = None,
714
+ output_attentions: Optional[bool] = None,
715
+ output_hidden_states: Optional[bool] = None,
716
+ return_dict: Optional[bool] = None,
717
+ ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPoolingAndCrossAttentions]:
718
+ r"""
719
+ encoder_hidden_states (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
720
+ Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
721
+ the model is configured as a decoder.
722
+ encoder_attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
723
+ Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
724
+ the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:
725
+
726
+ - 1 for tokens that are **not masked**,
727
+ - 0 for tokens that are **masked**.
728
+ past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
729
+ Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
730
+
731
+ If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
732
+ don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
733
+ `decoder_input_ids` of shape `(batch_size, sequence_length)`.
734
+ use_cache (`bool`, *optional*):
735
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
736
+ `past_key_values`).
737
+ """
738
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
739
+ output_hidden_states = (
740
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
741
+ )
742
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
743
+
744
+ if self.config.is_decoder:
745
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
746
+ else:
747
+ use_cache = False
748
+
749
+ if input_ids is not None and inputs_embeds is not None:
750
+ raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
751
+ elif input_ids is not None:
752
+ self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
753
+ input_shape = input_ids.size()
754
+ elif inputs_embeds is not None:
755
+ input_shape = inputs_embeds.size()[:-1]
756
+ else:
757
+ raise ValueError("You have to specify either input_ids or inputs_embeds")
758
+
759
+ batch_size, seq_length = input_shape
760
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
761
+
762
+ # past_key_values_length
763
+ past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
764
+
765
+ if attention_mask is None:
766
+ attention_mask = torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)
767
+
768
+ if token_type_ids is None:
769
+ if hasattr(self.embeddings, "token_type_ids"):
770
+ buffered_token_type_ids = self.embeddings.token_type_ids[:, :seq_length]
771
+ buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
772
+ token_type_ids = buffered_token_type_ids_expanded
773
+ else:
774
+ token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
775
+
776
+ # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
777
+ # ourselves in which case we just need to make it broadcastable to all heads.
778
+ extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape)
779
+
780
+ # If a 2D or 3D attention mask is provided for the cross-attention
781
+ # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
782
+ if self.config.is_decoder and encoder_hidden_states is not None:
783
+ encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
784
+ encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
785
+ if encoder_attention_mask is None:
786
+ encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)
787
+ encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
788
+ else:
789
+ encoder_extended_attention_mask = None
790
+
791
+ # Prepare head mask if needed
792
+ # 1.0 in head_mask indicate we keep the head
793
+ # attention_probs has shape bsz x n_heads x N x N
794
+ # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
795
+ # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
796
+ head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
797
+
798
+ embedding_output = self.embeddings(
799
+ input_ids=input_ids,
800
+ position_ids=position_ids,
801
+ token_type_ids=token_type_ids,
802
+ inputs_embeds=inputs_embeds,
803
+ past_key_values_length=past_key_values_length,
804
+ )
805
+ encoder_outputs = self.encoder(
806
+ embedding_output,
807
+ attention_mask=extended_attention_mask,
808
+ head_mask=head_mask,
809
+ encoder_hidden_states=encoder_hidden_states,
810
+ encoder_attention_mask=encoder_extended_attention_mask,
811
+ past_key_values=past_key_values,
812
+ use_cache=use_cache,
813
+ output_attentions=output_attentions,
814
+ output_hidden_states=output_hidden_states,
815
+ return_dict=return_dict,
816
+ )
817
+ sequence_output = encoder_outputs[0]
818
+ pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
819
+
820
+ if not return_dict:
821
+ return (sequence_output, pooled_output) + encoder_outputs[1:]
822
+
823
+ return BaseModelOutputWithPoolingAndCrossAttentions(
824
+ last_hidden_state=sequence_output,
825
+ pooler_output=pooled_output,
826
+ past_key_values=encoder_outputs.past_key_values,
827
+ hidden_states=encoder_outputs.hidden_states,
828
+ attentions=encoder_outputs.attentions,
829
+ cross_attentions=encoder_outputs.cross_attentions,
830
+ )
831
+
832
+
833
+ @add_start_docstrings(
834
+ """XLM-RoBERTa-XL Model with a `language modeling` head on top for CLM fine-tuning.""",
835
+ XLM_ROBERTA_XL_START_DOCSTRING,
836
+ )
837
+ class XLMRobertaXLForCausalLM(XLMRobertaXLPreTrainedModel):
838
+ _tied_weights_keys = ["lm_head.decoder.weight", "lm_head.decoder.bias"]
839
+
840
+ def __init__(self, config):
841
+ super().__init__(config)
842
+
843
+ if not config.is_decoder:
844
+ logger.warning("If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`")
845
+
846
+ self.roberta = XLMRobertaXLModel(config, add_pooling_layer=False)
847
+ self.lm_head = XLMRobertaXLLMHead(config)
848
+
849
+ self.init_weights()
850
+
851
+ def get_output_embeddings(self):
852
+ return self.lm_head.decoder
853
+
854
+ def set_output_embeddings(self, new_embeddings):
855
+ self.lm_head.decoder = new_embeddings
856
+
857
+ @add_start_docstrings_to_model_forward(XLM_ROBERTA_XL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
858
+ @replace_return_docstrings(output_type=CausalLMOutputWithCrossAttentions, config_class=_CONFIG_FOR_DOC)
859
+ def forward(
860
+ self,
861
+ input_ids: Optional[torch.LongTensor] = None,
862
+ attention_mask: Optional[torch.FloatTensor] = None,
863
+ token_type_ids: Optional[torch.LongTensor] = None,
864
+ position_ids: Optional[torch.LongTensor] = None,
865
+ head_mask: Optional[torch.FloatTensor] = None,
866
+ inputs_embeds: Optional[torch.FloatTensor] = None,
867
+ encoder_hidden_states: Optional[torch.FloatTensor] = None,
868
+ encoder_attention_mask: Optional[torch.FloatTensor] = None,
869
+ labels: Optional[torch.LongTensor] = None,
870
+ past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
871
+ use_cache: Optional[bool] = None,
872
+ output_attentions: Optional[bool] = None,
873
+ output_hidden_states: Optional[bool] = None,
874
+ return_dict: Optional[bool] = None,
875
+ ) -> Union[Tuple, CausalLMOutputWithCrossAttentions]:
876
+ r"""
877
+ encoder_hidden_states (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
878
+ Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
879
+ the model is configured as a decoder.
880
+ encoder_attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
881
+ Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
882
+ the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:
883
+
884
+ - 1 for tokens that are **not masked**,
885
+ - 0 for tokens that are **masked**.
886
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
887
+ Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
888
+ `[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are
889
+ ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
890
+ past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
891
+ Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
892
+ If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
893
+ don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
894
+ `decoder_input_ids` of shape `(batch_size, sequence_length)`.
895
+ use_cache (`bool`, *optional*):
896
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
897
+ `past_key_values`).
898
+
899
+ Returns:
900
+
901
+ Example:
902
+
903
+ ```python
904
+ >>> from transformers import AutoTokenizer, RobertaForCausalLM, RobertaConfig
905
+ >>> import torch
906
+
907
+ >>> tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base")
908
+ >>> config = RobertaConfig.from_pretrained("FacebookAI/roberta-base")
909
+ >>> config.is_decoder = True
910
+ >>> model = RobertaForCausalLM.from_pretrained("FacebookAI/roberta-base", config=config)
911
+ >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
912
+ >>> outputs = model(**inputs)
913
+ >>> prediction_logits = outputs.logits
914
+ ```
915
+ """
916
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
917
+ if labels is not None:
918
+ use_cache = False
919
+
920
+ outputs = self.roberta(
921
+ input_ids,
922
+ attention_mask=attention_mask,
923
+ token_type_ids=token_type_ids,
924
+ position_ids=position_ids,
925
+ head_mask=head_mask,
926
+ inputs_embeds=inputs_embeds,
927
+ encoder_hidden_states=encoder_hidden_states,
928
+ encoder_attention_mask=encoder_attention_mask,
929
+ past_key_values=past_key_values,
930
+ use_cache=use_cache,
931
+ output_attentions=output_attentions,
932
+ output_hidden_states=output_hidden_states,
933
+ return_dict=return_dict,
934
+ )
935
+
936
+ sequence_output = outputs[0]
937
+ prediction_scores = self.lm_head(sequence_output)
938
+
939
+ lm_loss = None
940
+ if labels is not None:
941
+ # we are doing next-token prediction; shift prediction scores and input ids by one
942
+ shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
943
+ labels = labels[:, 1:].contiguous()
944
+ loss_fct = CrossEntropyLoss()
945
+ lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
946
+
947
+ if not return_dict:
948
+ output = (prediction_scores,) + outputs[2:]
949
+ return ((lm_loss,) + output) if lm_loss is not None else output
950
+
951
+ return CausalLMOutputWithCrossAttentions(
952
+ loss=lm_loss,
953
+ logits=prediction_scores,
954
+ past_key_values=outputs.past_key_values,
955
+ hidden_states=outputs.hidden_states,
956
+ attentions=outputs.attentions,
957
+ cross_attentions=outputs.cross_attentions,
958
+ )
959
+
960
+ def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attention_mask=None, **model_kwargs):
961
+ input_shape = input_ids.shape
962
+ # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
963
+ if attention_mask is None:
964
+ attention_mask = input_ids.new_ones(input_shape)
965
+
966
+ # cut decoder_input_ids if past_key_values is used
967
+ if past_key_values is not None:
968
+ past_length = past_key_values[0][0].shape[2]
969
+
970
+ # Some generation methods already pass only the last input ID
971
+ if input_ids.shape[1] > past_length:
972
+ remove_prefix_length = past_length
973
+ else:
974
+ # Default to old behavior: keep only final ID
975
+ remove_prefix_length = input_ids.shape[1] - 1
976
+
977
+ input_ids = input_ids[:, remove_prefix_length:]
978
+
979
+ return {"input_ids": input_ids, "attention_mask": attention_mask, "past_key_values": past_key_values}
980
+
981
+ def _reorder_cache(self, past_key_values, beam_idx):
982
+ reordered_past = ()
983
+ for layer_past in past_key_values:
984
+ reordered_past += (
985
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
986
+ )
987
+ return reordered_past
988
+
989
+
990
+ @add_start_docstrings(
991
+ """XLM-RoBERTa-XL Model with a `language modeling` head on top.""", XLM_ROBERTA_XL_START_DOCSTRING
992
+ )
993
+ class XLMRobertaXLForMaskedLM(XLMRobertaXLPreTrainedModel):
994
+ _tied_weights_keys = ["lm_head.decoder.weight", "lm_head.decoder.bias"]
995
+
996
+ def __init__(self, config):
997
+ super().__init__(config)
998
+
999
+ if config.is_decoder:
1000
+ logger.warning(
1001
+ "If you want to use `RobertaForMaskedLM` make sure `config.is_decoder=False` for "
1002
+ "bi-directional self-attention."
1003
+ )
1004
+
1005
+ self.roberta = XLMRobertaXLModel(config, add_pooling_layer=False)
1006
+ self.lm_head = XLMRobertaXLLMHead(config)
1007
+
1008
+ self.init_weights()
1009
+
1010
+ def get_output_embeddings(self):
1011
+ return self.lm_head.decoder
1012
+
1013
+ def set_output_embeddings(self, new_embeddings):
1014
+ self.lm_head.decoder = new_embeddings
1015
+
1016
+ @add_start_docstrings_to_model_forward(XLM_ROBERTA_XL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
1017
+ @add_code_sample_docstrings(
1018
+ checkpoint=_CHECKPOINT_FOR_DOC,
1019
+ output_type=MaskedLMOutput,
1020
+ config_class=_CONFIG_FOR_DOC,
1021
+ mask="<mask>",
1022
+ )
1023
+ def forward(
1024
+ self,
1025
+ input_ids: Optional[torch.LongTensor] = None,
1026
+ attention_mask: Optional[torch.FloatTensor] = None,
1027
+ token_type_ids: Optional[torch.LongTensor] = None,
1028
+ position_ids: Optional[torch.LongTensor] = None,
1029
+ head_mask: Optional[torch.FloatTensor] = None,
1030
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1031
+ encoder_hidden_states: Optional[torch.Tensor] = None,
1032
+ encoder_attention_mask: Optional[torch.FloatTensor] = None,
1033
+ labels: Optional[torch.LongTensor] = None,
1034
+ output_attentions: Optional[bool] = None,
1035
+ output_hidden_states: Optional[bool] = None,
1036
+ return_dict: Optional[bool] = None,
1037
+ ) -> Union[Tuple, MaskedLMOutput]:
1038
+ r"""
1039
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1040
+ Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
1041
+ config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
1042
+ loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
1043
+ kwargs (`Dict[str, any]`, optional, defaults to *{}*):
1044
+ Used to hide legacy arguments that have been deprecated.
1045
+ """
1046
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1047
+
1048
+ outputs = self.roberta(
1049
+ input_ids,
1050
+ attention_mask=attention_mask,
1051
+ token_type_ids=token_type_ids,
1052
+ position_ids=position_ids,
1053
+ head_mask=head_mask,
1054
+ inputs_embeds=inputs_embeds,
1055
+ encoder_hidden_states=encoder_hidden_states,
1056
+ encoder_attention_mask=encoder_attention_mask,
1057
+ output_attentions=output_attentions,
1058
+ output_hidden_states=output_hidden_states,
1059
+ return_dict=return_dict,
1060
+ )
1061
+ sequence_output = outputs[0]
1062
+ prediction_scores = self.lm_head(sequence_output)
1063
+
1064
+ masked_lm_loss = None
1065
+ if labels is not None:
1066
+ loss_fct = CrossEntropyLoss()
1067
+ masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
1068
+
1069
+ if not return_dict:
1070
+ output = (prediction_scores,) + outputs[2:]
1071
+ return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
1072
+
1073
+ return MaskedLMOutput(
1074
+ loss=masked_lm_loss,
1075
+ logits=prediction_scores,
1076
+ hidden_states=outputs.hidden_states,
1077
+ attentions=outputs.attentions,
1078
+ )
1079
+
1080
+
1081
+ class XLMRobertaXLLMHead(nn.Module):
1082
+ """XLM-RoBERTa-XL Head for masked language modeling."""
1083
+
1084
+ def __init__(self, config):
1085
+ super().__init__()
1086
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
1087
+ self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
1088
+
1089
+ self.decoder = nn.Linear(config.hidden_size, config.vocab_size)
1090
+ self.bias = nn.Parameter(torch.zeros(config.vocab_size))
1091
+ self.decoder.bias = self.bias
1092
+
1093
+ def forward(self, features, **kwargs):
1094
+ x = self.dense(features)
1095
+ x = gelu(x)
1096
+ x = self.layer_norm(x)
1097
+
1098
+ # project back to size of vocabulary with bias
1099
+ x = self.decoder(x)
1100
+
1101
+ return x
1102
+
1103
+ def _tie_weights(self):
1104
+ # To tie those two weights if they get disconnected (on TPU or when the bias is resized)
1105
+ self.bias = self.decoder.bias
1106
+
1107
+
1108
+ @add_start_docstrings(
1109
+ """
1110
+ XLM-RoBERTa-XL Model transformer with a sequence classification/regression head on top (a linear layer on top
1111
+ of the pooled output) e.g. for GLUE tasks.
1112
+ """,
1113
+ XLM_ROBERTA_XL_START_DOCSTRING,
1114
+ )
1115
+ class XLMRobertaXLForSequenceClassification(XLMRobertaXLPreTrainedModel):
1116
+ def __init__(self, config):
1117
+ super().__init__(config)
1118
+ self.num_labels = config.num_labels
1119
+ self.config = config
1120
+
1121
+ self.roberta = XLMRobertaXLModel(config, add_pooling_layer=False)
1122
+ self.classifier = XLMRobertaXLClassificationHead(config)
1123
+
1124
+ self.init_weights()
1125
+
1126
+ @add_start_docstrings_to_model_forward(XLM_ROBERTA_XL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
1127
+ @add_code_sample_docstrings(
1128
+ checkpoint=_CHECKPOINT_FOR_DOC,
1129
+ output_type=SequenceClassifierOutput,
1130
+ config_class=_CONFIG_FOR_DOC,
1131
+ )
1132
+ def forward(
1133
+ self,
1134
+ input_ids: Optional[torch.LongTensor] = None,
1135
+ attention_mask: Optional[torch.FloatTensor] = None,
1136
+ token_type_ids: Optional[torch.LongTensor] = None,
1137
+ position_ids: Optional[torch.LongTensor] = None,
1138
+ head_mask: Optional[torch.FloatTensor] = None,
1139
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1140
+ labels: Optional[torch.LongTensor] = None,
1141
+ output_attentions: Optional[bool] = None,
1142
+ output_hidden_states: Optional[bool] = None,
1143
+ return_dict: Optional[bool] = None,
1144
+ ) -> Union[Tuple, SequenceClassifierOutput]:
1145
+ r"""
1146
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1147
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1148
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1149
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1150
+ """
1151
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1152
+
1153
+ outputs = self.roberta(
1154
+ input_ids,
1155
+ attention_mask=attention_mask,
1156
+ token_type_ids=token_type_ids,
1157
+ position_ids=position_ids,
1158
+ head_mask=head_mask,
1159
+ inputs_embeds=inputs_embeds,
1160
+ output_attentions=output_attentions,
1161
+ output_hidden_states=output_hidden_states,
1162
+ return_dict=return_dict,
1163
+ )
1164
+ sequence_output = outputs[0]
1165
+ logits = self.classifier(sequence_output)
1166
+
1167
+ loss = None
1168
+ if labels is not None:
1169
+ if self.config.problem_type is None:
1170
+ if self.num_labels == 1:
1171
+ self.config.problem_type = "regression"
1172
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1173
+ self.config.problem_type = "single_label_classification"
1174
+ else:
1175
+ self.config.problem_type = "multi_label_classification"
1176
+
1177
+ if self.config.problem_type == "regression":
1178
+ loss_fct = MSELoss()
1179
+ if self.num_labels == 1:
1180
+ loss = loss_fct(logits.squeeze(), labels.squeeze())
1181
+ else:
1182
+ loss = loss_fct(logits, labels)
1183
+ elif self.config.problem_type == "single_label_classification":
1184
+ loss_fct = CrossEntropyLoss()
1185
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
1186
+ elif self.config.problem_type == "multi_label_classification":
1187
+ loss_fct = BCEWithLogitsLoss()
1188
+ loss = loss_fct(logits, labels)
1189
+
1190
+ if not return_dict:
1191
+ output = (logits,) + outputs[2:]
1192
+ return ((loss,) + output) if loss is not None else output
1193
+
1194
+ return SequenceClassifierOutput(
1195
+ loss=loss,
1196
+ logits=logits,
1197
+ hidden_states=outputs.hidden_states,
1198
+ attentions=outputs.attentions,
1199
+ )
1200
+
1201
+
1202
+ @add_start_docstrings(
1203
+ """
1204
+ XLM-RoBERTa-XL Model with a multiple choice classification head on top (a linear layer on top of the pooled
1205
+ output and a softmax) e.g. for RocStories/SWAG tasks.
1206
+ """,
1207
+ XLM_ROBERTA_XL_START_DOCSTRING,
1208
+ )
1209
+ class XLMRobertaXLForMultipleChoice(XLMRobertaXLPreTrainedModel):
1210
+ def __init__(self, config):
1211
+ super().__init__(config)
1212
+
1213
+ self.roberta = XLMRobertaXLModel(config)
1214
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
1215
+ self.classifier = nn.Linear(config.hidden_size, 1)
1216
+
1217
+ self.init_weights()
1218
+
1219
+ @add_start_docstrings_to_model_forward(
1220
+ XLM_ROBERTA_XL_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length")
1221
+ )
1222
+ @add_code_sample_docstrings(
1223
+ checkpoint=_CHECKPOINT_FOR_DOC,
1224
+ output_type=MultipleChoiceModelOutput,
1225
+ config_class=_CONFIG_FOR_DOC,
1226
+ )
1227
+ def forward(
1228
+ self,
1229
+ input_ids: Optional[torch.LongTensor] = None,
1230
+ token_type_ids: Optional[torch.LongTensor] = None,
1231
+ attention_mask: Optional[torch.FloatTensor] = None,
1232
+ labels: Optional[torch.LongTensor] = None,
1233
+ position_ids: Optional[torch.LongTensor] = None,
1234
+ head_mask: Optional[torch.FloatTensor] = None,
1235
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1236
+ output_attentions: Optional[bool] = None,
1237
+ output_hidden_states: Optional[bool] = None,
1238
+ return_dict: Optional[bool] = None,
1239
+ ) -> Union[Tuple, MultipleChoiceModelOutput]:
1240
+ r"""
1241
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1242
+ Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
1243
+ num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
1244
+ `input_ids` above)
1245
+ """
1246
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1247
+ num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
1248
+
1249
+ flat_input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
1250
+ flat_position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
1251
+ flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
1252
+ flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
1253
+ flat_inputs_embeds = (
1254
+ inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))
1255
+ if inputs_embeds is not None
1256
+ else None
1257
+ )
1258
+
1259
+ outputs = self.roberta(
1260
+ flat_input_ids,
1261
+ position_ids=flat_position_ids,
1262
+ token_type_ids=flat_token_type_ids,
1263
+ attention_mask=flat_attention_mask,
1264
+ head_mask=head_mask,
1265
+ inputs_embeds=flat_inputs_embeds,
1266
+ output_attentions=output_attentions,
1267
+ output_hidden_states=output_hidden_states,
1268
+ return_dict=return_dict,
1269
+ )
1270
+ pooled_output = outputs[1]
1271
+
1272
+ pooled_output = self.dropout(pooled_output)
1273
+ logits = self.classifier(pooled_output)
1274
+ reshaped_logits = logits.view(-1, num_choices)
1275
+
1276
+ loss = None
1277
+ if labels is not None:
1278
+ loss_fct = CrossEntropyLoss()
1279
+ loss = loss_fct(reshaped_logits, labels)
1280
+
1281
+ if not return_dict:
1282
+ output = (reshaped_logits,) + outputs[2:]
1283
+ return ((loss,) + output) if loss is not None else output
1284
+
1285
+ return MultipleChoiceModelOutput(
1286
+ loss=loss,
1287
+ logits=reshaped_logits,
1288
+ hidden_states=outputs.hidden_states,
1289
+ attentions=outputs.attentions,
1290
+ )
1291
+
1292
+
1293
+ class LayerwiseAttention(torch.nn.Module):
1294
+ def __init__(
1295
+ self,
1296
+ num_hidden_layers: int,
1297
+ layer_norm: bool = False,
1298
+ layer_weights: Optional[List[int]] = None,
1299
+ dropout: float = None,
1300
+ layer_transformation: str = "softmax",
1301
+ ) -> None:
1302
+ super(LayerwiseAttention, self).__init__()
1303
+ self.num_layers = num_hidden_layers + 1
1304
+ self.layer_norm = layer_norm
1305
+ self.dropout = dropout
1306
+
1307
+ self.transform_fn = torch.softmax
1308
+ if layer_transformation == "sparsemax":
1309
+ from entmax import sparsemax
1310
+
1311
+ self.transform_fn = sparsemax
1312
+
1313
+ if layer_weights is None:
1314
+ layer_weights = [0.0] * self.num_layers
1315
+ elif len(layer_weights) != self.num_layers:
1316
+ raise Exception(
1317
+ "Length of layer_weights {} differs \
1318
+ from num_layers {}".format(
1319
+ layer_weights, self.num_layers
1320
+ )
1321
+ )
1322
+ self.gam = Parameter(torch.FloatTensor([1.0]), requires_grad=True)
1323
+ self.scalar_parameters = ParameterList(
1324
+ [
1325
+ Parameter(
1326
+ torch.FloatTensor([layer_weights[i]]),
1327
+ requires_grad=True,
1328
+ )
1329
+ for i in range(self.num_layers)
1330
+ ]
1331
+ )
1332
+
1333
+
1334
+
1335
+ if self.dropout:
1336
+ dropout_mask = torch.zeros(len(self.scalar_parameters))
1337
+ dropout_fill = torch.empty(len(self.scalar_parameters)).fill_(-1e20)
1338
+ self.register_buffer("dropout_mask", dropout_mask)
1339
+ self.register_buffer("dropout_fill", dropout_fill)
1340
+
1341
+ def forward(
1342
+ self,
1343
+ tensors: List[torch.Tensor], # pylint: disable=arguments-differ
1344
+ mask: torch.Tensor = None,
1345
+ ) -> torch.Tensor:
1346
+ if len(tensors) != self.num_layers:
1347
+ raise Exception(
1348
+ "{} tensors were passed, but the module was initialized to \
1349
+ mix {} tensors.".format(
1350
+ len(tensors), self.num_layers
1351
+ )
1352
+ )
1353
+
1354
+ def _layer_norm(tensor, broadcast_mask, mask):
1355
+ tensor_masked = tensor * broadcast_mask
1356
+ batch_size, _, input_dim = tensors[0].size()
1357
+
1358
+ # mean for each sentence
1359
+ num_elements_not_masked = mask.sum(1) * input_dim
1360
+ mean = tensor_masked.view(batch_size, -1).sum(1)
1361
+ mean = (mean / num_elements_not_masked).view(batch_size, 1, 1)
1362
+
1363
+ variance = (((tensor_masked - mean) * broadcast_mask) ** 2).view(
1364
+ batch_size, -1
1365
+ ).sum(1) / num_elements_not_masked
1366
+ normalized_tensor = (tensor - mean) / torch.sqrt(variance + 1e-12).view(
1367
+ batch_size, 1, 1
1368
+ )
1369
+ return normalized_tensor
1370
+
1371
+ # BUG: Pytorch bug fix when Parameters are not well copied across GPUs
1372
+ # https://github.com/pytorch/pytorch/issues/36035
1373
+ if len([parameter for parameter in self.scalar_parameters]) != self.num_layers:
1374
+ weights = torch.tensor(self.weights, device=tensors[0].device)
1375
+ gamma = torch.tensor(self.gam, device=tensors[0].device)
1376
+ else:
1377
+ weights = torch.cat([parameter for parameter in self.scalar_parameters])
1378
+ gamma = self.gam
1379
+
1380
+ if self.training and self.dropout:
1381
+ weights = torch.where(
1382
+ self.dropout_mask.uniform_() > self.dropout, weights, self.dropout_fill
1383
+ )
1384
+
1385
+ normed_weights = self.transform_fn(weights, dim=0)
1386
+ normed_weights = torch.split(normed_weights, split_size_or_sections=1)
1387
+
1388
+ if not self.layer_norm:
1389
+ pieces = []
1390
+ for weight, tensor in zip(normed_weights, tensors):
1391
+ pieces.append(weight * tensor)
1392
+ return gamma * sum(pieces)
1393
+
1394
+ else:
1395
+ mask_float = mask.float()
1396
+ broadcast_mask = mask_float.unsqueeze(-1)
1397
+
1398
+ pieces = []
1399
+ for weight, tensor in zip(normed_weights, tensors):
1400
+ pieces.append(weight * _layer_norm(tensor, broadcast_mask, mask_float))
1401
+ return gamma * sum(pieces)
1402
+
1403
+
1404
+ class FeedForward(nn.Module):
1405
+ """Feed Forward Neural Network.
1406
+
1407
+ Args:
1408
+ in_dim (int): Number input features.
1409
+ out_dim (int): Number of output features. Default is just a score.
1410
+ hidden_sizes (List[int]): List with hidden layer sizes. Defaults to [3072,1024]
1411
+ activations (str): Name of the activation function to be used in the hidden
1412
+ layers. Defaults to 'Tanh'.
1413
+ final_activation (Optional[str]): Final activation if any.
1414
+ dropout (float): dropout to be used in the hidden layers.
1415
+ """
1416
+
1417
+ def __init__(
1418
+ self,
1419
+ in_dim: int = 1024,
1420
+ out_dim: int = 1,
1421
+ hidden_sizes: List[int] = [3072, 1024],
1422
+ activations: str = "Tanh",
1423
+ final_activation: Optional[str] = None,
1424
+ dropout: float = 0.0,
1425
+ ) -> None:
1426
+ super().__init__()
1427
+ modules = []
1428
+ modules.append(nn.Linear(in_dim, hidden_sizes[0]))
1429
+ modules.append(self.build_activation(activations))
1430
+ modules.append(nn.Dropout(dropout))
1431
+
1432
+ for i in range(1, len(hidden_sizes)):
1433
+ modules.append(nn.Linear(hidden_sizes[i - 1], hidden_sizes[i]))
1434
+ modules.append(self.build_activation(activations))
1435
+ modules.append(nn.Dropout(dropout))
1436
+
1437
+ modules.append(nn.Linear(hidden_sizes[-1], int(out_dim)))
1438
+ if final_activation is not None:
1439
+ modules.append(self.build_activation(final_activation))
1440
+
1441
+ self.ff = nn.Sequential(*modules)
1442
+
1443
+ def build_activation(self, activation: str) -> nn.Module:
1444
+ if hasattr(nn, activation.title()):
1445
+ return getattr(nn, activation.title())()
1446
+ else:
1447
+ raise Exception(f"{activation} is not a valid activation function!")
1448
+
1449
+ def forward(self, in_features: torch.Tensor) -> torch.Tensor:
1450
+ return self.ff(in_features)
1451
+
1452
+
1453
+ @add_start_docstrings(
1454
+ """
1455
+ XLM-RoBERTa-XL Model with a multiple choice classification head on top (a linear layer on top of the pooled
1456
+ output and a softmax) e.g. for RocStories/SWAG tasks.
1457
+ """,
1458
+ XLM_ROBERTA_XL_START_DOCSTRING,
1459
+ )
1460
+ class XLMRobertaXLForEstimation(XLMRobertaXLPreTrainedModel):
1461
+ def __init__(self, config):
1462
+ super().__init__(config)
1463
+
1464
+ self.roberta = XLMRobertaXLModel(config, add_pooling_layer=False)
1465
+ self.layerwise_attention = LayerwiseAttention(
1466
+ layer_transformation=config.layer_transformation,
1467
+ num_hidden_layers=config.num_hidden_layers,
1468
+ dropout=config.dropout,
1469
+ layer_norm=config.layer_norm
1470
+ )
1471
+
1472
+ self.estimator = FeedForward(
1473
+ in_dim=config.hidden_size,
1474
+ hidden_sizes=config.estimator_sizes,
1475
+ )
1476
+
1477
+ self.init_weights()
1478
+
1479
+ def forward(
1480
+ self,
1481
+ input_ids: Optional[torch.LongTensor] = None,
1482
+ token_type_ids: Optional[torch.LongTensor] = None,
1483
+ attention_mask: Optional[torch.FloatTensor] = None,
1484
+ labels: Optional[torch.LongTensor] = None,
1485
+ position_ids: Optional[torch.LongTensor] = None,
1486
+ head_mask: Optional[torch.FloatTensor] = None,
1487
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1488
+ output_attentions: Optional[bool] = None,
1489
+ output_hidden_states: Optional[bool] = None,
1490
+ return_dict: Optional[bool] = None,
1491
+ ) -> Union[Tuple, MultipleChoiceModelOutput]:
1492
+ r"""
1493
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1494
+ Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
1495
+ num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
1496
+ `input_ids` above)
1497
+ """
1498
+ return_dict = False
1499
+ output_hidden_states = True
1500
+ num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
1501
+
1502
+ flat_input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
1503
+ flat_position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
1504
+ flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
1505
+ flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
1506
+ flat_inputs_embeds = (
1507
+ inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))
1508
+ if inputs_embeds is not None
1509
+ else None
1510
+ )
1511
+
1512
+ outputs = self.roberta(
1513
+ flat_input_ids,
1514
+ position_ids=flat_position_ids,
1515
+ token_type_ids=flat_token_type_ids,
1516
+ attention_mask=flat_attention_mask,
1517
+ head_mask=head_mask,
1518
+ inputs_embeds=flat_inputs_embeds,
1519
+ output_attentions=output_attentions,
1520
+ output_hidden_states=output_hidden_states,
1521
+ return_dict=return_dict,
1522
+ )
1523
+
1524
+ if self.layerwise_attention:
1525
+ embeddings = self.layerwise_attention(
1526
+ outputs[2], attention_mask
1527
+ )
1528
+ else:
1529
+ embeddings = outputs[0]
1530
+
1531
+ CLS_tok = embeddings[:, 0, :] # for some reason at sentence level we take the first token score cf Comet
1532
+
1533
+ logits = self.estimator(CLS_tok)
1534
+ reshaped_logits = logits #.view(-1, num_choices)
1535
+
1536
+ loss = None
1537
+ if labels is not None:
1538
+ loss_fct = CrossEntropyLoss()
1539
+ loss = loss_fct(reshaped_logits, labels)
1540
+
1541
+ if not return_dict:
1542
+ output = (reshaped_logits,) + outputs[2:]
1543
+ return ((loss,) + output) if loss is not None else output
1544
+
1545
+ return MultipleChoiceModelOutput(
1546
+ loss=loss,
1547
+ logits=reshaped_logits,
1548
+ hidden_states=outputs.hidden_states,
1549
+ attentions=outputs.attentions,
1550
+ )
1551
+
1552
+
1553
+ @add_start_docstrings(
1554
+ """
1555
+ XLM-RoBERTa-XL Model with a token classification head on top (a linear layer on top of the hidden-states
1556
+ output) e.g. for Named-Entity-Recognition (NER) tasks.
1557
+ """,
1558
+ XLM_ROBERTA_XL_START_DOCSTRING,
1559
+ )
1560
+ class XLMRobertaXLForTokenClassification(XLMRobertaXLPreTrainedModel):
1561
+ def __init__(self, config):
1562
+ super().__init__(config)
1563
+ self.num_labels = config.num_labels
1564
+
1565
+ self.roberta = XLMRobertaXLModel(config, add_pooling_layer=False)
1566
+ classifier_dropout = (
1567
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1568
+ )
1569
+ self.dropout = nn.Dropout(classifier_dropout)
1570
+ self.classifier = nn.Linear(config.hidden_size, config.num_labels)
1571
+
1572
+ self.init_weights()
1573
+
1574
+ @add_start_docstrings_to_model_forward(XLM_ROBERTA_XL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
1575
+ @add_code_sample_docstrings(
1576
+ checkpoint=_CHECKPOINT_FOR_DOC,
1577
+ output_type=TokenClassifierOutput,
1578
+ config_class=_CONFIG_FOR_DOC,
1579
+ )
1580
+ def forward(
1581
+ self,
1582
+ input_ids: Optional[torch.LongTensor] = None,
1583
+ attention_mask: Optional[torch.FloatTensor] = None,
1584
+ token_type_ids: Optional[torch.LongTensor] = None,
1585
+ position_ids: Optional[torch.LongTensor] = None,
1586
+ head_mask: Optional[torch.FloatTensor] = None,
1587
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1588
+ labels: Optional[torch.LongTensor] = None,
1589
+ output_attentions: Optional[bool] = None,
1590
+ output_hidden_states: Optional[bool] = None,
1591
+ return_dict: Optional[bool] = None,
1592
+ ) -> Union[Tuple, TokenClassifierOutput]:
1593
+ r"""
1594
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1595
+ Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
1596
+ """
1597
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1598
+
1599
+ outputs = self.roberta(
1600
+ input_ids,
1601
+ attention_mask=attention_mask,
1602
+ token_type_ids=token_type_ids,
1603
+ position_ids=position_ids,
1604
+ head_mask=head_mask,
1605
+ inputs_embeds=inputs_embeds,
1606
+ output_attentions=output_attentions,
1607
+ output_hidden_states=output_hidden_states,
1608
+ return_dict=return_dict,
1609
+ )
1610
+
1611
+ sequence_output = outputs[0]
1612
+
1613
+ sequence_output = self.dropout(sequence_output)
1614
+ logits = self.classifier(sequence_output)
1615
+
1616
+ loss = None
1617
+ if labels is not None:
1618
+ loss_fct = CrossEntropyLoss()
1619
+ # Only keep active parts of the loss
1620
+ if attention_mask is not None:
1621
+ active_loss = attention_mask.view(-1) == 1
1622
+ active_logits = logits.view(-1, self.num_labels)
1623
+ active_labels = torch.where(
1624
+ active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)
1625
+ )
1626
+ loss = loss_fct(active_logits, active_labels)
1627
+ else:
1628
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
1629
+
1630
+ if not return_dict:
1631
+ output = (logits,) + outputs[2:]
1632
+ return ((loss,) + output) if loss is not None else output
1633
+
1634
+ return TokenClassifierOutput(
1635
+ loss=loss,
1636
+ logits=logits,
1637
+ hidden_states=outputs.hidden_states,
1638
+ attentions=outputs.attentions,
1639
+ )
1640
+
1641
+
1642
+ class XLMRobertaXLClassificationHead(nn.Module):
1643
+ """Head for sentence-level classification tasks."""
1644
+
1645
+ def __init__(self, config):
1646
+ super().__init__()
1647
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
1648
+ classifier_dropout = (
1649
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1650
+ )
1651
+ self.dropout = nn.Dropout(classifier_dropout)
1652
+ self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
1653
+
1654
+ def forward(self, features, **kwargs):
1655
+ x = features[:, 0, :] # take <s> token (equiv. to [CLS])
1656
+ x = self.dropout(x)
1657
+ x = self.dense(x)
1658
+ x = torch.tanh(x)
1659
+ x = self.dropout(x)
1660
+ x = self.out_proj(x)
1661
+ return x
1662
+
1663
+
1664
+ @add_start_docstrings(
1665
+ """
1666
+ XLM-RoBERTa-XL Model with a span classification head on top for extractive question-answering tasks like SQuAD
1667
+ (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`).
1668
+ """,
1669
+ XLM_ROBERTA_XL_START_DOCSTRING,
1670
+ )
1671
+ class XLMRobertaXLForQuestionAnswering(XLMRobertaXLPreTrainedModel):
1672
+ def __init__(self, config):
1673
+ super().__init__(config)
1674
+ self.num_labels = config.num_labels
1675
+
1676
+ self.roberta = XLMRobertaXLModel(config, add_pooling_layer=False)
1677
+ self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
1678
+
1679
+ self.init_weights()
1680
+
1681
+ @add_start_docstrings_to_model_forward(XLM_ROBERTA_XL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
1682
+ @add_code_sample_docstrings(
1683
+ checkpoint=_CHECKPOINT_FOR_DOC,
1684
+ output_type=QuestionAnsweringModelOutput,
1685
+ config_class=_CONFIG_FOR_DOC,
1686
+ )
1687
+ def forward(
1688
+ self,
1689
+ input_ids: Optional[torch.LongTensor] = None,
1690
+ attention_mask: Optional[torch.FloatTensor] = None,
1691
+ token_type_ids: Optional[torch.LongTensor] = None,
1692
+ position_ids: Optional[torch.LongTensor] = None,
1693
+ head_mask: Optional[torch.FloatTensor] = None,
1694
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1695
+ start_positions: Optional[torch.LongTensor] = None,
1696
+ end_positions: Optional[torch.LongTensor] = None,
1697
+ output_attentions: Optional[bool] = None,
1698
+ output_hidden_states: Optional[bool] = None,
1699
+ return_dict: Optional[bool] = None,
1700
+ ) -> Union[Tuple, QuestionAnsweringModelOutput]:
1701
+ r"""
1702
+ start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1703
+ Labels for position (index) of the start of the labelled span for computing the token classification loss.
1704
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1705
+ are not taken into account for computing the loss.
1706
+ end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1707
+ Labels for position (index) of the end of the labelled span for computing the token classification loss.
1708
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1709
+ are not taken into account for computing the loss.
1710
+ """
1711
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1712
+
1713
+ outputs = self.roberta(
1714
+ input_ids,
1715
+ attention_mask=attention_mask,
1716
+ token_type_ids=token_type_ids,
1717
+ position_ids=position_ids,
1718
+ head_mask=head_mask,
1719
+ inputs_embeds=inputs_embeds,
1720
+ output_attentions=output_attentions,
1721
+ output_hidden_states=output_hidden_states,
1722
+ return_dict=return_dict,
1723
+ )
1724
+
1725
+ sequence_output = outputs[0]
1726
+
1727
+ logits = self.qa_outputs(sequence_output)
1728
+ start_logits, end_logits = logits.split(1, dim=-1)
1729
+ start_logits = start_logits.squeeze(-1).contiguous()
1730
+ end_logits = end_logits.squeeze(-1).contiguous()
1731
+
1732
+ total_loss = None
1733
+ if start_positions is not None and end_positions is not None:
1734
+ # If we are on multi-GPU, split add a dimension
1735
+ if len(start_positions.size()) > 1:
1736
+ start_positions = start_positions.squeeze(-1)
1737
+ if len(end_positions.size()) > 1:
1738
+ end_positions = end_positions.squeeze(-1)
1739
+ # sometimes the start/end positions are outside our model inputs, we ignore these terms
1740
+ ignored_index = start_logits.size(1)
1741
+ start_positions = start_positions.clamp(0, ignored_index)
1742
+ end_positions = end_positions.clamp(0, ignored_index)
1743
+
1744
+ loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
1745
+ start_loss = loss_fct(start_logits, start_positions)
1746
+ end_loss = loss_fct(end_logits, end_positions)
1747
+ total_loss = (start_loss + end_loss) / 2
1748
+
1749
+ if not return_dict:
1750
+ output = (start_logits, end_logits) + outputs[2:]
1751
+ return ((total_loss,) + output) if total_loss is not None else output
1752
+
1753
+ return QuestionAnsweringModelOutput(
1754
+ loss=total_loss,
1755
+ start_logits=start_logits,
1756
+ end_logits=end_logits,
1757
+ hidden_states=outputs.hidden_states,
1758
+ attentions=outputs.attentions,
1759
+ )
1760
+
1761
+
1762
+ # Copied from transformers.models.roberta.modeling_roberta.create_position_ids_from_input_ids
1763
+ def create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length=0):
1764
+ """
1765
+ Replace non-padding symbols with their position numbers. Position numbers begin at padding_idx+1. Padding symbols
1766
+ are ignored. This is modified from fairseq's `utils.make_positions`.
1767
+
1768
+ Args:
1769
+ x: torch.Tensor x:
1770
+
1771
+ Returns: torch.Tensor
1772
+ """
1773
+ # The series of casts and type-conversions here are carefully balanced to both work with ONNX export and XLA.
1774
+ mask = input_ids.ne(padding_idx).int()
1775
+ incremental_indices = (torch.cumsum(mask, dim=1).type_as(mask) + past_key_values_length) * mask
1776
+ return incremental_indices.long() + padding_idx
1777
+
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4e1bafc1984c123fcc153e970961d014fecb3026d731458ddecd2a24eae85c46
3
+ size 6971794694
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true}}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "sep_token": "</s>", "cls_token": "<s>", "unk_token": "<unk>", "pad_token": "<pad>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "sp_model_kwargs": {}, "special_tokens_map_file": "./special_tokens_map.json", "name_or_path": "./", "tokenizer_class": "XLMRobertaTokenizer"}