pszemraj autoevaluator HF staff commited on
Commit
786364e
1 Parent(s): a48cb8c

Add verifyToken field to verify evaluation results are produced by Hugging Face's automatic model evaluator (#6)

Browse files

- Add verifyToken field to verify evaluation results are produced by Hugging Face's automatic model evaluator (8484d1e2edd0e5e3e7b308363f30ad86931692bd)


Co-authored-by: Evaluation Bot <autoevaluator@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +125 -102
README.md CHANGED
@@ -1,6 +1,7 @@
1
  ---
2
  language:
3
  - en
 
4
  tags:
5
  - summarization
6
  - summarisation
@@ -9,7 +10,6 @@ tags:
9
  - bigbird_pegasus_
10
  - pegasus
11
  - bigbird
12
- license: apache-2.0
13
  datasets:
14
  - kmfoda/booksum
15
  metrics:
@@ -28,39 +28,38 @@ widget:
28
  deviation of the average recurrence interval, the more specific could be the long
29
  term prediction of a future mainshock.
30
  example_title: earthquakes
31
- - text: " A typical feed-forward neural field algorithm. Spatiotemporal coordinates\
32
- \ are fed into a neural network that predicts values in the reconstructed domain.\
33
- \ Then, this domain is mapped to the sensor domain where sensor measurements are\
34
- \ available as supervision. Class and Section Problems Addressed Generalization\
35
- \ (Section 2) Inverse problems, ill-posed problems, editability; symmetries. Hybrid\
36
- \ Representations (Section 3) Computation & memory efficiency, representation\
37
- \ capacity, editability: Forward Maps (Section 4) Inverse problems Network Architecture\
38
- \ (Section 5) Spectral bias, integration & derivatives. Manipulating Neural Fields\
39
- \ (Section 6) Edit ability, constraints, regularization. Table 2: The five classes\
40
- \ of techniques in the neural field toolbox each addresses problems that arise\
41
- \ in learning, inference, and control. (Section 3). We can supervise reconstruction\
42
- \ via differentiable forward maps that transform Or project our domain (e.g, 3D\
43
- \ reconstruction via 2D images; Section 4) With appropriate network architecture\
44
- \ choices, we can overcome neural network spectral biases (blurriness) and efficiently\
45
- \ compute derivatives and integrals (Section 5). Finally, we can manipulate neural\
46
- \ fields to add constraints and regularizations, and to achieve editable representations\
47
- \ (Section 6). Collectively, these classes constitute a 'toolbox' of techniques\
48
- \ to help solve problems with neural fields There are three components in a conditional\
49
- \ neural field: (1) An encoder or inference function \u20AC that outputs the conditioning\
50
- \ latent variable 2 given an observation 0 E(0) =2. 2 is typically a low-dimensional\
51
- \ vector, and is often referred to aS a latent code Or feature code_ (2) A mapping\
52
- \ function 4 between Z and neural field parameters O: Y(z) = O; (3) The neural\
53
- \ field itself $. The encoder \u20AC finds the most probable z given the observations\
54
- \ O: argmaxz P(2/0). The decoder maximizes the inverse conditional probability\
55
- \ to find the most probable 0 given Z: arg- max P(Olz). We discuss different encoding\
56
- \ schemes with different optimality guarantees (Section 2.1.1), both global and\
57
- \ local conditioning (Section 2.1.2), and different mapping functions Y (Section\
58
- \ 2.1.3) 2. Generalization Suppose we wish to estimate a plausible 3D surface\
59
- \ shape given a partial or noisy point cloud. We need a suitable prior over the\
60
- \ sur- face in its reconstruction domain to generalize to the partial observations.\
61
- \ A neural network expresses a prior via the function space of its architecture\
62
- \ and parameters 0, and generalization is influenced by the inductive bias of\
63
- \ this function space (Section 5)."
64
  example_title: scientific paper
65
  - text: ' the big variety of data coming from diverse sources is one of the key properties
66
  of the big data phenomenon. It is, therefore, beneficial to understand how data
@@ -105,50 +104,62 @@ widget:
105
  in their business An important area of data analytics on the edge of corporate
106
  IT and the Internet is Web Analytics.'
107
  example_title: data science textbook
108
- - text: "Transformer-based models have shown to be very useful for many NLP tasks.\
109
- \ However, a major limitation of transformers-based models is its O(n^2)O(n 2)\
110
- \ time & memory complexity (where nn is sequence length). Hence, it's computationally\
111
- \ very expensive to apply transformer-based models on long sequences n > 512n>512.\
112
- \ Several recent papers, e.g. Longformer, Performer, Reformer, Clustered attention\
113
- \ try to remedy this problem by approximating the full attention matrix. You can\
114
- \ checkout \U0001F917's recent blog post in case you are unfamiliar with these\
115
- \ models.\nBigBird (introduced in paper) is one of such recent models to address\
116
- \ this issue. BigBird relies on block sparse attention instead of normal attention\
117
- \ (i.e. BERT's attention) and can handle sequences up to a length of 4096 at a\
118
- \ much lower computational cost compared to BERT. It has achieved SOTA on various\
119
- \ tasks involving very long sequences such as long documents summarization, question-answering\
120
- \ with long contexts.\nBigBird RoBERTa-like model is now available in \U0001F917\
121
- Transformers. The goal of this post is to give the reader an in-depth understanding\
122
- \ of big bird implementation & ease one's life in using BigBird with \U0001F917\
123
- Transformers. But, before going into more depth, it is important to remember that\
124
- \ the BigBird's attention is an approximation of BERT's full attention and therefore\
125
- \ does not strive to be better than BERT's full attention, but rather to be more\
126
- \ efficient. It simply allows to apply transformer-based models to much longer\
127
- \ sequences since BERT's quadratic memory requirement quickly becomes unbearable.\
128
- \ Simply put, if we would have \u221E compute & \u221E time, BERT's attention\
129
- \ would be preferred over block sparse attention (which we are going to discuss\
130
- \ in this post).\nIf you wonder why we need more compute when working with longer\
131
- \ sequences, this blog post is just right for you!\nSome of the main questions\
132
- \ one might have when working with standard BERT-like attention include:\nDo all\
133
- \ tokens really have to attend to all other tokens? Why not compute attention\
134
- \ only over important tokens? How to decide what tokens are important? How to\
135
- \ attend to just a few tokens in a very efficient way? In this blog post, we will\
136
- \ try to answer those questions.\nWhat tokens should be attended to? We will give\
137
- \ a practical example of how attention works by considering the sentence 'BigBird\
138
- \ is now available in HuggingFace for extractive question answering'. In BERT-like\
139
- \ attention, every word would simply attend to all other tokens.\nLet's think\
140
- \ about a sensible choice of key tokens that a queried token actually only should\
141
- \ attend to by writing some pseudo-code. Will will assume that the token available\
142
- \ is queried and build a sensible list of key tokens to attend to.\n>>> # let's\
143
- \ consider following sentence as an example >>> example = ['BigBird', 'is', 'now',\
144
- \ 'available', 'in', 'HuggingFace', 'for', 'extractive', 'question', 'answering']\n\
145
- >>> # further let's assume, we're trying to understand the representation of 'available'\
146
- \ i.e. >>> query_token = 'available' >>> # We will initialize an empty `set` and\
147
- \ fill up the tokens of our interest as we proceed in this section. >>> key_tokens\
148
- \ = [] # => currently 'available' token doesn't have anything to attend Nearby\
149
- \ tokens should be important because, in a sentence (sequence of words), the current\
150
- \ word is highly dependent on neighboring past & future tokens. This intuition\
151
- \ is the idea behind the concept of sliding attention."
 
 
 
 
 
 
 
 
 
 
 
 
152
  example_title: bigbird blog intro
153
  inference:
154
  parameters:
@@ -171,30 +182,36 @@ model-index:
171
  config: kmfoda--booksum
172
  split: test
173
  metrics:
174
- - name: ROUGE-1
175
- type: rouge
176
  value: 34.0757
 
177
  verified: true
178
- - name: ROUGE-2
179
- type: rouge
180
  value: 5.9177
 
181
  verified: true
182
- - name: ROUGE-L
183
- type: rouge
184
  value: 16.3874
 
185
  verified: true
186
- - name: ROUGE-LSUM
187
- type: rouge
188
  value: 31.6118
 
189
  verified: true
190
- - name: loss
191
- type: loss
192
  value: 3.522040605545044
 
193
  verified: true
194
- - name: gen_len
195
- type: gen_len
196
  value: 254.3676
 
197
  verified: true
 
198
  - task:
199
  type: summarization
200
  name: Summarization
@@ -204,30 +221,36 @@ model-index:
204
  config: plain_text
205
  split: test
206
  metrics:
207
- - name: ROUGE-1
208
- type: rouge
209
  value: 40.015
 
210
  verified: true
211
- - name: ROUGE-2
212
- type: rouge
213
  value: 10.7406
 
214
  verified: true
215
- - name: ROUGE-L
216
- type: rouge
217
  value: 20.1344
 
218
  verified: true
219
- - name: ROUGE-LSUM
220
- type: rouge
221
  value: 36.7743
 
222
  verified: true
223
- - name: loss
224
- type: loss
225
  value: 3.8273396492004395
 
226
  verified: true
227
- - name: gen_len
228
- type: gen_len
229
  value: 228.1285
 
230
  verified: true
 
231
  ---
232
 
233
 
1
  ---
2
  language:
3
  - en
4
+ license: apache-2.0
5
  tags:
6
  - summarization
7
  - summarisation
10
  - bigbird_pegasus_
11
  - pegasus
12
  - bigbird
 
13
  datasets:
14
  - kmfoda/booksum
15
  metrics:
28
  deviation of the average recurrence interval, the more specific could be the long
29
  term prediction of a future mainshock.
30
  example_title: earthquakes
31
+ - text: ' A typical feed-forward neural field algorithm. Spatiotemporal coordinates
32
+ are fed into a neural network that predicts values in the reconstructed domain.
33
+ Then, this domain is mapped to the sensor domain where sensor measurements are
34
+ available as supervision. Class and Section Problems Addressed Generalization
35
+ (Section 2) Inverse problems, ill-posed problems, editability; symmetries. Hybrid
36
+ Representations (Section 3) Computation & memory efficiency, representation capacity,
37
+ editability: Forward Maps (Section 4) Inverse problems Network Architecture (Section
38
+ 5) Spectral bias, integration & derivatives. Manipulating Neural Fields (Section
39
+ 6) Edit ability, constraints, regularization. Table 2: The five classes of techniques
40
+ in the neural field toolbox each addresses problems that arise in learning, inference,
41
+ and control. (Section 3). We can supervise reconstruction via differentiable forward
42
+ maps that transform Or project our domain (e.g, 3D reconstruction via 2D images;
43
+ Section 4) With appropriate network architecture choices, we can overcome neural
44
+ network spectral biases (blurriness) and efficiently compute derivatives and integrals
45
+ (Section 5). Finally, we can manipulate neural fields to add constraints and regularizations,
46
+ and to achieve editable representations (Section 6). Collectively, these classes
47
+ constitute a ''toolbox'' of techniques to help solve problems with neural fields
48
+ There are three components in a conditional neural field: (1) An encoder or inference
49
+ function that outputs the conditioning latent variable 2 given an observation
50
+ 0 E(0) =2. 2 is typically a low-dimensional vector, and is often referred to aS
51
+ a latent code Or feature code_ (2) A mapping function 4 between Z and neural field
52
+ parameters O: Y(z) = O; (3) The neural field itself $. The encoder € finds the
53
+ most probable z given the observations O: argmaxz P(2/0). The decoder maximizes
54
+ the inverse conditional probability to find the most probable 0 given Z: arg-
55
+ max P(Olz). We discuss different encoding schemes with different optimality guarantees
56
+ (Section 2.1.1), both global and local conditioning (Section 2.1.2), and different
57
+ mapping functions Y (Section 2.1.3) 2. Generalization Suppose we wish to estimate
58
+ a plausible 3D surface shape given a partial or noisy point cloud. We need a suitable
59
+ prior over the sur- face in its reconstruction domain to generalize to the partial
60
+ observations. A neural network expresses a prior via the function space of its
61
+ architecture and parameters 0, and generalization is influenced by the inductive
62
+ bias of this function space (Section 5).'
 
63
  example_title: scientific paper
64
  - text: ' the big variety of data coming from diverse sources is one of the key properties
65
  of the big data phenomenon. It is, therefore, beneficial to understand how data
104
  in their business An important area of data analytics on the edge of corporate
105
  IT and the Internet is Web Analytics.'
106
  example_title: data science textbook
107
+ - text: 'Transformer-based models have shown to be very useful for many NLP tasks.
108
+ However, a major limitation of transformers-based models is its O(n^2)O(n 2) time
109
+ & memory complexity (where nn is sequence length). Hence, it''s computationally
110
+ very expensive to apply transformer-based models on long sequences n > 512n>512.
111
+ Several recent papers, e.g. Longformer, Performer, Reformer, Clustered attention
112
+ try to remedy this problem by approximating the full attention matrix. You can
113
+ checkout 🤗''s recent blog post in case you are unfamiliar with these models.
114
+
115
+ BigBird (introduced in paper) is one of such recent models to address this issue.
116
+ BigBird relies on block sparse attention instead of normal attention (i.e. BERT''s
117
+ attention) and can handle sequences up to a length of 4096 at a much lower computational
118
+ cost compared to BERT. It has achieved SOTA on various tasks involving very long
119
+ sequences such as long documents summarization, question-answering with long contexts.
120
+
121
+ BigBird RoBERTa-like model is now available in 🤗Transformers. The goal of this
122
+ post is to give the reader an in-depth understanding of big bird implementation
123
+ & ease one''s life in using BigBird with 🤗Transformers. But, before going into
124
+ more depth, it is important to remember that the BigBird''s attention is an approximation
125
+ of BERT''s full attention and therefore does not strive to be better than BERT''s
126
+ full attention, but rather to be more efficient. It simply allows to apply transformer-based
127
+ models to much longer sequences since BERT''s quadratic memory requirement quickly
128
+ becomes unbearable. Simply put, if we would have compute & time, BERT''s attention
129
+ would be preferred over block sparse attention (which we are going to discuss
130
+ in this post).
131
+
132
+ If you wonder why we need more compute when working with longer sequences, this
133
+ blog post is just right for you!
134
+
135
+ Some of the main questions one might have when working with standard BERT-like
136
+ attention include:
137
+
138
+ Do all tokens really have to attend to all other tokens? Why not compute attention
139
+ only over important tokens? How to decide what tokens are important? How to attend
140
+ to just a few tokens in a very efficient way? In this blog post, we will try to
141
+ answer those questions.
142
+
143
+ What tokens should be attended to? We will give a practical example of how attention
144
+ works by considering the sentence ''BigBird is now available in HuggingFace for
145
+ extractive question answering''. In BERT-like attention, every word would simply
146
+ attend to all other tokens.
147
+
148
+ Let''s think about a sensible choice of key tokens that a queried token actually
149
+ only should attend to by writing some pseudo-code. Will will assume that the token
150
+ available is queried and build a sensible list of key tokens to attend to.
151
+
152
+ >>> # let''s consider following sentence as an example >>> example = [''BigBird'',
153
+ ''is'', ''now'', ''available'', ''in'', ''HuggingFace'', ''for'', ''extractive'',
154
+ ''question'', ''answering'']
155
+
156
+ >>> # further let''s assume, we''re trying to understand the representation of
157
+ ''available'' i.e. >>> query_token = ''available'' >>> # We will initialize an
158
+ empty `set` and fill up the tokens of our interest as we proceed in this section.
159
+ >>> key_tokens = [] # => currently ''available'' token doesn''t have anything
160
+ to attend Nearby tokens should be important because, in a sentence (sequence of
161
+ words), the current word is highly dependent on neighboring past & future tokens.
162
+ This intuition is the idea behind the concept of sliding attention.'
163
  example_title: bigbird blog intro
164
  inference:
165
  parameters:
182
  config: kmfoda--booksum
183
  split: test
184
  metrics:
185
+ - type: rouge
 
186
  value: 34.0757
187
+ name: ROUGE-1
188
  verified: true
189
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiYzk3NmI2ODg0MDM3MzY3ZjMyYzhmNTYyZjBmNTJlM2M3MjZjMzI0YzMxNmRmODhhMzI2MDMzMzMzMmJhMGIyMCIsInZlcnNpb24iOjF9.gM1ClaQdlrDE9q3CGF164WhhlTpg8Ym1cpvN1RARK8FGKDSR37EWmgdg-PSSHgB_l9NuvZ3BgoC7hKxfpcnKCQ
190
+ - type: rouge
191
  value: 5.9177
192
+ name: ROUGE-2
193
  verified: true
194
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiMzdmMGU5ODhiMjcxZTJjODk3ZWI3NjY0NWJkMDFjYWI1ZDIyN2YwMDBjODE2ODQzY2I4ZTA1NWI0MTZiZGQwYSIsInZlcnNpb24iOjF9.ZkX-5RfN9cR1y56TUJWFtMRkHRRIzh9bEApa08ClR1ybgHvsnTjhSnNaNSjpXBR4jOVV9075qV38MJpqO8U8Bg
195
+ - type: rouge
196
  value: 16.3874
197
+ name: ROUGE-L
198
  verified: true
199
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiMWU4ODExMjEwZjcyOWQ3NGJkYzM4NDgyMGQ2YzM5OThkNWIyMmVhMDNkNjA5OGRkM2UyMDE1MGIxZGVhMjUzZSIsInZlcnNpb24iOjF9.2pDo80GWdIAeyWZ4js7PAf_tJCsRceZTX0MoBINGsdjFBI864C1MkgB1s8aJx5Q47oZMkeFoFoAu0Vs21KF4Cg
200
+ - type: rouge
201
  value: 31.6118
202
+ name: ROUGE-LSUM
203
  verified: true
204
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiYjY2ODJiZDg2MzI3N2M5NTU5YzIyZmQ0NzkwM2NlY2U0ZDQ5OTM0NmM5ZmI5NjUxYjA3N2IwYWViOTkxN2MxZCIsInZlcnNpb24iOjF9.9c6Spmci31HdkfXUqKyju1X-Z9HOHSSnZNgC4JDyN6csLaDWkyVwWs5xWvC0mvEnaEnigmkSX1Uy3i355ELmBw
205
+ - type: loss
206
  value: 3.522040605545044
207
+ name: loss
208
  verified: true
209
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiODAyZTFiMjUzYTIzNWI0YjQxOWNlZjdkYjcxNDY3ZjMyNTg3ZDdkOTg3YmEzMjFiYzk2NTM4ZTExZjJiZmI3MCIsInZlcnNpb24iOjF9.n-L_DOkTlkbipJWIQQA-cQqeWJ9Q_b1d2zm7RhLxSpjzXegFxJgkC25hTEhqvanGYZwzahn950ikyyxa4JevAw
210
+ - type: gen_len
211
  value: 254.3676
212
+ name: gen_len
213
  verified: true
214
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiMzdlY2U1ZTgwNGUyNGM4ZGJlNDNlY2RjOWViYmFkOWE0ZjMzYTU0ZTg2NTlkN2EyMTYyMjE0NjcwOTU4NzY2NiIsInZlcnNpb24iOjF9.YnwkkcCRnZWbh48BX0fktufQk5pb0qfQvjNrIbARYx7w0PTd-6Fjn6FKwCJ1MOfyeZDI1sd6xckm_Wt8XsReAg
215
  - task:
216
  type: summarization
217
  name: Summarization
221
  config: plain_text
222
  split: test
223
  metrics:
224
+ - type: rouge
 
225
  value: 40.015
226
+ name: ROUGE-1
227
  verified: true
228
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiMzE1MGM3ZDYzMDgwZGRlZDRkYmFmZGI4ODg0N2NhMGUyYmU1YmI5Njg0MzMxNzAxZGUxYjc3NTZjYjMwZDhmOCIsInZlcnNpb24iOjF9.7-SojdX5JiNAK31FpAHfkic0S2iziZiYWHCTnb4VTjsDnrDP3xfow1BWsC1N9aNAN_Pi-7FDh_BhDMp89csoCQ
229
+ - type: rouge
230
  value: 10.7406
231
+ name: ROUGE-2
232
  verified: true
233
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiZjEwOTRjOTA4N2E0OGQ3OGY0OThjNjlkN2VlZDBlNTI4OGYxNDFiN2YxYTI2YjBjOTJhYWJiNGE1NzcyOWE5YyIsInZlcnNpb24iOjF9.SrMCtxOkMabMELFr5_yqG52zTKGk81oqnqczrovgsko1bGhqpR-83nE7dc8oZ_tmTsbTUF3i7cQ3Eb_8EvPhDg
234
+ - type: rouge
235
  value: 20.1344
236
+ name: ROUGE-L
237
  verified: true
238
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiYzkxZmJkYzdmOGI3Yzc1ZDliNGY3ZjE5OWFiYmFmMTU4ZWU2ZDUyNzE0YmY3MmUyMTQyNjkyMTMwYTM2OWU2ZSIsInZlcnNpb24iOjF9.FPX3HynlHurNYlgK1jjocJHZIZ2t8OLFS_qN8skIwbzw1mGb8ST3tVebE9qeXZWY9TbNfWsGERShJH1giw2qDw
239
+ - type: rouge
240
  value: 36.7743
241
+ name: ROUGE-LSUM
242
  verified: true
243
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiYjgxNmQ1MmEwY2VlYTAzMTVhMDBlODFjMDNlMjA4NjRiOTNkNjkxZWNiNDg4ODM1NWUwNjk1ODFkMzI3YmM5ZCIsInZlcnNpb24iOjF9.uK7C2bGmOGEWzc8D2Av_WYSqn2epqqiXXq2ybJmoHAT8GYc80jpEGTKjyhjf00lCLw-kOxeSG5Qpr_JihR5kAg
244
+ - type: loss
245
  value: 3.8273396492004395
246
+ name: loss
247
  verified: true
248
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNzI4OTcwOGYzYmM5MmM2NmViNjc4MTkyYzJlYjAwODM4ODRmZTAyZTVmMjJlY2JiYjY0YjA5OWY4NDhjOWQ0ZiIsInZlcnNpb24iOjF9.p46FdAgmW5t3KtP4kBhcoVynTQJj1abV4LqM6MQ-o--c46yMlafmtA4mgMEqsJK_CZl7Iv5SSP_n8GiVMpgmAQ
249
+ - type: gen_len
250
  value: 228.1285
251
+ name: gen_len
252
  verified: true
253
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiODY2OGUzNDlhNzM5NzBiMmNmMDZiNjNkNDI0MDkxMzNkZDE4ZjU4OWM1NGQ5Yjk3ZjgzZjk2MDk0NWI0NGI4YiIsInZlcnNpb24iOjF9.Jb61P9-a31VBbwdOD-8ahNgf5Tpln0vjxd4uQtR7vxGu0Ovfa1T9Y8rKXBApTSigrmqBjRdsLfoAU7LqLiL6Cg
254
  ---
255
 
256