RichardErkhov commited on
Commit
26a867c
1 Parent(s): d18a64f

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +338 -0
README.md ADDED
@@ -0,0 +1,338 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ bigbird-pegasus-large-K-booksum - bnb 8bits
11
+ - Model creator: https://huggingface.co/pszemraj/
12
+ - Original model: https://huggingface.co/pszemraj/bigbird-pegasus-large-K-booksum/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ language:
20
+ - en
21
+ license: apache-2.0
22
+ tags:
23
+ - summarization
24
+ - summarisation
25
+ - summary
26
+ - notes
27
+ - bigbird_pegasus_
28
+ - pegasus
29
+ - bigbird
30
+ datasets:
31
+ - kmfoda/booksum
32
+ metrics:
33
+ - rouge
34
+ widget:
35
+ - text: large earthquakes along a given fault segment do not occur at random intervals
36
+ because it takes time to accumulate the strain energy for the rupture. The rates
37
+ at which tectonic plates move and accumulate strain at their boundaries are approximately
38
+ uniform. Therefore, in first approximation, one may expect that large ruptures
39
+ of the same fault segment will occur at approximately constant time intervals.
40
+ If subsequent main shocks have different amounts of slip across the fault, then
41
+ the recurrence time may vary, and the basic idea of periodic mainshocks must be
42
+ modified. For great plate boundary ruptures the length and slip often vary by
43
+ a factor of 2. Along the southern segment of the San Andreas fault the recurrence
44
+ interval is 145 years with variations of several decades. The smaller the standard
45
+ deviation of the average recurrence interval, the more specific could be the long
46
+ term prediction of a future mainshock.
47
+ example_title: earthquakes
48
+ - text: ' A typical feed-forward neural field algorithm. Spatiotemporal coordinates
49
+ are fed into a neural network that predicts values in the reconstructed domain.
50
+ Then, this domain is mapped to the sensor domain where sensor measurements are
51
+ available as supervision. Class and Section Problems Addressed Generalization
52
+ (Section 2) Inverse problems, ill-posed problems, editability; symmetries. Hybrid
53
+ Representations (Section 3) Computation & memory efficiency, representation capacity,
54
+ editability: Forward Maps (Section 4) Inverse problems Network Architecture (Section
55
+ 5) Spectral bias, integration & derivatives. Manipulating Neural Fields (Section
56
+ 6) Edit ability, constraints, regularization. Table 2: The five classes of techniques
57
+ in the neural field toolbox each addresses problems that arise in learning, inference,
58
+ and control. (Section 3). We can supervise reconstruction via differentiable forward
59
+ maps that transform Or project our domain (e.g, 3D reconstruction via 2D images;
60
+ Section 4) With appropriate network architecture choices, we can overcome neural
61
+ network spectral biases (blurriness) and efficiently compute derivatives and integrals
62
+ (Section 5). Finally, we can manipulate neural fields to add constraints and regularizations,
63
+ and to achieve editable representations (Section 6). Collectively, these classes
64
+ constitute a ''toolbox'' of techniques to help solve problems with neural fields
65
+ There are three components in a conditional neural field: (1) An encoder or inference
66
+ function € that outputs the conditioning latent variable 2 given an observation
67
+ 0 E(0) =2. 2 is typically a low-dimensional vector, and is often referred to aS
68
+ a latent code Or feature code_ (2) A mapping function 4 between Z and neural field
69
+ parameters O: Y(z) = O; (3) The neural field itself $. The encoder € finds the
70
+ most probable z given the observations O: argmaxz P(2/0). The decoder maximizes
71
+ the inverse conditional probability to find the most probable 0 given Z: arg-
72
+ max P(Olz). We discuss different encoding schemes with different optimality guarantees
73
+ (Section 2.1.1), both global and local conditioning (Section 2.1.2), and different
74
+ mapping functions Y (Section 2.1.3) 2. Generalization Suppose we wish to estimate
75
+ a plausible 3D surface shape given a partial or noisy point cloud. We need a suitable
76
+ prior over the sur- face in its reconstruction domain to generalize to the partial
77
+ observations. A neural network expresses a prior via the function space of its
78
+ architecture and parameters 0, and generalization is influenced by the inductive
79
+ bias of this function space (Section 5).'
80
+ example_title: scientific paper
81
+ - text: ' the big variety of data coming from diverse sources is one of the key properties
82
+ of the big data phenomenon. It is, therefore, beneficial to understand how data
83
+ is generated in various environments and scenarios, before looking at what should
84
+ be done with this data and how to design the best possible architecture to accomplish
85
+ this The evolution of IT architectures, described in Chapter 2, means that the
86
+ data is no longer processed by a few big monolith systems, but rather by a group
87
+ of services In parallel to the processing layer, the underlying data storage has
88
+ also changed and became more distributed This, in turn, required a significant
89
+ paradigm shift as the traditional approach to transactions (ACID) could no longer
90
+ be supported. On top of this, cloud computing is becoming a major approach with
91
+ the benefits of reducing costs and providing on-demand scalability but at the
92
+ same time introducing concerns about privacy, data ownership, etc In the meantime
93
+ the Internet continues its exponential growth: Every day both structured and unstructured
94
+ data is published and available for processing: To achieve competitive advantage
95
+ companies have to relate their corporate resources to external services, e.g.
96
+ financial markets, weather forecasts, social media, etc While several of the sites
97
+ provide some sort of API to access the data in a more orderly fashion; countless
98
+ sources require advanced web mining and Natural Language Processing (NLP) processing
99
+ techniques: Advances in science push researchers to construct new instruments
100
+ for observing the universe O conducting experiments to understand even better
101
+ the laws of physics and other domains. Every year humans have at their disposal
102
+ new telescopes, space probes, particle accelerators, etc These instruments generate
103
+ huge streams of data, which need to be stored and analyzed. The constant drive
104
+ for efficiency in the industry motivates the introduction of new automation techniques
105
+ and process optimization: This could not be done without analyzing the precise
106
+ data that describe these processes. As more and more human tasks are automated,
107
+ machines provide rich data sets, which can be analyzed in real-time to drive efficiency
108
+ to new levels. Finally, it is now evident that the growth of the Internet of Things
109
+ is becoming a major source of data. More and more of the devices are equipped
110
+ with significant computational power and can generate a continuous data stream
111
+ from their sensors. In the subsequent sections of this chapter, we will look at
112
+ the domains described above to see what they generate in terms of data sets. We
113
+ will compare the volumes but will also look at what is characteristic and important
114
+ from their respective points of view. 3.1 The Internet is undoubtedly the largest
115
+ database ever created by humans. While several well described; cleaned, and structured
116
+ data sets have been made available through this medium, most of the resources
117
+ are of an ambiguous, unstructured, incomplete or even erroneous nature. Still,
118
+ several examples in the areas such as opinion mining, social media analysis, e-governance,
119
+ etc, clearly show the potential lying in these resources. Those who can successfully
120
+ mine and interpret the Internet data can gain unique insight and competitive advantage
121
+ in their business An important area of data analytics on the edge of corporate
122
+ IT and the Internet is Web Analytics.'
123
+ example_title: data science textbook
124
+ - text: 'Transformer-based models have shown to be very useful for many NLP tasks.
125
+ However, a major limitation of transformers-based models is its O(n^2)O(n 2) time
126
+ & memory complexity (where nn is sequence length). Hence, it''s computationally
127
+ very expensive to apply transformer-based models on long sequences n > 512n>512.
128
+ Several recent papers, e.g. Longformer, Performer, Reformer, Clustered attention
129
+ try to remedy this problem by approximating the full attention matrix. You can
130
+ checkout 🤗''s recent blog post in case you are unfamiliar with these models.
131
+
132
+ BigBird (introduced in paper) is one of such recent models to address this issue.
133
+ BigBird relies on block sparse attention instead of normal attention (i.e. BERT''s
134
+ attention) and can handle sequences up to a length of 4096 at a much lower computational
135
+ cost compared to BERT. It has achieved SOTA on various tasks involving very long
136
+ sequences such as long documents summarization, question-answering with long contexts.
137
+
138
+ BigBird RoBERTa-like model is now available in 🤗Transformers. The goal of this
139
+ post is to give the reader an in-depth understanding of big bird implementation
140
+ & ease one''s life in using BigBird with 🤗Transformers. But, before going into
141
+ more depth, it is important to remember that the BigBird''s attention is an approximation
142
+ of BERT''s full attention and therefore does not strive to be better than BERT''s
143
+ full attention, but rather to be more efficient. It simply allows to apply transformer-based
144
+ models to much longer sequences since BERT''s quadratic memory requirement quickly
145
+ becomes unbearable. Simply put, if we would have ∞ compute & ∞ time, BERT''s attention
146
+ would be preferred over block sparse attention (which we are going to discuss
147
+ in this post).
148
+
149
+ If you wonder why we need more compute when working with longer sequences, this
150
+ blog post is just right for you!
151
+
152
+ Some of the main questions one might have when working with standard BERT-like
153
+ attention include:
154
+
155
+ Do all tokens really have to attend to all other tokens? Why not compute attention
156
+ only over important tokens? How to decide what tokens are important? How to attend
157
+ to just a few tokens in a very efficient way? In this blog post, we will try to
158
+ answer those questions.
159
+
160
+ What tokens should be attended to? We will give a practical example of how attention
161
+ works by considering the sentence ''BigBird is now available in HuggingFace for
162
+ extractive question answering''. In BERT-like attention, every word would simply
163
+ attend to all other tokens.
164
+
165
+ Let''s think about a sensible choice of key tokens that a queried token actually
166
+ only should attend to by writing some pseudo-code. Will will assume that the token
167
+ available is queried and build a sensible list of key tokens to attend to.
168
+
169
+ >>> # let''s consider following sentence as an example >>> example = [''BigBird'',
170
+ ''is'', ''now'', ''available'', ''in'', ''HuggingFace'', ''for'', ''extractive'',
171
+ ''question'', ''answering'']
172
+
173
+ >>> # further let''s assume, we''re trying to understand the representation of
174
+ ''available'' i.e. >>> query_token = ''available'' >>> # We will initialize an
175
+ empty `set` and fill up the tokens of our interest as we proceed in this section.
176
+ >>> key_tokens = [] # => currently ''available'' token doesn''t have anything
177
+ to attend Nearby tokens should be important because, in a sentence (sequence of
178
+ words), the current word is highly dependent on neighboring past & future tokens.
179
+ This intuition is the idea behind the concept of sliding attention.'
180
+ example_title: bigbird blog intro
181
+ inference:
182
+ parameters:
183
+ max_length: 64
184
+ no_repeat_ngram_size: 2
185
+ encoder_no_repeat_ngram_size: 3
186
+ repetition_penalty: 2.4
187
+ length_penalty: 0.5
188
+ num_beams: 4
189
+ early_stopping: true
190
+ model-index:
191
+ - name: pszemraj/bigbird-pegasus-large-K-booksum
192
+ results:
193
+ - task:
194
+ type: summarization
195
+ name: Summarization
196
+ dataset:
197
+ name: kmfoda/booksum
198
+ type: kmfoda/booksum
199
+ config: kmfoda--booksum
200
+ split: test
201
+ metrics:
202
+ - type: rouge
203
+ value: 34.0757
204
+ name: ROUGE-1
205
+ verified: true
206
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiYzk3NmI2ODg0MDM3MzY3ZjMyYzhmNTYyZjBmNTJlM2M3MjZjMzI0YzMxNmRmODhhMzI2MDMzMzMzMmJhMGIyMCIsInZlcnNpb24iOjF9.gM1ClaQdlrDE9q3CGF164WhhlTpg8Ym1cpvN1RARK8FGKDSR37EWmgdg-PSSHgB_l9NuvZ3BgoC7hKxfpcnKCQ
207
+ - type: rouge
208
+ value: 5.9177
209
+ name: ROUGE-2
210
+ verified: true
211
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiMzdmMGU5ODhiMjcxZTJjODk3ZWI3NjY0NWJkMDFjYWI1ZDIyN2YwMDBjODE2ODQzY2I4ZTA1NWI0MTZiZGQwYSIsInZlcnNpb24iOjF9.ZkX-5RfN9cR1y56TUJWFtMRkHRRIzh9bEApa08ClR1ybgHvsnTjhSnNaNSjpXBR4jOVV9075qV38MJpqO8U8Bg
212
+ - type: rouge
213
+ value: 16.3874
214
+ name: ROUGE-L
215
+ verified: true
216
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiMWU4ODExMjEwZjcyOWQ3NGJkYzM4NDgyMGQ2YzM5OThkNWIyMmVhMDNkNjA5OGRkM2UyMDE1MGIxZGVhMjUzZSIsInZlcnNpb24iOjF9.2pDo80GWdIAeyWZ4js7PAf_tJCsRceZTX0MoBINGsdjFBI864C1MkgB1s8aJx5Q47oZMkeFoFoAu0Vs21KF4Cg
217
+ - type: rouge
218
+ value: 31.6118
219
+ name: ROUGE-LSUM
220
+ verified: true
221
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiYjY2ODJiZDg2MzI3N2M5NTU5YzIyZmQ0NzkwM2NlY2U0ZDQ5OTM0NmM5ZmI5NjUxYjA3N2IwYWViOTkxN2MxZCIsInZlcnNpb24iOjF9.9c6Spmci31HdkfXUqKyju1X-Z9HOHSSnZNgC4JDyN6csLaDWkyVwWs5xWvC0mvEnaEnigmkSX1Uy3i355ELmBw
222
+ - type: loss
223
+ value: 3.522040605545044
224
+ name: loss
225
+ verified: true
226
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiODAyZTFiMjUzYTIzNWI0YjQxOWNlZjdkYjcxNDY3ZjMyNTg3ZDdkOTg3YmEzMjFiYzk2NTM4ZTExZjJiZmI3MCIsInZlcnNpb24iOjF9.n-L_DOkTlkbipJWIQQA-cQqeWJ9Q_b1d2zm7RhLxSpjzXegFxJgkC25hTEhqvanGYZwzahn950ikyyxa4JevAw
227
+ - type: gen_len
228
+ value: 254.3676
229
+ name: gen_len
230
+ verified: true
231
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiMzdlY2U1ZTgwNGUyNGM4ZGJlNDNlY2RjOWViYmFkOWE0ZjMzYTU0ZTg2NTlkN2EyMTYyMjE0NjcwOTU4NzY2NiIsInZlcnNpb24iOjF9.YnwkkcCRnZWbh48BX0fktufQk5pb0qfQvjNrIbARYx7w0PTd-6Fjn6FKwCJ1MOfyeZDI1sd6xckm_Wt8XsReAg
232
+ - task:
233
+ type: summarization
234
+ name: Summarization
235
+ dataset:
236
+ name: launch/gov_report
237
+ type: launch/gov_report
238
+ config: plain_text
239
+ split: test
240
+ metrics:
241
+ - type: rouge
242
+ value: 40.015
243
+ name: ROUGE-1
244
+ verified: true
245
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiMzE1MGM3ZDYzMDgwZGRlZDRkYmFmZGI4ODg0N2NhMGUyYmU1YmI5Njg0MzMxNzAxZGUxYjc3NTZjYjMwZDhmOCIsInZlcnNpb24iOjF9.7-SojdX5JiNAK31FpAHfkic0S2iziZiYWHCTnb4VTjsDnrDP3xfow1BWsC1N9aNAN_Pi-7FDh_BhDMp89csoCQ
246
+ - type: rouge
247
+ value: 10.7406
248
+ name: ROUGE-2
249
+ verified: true
250
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiZjEwOTRjOTA4N2E0OGQ3OGY0OThjNjlkN2VlZDBlNTI4OGYxNDFiN2YxYTI2YjBjOTJhYWJiNGE1NzcyOWE5YyIsInZlcnNpb24iOjF9.SrMCtxOkMabMELFr5_yqG52zTKGk81oqnqczrovgsko1bGhqpR-83nE7dc8oZ_tmTsbTUF3i7cQ3Eb_8EvPhDg
251
+ - type: rouge
252
+ value: 20.1344
253
+ name: ROUGE-L
254
+ verified: true
255
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiYzkxZmJkYzdmOGI3Yzc1ZDliNGY3ZjE5OWFiYmFmMTU4ZWU2ZDUyNzE0YmY3MmUyMTQyNjkyMTMwYTM2OWU2ZSIsInZlcnNpb24iOjF9.FPX3HynlHurNYlgK1jjocJHZIZ2t8OLFS_qN8skIwbzw1mGb8ST3tVebE9qeXZWY9TbNfWsGERShJH1giw2qDw
256
+ - type: rouge
257
+ value: 36.7743
258
+ name: ROUGE-LSUM
259
+ verified: true
260
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiYjgxNmQ1MmEwY2VlYTAzMTVhMDBlODFjMDNlMjA4NjRiOTNkNjkxZWNiNDg4ODM1NWUwNjk1ODFkMzI3YmM5ZCIsInZlcnNpb24iOjF9.uK7C2bGmOGEWzc8D2Av_WYSqn2epqqiXXq2ybJmoHAT8GYc80jpEGTKjyhjf00lCLw-kOxeSG5Qpr_JihR5kAg
261
+ - type: loss
262
+ value: 3.8273396492004395
263
+ name: loss
264
+ verified: true
265
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNzI4OTcwOGYzYmM5MmM2NmViNjc4MTkyYzJlYjAwODM4ODRmZTAyZTVmMjJlY2JiYjY0YjA5OWY4NDhjOWQ0ZiIsInZlcnNpb24iOjF9.p46FdAgmW5t3KtP4kBhcoVynTQJj1abV4LqM6MQ-o--c46yMlafmtA4mgMEqsJK_CZl7Iv5SSP_n8GiVMpgmAQ
266
+ - type: gen_len
267
+ value: 228.1285
268
+ name: gen_len
269
+ verified: true
270
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiODY2OGUzNDlhNzM5NzBiMmNmMDZiNjNkNDI0MDkxMzNkZDE4ZjU4OWM1NGQ5Yjk3ZjgzZjk2MDk0NWI0NGI4YiIsInZlcnNpb24iOjF9.Jb61P9-a31VBbwdOD-8ahNgf5Tpln0vjxd4uQtR7vxGu0Ovfa1T9Y8rKXBApTSigrmqBjRdsLfoAU7LqLiL6Cg
271
+ ---
272
+
273
+
274
+ # bigbird pegasus on the booksum dataset
275
+
276
+ >_this is the "latest" version of the model that has been trained the longest, currently at 70k steps_
277
+
278
+ - **GOAL:** A summarization model that 1) summarizes the source content accurately 2) _more important IMO_ produces summaries that are easy to read and understand (* cough * unlike arXiv * cough *)
279
+ - This model attempts to help with that by using the [booksum](https://arxiv.org/abs/2105.08209) dataset to provide **explanatory summarization**
280
+ - Explanatory Summary - A summary that both consolidates information and also explains why said consolidated information is important.
281
+ - This model was trained for seven epochs total (approx 70,000 steps) and is closer to finished.
282
+ - Will continue to improve (slowly, now that it has been trained for a long time) based on any result findings/feedback.
283
+ - starting checkpoint was `google/bigbird-pegasus-large-bigpatent`
284
+
285
+ ---
286
+
287
+ # example usage
288
+
289
+ > An extended example, including a demo of batch summarization, is [here](https://colab.research.google.com/gist/pszemraj/2c8c0aecbcd4af6e9cbb51e195be10e2/bigbird-pegasus-large-booksum-20k-example.ipynb).
290
+
291
+
292
+ - create the summarizer object:
293
+
294
+ ```python
295
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
296
+ from transformers import pipeline
297
+
298
+ model = AutoModelForSeq2SeqLM.from_pretrained(
299
+ "pszemraj/bigbird-pegasus-large-K-booksum",
300
+ low_cpu_mem_usage=True,
301
+ )
302
+
303
+ tokenizer = AutoTokenizer.from_pretrained(
304
+ "pszemraj/bigbird-pegasus-large-K-booksum",
305
+ )
306
+
307
+
308
+ summarizer = pipeline(
309
+ "summarization",
310
+ model=model,
311
+ tokenizer=tokenizer,
312
+ )
313
+ ```
314
+
315
+ - define text to be summarized, and pass it through the pipeline. Boom done.
316
+
317
+ ```python
318
+ wall_of_text = "your text to be summarized goes here."
319
+
320
+ result = summarizer(
321
+ wall_of_text,
322
+ min_length=16,
323
+ max_length=256,
324
+ no_repeat_ngram_size=3,
325
+ clean_up_tokenization_spaces=True,
326
+ )
327
+
328
+ print(result[0]["summary_text"])
329
+ ```
330
+
331
+ ## Alternate Checkpoint
332
+
333
+ - if experiencing runtime/memory issues, try [this earlier checkpoint](https://huggingface.co/pszemraj/bigbird-pegasus-large-booksum-40k-K) at 40,000 steps which is almost as good at the explanatory summarization task but runs faster.
334
+ - see similar summarization models fine-tuned on booksum but using different architectures: [long-t5 base](https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary) and [LED-Large](https://huggingface.co/pszemraj/led-large-book-summary)
335
+
336
+ ---
337
+
338
+