patrickvonplaten commited on
Commit
fa3a8ef
1 Parent(s): e794dfc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +710 -0
README.md CHANGED
@@ -17,3 +17,713 @@ This is the official *led-large-16384* checkpoint that is fine-tuned on the arXi
17
  ## Evaluation on downstream task
18
 
19
  [This notebook](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing) shows how *led-large-16384-arxiv* can be evaluated on the [arxiv dataset](https://huggingface.co/datasets/scientific_papers)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  ## Evaluation on downstream task
18
 
19
  [This notebook](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing) shows how *led-large-16384-arxiv* can be evaluated on the [arxiv dataset](https://huggingface.co/datasets/scientific_papers)
20
+
21
+ ## Usage
22
+
23
+ The model can be used as follows. The input is taken from the test data of the [arxiv dataset](https://huggingface.co/datasets/scientific_papers).
24
+
25
+ ```python
26
+ LONG_ARTICLE = """"for about 20 years the problem of properties of
27
+ short - term changes of solar activity has been
28
+ considered extensively . many investigators
29
+ studied the short - term periodicities of the
30
+ various indices of solar activity . several
31
+ periodicities were detected , but the
32
+ periodicities about 155 days and from the interval
33
+ of @xmath3 $ ] days ( @xmath4 $ ] years ) are
34
+ mentioned most often . first of them was
35
+ discovered by @xcite in the occurence rate of
36
+ gamma - ray flares detected by the gamma - ray
37
+ spectrometer aboard the _ solar maximum mission (
38
+ smm ) . this periodicity was confirmed for other
39
+ solar flares data and for the same time period
40
+ @xcite . it was also found in proton flares during
41
+ solar cycles 19 and 20 @xcite , but it was not
42
+ found in the solar flares data during solar cycles
43
+ 22 @xcite . _ several autors confirmed above
44
+ results for the daily sunspot area data . @xcite
45
+ studied the sunspot data from 18741984 . she found
46
+ the 155-day periodicity in data records from 31
47
+ years . this periodicity is always characteristic
48
+ for one of the solar hemispheres ( the southern
49
+ hemisphere for cycles 1215 and the northern
50
+ hemisphere for cycles 1621 ) . moreover , it is
51
+ only present during epochs of maximum activity (
52
+ in episodes of 13 years ) .
53
+ similarinvestigationswerecarriedoutby + @xcite .
54
+ they applied the same power spectrum method as
55
+ lean , but the daily sunspot area data ( cycles
56
+ 1221 ) were divided into 10 shorter time series .
57
+ the periodicities were searched for the frequency
58
+ interval 57115 nhz ( 100200 days ) and for each of
59
+ 10 time series . the authors showed that the
60
+ periodicity between 150160 days is statistically
61
+ significant during all cycles from 16 to 21 . the
62
+ considered peaks were remained unaltered after
63
+ removing the 11-year cycle and applying the power
64
+ spectrum analysis . @xcite used the wavelet
65
+ technique for the daily sunspot areas between 1874
66
+ and 1993 . they determined the epochs of
67
+ appearance of this periodicity and concluded that
68
+ it presents around the maximum activity period in
69
+ cycles 16 to 21 . moreover , the power of this
70
+ periodicity started growing at cycle 19 ,
71
+ decreased in cycles 20 and 21 and disappered after
72
+ cycle 21 . similaranalyseswerepresentedby + @xcite
73
+ , but for sunspot number , solar wind plasma ,
74
+ interplanetary magnetic field and geomagnetic
75
+ activity index @xmath5 . during 1964 - 2000 the
76
+ sunspot number wavelet power of periods less than
77
+ one year shows a cyclic evolution with the phase
78
+ of the solar cycle.the 154-day period is prominent
79
+ and its strenth is stronger around the 1982 - 1984
80
+ interval in almost all solar wind parameters . the
81
+ existence of the 156-day periodicity in sunspot
82
+ data were confirmed by @xcite . they considered
83
+ the possible relation between the 475-day (
84
+ 1.3-year ) and 156-day periodicities . the 475-day
85
+ ( 1.3-year ) periodicity was also detected in
86
+ variations of the interplanetary magnetic field ,
87
+ geomagnetic activity helioseismic data and in the
88
+ solar wind speed @xcite . @xcite concluded that
89
+ the region of larger wavelet power shifts from
90
+ 475-day ( 1.3-year ) period to 620-day ( 1.7-year
91
+ ) period and then back to 475-day ( 1.3-year ) .
92
+ the periodicities from the interval @xmath6 $ ]
93
+ days ( @xmath4 $ ] years ) have been considered
94
+ from 1968 . @xcite mentioned a 16.3-month (
95
+ 490-day ) periodicity in the sunspot numbers and
96
+ in the geomagnetic data . @xcite analysed the
97
+ occurrence rate of major flares during solar
98
+ cycles 19 . they found a 18-month ( 540-day )
99
+ periodicity in flare rate of the norhern
100
+ hemisphere . @xcite confirmed this result for the
101
+ @xmath7 flare data for solar cycles 20 and 21 and
102
+ found a peak in the power spectra near 510540 days
103
+ . @xcite found a 17-month ( 510-day ) periodicity
104
+ of sunspot groups and their areas from 1969 to
105
+ 1986 . these authors concluded that the length of
106
+ this period is variable and the reason of this
107
+ periodicity is still not understood . @xcite and +
108
+ @xcite obtained statistically significant peaks of
109
+ power at around 158 days for daily sunspot data
110
+ from 1923 - 1933 ( cycle 16 ) . in this paper the
111
+ problem of the existence of this periodicity for
112
+ sunspot data from cycle 16 is considered . the
113
+ daily sunspot areas , the mean sunspot areas per
114
+ carrington rotation , the monthly sunspot numbers
115
+ and their fluctuations , which are obtained after
116
+ removing the 11-year cycle are analysed . in
117
+ section 2 the properties of the power spectrum
118
+ methods are described . in section 3 a new
119
+ approach to the problem of aliases in the power
120
+ spectrum analysis is presented . in section 4
121
+ numerical results of the new method of the
122
+ diagnosis of an echo - effect for sunspot area
123
+ data are discussed . in section 5 the problem of
124
+ the existence of the periodicity of about 155 days
125
+ during the maximum activity period for sunspot
126
+ data from the whole solar disk and from each solar
127
+ hemisphere separately is considered . to find
128
+ periodicities in a given time series the power
129
+ spectrum analysis is applied . in this paper two
130
+ methods are used : the fast fourier transformation
131
+ algorithm with the hamming window function ( fft )
132
+ and the blackman - tukey ( bt ) power spectrum
133
+ method @xcite . the bt method is used for the
134
+ diagnosis of the reasons of the existence of peaks
135
+ , which are obtained by the fft method . the bt
136
+ method consists in the smoothing of a cosine
137
+ transform of an autocorrelation function using a
138
+ 3-point weighting average . such an estimator is
139
+ consistent and unbiased . moreover , the peaks are
140
+ uncorrelated and their sum is a variance of a
141
+ considered time series . the main disadvantage of
142
+ this method is a weak resolution of the
143
+ periodogram points , particularly for low
144
+ frequences . for example , if the autocorrelation
145
+ function is evaluated for @xmath8 , then the
146
+ distribution points in the time domain are :
147
+ @xmath9 thus , it is obvious that this method
148
+ should not be used for detecting low frequency
149
+ periodicities with a fairly good resolution .
150
+ however , because of an application of the
151
+ autocorrelation function , the bt method can be
152
+ used to verify a reality of peaks which are
153
+ computed using a method giving the better
154
+ resolution ( for example the fft method ) . it is
155
+ valuable to remember that the power spectrum
156
+ methods should be applied very carefully . the
157
+ difficulties in the interpretation of significant
158
+ peaks could be caused by at least four effects : a
159
+ sampling of a continuos function , an echo -
160
+ effect , a contribution of long - term
161
+ periodicities and a random noise . first effect
162
+ exists because periodicities , which are shorter
163
+ than the sampling interval , may mix with longer
164
+ periodicities . in result , this effect can be
165
+ reduced by an decrease of the sampling interval
166
+ between observations . the echo - effect occurs
167
+ when there is a latent harmonic of frequency
168
+ @xmath10 in the time series , giving a spectral
169
+ peak at @xmath10 , and also periodic terms of
170
+ frequency @xmath11 etc . this may be detected by
171
+ the autocorrelation function for time series with
172
+ a large variance . time series often contain long
173
+ - term periodicities , that influence short - term
174
+ peaks . they could rise periodogram s peaks at
175
+ lower frequencies . however , it is also easy to
176
+ notice the influence of the long - term
177
+ periodicities on short - term peaks in the graphs
178
+ of the autocorrelation functions . this effect is
179
+ observed for the time series of solar activity
180
+ indexes which are limited by the 11-year cycle .
181
+ to find statistically significant periodicities it
182
+ is reasonable to use the autocorrelation function
183
+ and the power spectrum method with a high
184
+ resolution . in the case of a stationary time
185
+ series they give similar results . moreover , for
186
+ a stationary time series with the mean zero the
187
+ fourier transform is equivalent to the cosine
188
+ transform of an autocorrelation function @xcite .
189
+ thus , after a comparison of a periodogram with an
190
+ appropriate autocorrelation function one can
191
+ detect peaks which are in the graph of the first
192
+ function and do not exist in the graph of the
193
+ second function . the reasons of their existence
194
+ could be explained by the long - term
195
+ periodicities and the echo - effect . below method
196
+ enables one to detect these effects . ( solid line
197
+ ) and the 95% confidence level basing on thered
198
+ noise ( dotted line ) . the periodogram values are
199
+ presented on the left axis . the lower curve
200
+ illustrates the autocorrelation function of the
201
+ same time series ( solid line ) . the dotted lines
202
+ represent two standard errors of the
203
+ autocorrelation function . the dashed horizontal
204
+ line shows the zero level . the autocorrelation
205
+ values are shown in the right axis . ] because
206
+ the statistical tests indicate that the time
207
+ series is a white noise the confidence level is
208
+ not marked . ] . ] the method of the diagnosis
209
+ of an echo - effect in the power spectrum ( de )
210
+ consists in an analysis of a periodogram of a
211
+ given time series computed using the bt method .
212
+ the bt method bases on the cosine transform of the
213
+ autocorrelation function which creates peaks which
214
+ are in the periodogram , but not in the
215
+ autocorrelation function . the de method is used
216
+ for peaks which are computed by the fft method (
217
+ with high resolution ) and are statistically
218
+ significant . the time series of sunspot activity
219
+ indexes with the spacing interval one rotation or
220
+ one month contain a markov - type persistence ,
221
+ which means a tendency for the successive values
222
+ of the time series to remember their antecendent
223
+ values . thus , i use a confidence level basing on
224
+ the red noise of markov @xcite for the choice of
225
+ the significant peaks of the periodogram computed
226
+ by the fft method . when a time series does not
227
+ contain the markov - type persistence i apply the
228
+ fisher test and the kolmogorov - smirnov test at
229
+ the significance level @xmath12 @xcite to verify a
230
+ statistically significance of periodograms peaks .
231
+ the fisher test checks the null hypothesis that
232
+ the time series is white noise agains the
233
+ alternative hypothesis that the time series
234
+ contains an added deterministic periodic component
235
+ of unspecified frequency . because the fisher test
236
+ tends to be severe in rejecting peaks as
237
+ insignificant the kolmogorov - smirnov test is
238
+ also used . the de method analyses raw estimators
239
+ of the power spectrum . they are given as follows
240
+ @xmath13 for @xmath14 + where @xmath15 for
241
+ @xmath16 + @xmath17 is the length of the time
242
+ series @xmath18 and @xmath19 is the mean value .
243
+ the first term of the estimator @xmath20 is
244
+ constant . the second term takes two values (
245
+ depending on odd or even @xmath21 ) which are not
246
+ significant because @xmath22 for large m. thus ,
247
+ the third term of ( 1 ) should be analysed .
248
+ looking for intervals of @xmath23 for which
249
+ @xmath24 has the same sign and different signs one
250
+ can find such parts of the function @xmath25 which
251
+ create the value @xmath20 . let the set of values
252
+ of the independent variable of the autocorrelation
253
+ function be called @xmath26 and it can be divided
254
+ into the sums of disjoint sets : @xmath27 where +
255
+ @xmath28 + @xmath29 @xmath30 @xmath31 + @xmath32 +
256
+ @xmath33 @xmath34 @xmath35 @xmath36 @xmath37
257
+ @xmath38 @xmath39 @xmath40 well , the set
258
+ @xmath41 contains all integer values of @xmath23
259
+ from the interval of @xmath42 for which the
260
+ autocorrelation function and the cosinus function
261
+ with the period @xmath43 $ ] are positive . the
262
+ index @xmath44 indicates successive parts of the
263
+ cosinus function for which the cosinuses of
264
+ successive values of @xmath23 have the same sign .
265
+ however , sometimes the set @xmath41 can be empty
266
+ . for example , for @xmath45 and @xmath46 the set
267
+ @xmath47 should contain all @xmath48 $ ] for which
268
+ @xmath49 and @xmath50 , but for such values of
269
+ @xmath23 the values of @xmath51 are negative .
270
+ thus , the set @xmath47 is empty . . the
271
+ periodogram values are presented on the left axis
272
+ . the lower curve illustrates the autocorrelation
273
+ function of the same time series . the
274
+ autocorrelation values are shown in the right axis
275
+ . ] let us take into consideration all sets
276
+ \{@xmath52 } , \{@xmath53 } and \{@xmath41 } which
277
+ are not empty . because numberings and power of
278
+ these sets depend on the form of the
279
+ autocorrelation function of the given time series
280
+ , it is impossible to establish them arbitrary .
281
+ thus , the sets of appropriate indexes of the sets
282
+ \{@xmath52 } , \{@xmath53 } and \{@xmath41 } are
283
+ called @xmath54 , @xmath55 and @xmath56
284
+ respectively . for example the set @xmath56
285
+ contains all @xmath44 from the set @xmath57 for
286
+ which the sets @xmath41 are not empty . to
287
+ separate quantitatively in the estimator @xmath20
288
+ the positive contributions which are originated by
289
+ the cases described by the formula ( 5 ) from the
290
+ cases which are described by the formula ( 3 ) the
291
+ following indexes are introduced : @xmath58
292
+ @xmath59 @xmath60 @xmath61 where @xmath62 @xmath63
293
+ @xmath64 taking for the empty sets \{@xmath53 }
294
+ and \{@xmath41 } the indices @xmath65 and @xmath66
295
+ equal zero . the index @xmath65 describes a
296
+ percentage of the contribution of the case when
297
+ @xmath25 and @xmath51 are positive to the positive
298
+ part of the third term of the sum ( 1 ) . the
299
+ index @xmath66 describes a similar contribution ,
300
+ but for the case when the both @xmath25 and
301
+ @xmath51 are simultaneously negative . thanks to
302
+ these one can decide which the positive or the
303
+ negative values of the autocorrelation function
304
+ have a larger contribution to the positive values
305
+ of the estimator @xmath20 . when the difference
306
+ @xmath67 is positive , the statement the
307
+ @xmath21-th peak really exists can not be rejected
308
+ . thus , the following formula should be satisfied
309
+ : @xmath68 because the @xmath21-th peak could
310
+ exist as a result of the echo - effect , it is
311
+ necessary to verify the second condition :
312
+ @xmath69\in c_m.\ ] ] . the periodogram values
313
+ are presented on the left axis . the lower curve
314
+ illustrates the autocorrelation function of the
315
+ same time series ( solid line ) . the dotted lines
316
+ represent two standard errors of the
317
+ autocorrelation function . the dashed horizontal
318
+ line shows the zero level . the autocorrelation
319
+ values are shown in the right axis . ] to
320
+ verify the implication ( 8) firstly it is
321
+ necessary to evaluate the sets @xmath41 for
322
+ @xmath70 of the values of @xmath23 for which the
323
+ autocorrelation function and the cosine function
324
+ with the period @xmath71 $ ] are positive and the
325
+ sets @xmath72 of values of @xmath23 for which the
326
+ autocorrelation function and the cosine function
327
+ with the period @xmath43 $ ] are negative .
328
+ secondly , a percentage of the contribution of the
329
+ sum of products of positive values of @xmath25 and
330
+ @xmath51 to the sum of positive products of the
331
+ values of @xmath25 and @xmath51 should be
332
+ evaluated . as a result the indexes @xmath65 for
333
+ each set @xmath41 where @xmath44 is the index from
334
+ the set @xmath56 are obtained . thirdly , from all
335
+ sets @xmath41 such that @xmath70 the set @xmath73
336
+ for which the index @xmath65 is the greatest
337
+ should be chosen . the implication ( 8) is true
338
+ when the set @xmath73 includes the considered
339
+ period @xmath43 $ ] . this means that the greatest
340
+ contribution of positive values of the
341
+ autocorrelation function and positive cosines with
342
+ the period @xmath43 $ ] to the periodogram value
343
+ @xmath20 is caused by the sum of positive products
344
+ of @xmath74 for each @xmath75-\frac{m}{2k},[\frac{
345
+ 2m}{k}]+\frac{m}{2k})$ ] . when the implication
346
+ ( 8) is false , the peak @xmath20 is mainly
347
+ created by the sum of positive products of
348
+ @xmath74 for each @xmath76-\frac{m}{2k},\big [
349
+ \frac{2m}{n}\big ] + \frac{m}{2k } \big ) $ ] ,
350
+ where @xmath77 is a multiple or a divisor of
351
+ @xmath21 . it is necessary to add , that the de
352
+ method should be applied to the periodograms peaks
353
+ , which probably exist because of the echo -
354
+ effect . it enables one to find such parts of the
355
+ autocorrelation function , which have the
356
+ significant contribution to the considered peak .
357
+ the fact , that the conditions ( 7 ) and ( 8) are
358
+ satisfied , can unambiguously decide about the
359
+ existence of the considered periodicity in the
360
+ given time series , but if at least one of them is
361
+ not satisfied , one can doubt about the existence
362
+ of the considered periodicity . thus , in such
363
+ cases the sentence the peak can not be treated as
364
+ true should be used . using the de method it is
365
+ necessary to remember about the power of the set
366
+ @xmath78 . if @xmath79 is too large , errors of an
367
+ autocorrelation function estimation appear . they
368
+ are caused by the finite length of the given time
369
+ series and as a result additional peaks of the
370
+ periodogram occur . if @xmath79 is too small ,
371
+ there are less peaks because of a low resolution
372
+ of the periodogram . in applications @xmath80 is
373
+ used . in order to evaluate the value @xmath79 the
374
+ fft method is used . the periodograms computed by
375
+ the bt and the fft method are compared . the
376
+ conformity of them enables one to obtain the value
377
+ @xmath79 . . the fft periodogram values are
378
+ presented on the left axis . the lower curve
379
+ illustrates the bt periodogram of the same time
380
+ series ( solid line and large black circles ) .
381
+ the bt periodogram values are shown in the right
382
+ axis . ] in this paper the sunspot activity data (
383
+ august 1923 - october 1933 ) provided by the
384
+ greenwich photoheliographic results ( gpr ) are
385
+ analysed . firstly , i consider the monthly
386
+ sunspot number data . to eliminate the 11-year
387
+ trend from these data , the consecutively smoothed
388
+ monthly sunspot number @xmath81 is subtracted from
389
+ the monthly sunspot number @xmath82 where the
390
+ consecutive mean @xmath83 is given by @xmath84 the
391
+ values @xmath83 for @xmath85 and @xmath86 are
392
+ calculated using additional data from last six
393
+ months of cycle 15 and first six months of cycle
394
+ 17 . because of the north - south asymmetry of
395
+ various solar indices @xcite , the sunspot
396
+ activity is considered for each solar hemisphere
397
+ separately . analogously to the monthly sunspot
398
+ numbers , the time series of sunspot areas in the
399
+ northern and southern hemispheres with the spacing
400
+ interval @xmath87 rotation are denoted . in order
401
+ to find periodicities , the following time series
402
+ are used : + @xmath88 + @xmath89 + @xmath90
403
+ + in the lower part of figure [ f1 ] the
404
+ autocorrelation function of the time series for
405
+ the northern hemisphere @xmath88 is shown . it is
406
+ easy to notice that the prominent peak falls at 17
407
+ rotations interval ( 459 days ) and @xmath25 for
408
+ @xmath91 $ ] rotations ( [ 81 , 162 ] days ) are
409
+ significantly negative . the periodogram of the
410
+ time series @xmath88 ( see the upper curve in
411
+ figures [ f1 ] ) does not show the significant
412
+ peaks at @xmath92 rotations ( 135 , 162 days ) ,
413
+ but there is the significant peak at @xmath93 (
414
+ 243 days ) . the peaks at @xmath94 are close to
415
+ the peaks of the autocorrelation function . thus ,
416
+ the result obtained for the periodicity at about
417
+ @xmath0 days are contradict to the results
418
+ obtained for the time series of daily sunspot
419
+ areas @xcite . for the southern hemisphere (
420
+ the lower curve in figure [ f2 ] ) @xmath25 for
421
+ @xmath95 $ ] rotations ( [ 54 , 189 ] days ) is
422
+ not positive except @xmath96 ( 135 days ) for
423
+ which @xmath97 is not statistically significant .
424
+ the upper curve in figures [ f2 ] presents the
425
+ periodogram of the time series @xmath89 . this
426
+ time series does not contain a markov - type
427
+ persistence . moreover , the kolmogorov - smirnov
428
+ test and the fisher test do not reject a null
429
+ hypothesis that the time series is a white noise
430
+ only . this means that the time series do not
431
+ contain an added deterministic periodic component
432
+ of unspecified frequency . the autocorrelation
433
+ function of the time series @xmath90 ( the lower
434
+ curve in figure [ f3 ] ) has only one
435
+ statistically significant peak for @xmath98 months
436
+ ( 480 days ) and negative values for @xmath99 $ ]
437
+ months ( [ 90 , 390 ] days ) . however , the
438
+ periodogram of this time series ( the upper curve
439
+ in figure [ f3 ] ) has two significant peaks the
440
+ first at 15.2 and the second at 5.3 months ( 456 ,
441
+ 159 days ) . thus , the periodogram contains the
442
+ significant peak , although the autocorrelation
443
+ function has the negative value at @xmath100
444
+ months . to explain these problems two
445
+ following time series of daily sunspot areas are
446
+ considered : + @xmath101 + @xmath102 + where
447
+ @xmath103 the values @xmath104 for @xmath105
448
+ and @xmath106 are calculated using additional
449
+ daily data from the solar cycles 15 and 17 .
450
+ and the cosine function for @xmath45 ( the period
451
+ at about 154 days ) . the horizontal line ( dotted
452
+ line ) shows the zero level . the vertical dotted
453
+ lines evaluate the intervals where the sets
454
+ @xmath107 ( for @xmath108 ) are searched . the
455
+ percentage values show the index @xmath65 for each
456
+ @xmath41 for the time series @xmath102 ( in
457
+ parentheses for the time series @xmath101 ) . in
458
+ the right bottom corner the values of @xmath65 for
459
+ the time series @xmath102 , for @xmath109 are
460
+ written . ] ( the 500-day period ) ] the
461
+ comparison of the functions @xmath25 of the time
462
+ series @xmath101 ( the lower curve in figure [ f4
463
+ ] ) and @xmath102 ( the lower curve in figure [ f5
464
+ ] ) suggests that the positive values of the
465
+ function @xmath110 of the time series @xmath101 in
466
+ the interval of @xmath111 $ ] days could be caused
467
+ by the 11-year cycle . this effect is not visible
468
+ in the case of periodograms of the both time
469
+ series computed using the fft method ( see the
470
+ upper curves in figures [ f4 ] and [ f5 ] ) or the
471
+ bt method ( see the lower curve in figure [ f6 ] )
472
+ . moreover , the periodogram of the time series
473
+ @xmath102 has the significant values at @xmath112
474
+ days , but the autocorrelation function is
475
+ negative at these points . @xcite showed that the
476
+ lomb - scargle periodograms for the both time
477
+ series ( see @xcite , figures 7 a - c ) have a
478
+ peak at 158.8 days which stands over the fap level
479
+ by a significant amount . using the de method the
480
+ above discrepancies are obvious . to establish the
481
+ @xmath79 value the periodograms computed by the
482
+ fft and the bt methods are shown in figure [ f6 ]
483
+ ( the upper and the lower curve respectively ) .
484
+ for @xmath46 and for periods less than 166 days
485
+ there is a good comformity of the both
486
+ periodograms ( but for periods greater than 166
487
+ days the points of the bt periodogram are not
488
+ linked because the bt periodogram has much worse
489
+ resolution than the fft periodogram ( no one know
490
+ how to do it ) ) . for @xmath46 and @xmath113 the
491
+ value of @xmath21 is 13 ( @xmath71=153 $ ] ) . the
492
+ inequality ( 7 ) is satisfied because @xmath114 .
493
+ this means that the value of @xmath115 is mainly
494
+ created by positive values of the autocorrelation
495
+ function . the implication ( 8) needs an
496
+ evaluation of the greatest value of the index
497
+ @xmath65 where @xmath70 , but the solar data
498
+ contain the most prominent period for @xmath116
499
+ days because of the solar rotation . thus ,
500
+ although @xmath117 for each @xmath118 , all sets
501
+ @xmath41 ( see ( 5 ) and ( 6 ) ) without the set
502
+ @xmath119 ( see ( 4 ) ) , which contains @xmath120
503
+ $ ] , are considered . this situation is presented
504
+ in figure [ f7 ] . in this figure two curves
505
+ @xmath121 and @xmath122 are plotted . the vertical
506
+ dotted lines evaluate the intervals where the sets
507
+ @xmath107 ( for @xmath123 ) are searched . for
508
+ such @xmath41 two numbers are written : in
509
+ parentheses the value of @xmath65 for the time
510
+ series @xmath101 and above it the value of
511
+ @xmath65 for the time series @xmath102 . to make
512
+ this figure clear the curves are plotted for the
513
+ set @xmath124 only . ( in the right bottom corner
514
+ information about the values of @xmath65 for the
515
+ time series @xmath102 , for @xmath109 are written
516
+ . ) the implication ( 8) is not true , because
517
+ @xmath125 for @xmath126 . therefore ,
518
+ @xmath43=153\notin c_6=[423,500]$ ] . moreover ,
519
+ the autocorrelation function for @xmath127 $ ] is
520
+ negative and the set @xmath128 is empty . thus ,
521
+ @xmath129 . on the basis of these information one
522
+ can state , that the periodogram peak at @xmath130
523
+ days of the time series @xmath102 exists because
524
+ of positive @xmath25 , but for @xmath23 from the
525
+ intervals which do not contain this period .
526
+ looking at the values of @xmath65 of the time
527
+ series @xmath101 , one can notice that they
528
+ decrease when @xmath23 increases until @xmath131 .
529
+ this indicates , that when @xmath23 increases ,
530
+ the contribution of the 11-year cycle to the peaks
531
+ of the periodogram decreases . an increase of the
532
+ value of @xmath65 is for @xmath132 for the both
533
+ time series , although the contribution of the
534
+ 11-year cycle for the time series @xmath101 is
535
+ insignificant . thus , this part of the
536
+ autocorrelation function ( @xmath133 for the time
537
+ series @xmath102 ) influences the @xmath21-th peak
538
+ of the periodogram . this suggests that the
539
+ periodicity at about 155 days is a harmonic of the
540
+ periodicity from the interval of @xmath1 $ ] days
541
+ . ( solid line ) and consecutively smoothed
542
+ sunspot areas of the one rotation time interval
543
+ @xmath134 ( dotted line ) . both indexes are
544
+ presented on the left axis . the lower curve
545
+ illustrates fluctuations of the sunspot areas
546
+ @xmath135 . the dotted and dashed horizontal lines
547
+ represent levels zero and @xmath136 respectively .
548
+ the fluctuations are shown on the right axis . ]
549
+ the described reasoning can be carried out for
550
+ other values of the periodogram . for example ,
551
+ the condition ( 8) is not satisfied for @xmath137
552
+ ( 250 , 222 , 200 days ) . moreover , the
553
+ autocorrelation function at these points is
554
+ negative . these suggest that there are not a true
555
+ periodicity in the interval of [ 200 , 250 ] days
556
+ . it is difficult to decide about the existence of
557
+ the periodicities for @xmath138 ( 333 days ) and
558
+ @xmath139 ( 286 days ) on the basis of above
559
+ analysis . the implication ( 8) is not satisfied
560
+ for @xmath139 and the condition ( 7 ) is not
561
+ satisfied for @xmath138 , although the function
562
+ @xmath25 of the time series @xmath102 is
563
+ significantly positive for @xmath140 . the
564
+ conditions ( 7 ) and ( 8) are satisfied for
565
+ @xmath141 ( figure [ f8 ] ) and @xmath142 .
566
+ therefore , it is possible to exist the
567
+ periodicity from the interval of @xmath1 $ ] days
568
+ . similar results were also obtained by @xcite for
569
+ daily sunspot numbers and daily sunspot areas .
570
+ she considered the means of three periodograms of
571
+ these indexes for data from @xmath143 years and
572
+ found statistically significant peaks from the
573
+ interval of @xmath1 $ ] ( see @xcite , figure 2 )
574
+ . @xcite studied sunspot areas from 1876 - 1999
575
+ and sunspot numbers from 1749 - 2001 with the help
576
+ of the wavelet transform . they pointed out that
577
+ the 154 - 158-day period could be the third
578
+ harmonic of the 1.3-year ( 475-day ) period .
579
+ moreover , the both periods fluctuate considerably
580
+ with time , being stronger during stronger sunspot
581
+ cycles . therefore , the wavelet analysis suggests
582
+ a common origin of the both periodicities . this
583
+ conclusion confirms the de method result which
584
+ indicates that the periodogram peak at @xmath144
585
+ days is an alias of the periodicity from the
586
+ interval of @xmath1 $ ] in order to verify the
587
+ existence of the periodicity at about 155 days i
588
+ consider the following time series : + @xmath145
589
+ + @xmath146 + @xmath147 + the value @xmath134
590
+ is calculated analogously to @xmath83 ( see sect .
591
+ the values @xmath148 and @xmath149 are evaluated
592
+ from the formula ( 9 ) . in the upper part of
593
+ figure [ f9 ] the time series of sunspot areas
594
+ @xmath150 of the one rotation time interval from
595
+ the whole solar disk and the time series of
596
+ consecutively smoothed sunspot areas @xmath151 are
597
+ showed . in the lower part of figure [ f9 ] the
598
+ time series of sunspot area fluctuations @xmath145
599
+ is presented . on the basis of these data the
600
+ maximum activity period of cycle 16 is evaluated .
601
+ it is an interval between two strongest
602
+ fluctuations e.a . @xmath152 $ ] rotations . the
603
+ length of the time interval @xmath153 is 54
604
+ rotations . if the about @xmath0-day ( 6 solar
605
+ rotations ) periodicity existed in this time
606
+ interval and it was characteristic for strong
607
+ fluctuations from this time interval , 10 local
608
+ maxima in the set of @xmath154 would be seen .
609
+ then it should be necessary to find such a value
610
+ of p for which @xmath155 for @xmath156 and the
611
+ number of the local maxima of these values is 10 .
612
+ as it can be seen in the lower part of figure [ f9
613
+ ] this is for the case of @xmath157 ( in this
614
+ figure the dashed horizontal line is the level of
615
+ @xmath158 ) . figure [ f10 ] presents nine time
616
+ distances among the successive fluctuation local
617
+ maxima and the horizontal line represents the
618
+ 6-rotation periodicity . it is immediately
619
+ apparent that the dispersion of these points is 10
620
+ and it is difficult to find even few points which
621
+ oscillate around the value of 6 . such an analysis
622
+ was carried out for smaller and larger @xmath136
623
+ and the results were similar . therefore , the
624
+ fact , that the about @xmath0-day periodicity
625
+ exists in the time series of sunspot area
626
+ fluctuations during the maximum activity period is
627
+ questionable . . the horizontal line represents
628
+ the 6-rotation ( 162-day ) period . ] ] ]
629
+ to verify again the existence of the about
630
+ @xmath0-day periodicity during the maximum
631
+ activity period in each solar hemisphere
632
+ separately , the time series @xmath88 and @xmath89
633
+ were also cut down to the maximum activity period
634
+ ( january 1925december 1930 ) . the comparison of
635
+ the autocorrelation functions of these time series
636
+ with the appriopriate autocorrelation functions of
637
+ the time series @xmath88 and @xmath89 , which are
638
+ computed for the whole 11-year cycle ( the lower
639
+ curves of figures [ f1 ] and [ f2 ] ) , indicates
640
+ that there are not significant differences between
641
+ them especially for @xmath23=5 and 6 rotations (
642
+ 135 and 162 days ) ) . this conclusion is
643
+ confirmed by the analysis of the time series
644
+ @xmath146 for the maximum activity period . the
645
+ autocorrelation function ( the lower curve of
646
+ figure [ f11 ] ) is negative for the interval of [
647
+ 57 , 173 ] days , but the resolution of the
648
+ periodogram is too low to find the significant
649
+ peak at @xmath159 days . the autocorrelation
650
+ function gives the same result as for daily
651
+ sunspot area fluctuations from the whole solar
652
+ disk ( @xmath160 ) ( see also the lower curve of
653
+ figures [ f5 ] ) . in the case of the time series
654
+ @xmath89 @xmath161 is zero for the fluctuations
655
+ from the whole solar cycle and it is almost zero (
656
+ @xmath162 ) for the fluctuations from the maximum
657
+ activity period . the value @xmath163 is negative
658
+ . similarly to the case of the northern hemisphere
659
+ the autocorrelation function and the periodogram
660
+ of southern hemisphere daily sunspot area
661
+ fluctuations from the maximum activity period
662
+ @xmath147 are computed ( see figure [ f12 ] ) .
663
+ the autocorrelation function has the statistically
664
+ significant positive peak in the interval of [ 155
665
+ , 165 ] days , but the periodogram has too low
666
+ resolution to decide about the possible
667
+ periodicities . the correlative analysis indicates
668
+ that there are positive fluctuations with time
669
+ distances about @xmath0 days in the maximum
670
+ activity period . the results of the analyses of
671
+ the time series of sunspot area fluctuations from
672
+ the maximum activity period are contradict with
673
+ the conclusions of @xcite . she uses the power
674
+ spectrum analysis only . the periodogram of daily
675
+ sunspot fluctuations contains peaks , which could
676
+ be harmonics or subharmonics of the true
677
+ periodicities . they could be treated as real
678
+ periodicities . this effect is not visible for
679
+ sunspot data of the one rotation time interval ,
680
+ but averaging could lose true periodicities . this
681
+ is observed for data from the southern hemisphere
682
+ . there is the about @xmath0-day peak in the
683
+ autocorrelation function of daily fluctuations ,
684
+ but the correlation for data of the one rotation
685
+ interval is almost zero or negative at the points
686
+ @xmath164 and 6 rotations . thus , it is
687
+ reasonable to research both time series together
688
+ using the correlative and the power spectrum
689
+ analyses . the following results are obtained :
690
+ 1 . a new method of the detection of statistically
691
+ significant peaks of the periodograms enables one
692
+ to identify aliases in the periodogram . 2 . two
693
+ effects cause the existence of the peak of the
694
+ periodogram of the time series of sunspot area
695
+ fluctuations at about @xmath0 days : the first is
696
+ caused by the 27-day periodicity , which probably
697
+ creates the 162-day periodicity ( it is a
698
+ subharmonic frequency of the 27-day periodicity )
699
+ and the second is caused by statistically
700
+ significant positive values of the autocorrelation
701
+ function from the intervals of @xmath165 $ ] and
702
+ @xmath166 $ ] days . the existence of the
703
+ periodicity of about @xmath0 days of the time
704
+ series of sunspot area fluctuations and sunspot
705
+ area fluctuations from the northern hemisphere
706
+ during the maximum activity period is questionable
707
+ . the autocorrelation analysis of the time series
708
+ of sunspot area fluctuations from the southern
709
+ hemisphere indicates that the periodicity of about
710
+ 155 days exists during the maximum activity period
711
+ . i appreciate valuable comments from professor j.
712
+ jakimiec ."""
713
+
714
+ from transformers import LEDForConditionalGeneration, LEDTokenizer
715
+ import torch
716
+
717
+ tokenizer = LEDTokenizer.from_pretrained("allenai/led-large-16384-arxiv")
718
+
719
+ input_ids = tokenizer(LONG_ARTICLE, return_tensors="pt").input_ids.to("cuda")
720
+ global_attention_mask = torch.zeros_like(input_ids)
721
+ # set global_attention_mask on first token
722
+ global_attention_mask[:, 0] = 1
723
+
724
+ model = LEDForConditionalGeneration.from_pretrained("allenai/led-large-16384-arxiv", return_dict_in_generate=True).to("cuda")
725
+
726
+ sequences = model.generate(input_ids, global_attention_mask=global_attention_mask).sequences
727
+
728
+ summary = tokenizer.batch_decode(sequences)
729
+ ```