UNIST-Eunchan commited on
Commit
7fa246a
1 Parent(s): 4787bea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +243 -0
README.md CHANGED
@@ -4,6 +4,249 @@ tags:
4
  - generated_from_trainer
5
  datasets:
6
  - arxiv-summarization
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  model-index:
8
  - name: Long-paper-summarization-pegasus-x-b
9
  results:
 
4
  - generated_from_trainer
5
  datasets:
6
  - arxiv-summarization
7
+
8
+ widget:
9
+ - text: >-
10
+
11
+ [Abstract] The dominant sequence transduction models are based on complex
12
+ recurrent or convolutional neural networks in an encoder-decoder
13
+ configuration. The best performing models also connect the encoder and
14
+ decoder through an attention mechanism. We propose a new simple network
15
+ architecture, the Transformer, based solely on attention mechanisms,
16
+ dispensing with recurrence and convolutions entirely. Experiments on two
17
+ machine translation tasks show these models to be superior in quality while
18
+ being more parallelizable and requiring significantly less time to train.
19
+ Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation
20
+ task, improving over the existing best results, including ensembles by over
21
+ 2 BLEU. On the WMT 2014 English-to-French translation task, our model
22
+ establishes a new single-model state-of-the-art BLEU score of 41.8 after
23
+ training for 3.5 days on eight GPUs, a small fraction of the training costs
24
+ of the best models from the literature. We show that the Transformer
25
+ generalizes well to other tasks by applying it successfully to English
26
+ constituency parsing both with large and limited training data.
27
+ [Introduction] Recurrent neural networks, long short-term memory [13] and
28
+ gated recurrent [7] neural networks in particular, have been firmly
29
+ established as state of the art approaches in sequence modeling and
30
+ transduction problems such as language modeling and machine translation [35,
31
+ 2, 5]. Numerous efforts have since continued to push the boundaries of
32
+ recurrent language models and encoder-decoder architectures [38, 24, 15].
33
+ Recurrent models typically factor computation along the symbol positions of
34
+ the input and output sequences. Aligning the positions to steps in
35
+ computation time, they generate a sequence of hidden states ht, as a
36
+ function of the previous hidden state ht−1 and the input for position t.
37
+ This inherently sequential nature precludes parallelization within training
38
+ examples, which becomes critical at longer sequence lengths, as memory
39
+ constraints limit batching across examples. Recent work has achieved
40
+ significant improvements in computational efficiency through factorization
41
+ tricks [21] and conditional computation [32], while also improving model
42
+ performance in case of the latter. The fundamental constraint of sequential
43
+ computation, however, remains. Attention mechanisms have become an integral
44
+ part of compelling sequence modeling and transduction models in various
45
+ tasks, allowing modeling of dependencies without regard to their distance in
46
+ the input or output sequences [2, 19]. In all but a few cases [27], however,
47
+ such attention mechanisms are used in conjunction with a recurrent network.
48
+ In this work we propose the Transformer, a model architecture eschewing
49
+ recurrence and instead relying entirely on an attention mechanism to draw
50
+ global dependencies between input and output. The Transformer allows for
51
+ significantly more parallelization and can reach a new state of the art in
52
+ translation quality after being trained for as little as twelve hours on
53
+ eight P100 GPUs.
54
+ example_title: Attention Is All You Need
55
+ - text: >-
56
+ [Abstract] In this work, we explore prompt tuning, a simple yet effective
57
+ mechanism for learning soft prompts to condition frozen language models to
58
+ perform specific downstream tasks. Unlike the discrete text prompts used by
59
+ GPT-3, soft prompts are learned through backpropagation and can be tuned to
60
+ incorporate signal from any number of labeled examples. Our end-to-end
61
+ learned approach outperforms GPT-3's few-shot learning by a large margin.
62
+ More remarkably, through ablations on model size using T5, we show that
63
+ prompt tuning becomes more competitive with scale: as models exceed billions
64
+ of parameters, our method closes the gap and matches the strong performance
65
+ of model tuning (where all model weights are tuned). This finding is
66
+ especially relevant in that large models are costly to share and serve, and
67
+ the ability to reuse one frozen model for multiple downstream tasks can ease
68
+ this burden. Our method can be seen as a simplification of the recently
69
+ proposed prefix tuning of Li and Liang (2021), and we provide a comparison
70
+ to this and other similar approaches. Finally, we show that conditioning a
71
+ frozen model with soft prompts confers benefits in robustness to domain
72
+ transfer, as compared to full model tuning. [Introduction] With the wide
73
+ success of pre-trained large language models, a range of techniques has
74
+ arisen to adapt these general-purpose models to downstream tasks. ELMo
75
+ (Peters et al., 2018) proposed freezing the pre-trained model and learning a
76
+ task-specific weighting of its per-layer representations. However, since GPT
77
+ (Radford et al., 2018) and BERT (Devlin et al., 2019), the dominant
78
+ adaptation technique has been model tuning (or fine-tuning), where all model
79
+ parameters are tuned during adaptation, as proposed by Howard and Ruder
80
+ (2018).More recently, Brown et al. (2020) showed that prompt design (or
81
+ priming) is surprisingly effective at modulating a frozen GPT-3 model’s
82
+ behavior through text prompts. Prompts are typically composed of a task
83
+ description and/or several canonical examples. This return to freezing
84
+ pre-trained models is appealing, especially as model size continues to
85
+ increase. Rather than requiring a separate copy of the model for each
86
+ downstream task, a single generalist model can simultaneously serve many
87
+ different tasks. Unfortunately, prompt-based adaptation has several key
88
+ drawbacks. Task description is error-prone and requires human involvement,
89
+ and the effectiveness of a prompt is limited by how much conditioning text
90
+ can fit into the model’s input. As a result, downstream task quality still
91
+ lags far behind that of tuned models. For instance, GPT-3 175B fewshot
92
+ performance on SuperGLUE is 17.5 points below fine-tuned T5-XXL (Raffel et
93
+ al., 2020) (71.8 vs. 89.3) despite using 16 times more parameters. Several
94
+ efforts to automate prompt design have been recently proposed. Shin et al.
95
+ (2020) propose a search algorithm over the discrete space of words, guided
96
+ by the downstream application training data. While this technique
97
+ outperforms manual prompt design, there is still a gap relative to model
98
+ tuning. Li and Liang (2021) propose prefix tuning and show strong results on
99
+ generative tasks. This method freezes the model parameters and
100
+ backpropagates the error during tuning to prefix activations prepended to
101
+ each layer in the encoder stack, including the input layer. Hambardzumyan et
102
+ al. (2021) simplify this recipe by restricting the trainable parameters to
103
+ the input and output subnetworks of a masked language model, and show
104
+ reasonable results on classifications tasks. In this paper, we propose
105
+ prompt tuning as a further simplification for adapting language models. We
106
+ freeze the entire pre-trained model and only allow an additional k tunable
107
+ tokens per downstream task to be prepended to the input text. This soft
108
+ prompt is trained end-to-end and can condense the signal from a full labeled
109
+ dataset, allowing our method to outperform few-shot prompts and close the
110
+ quality gap with model tuning (Figure 1). At the same time, since a single
111
+ pre-trained model is recycled for all downstream tasks, we retain the
112
+ efficient serving benefits of frozen models (Figure 2). While we developed
113
+ our method concurrently with Li and Liang (2021) and Hambardzumyan et al.
114
+ (2021), we are the first to show that prompt tuning alone (with no
115
+ intermediate-layer prefixes or task-specific output layers) is sufficient to
116
+ be competitive with model tuning. Through detailed experiments in sections
117
+ 2–3, we demonstrate that language model capacity is a key ingredient for
118
+ these approaches to succeed. As Figure 1 shows, prompt tuning becomes more
119
+ competitive with scale. We compare with similar approaches in Section 4.
120
+ Explicitly separating task-specific parameters from the generalist
121
+ parameters needed for general language-understanding has a range of
122
+ additional benefits. We show in Section 5 that by capturing the task
123
+ definition in the prompt while keeping the generalist parameters fixed, we
124
+ are able to achieve better resilience to domain shifts. In Section 6, we
125
+ show that prompt ensembling, learning multiple prompts for the same task,
126
+ can boost quality and is more efficient than classic model ensembling.
127
+ Finally, in Section 7, we investigate the interpretability of our learned
128
+ soft prompts. In sum, our key contributions are: 1. Proposing prompt tuning
129
+ and showing its competitiveness with model tuning in the regime of large
130
+ language models. 2. Ablating many design choices, and showing quality and
131
+ robustness improve with scale. 3. Showing prompt tuning outperforms model
132
+ tuning on domain shift problems. 4. Proposing prompt ensembling and showing
133
+ its effectiveness.
134
+ example_title: PEFT (2104.08691)
135
+ - text: >-
136
+ [Abstract] For the first time in the world, we succeeded in synthesizing the
137
+ room-temperature superconductor (Tc≥400 K, 127∘C) working at ambient
138
+ pressure with a modified lead-apatite (LK-99) structure. The
139
+ superconductivity of LK-99 is proved with the Critical temperature (Tc),
140
+ Zero-resistivity, Critical current (Ic), Critical magnetic field (Hc), and
141
+ the Meissner effect. The superconductivity of LK-99 originates from minute
142
+ structural distortion by a slight volume shrinkage (0.48 %), not by external
143
+ factors such as temperature and pressure. The shrinkage is caused by Cu2+
144
+ substitution of Pb2+(2) ions in the insulating network of Pb(2)-phosphate
145
+ and it generates the stress. It concurrently transfers to Pb(1) of the
146
+ cylindrical column resulting in distortion of the cylindrical column
147
+ interface, which creates superconducting quantum wells (SQWs) in the
148
+ interface. The heat capacity results indicated that the new model is
149
+ suitable for explaining the superconductivity of LK-99. The unique structure
150
+ of LK-99 that allows the minute distorted structure to be maintained in the
151
+ interfaces is the most important factor that LK-99 maintains and exhibits
152
+ superconductivity at room temperatures and ambient pressure. [Introduction]
153
+ Since the discovery of the first superconductor(1), many efforts to search
154
+ for new roomtemperature superconductors have been carried out worldwide(2,
155
+ 3) through their experimental clarity or/and theoretical perspectives(4-8).
156
+ The recent success of developing room-temperature superconductors with
157
+ hydrogen sulfide(9) and yttrium super-hydride(10) has great attention
158
+ worldwide, which is expected by strong electron-phonon coupling theory with
159
+ high-frequency hydrogen phonon modes(11, 12). However, it is difficult to
160
+ apply them to actual application devices in daily life because of the
161
+ tremendously high pressure, and more efforts are being made to overcome the
162
+ high-pressure problem(13). For the first time in the world, we report the
163
+ success in synthesizing a room-temperature and ambient-pressure
164
+ superconductor with a chemical approach to solve the temperature and
165
+ pressure problem. We named the first room temperature and ambient pressure
166
+ superconductor LK-99. The superconductivity of LK-99 proved with the
167
+ Critical temperature (Tc), Zero-resistivity, Critical current (Ic), Critical
168
+ magnetic field (Hc), and Meissner effect(14, 15). Several data were
169
+ collected and analyzed in detail to figure out the puzzle of
170
+ superconductivity of LK-99: X-ray diffraction (XRD), X-ray photoelectron
171
+ spectroscopy (XPS), Electron Paramagnetic Resonance Spectroscopy (EPR), Heat
172
+ Capacity, and Superconducting quantum interference device (SQUID) data.
173
+ Henceforth in this paper, we will report and discuss our new findings
174
+ including superconducting quantum wells associated with the
175
+ superconductivity of LK-99.
176
+ example_title: LK-99 (Not NLP)
177
+ - text: >-
178
+ [Abstract] Abstract Evaluation practices in natural language generation
179
+ (NLG) have many known flaws, but improved evaluation approaches are rarely
180
+ widely adopted. This issue has become more urgent, since neural NLG models
181
+ have improved to the point where they can often no longer be distinguished
182
+ based on the surfacelevel features that older metrics rely on. This paper
183
+ surveys the issues with human and automatic model evaluations and with
184
+ commonly used datasets in NLG that have been pointed out over the past 20
185
+ years. We summarize, categorize, and discuss how researchers have been
186
+ addressing these issues and what their findings mean for the current state
187
+ of model evaluations. Building on those insights, we lay out a long-term
188
+ vision for NLG evaluation and propose concrete steps for researchers to
189
+ improve their evaluation processes. Finally, we analyze 66 NLG papers from
190
+ recent NLP conferences in how well they already follow these suggestions and
191
+ identify which areas require more drastic changes to the status quo.
192
+ [Introduction] There are many issues with the evaluation of models that
193
+ generate natural language. For example, datasets are often constructed in a
194
+ way that prevents measuring tail effects of robustness, and they almost
195
+ exclusively cover English. Most automated metrics measure only similarity
196
+ between model output and references instead of fine-grained quality aspects
197
+ (and even that poorly). Human evaluations have a high variance and, due to
198
+ insufficient documentation, rarely produce replicable results. These issues
199
+ have become more urgent as the nature of models that generate language has
200
+ changed without significant changes to how they are being evaluated. While
201
+ evaluation methods can capture surface-level improvements in text generated
202
+ by state-of-the-art models (such as increased fluency) to some extent, they
203
+ are ill-suited to detect issues with the content of model outputs, for
204
+ example if they are not attributable to input information. These ineffective
205
+ evaluations lead to overestimates of model capabilities. Deeper analyses
206
+ uncover that popular models fail even at simple tasks by taking shortcuts,
207
+ overfitting, hallucinating, and not being in accordance with their
208
+ communicative goals. Identifying these shortcomings, many recent papers
209
+ critique evaluation techniques or propose new ones. But almost none of the
210
+ suggestions are followed or new techniques used. There is an incentive
211
+ mismatch between conducting high-quality evaluations and publishing new
212
+ models or modeling techniques. While general-purpose evaluation techniques
213
+ could lower the barrier of entry for incorporating evaluation advances into
214
+ model development, their development requires resources that are hard to
215
+ come by, including model outputs on validation and test sets or large
216
+ quantities of human assessments of such outputs. Moreover, some issues, like
217
+ the refinement of datasets, require iterative processes where many
218
+ researchers collaborate. All this leads to a circular dependency where
219
+ evaluations of generation models can be improved only if generation models
220
+ use better evaluations. We find that there is a systemic difference between
221
+ selecting the best model and characterizing how good this model really is.
222
+ Current evaluation techniques focus on the first, while the second is
223
+ required to detect crucial issues. More emphasis needs to be put on
224
+ measuring and reporting model limitations, rather than focusing on producing
225
+ the highest performance numbers. To that end, this paper surveys analyses
226
+ and critiques of evaluation approaches (sections 3 and 4) and of commonly
227
+ used NLG datasets (section 5). Drawing on their insights, we describe how
228
+ researchers developing modeling techniques can help to improve and
229
+ subsequently benefit from better evaluations with methods available today
230
+ (section 6). Expanding on existing work on model documentation and formal
231
+ evaluation processes (Mitchell et al., 2019; Ribeiro et al., 2020), we
232
+ propose releasing evaluation reports which focus on demonstrating NLG model
233
+ shortcomings using evaluation suites. These reports should apply a
234
+ complementary set of automatic metrics, include rigorous human evaluations,
235
+ and be accompanied by data releases that allow for re-analysis with improved
236
+ metrics. In an analysis of 66 recent EMNLP, INLG, and ACL papers along 29
237
+ dimensions related to our suggestions (section 7), we find that the first
238
+ steps toward an improved evaluation are already frequently taken at an
239
+ average rate of 27%. The analysis uncovers the dimensions that require more
240
+ drastic changes in the NLG community. For example, 84% of papers already
241
+ report results on multiple datasets and more than 28% point out issues in
242
+ them, but we found only a single paper that contributed to the dataset
243
+ documentation, leaving future researchers to re-identify those issues. We
244
+ further highlight typical unsupported claims and a need for more consistent
245
+ data release practices. Following the suggestions and results, we discuss
246
+ how incorporating the suggestions can improve evaluation research, how the
247
+ suggestions differ from similar ones made for NLU, and how better metrics
248
+ can benefit model development itself (section 8).
249
+ example_title: NLG-Eval (2202.06935)
250
  model-index:
251
  - name: Long-paper-summarization-pegasus-x-b
252
  results: