File size: 31,616 Bytes
6ba19a8 bf01a04 2a1bcb8 6ba19a8 bf01a04 6ba19a8 2b9d86d bf01a04 4490d89 bf01a04 523b1d3 6ba19a8 bf01a04 6ba19a8 77470a8 6ba19a8 0ff9606 6ba19a8 b09f644 7dcfce3 6ba19a8 9c4c56e a1a4522 b09f644 6ba19a8 a1a4522 0e1f401 77470a8 6ba19a8 77470a8 6ba19a8 77470a8 0e1f401 77470a8 6ba19a8 77470a8 0e1f401 7f72798 77470a8 0e1f401 77470a8 0e1f401 77470a8 6ba19a8 a1a4522 0e1f401 a1a4522 0e1f401 a1a4522 0e1f401 7f72798 74d932d 7f72798 a1a4522 6ba19a8 77470a8 6ba19a8 77470a8 7051c52 77470a8 9c4c56e 0e1f401 9c4c56e 77470a8 6ba19a8 77470a8 2a1bcb8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 |
---
license: apache-2.0
base_model: google/flan-t5-large
tags:
- generated_from_trainer
- NLPPaper_to_Question_Generation
- Summarization
- Long Document Summarization
model-index:
- name: FLAN-T5-NLP-Paper-to-Question-Generation
results: []
widget:
- text: >-
Generate Question, Answer pair correspond to the following research paper.
[Abstract] The dominant sequence transduction models are based on complex
recurrent or convolutional neural networks in an encoder-decoder
configuration. The best performing models also connect the encoder and
decoder through an attention mechanism. We propose a new simple network
architecture, the Transformer, based solely on attention mechanisms,
dispensing with recurrence and convolutions entirely. Experiments on two
machine translation tasks show these models to be superior in quality while
being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation
task, improving over the existing best results, including ensembles by over
2 BLEU. On the WMT 2014 English-to-French translation task, our model
establishes a new single-model state-of-the-art BLEU score of 41.8 after
training for 3.5 days on eight GPUs, a small fraction of the training costs
of the best models from the literature. We show that the Transformer
generalizes well to other tasks by applying it successfully to English
constituency parsing both with large and limited training data.
[Introduction] Recurrent neural networks, long short-term memory [13] and
gated recurrent [7] neural networks in particular, have been firmly
established as state of the art approaches in sequence modeling and
transduction problems such as language modeling and machine translation [35,
2, 5]. Numerous efforts have since continued to push the boundaries of
recurrent language models and encoder-decoder architectures [38, 24, 15].
Recurrent models typically factor computation along the symbol positions of
the input and output sequences. Aligning the positions to steps in
computation time, they generate a sequence of hidden states ht, as a
function of the previous hidden state ht−1 and the input for position t.
This inherently sequential nature precludes parallelization within training
examples, which becomes critical at longer sequence lengths, as memory
constraints limit batching across examples. Recent work has achieved
significant improvements in computational efficiency through factorization
tricks [21] and conditional computation [32], while also improving model
performance in case of the latter. The fundamental constraint of sequential
computation, however, remains. Attention mechanisms have become an integral
part of compelling sequence modeling and transduction models in various
tasks, allowing modeling of dependencies without regard to their distance in
the input or output sequences [2, 19]. In all but a few cases [27], however,
such attention mechanisms are used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing
recurrence and instead relying entirely on an attention mechanism to draw
global dependencies between input and output. The Transformer allows for
significantly more parallelization and can reach a new state of the art in
translation quality after being trained for as little as twelve hours on
eight P100 GPUs.
Question, Answer:
example_title: Attention Is All You Need
- text: >-
Generate Question, Answer pair correspond to the following research paper.
[Abstract] In this work, we explore prompt tuning, a simple yet effective
mechanism for learning soft prompts to condition frozen language models to
perform specific downstream tasks. Unlike the discrete text prompts used by
GPT-3, soft prompts are learned through backpropagation and can be tuned to
incorporate signal from any number of labeled examples. Our end-to-end
learned approach outperforms GPT-3's few-shot learning by a large margin.
More remarkably, through ablations on model size using T5, we show that
prompt tuning becomes more competitive with scale: as models exceed billions
of parameters, our method closes the gap and matches the strong performance
of model tuning (where all model weights are tuned). This finding is
especially relevant in that large models are costly to share and serve, and
the ability to reuse one frozen model for multiple downstream tasks can ease
this burden. Our method can be seen as a simplification of the recently
proposed prefix tuning of Li and Liang (2021), and we provide a comparison
to this and other similar approaches. Finally, we show that conditioning a
frozen model with soft prompts confers benefits in robustness to domain
transfer, as compared to full model tuning. [Introduction] With the wide
success of pre-trained large language models, a range of techniques has
arisen to adapt these general-purpose models to downstream tasks. ELMo
(Peters et al., 2018) proposed freezing the pre-trained model and learning a
task-specific weighting of its per-layer representations. However, since GPT
(Radford et al., 2018) and BERT (Devlin et al., 2019), the dominant
adaptation technique has been model tuning (or fine-tuning), where all model
parameters are tuned during adaptation, as proposed by Howard and Ruder
(2018).More recently, Brown et al. (2020) showed that prompt design (or
priming) is surprisingly effective at modulating a frozen GPT-3 model’s
behavior through text prompts. Prompts are typically composed of a task
description and/or several canonical examples. This return to freezing
pre-trained models is appealing, especially as model size continues to
increase. Rather than requiring a separate copy of the model for each
downstream task, a single generalist model can simultaneously serve many
different tasks. Unfortunately, prompt-based adaptation has several key
drawbacks. Task description is error-prone and requires human involvement,
and the effectiveness of a prompt is limited by how much conditioning text
can fit into the model’s input. As a result, downstream task quality still
lags far behind that of tuned models. For instance, GPT-3 175B fewshot
performance on SuperGLUE is 17.5 points below fine-tuned T5-XXL (Raffel et
al., 2020) (71.8 vs. 89.3) despite using 16 times more parameters. Several
efforts to automate prompt design have been recently proposed. Shin et al.
(2020) propose a search algorithm over the discrete space of words, guided
by the downstream application training data. While this technique
outperforms manual prompt design, there is still a gap relative to model
tuning. Li and Liang (2021) propose prefix tuning and show strong results on
generative tasks. This method freezes the model parameters and
backpropagates the error during tuning to prefix activations prepended to
each layer in the encoder stack, including the input layer. Hambardzumyan et
al. (2021) simplify this recipe by restricting the trainable parameters to
the input and output subnetworks of a masked language model, and show
reasonable results on classifications tasks. In this paper, we propose
prompt tuning as a further simplification for adapting language models. We
freeze the entire pre-trained model and only allow an additional k tunable
tokens per downstream task to be prepended to the input text. This soft
prompt is trained end-to-end and can condense the signal from a full labeled
dataset, allowing our method to outperform few-shot prompts and close the
quality gap with model tuning (Figure 1). At the same time, since a single
pre-trained model is recycled for all downstream tasks, we retain the
efficient serving benefits of frozen models (Figure 2). While we developed
our method concurrently with Li and Liang (2021) and Hambardzumyan et al.
(2021), we are the first to show that prompt tuning alone (with no
intermediate-layer prefixes or task-specific output layers) is sufficient to
be competitive with model tuning. Through detailed experiments in sections
2–3, we demonstrate that language model capacity is a key ingredient for
these approaches to succeed. As Figure 1 shows, prompt tuning becomes more
competitive with scale. We compare with similar approaches in Section 4.
Explicitly separating task-specific parameters from the generalist
parameters needed for general language-understanding has a range of
additional benefits. We show in Section 5 that by capturing the task
definition in the prompt while keeping the generalist parameters fixed, we
are able to achieve better resilience to domain shifts. In Section 6, we
show that prompt ensembling, learning multiple prompts for the same task,
can boost quality and is more efficient than classic model ensembling.
Finally, in Section 7, we investigate the interpretability of our learned
soft prompts. In sum, our key contributions are: 1. Proposing prompt tuning
and showing its competitiveness with model tuning in the regime of large
language models. 2. Ablating many design choices, and showing quality and
robustness improve with scale. 3. Showing prompt tuning outperforms model
tuning on domain shift problems. 4. Proposing prompt ensembling and showing
its effectiveness.
Question, Answer:
example_title: PEFT (2104.08691)
- text: >-
Generate Question, Answer pair correspond to the following research paper.
[Abstract] For the first time in the world, we succeeded in synthesizing the
room-temperature superconductor (Tc≥400 K, 127∘C) working at ambient
pressure with a modified lead-apatite (LK-99) structure. The
superconductivity of LK-99 is proved with the Critical temperature (Tc),
Zero-resistivity, Critical current (Ic), Critical magnetic field (Hc), and
the Meissner effect. The superconductivity of LK-99 originates from minute
structural distortion by a slight volume shrinkage (0.48 %), not by external
factors such as temperature and pressure. The shrinkage is caused by Cu2+
substitution of Pb2+(2) ions in the insulating network of Pb(2)-phosphate
and it generates the stress. It concurrently transfers to Pb(1) of the
cylindrical column resulting in distortion of the cylindrical column
interface, which creates superconducting quantum wells (SQWs) in the
interface. The heat capacity results indicated that the new model is
suitable for explaining the superconductivity of LK-99. The unique structure
of LK-99 that allows the minute distorted structure to be maintained in the
interfaces is the most important factor that LK-99 maintains and exhibits
superconductivity at room temperatures and ambient pressure. [Introduction]
Since the discovery of the first superconductor(1), many efforts to search
for new roomtemperature superconductors have been carried out worldwide(2,
3) through their experimental clarity or/and theoretical perspectives(4-8).
The recent success of developing room-temperature superconductors with
hydrogen sulfide(9) and yttrium super-hydride(10) has great attention
worldwide, which is expected by strong electron-phonon coupling theory with
high-frequency hydrogen phonon modes(11, 12). However, it is difficult to
apply them to actual application devices in daily life because of the
tremendously high pressure, and more efforts are being made to overcome the
high-pressure problem(13). For the first time in the world, we report the
success in synthesizing a room-temperature and ambient-pressure
superconductor with a chemical approach to solve the temperature and
pressure problem. We named the first room temperature and ambient pressure
superconductor LK-99. The superconductivity of LK-99 proved with the
Critical temperature (Tc), Zero-resistivity, Critical current (Ic), Critical
magnetic field (Hc), and Meissner effect(14, 15). Several data were
collected and analyzed in detail to figure out the puzzle of
superconductivity of LK-99: X-ray diffraction (XRD), X-ray photoelectron
spectroscopy (XPS), Electron Paramagnetic Resonance Spectroscopy (EPR), Heat
Capacity, and Superconducting quantum interference device (SQUID) data.
Henceforth in this paper, we will report and discuss our new findings
including superconducting quantum wells associated with the
superconductivity of LK-99.
Question, Answer:
example_title: LK-99 (Not NLP)
- text: >-
Generate Question, Answer pair correspond to the following research paper.
[Abstract] Abstract Evaluation practices in natural language generation
(NLG) have many known flaws, but improved evaluation approaches are rarely
widely adopted. This issue has become more urgent, since neural NLG models
have improved to the point where they can often no longer be distinguished
based on the surfacelevel features that older metrics rely on. This paper
surveys the issues with human and automatic model evaluations and with
commonly used datasets in NLG that have been pointed out over the past 20
years. We summarize, categorize, and discuss how researchers have been
addressing these issues and what their findings mean for the current state
of model evaluations. Building on those insights, we lay out a long-term
vision for NLG evaluation and propose concrete steps for researchers to
improve their evaluation processes. Finally, we analyze 66 NLG papers from
recent NLP conferences in how well they already follow these suggestions and
identify which areas require more drastic changes to the status quo.
[Introduction] There are many issues with the evaluation of models that
generate natural language. For example, datasets are often constructed in a
way that prevents measuring tail effects of robustness, and they almost
exclusively cover English. Most automated metrics measure only similarity
between model output and references instead of fine-grained quality aspects
(and even that poorly). Human evaluations have a high variance and, due to
insufficient documentation, rarely produce replicable results. These issues
have become more urgent as the nature of models that generate language has
changed without significant changes to how they are being evaluated. While
evaluation methods can capture surface-level improvements in text generated
by state-of-the-art models (such as increased fluency) to some extent, they
are ill-suited to detect issues with the content of model outputs, for
example if they are not attributable to input information. These ineffective
evaluations lead to overestimates of model capabilities. Deeper analyses
uncover that popular models fail even at simple tasks by taking shortcuts,
overfitting, hallucinating, and not being in accordance with their
communicative goals. Identifying these shortcomings, many recent papers
critique evaluation techniques or propose new ones. But almost none of the
suggestions are followed or new techniques used. There is an incentive
mismatch between conducting high-quality evaluations and publishing new
models or modeling techniques. While general-purpose evaluation techniques
could lower the barrier of entry for incorporating evaluation advances into
model development, their development requires resources that are hard to
come by, including model outputs on validation and test sets or large
quantities of human assessments of such outputs. Moreover, some issues, like
the refinement of datasets, require iterative processes where many
researchers collaborate. All this leads to a circular dependency where
evaluations of generation models can be improved only if generation models
use better evaluations. We find that there is a systemic difference between
selecting the best model and characterizing how good this model really is.
Current evaluation techniques focus on the first, while the second is
required to detect crucial issues. More emphasis needs to be put on
measuring and reporting model limitations, rather than focusing on producing
the highest performance numbers. To that end, this paper surveys analyses
and critiques of evaluation approaches (sections 3 and 4) and of commonly
used NLG datasets (section 5). Drawing on their insights, we describe how
researchers developing modeling techniques can help to improve and
subsequently benefit from better evaluations with methods available today
(section 6). Expanding on existing work on model documentation and formal
evaluation processes (Mitchell et al., 2019; Ribeiro et al., 2020), we
propose releasing evaluation reports which focus on demonstrating NLG model
shortcomings using evaluation suites. These reports should apply a
complementary set of automatic metrics, include rigorous human evaluations,
and be accompanied by data releases that allow for re-analysis with improved
metrics. In an analysis of 66 recent EMNLP, INLG, and ACL papers along 29
dimensions related to our suggestions (section 7), we find that the first
steps toward an improved evaluation are already frequently taken at an
average rate of 27%. The analysis uncovers the dimensions that require more
drastic changes in the NLG community. For example, 84% of papers already
report results on multiple datasets and more than 28% point out issues in
them, but we found only a single paper that contributed to the dataset
documentation, leaving future researchers to re-identify those issues. We
further highlight typical unsupported claims and a need for more consistent
data release practices. Following the suggestions and results, we discuss
how incorporating the suggestions can improve evaluation research, how the
suggestions differ from similar ones made for NLU, and how better metrics
can benefit model development itself (section 8).
Question, Answer:
example_title: NLG-Eval (2202.06935)
datasets:
- UNIST-Eunchan/NLP-Paper-to-QA-Generation
language:
- en
pipeline_tag: text2text-generation
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# FLAN-T5-NLP-Paper-to-Question-Generation
This model is a fine-tuned version of [google/flan-t5-large](https://huggingface.co/google/flan-t5-large) on an [allenai/QASPER: a dataset for question answering on scientific research papers ](https://huggingface.co/datasets/allenai/qasper)-based [NLP-Paper-to-QA-Generation](https://huggingface.co/datasets/UNIST-Eunchan/NLP-Paper-to-QA-Generation) dataset.
## Target Task
- NLP Paper's Abstract + Introduction --> {Question} [SEP] {Answer}
- Question-based Summarization
- Long Document Summarization
- Scientific Paper Summarization
## (1) How to use: Inference on CPU ( Code Snippets )
- Inference can be slow on CPU
### Load model directly
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("UNIST-Eunchan/FLAN-T5-NLP-Paper-to-Question-Generation")
model = AutoModelForSeq2SeqLM.from_pretrained("UNIST-Eunchan/FLAN-T5-NLP-Paper-to-Question-Generation")
```
### Prompting Input
```python
txt = r"""
Generate Question, Answer pair correspond to the following research paper.
[Abstract] + {text['abstract']} + [Introduction] + {text['introduction']}
Question, Answer:
""".replace("\n", "")
inputs = tokenizer(txt, max_length = 1024, truncation=True, padding="max_length", return_tensors="pt")
```
### For Multiple Question Generation (👍)
```python
num_generate_sequence = 4 #8, 16, 2, 1
summaries = model.generate(input_ids =inputs["input_ids"], max_new_tokens=100, do_sample = True, top_p = 0.95, num_return_sequences = num_generate_sequence)
```
### For Single Question Generation
```python
summaries = model.generate(input_ids =inputs["input_ids"], max_new_tokens=100, do_sample = True, top_p = 0.95)
```
```python
decoded_summaries = [tokenizer.decode(s, skip_special_tokens=False, clean_up_tokenization_spaces=True) for s in summaries]
decoded_summaries = [d.replace("<n>", " ").replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "") for d in decoded_summaries]
```
## (2) Faster Inference on GPU
- about 60x faster than (1) [CPU --> COLAB T4 GPU]
### Additional Installation
```python
!pip install accelerate -q
!pip install bitsandbytes -q
!pip install optimum -q
```
### Load model directly
```python
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM,BitsAndBytesConfig
from optimum.bettertransformer import BetterTransformer
# load model in 4-bit
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("UNIST-Eunchan/FLAN-T5-NLP-Paper-to-Question-Generation")
model = AutoModelForSeq2SeqLM.from_pretrained("UNIST-Eunchan/FLAN-T5-NLP-Paper-to-Question-Generation", quantization_config=quantization_config)
model = BetterTransformer.transform(model)
```
### For Multiple Question Generation (👍)
```python
# use to(device)
num_generate_sequence = 16 # (about 20 sec with Colab T4 GPU)
summaries = model.generate(input_ids =inputs["input_ids"].to(device), max_new_tokens=100, do_sample = True, top_p = 0.95, num_return_sequences = num_generate_sequence)
```
### Training results
It achieves the following results on the evaluation set:
- Loss: 0.4504
| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:----:|:---------------:|
| No log | 0.99 | 46 | 34.6109 |
| 29.7732 | 1.99 | 92 | 16.5236 |
| 29.7732 | 2.98 | 138 | 4.6887 |
| 7.9911 | 3.97 | 184 | 0.5679 |
| 7.9911 | 4.97 | 230 | 0.4795 |
| 0.6152 | 5.96 | 276 | 0.4577 |
| 0.6152 | 6.95 | 322 | 0.4523 |
| 0.4811 | 7.95 | 368 | 0.4509 |
| 0.4811 | 8.94 | 414 | 0.4505 |
| 0.4721 | 9.93 | 460 | 0.4504 |
## Model description
- FLAN-T5-Large (783M)
### Generated Output Example
- Our model generate 16 different Q-A Pair with top-p sampling.
```python
input: r"""
Generate Question, Answer pair correspond to the following research paper.
[Abstract] In this work, we explore prompt tuning, a simple yet effective mechanism for learning soft prompts to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signal from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3's few-shot learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method closes the gap and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant in that large models are costly to share and serve, and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed prefix tuning of Li and Liang (2021), and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer, as compared to full model tuning. [Introduction] With the wide success of pre-trained large language models, a range of techniques has arisen to adapt these general-purpose models to downstream tasks. ELMo (Peters et al., 2018) proposed freezing the pre-trained model and learning a task-specific weighting of its per-layer representations. However, since GPT (Radford et al., 2018) and BERT (Devlin et al., 2019), the dominant adaptation technique has been model tuning (or fine-tuning), where all model parameters are tuned during adaptation, as proposed by Howard and Ruder (2018).More recently, Brown et al. (2020) showed that prompt design (or priming) is surprisingly effective at modulating a frozen GPT-3 model’s behavior through text prompts. Prompts are typically composed of a task description and/or several canonical examples. This return to freezing pre-trained models is appealing, especially as model size continues to increase. Rather than requiring a separate copy of the model for each downstream task, a single generalist model can simultaneously serve many different tasks. Unfortunately, prompt-based adaptation has several key drawbacks. Task description is error-prone and requires human involvement, and the effectiveness of a prompt is limited by how much conditioning text can fit into the model’s input. As a result, downstream task quality still lags far behind that of tuned models. For instance, GPT-3 175B fewshot performance on SuperGLUE is 17.5 points below fine-tuned T5-XXL (Raffel et al., 2020) (71.8 vs. 89.3) despite using 16 times more parameters. Several efforts to automate prompt design have been recently proposed. Shin et al. (2020) propose a search algorithm over the discrete space of words, guided by the downstream application training data. While this technique outperforms manual prompt design, there is still a gap relative to model tuning. Li and Liang (2021) propose prefix tuning and show strong results on generative tasks. This method freezes the model parameters and backpropagates the error during tuning to prefix activations prepended to each layer in the encoder stack, including the input layer. Hambardzumyan et al. (2021) simplify this recipe by restricting the trainable parameters to the input and output subnetworks of a masked language model, and show reasonable results on classifications tasks. In this paper, we propose prompt tuning as a further simplification for adapting language models. We freeze the entire pre-trained model and only allow an additional k tunable tokens per downstream task to be prepended to the input text. This soft prompt is trained end-to-end and can condense the signal from a full labeled dataset, allowing our method to outperform few-shot prompts and close the quality gap with model tuning (Figure 1). At the same time, since a single pre-trained model is recycled for all downstream tasks, we retain the efficient serving benefits of frozen models (Figure 2). While we developed our method concurrently with Li and Liang (2021) and Hambardzumyan et al. (2021), we are the first to show that prompt tuning alone (with no intermediate-layer prefixes or task-specific output layers) is sufficient to be competitive with model tuning. Through detailed experiments in sections 2–3, we demonstrate that language model capacity is a key ingredient for these approaches to succeed. As Figure 1 shows, prompt tuning becomes more competitive with scale. We compare with similar approaches in Section 4. Explicitly separating task-specific parameters from the generalist parameters needed for general language-understanding has a range of additional benefits. We show in Section 5 that by capturing the task definition in the prompt while keeping the generalist parameters fixed, we are able to achieve better resilience to domain shifts. In Section 6, we show that prompt ensembling, learning multiple prompts for the same task, can boost quality and is more efficient than classic model ensembling. Finally, in Section 7, we investigate the interpretability of our learned soft prompts. In sum, our key contributions are: 1. Proposing prompt tuning and showing its competitiveness with model tuning in the regime of large language models. 2. Ablating many design choices, and showing quality and robustness improve with scale. 3. Showing prompt tuning outperforms model tuning on domain shift problems. 4. Proposing prompt ensembling and showing its effectiveness.
Question, Answer:
""".replace("\n", "")
output= [' What was the size of each untrained model?[SEP] The size of the model can be a combination of the size of all the parameters in a model',
' What are the benefits of using soft prompts?[SEP] They reduce the need to use manual prompt design and conserve machine training data',
' What is the sample size of dataset?[SEP] 22840',
' How does the method outperform some of the pre-trained models?[SEP] They successfully tune their model for two tasks, one for a few shot and the other for several downstream tasks.',
' What is the sample size of the experiments?[SEP]135 for a simple task?[SEP]32 for a more complicated task',
' What is the baseline model they tested? [SEP] GPT-3 model, with four state-of-the-art examples in a masked language model',
' What task accuracy is given by prompts?[SEP]Mixed task efficiency was 93% and accuracy 85% compared to normal noise level',
' What metrics do they use?[SEP] EMO score, VSD, and SVM scores',
' What metrics are used to assess the performance of the soft prompt training?[SEP] quality of translation, accuracy of text-to-text, robustness of domain transfer, error rate.',
' How much do they experiment with the T5 baseline?[SEP] The baseline is used for simulated benchmarks.',
' Which task are they applying their method to?[SEP]They test their approach on classifications tasks',
" Why do they show that their approach outperforms GPT-3's few-shot? [SEP] This is a large project that uses a multi-task approach to train GPT-3 models. In this paper, they demonstrate that the current method outperforms both the GPT-3 few-shot and the Li and Liang prefix tuning. They also show that the prefix tuning performed much better than the model tuning. What is the difference between their experiments",
' How do they compare with other techniques? [SEP] They provide a comparison for each approach.',
' Which task is the GPT-3 model most applicable to?[SEP]Classification tasks. For which tasks does the model need a subnetwork?[SEP]Classification tasks for GPT-3',
' What is the baseline test case used for this experiment?[SEP]Pompets for a variety of tasks are trained using the same method. This is the baseline, and the baseline is used for all applications.',
' What was the size of their model?[SEP] They experimented with 0.5 m.m and 0.5 m.m respectively.']
```
## Training and evaluation data
- Used Dataset: [UNIST-Eunchan/NLP-Paper-to-QA-Generation](https://huggingface.co/datasets/UNIST-Eunchan/NLP-Paper-to-QA-Generation) dataset.
- Train: dataset['train'] + dataset['test']
- Evaluation: dataset['validation']
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- gradient_accumulation_steps: 16
- total_train_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 184
- num_epochs: 10 |