UNIST-Eunchan's picture
Update README.md
7f93a8e
metadata
license: apache-2.0
base_model: google/flan-t5-large
tags:
  - generated_from_trainer
  - NLPPaper_to_Question_Generation
  - Summarization
  - Long Document Summarization
model-index:
  - name: FLAN-T5-NLP-Paper-to-Question-Generation
    results: []
widget:
  - text: >-
      Generate Question, Answer pair correspond to the following research paper.
      [Abstract] The dominant sequence transduction models are based on complex
      recurrent or convolutional neural networks in an encoder-decoder
      configuration. The best performing models also connect the encoder and
      decoder through an attention mechanism. We propose a new simple network
      architecture, the Transformer, based solely on attention mechanisms,
      dispensing with recurrence and convolutions entirely. Experiments on two
      machine translation tasks show these models to be superior in quality
      while being more parallelizable and requiring significantly less time to
      train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German
      translation task, improving over the existing best results, including
      ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation
      task, our model establishes a new single-model state-of-the-art BLEU score
      of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the
      training costs of the best models from the literature. We show that the
      Transformer generalizes well to other tasks by applying it successfully to
      English constituency parsing both with large and limited training data.
      [Introduction] Recurrent neural networks, long short-term memory [13] and
      gated recurrent [7] neural networks in particular, have been firmly
      established as state of the art approaches in sequence modeling and
      transduction problems such as language modeling and machine translation
      [35, 2, 5]. Numerous efforts have since continued to push the boundaries
      of recurrent language models and encoder-decoder architectures [38, 24,
      15]. Recurrent models typically factor computation along the symbol
      positions of the input and output sequences. Aligning the positions to
      steps in computation time, they generate a sequence of hidden states ht,
      as a function of the previous hidden state ht−1 and the input for position
      t. This inherently sequential nature precludes parallelization within
      training examples, which becomes critical at longer sequence lengths, as
      memory constraints limit batching across examples. Recent work has
      achieved significant improvements in computational efficiency through
      factorization tricks [21] and conditional computation [32], while also
      improving model performance in case of the latter. The fundamental
      constraint of sequential computation, however, remains. Attention
      mechanisms have become an integral part of compelling sequence modeling
      and transduction models in various tasks, allowing modeling of
      dependencies without regard to their distance in the input or output
      sequences [2, 19]. In all but a few cases [27], however, such attention
      mechanisms are used in conjunction with a recurrent network. In this work
      we propose the Transformer, a model architecture eschewing recurrence and
      instead relying entirely on an attention mechanism to draw global
      dependencies between input and output. The Transformer allows for
      significantly more parallelization and can reach a new state of the art in
      translation quality after being trained for as little as twelve hours on
      eight P100 GPUs. 
       Question, Answer:
    example_title: Attention Is All You Need
  - text: >-
      Generate Question, Answer pair correspond to the following research paper.
      [Abstract] In this work, we explore prompt tuning, a simple yet effective
      mechanism for learning soft prompts to condition frozen language models to
      perform specific downstream tasks. Unlike the discrete text prompts used
      by GPT-3, soft prompts are learned through backpropagation and can be
      tuned to incorporate signal from any number of labeled examples. Our
      end-to-end learned approach outperforms GPT-3's few-shot learning by a
      large margin. More remarkably, through ablations on model size using T5,
      we show that prompt tuning becomes more competitive with scale: as models
      exceed billions of parameters, our method closes the gap and matches the
      strong performance of model tuning (where all model weights are tuned).
      This finding is especially relevant in that large models are costly to
      share and serve, and the ability to reuse one frozen model for multiple
      downstream tasks can ease this burden. Our method can be seen as a
      simplification of the recently proposed prefix tuning of Li and Liang
      (2021), and we provide a comparison to this and other similar approaches.
      Finally, we show that conditioning a frozen model with soft prompts
      confers benefits in robustness to domain transfer, as compared to full
      model tuning. [Introduction] With the wide success of pre-trained large
      language models, a range of techniques has arisen to adapt these
      general-purpose models to downstream tasks. ELMo (Peters et al., 2018)
      proposed freezing the pre-trained model and learning a task-specific
      weighting of its per-layer representations. However, since GPT (Radford et
      al., 2018) and BERT (Devlin et al., 2019), the dominant adaptation
      technique has been model tuning (or fine-tuning), where all model
      parameters are tuned during adaptation, as proposed by Howard and Ruder
      (2018).More recently, Brown et al. (2020) showed that prompt design (or
      priming) is surprisingly effective at modulating a frozen GPT-3 model’s
      behavior through text prompts. Prompts are typically composed of a task
      description and/or several canonical examples. This return to freezing
      pre-trained models is appealing, especially as model size continues to
      increase. Rather than requiring a separate copy of the model for each
      downstream task, a single generalist model can simultaneously serve many
      different tasks. Unfortunately, prompt-based adaptation has several key
      drawbacks. Task description is error-prone and requires human involvement,
      and the effectiveness of a prompt is limited by how much conditioning text
      can fit into the model’s input. As a result, downstream task quality still
      lags far behind that of tuned models. For instance, GPT-3 175B fewshot
      performance on SuperGLUE is 17.5 points below fine-tuned T5-XXL (Raffel et
      al., 2020) (71.8 vs. 89.3) despite using 16 times more parameters. Several
      efforts to automate prompt design have been recently proposed. Shin et al.
      (2020) propose a search algorithm over the discrete space of words, guided
      by the downstream application training data. While this technique
      outperforms manual prompt design, there is still a gap relative to model
      tuning. Li and Liang (2021) propose prefix tuning and show strong results
      on generative tasks. This method freezes the model parameters and
      backpropagates the error during tuning to prefix activations prepended to
      each layer in the encoder stack, including the input layer. Hambardzumyan
      et al. (2021) simplify this recipe by restricting the trainable parameters
      to the input and output subnetworks of a masked language model, and show
      reasonable results on classifications tasks. In this paper, we propose
      prompt tuning as a further simplification for adapting language models. We
      freeze the entire pre-trained model and only allow an additional k tunable
      tokens per downstream task to be prepended to the input text. This soft
      prompt is trained end-to-end and can condense the signal from a full
      labeled dataset, allowing our method to outperform few-shot prompts and
      close the quality gap with model tuning (Figure 1). At the same time,
      since a single pre-trained model is recycled for all downstream tasks, we
      retain the efficient serving benefits of frozen models (Figure 2). While
      we developed our method concurrently with Li and Liang (2021) and
      Hambardzumyan et al. (2021), we are the first to show that prompt tuning
      alone (with no intermediate-layer prefixes or task-specific output layers)
      is sufficient to be competitive with model tuning. Through detailed
      experiments in sections 2–3, we demonstrate that language model capacity
      is a key ingredient for these approaches to succeed. As Figure 1 shows,
      prompt tuning becomes more competitive with scale. We compare with similar
      approaches in Section 4. Explicitly separating task-specific parameters
      from the generalist parameters needed for general language-understanding
      has a range of additional benefits. We show in Section 5 that by capturing
      the task definition in the prompt while keeping the generalist parameters
      fixed, we are able to achieve better resilience to domain shifts. In
      Section 6, we show that prompt ensembling, learning multiple prompts for
      the same task, can boost quality and is more efficient than classic model
      ensembling. Finally, in Section 7, we investigate the interpretability of
      our learned soft prompts. In sum, our key contributions are: 1. Proposing
      prompt tuning and showing its competitiveness with model tuning in the
      regime of large language models. 2. Ablating many design choices, and
      showing quality and robustness improve with scale. 3. Showing prompt
      tuning outperforms model tuning on domain shift problems. 4. Proposing
      prompt ensembling and showing its effectiveness. 
       Question, Answer:
    example_title: PEFT (2104.08691)
  - text: >-
      Generate Question, Answer pair correspond to the following research paper.
      [Abstract] For the first time in the world, we succeeded in synthesizing
      the room-temperature superconductor (Tc≥400 K, 127∘C) working at ambient
      pressure with a modified lead-apatite (LK-99) structure. The
      superconductivity of LK-99 is proved with the Critical temperature (Tc),
      Zero-resistivity, Critical current (Ic), Critical magnetic field (Hc), and
      the Meissner effect. The superconductivity of LK-99 originates from minute
      structural distortion by a slight volume shrinkage (0.48 %), not by
      external factors such as temperature and pressure. The shrinkage is caused
      by Cu2+ substitution of Pb2+(2) ions in the insulating network of
      Pb(2)-phosphate and it generates the stress. It concurrently transfers to
      Pb(1) of the cylindrical column resulting in distortion of the cylindrical
      column interface, which creates superconducting quantum wells (SQWs) in
      the interface. The heat capacity results indicated that the new model is
      suitable for explaining the superconductivity of LK-99. The unique
      structure of LK-99 that allows the minute distorted structure to be
      maintained in the interfaces is the most important factor that LK-99
      maintains and exhibits superconductivity at room temperatures and ambient
      pressure. [Introduction]  Since the discovery of the first
      superconductor(1), many efforts to search for new roomtemperature
      superconductors have been carried out worldwide(2, 3) through their
      experimental clarity or/and theoretical perspectives(4-8). The recent
      success of developing room-temperature superconductors with hydrogen
      sulfide(9) and yttrium super-hydride(10) has great attention worldwide,
      which is expected by strong electron-phonon coupling theory with
      high-frequency hydrogen phonon modes(11, 12). However, it is difficult to
      apply them to actual application devices in daily life because of the
      tremendously high pressure, and more efforts are being made to overcome
      the high-pressure problem(13). For the first time in the world, we report
      the success in synthesizing a room-temperature and ambient-pressure
      superconductor with a chemical approach to solve the temperature and
      pressure problem. We named the first room temperature and ambient pressure
      superconductor LK-99. The superconductivity of LK-99 proved with the
      Critical temperature (Tc), Zero-resistivity, Critical current (Ic),
      Critical magnetic field (Hc), and Meissner effect(14, 15). Several data
      were collected and analyzed in detail to figure out the puzzle of
      superconductivity of LK-99: X-ray diffraction (XRD), X-ray photoelectron
      spectroscopy (XPS), Electron Paramagnetic Resonance Spectroscopy (EPR),
      Heat Capacity, and Superconducting quantum interference device (SQUID)
      data. Henceforth in this paper, we will report and discuss our new
      findings including superconducting quantum wells associated with the
      superconductivity of LK-99.
       Question, Answer:
    example_title: LK-99 (Not NLP)
  - text: >-
      Generate Question, Answer pair correspond to the following research paper.
      [Abstract] Abstract Evaluation practices in natural language generation
      (NLG) have many known flaws, but improved evaluation approaches are rarely
      widely adopted. This issue has become more urgent, since neural NLG models
      have improved to the point where they can often no longer be distinguished
      based on the surfacelevel features that older metrics rely on. This paper
      surveys the issues with human and automatic model evaluations and with
      commonly used datasets in NLG that have been pointed out over the past 20
      years. We summarize, categorize, and discuss how researchers have been
      addressing these issues and what their findings mean for the current state
      of model evaluations. Building on those insights, we lay out a long-term
      vision for NLG evaluation and propose concrete steps for researchers to
      improve their evaluation processes. Finally, we analyze 66 NLG papers from
      recent NLP conferences in how well they already follow these suggestions
      and identify which areas require more drastic changes to the status quo.
      [Introduction] There are many issues with the evaluation of models that
      generate natural language. For example, datasets are often constructed in
      a way that prevents measuring tail effects of robustness, and they almost
      exclusively cover English. Most automated metrics measure only similarity
      between model output and references instead of fine-grained quality
      aspects (and even that poorly). Human evaluations have a high variance
      and, due to insufficient documentation, rarely produce replicable results.
      These issues have become more urgent as the nature of models that generate
      language has changed without significant changes to how they are being
      evaluated. While evaluation methods can capture surface-level improvements
      in text generated by state-of-the-art models (such as increased fluency)
      to some extent, they are ill-suited to detect issues with the content of
      model outputs, for example if they are not attributable to input
      information. These ineffective evaluations lead to overestimates of model
      capabilities. Deeper analyses uncover that popular models fail even at
      simple tasks by taking shortcuts, overfitting, hallucinating, and not
      being in accordance with their communicative goals. Identifying these
      shortcomings, many recent papers critique evaluation techniques or propose
      new ones. But almost none of the suggestions are followed or new
      techniques used. There is an incentive mismatch between conducting
      high-quality evaluations and publishing new models or modeling techniques.
      While general-purpose evaluation techniques could lower the barrier of
      entry for incorporating evaluation advances into model development, their
      development requires resources that are hard to come by, including model
      outputs on validation and test sets or large quantities of human
      assessments of such outputs. Moreover, some issues, like the refinement of
      datasets, require iterative processes where many researchers collaborate.
      All this leads to a circular dependency where evaluations of generation
      models can be improved only if generation models use better evaluations.
      We find that there is a systemic difference between selecting the best
      model and characterizing how good this model really is. Current evaluation
      techniques focus on the first, while the second is required to detect
      crucial issues. More emphasis needs to be put on measuring and reporting
      model limitations, rather than focusing on producing the highest
      performance numbers. To that end, this paper surveys analyses and
      critiques of evaluation approaches (sections 3 and 4) and of commonly used
      NLG datasets (section 5). Drawing on their insights, we describe how
      researchers developing modeling techniques can help to improve and
      subsequently benefit from better evaluations with methods available today
      (section 6). Expanding on existing work on model documentation and formal
      evaluation processes (Mitchell et al., 2019; Ribeiro et al., 2020), we
      propose releasing evaluation reports which focus on demonstrating NLG
      model shortcomings using evaluation suites. These reports should apply a
      complementary set of automatic metrics, include rigorous human
      evaluations, and be accompanied by data releases that allow for
      re-analysis with improved metrics. In an analysis of 66 recent EMNLP,
      INLG, and ACL papers along 29 dimensions related to our suggestions
      (section 7), we find that the first steps toward an improved evaluation
      are already frequently taken at an average rate of 27%. The analysis
      uncovers the dimensions that require more drastic changes in the NLG
      community. For example, 84% of papers already report results on multiple
      datasets and more than 28% point out issues in them, but we found only a
      single paper that contributed to the dataset documentation, leaving future
      researchers to re-identify those issues. We further highlight typical
      unsupported claims and a need for more consistent data release practices.
      Following the suggestions and results, we discuss how incorporating the
      suggestions can improve evaluation research, how the suggestions differ
      from similar ones made for NLU, and how better metrics can benefit model
      development itself (section 8). 
       Question, Answer:
    example_title: NLG-Eval (2202.06935)
  - text: >-
      Generate Question, Answer pair correspond to the following research paper.
      [Abstract] Humans have harbored a longstanding desire to acquire
      additional abilities through absorption. Super Mario serves as an
      embodiment of this human dream, which can collect items to gain extra
      skills such as throwing fireballs and being temporarily invincible. In
      this paper, we uncover that Language Models (LMs), either encoderor
      decoder-based, can obtain new capabilities by assimilating the parameters
      of homologous models without the need for retraining or GPUs. Typically,
      new abilities of LMs can be imparted by Supervised Fine-Tuning (SFT),
      reflected in the disparity between fine-tuned and pre-trained parameters
      (i.e., delta parameters). We initially observe that by introducing a novel
      operation called DARE (Drop And REscale), most of the delta parameters can
      be directly set to zeros without affecting the capabilities of SFT LMs and
      larger models can tolerate a higher proportion of discarded parameters.
      Based on this observation, we further sparsify delta parameters of
      multiple SFT homologous models with DARE and subsequently merge them into
      a single model by parameter averaging. We conduct experiments on eight
      datasets from the GLUE benchmark with BERT and RoBERTa. We also merge
      WizardLM, WizardMath, and Code Alpaca based on Llama 2. Experimental
      results show that: (1) The delta parameter value ranges for SFT models are
      typically small, often within 0.005, and DARE can eliminate 99% of them
      effortlessly. However, once the models are continuously pre-trained, the
      value ranges can grow to around 0.03, making DARE impractical. We have
      also tried to remove fine-tuned instead of delta parameters and find that
      a 10% reduction can lead to drastically decreased performance (even to
      0.0). This highlights that SFT merely stimulates the abilities via delta
      parameters rather than injecting new abilities into LMs; (2) DARE can
      merge multiple task-specific LMs into one LM with diverse abilities. For
      instance, the merger of WizardLM and WizardMath increases the GSM8K
      zeroshot accuracy of WizardLM from 2.2 to 66.3, retaining its
      instruction-following ability while surpassing WizardMath’s original 64.2
      performance. All resources are available at
      https://github.com/yule-BUAA/MergeLM. [Introduction] Human beings have
      always expressed their ambition to acquire additional abilities through
      various ways such as movies and games. For example, in X-Men’s Apocalypse,
      the character can absorb the powers of other mutants to strengthen
      himself. Likewise, the protagonist in the Super Mario games can gain
      superpowers like throwing fireballs by absorbing in-game items. Large
      Language Models (LLMs), such as GPT-4 [45], can reasonably be considered
      as early iterations of artificial general intelligence systems, given
      their performance is remarkably close to human-level capabilities. In this
      paper, we astonishingly find that LMs, similar to Apocalypse and Super
      Mario, can enhance their capabilities by absorbing other models without
      the need for training or GPUs. Formally, Supervised Fine-Tuning (SFT) is
      the most widely adopted strategy for assigning taskspecific capabilities
      to LMs by optimizing their parameters [13, 67]. The effectiveness of SFT
      is fully evident in the alteration of the model parameters before and
      after SFT, referred to as delta parameters [12]. We initially demonstrate
      that SFT LM (either encoder- or decoder-based) always tends to acquire
      excessively redundant delta parameters. To be specific, we present DARE,
      which randomly resets some delta parameters to zeros based on a drop rate
      p and subsequently scales the remaining parameters by a factor of 1/(1 −
      p). Despite its simplicity, with the assistance of DARE, when the LM model
      parameters reach 70 billion, we can eliminate up to 99% delta parameters
      with minimal impact on model performance (see Figure 1(a)). The more
      parameters the LM has, the larger p it can tolerate. This discovery
      suggests that SFT LM indeed learns a multitude of low-rank structures akin
      to LoRA [25]. Thus, even when most of these structures are removed,
      resulting in a low-rank and extremely sparse delta parameter set, the LM
      can still retain its capabilities. Based on this observation, we can
      confidently merge multiple homologous SFT LMs (pre-trained from the same
      backbone) without significant concerns about the decrease in their
      capabilities. As long as a small portion of the delta parameters remains
      unaffected in the merging process, the abilities of LMs unlocked by SFT
      can still be preserved. We first employ DARE to eliminate redundant delta
      parameters in each model before merging, which can potentially mitigate
      the interference of parameters among multiple models [62]. Then, we apply
      established model merging techniques [59, 26, 44, 27, 62] to the
      parameters with reduced redundancy to create a single model with diverse
      capabilities. We conduct extensive experiments on encoder-based LMs on
      eight datasets from the GLUE benchmark, and decoder-based Llama 2 with
      three distinct abilities: instruction-following, mathematical reasoning,
      and code-generating. We observe that: (1) SFT LMs exhibit a substantial
      number of redundant delta parameters whether they are based on BERT,
      RoBERTa, or Llama 2. DARE allows the removal of approximately 90% or even
      99% delta parameters without significantly affecting the performance of
      downstream tasks. The rescale operation in DARE is a crucial component to
      guarantee effective ablations of delta parameters. Without rescaling,
      removing only 10% delta parameters would noticeably affect performance. We
      attribute this phenomenon to the fact that rescaling helps preserve the
      connectivity of model parameters [46]. (2) DARE is able to enhance the
      performance of most existing model merging methods when merging
      encoder-based LMs on the eight datasets from GLUE. When it comes to larger
      LMs based on Llama 2, the simple parameter averaging method can already
      produce surprisingly good results. As shown in Figure 1(b), we merge
      WizardLM and WizardMath by combining DARE and parameter averaging, leading
      to a significant improvement of WizardLM’s mathematical reasoning ability
      from 2.2 to 64.2 accuracy on GSM8K, while also modestly enhancing its
      instruction-following ability with win rate from 67.2 to 67.5 on
      AlpacaEval. It is worth noticing that all these benefits are achieved by
      solely using CPUs without further training. Similar improvements can also
      be observed when merging code-generating models. (3) DARE is applicable to
      SFT delta parameters whose value ranges are relatively small. Different
      from the observations of delta parameters, dropping only 10% fine-tuned
      parameters would lead to a catastrophic decrease in performance, even
      approaching zero. We also find that the delta parameters of SFT LMs
      usually stay within a range of 0.005 or less, indicating minimal
      modifications to the pre-trained LM. However, once we continue
      pre-training, the delta parameters can rapidly reach around 0.03, making
      DARE infeasible. This further confirms that SFT primarily unlocks the
      abilities of the pre-trained LM, rather than introducing additional
      abilities. Last but not least, we have implemented an open-sourced
      codebase at https://github.com/ yule-BUAA/MergeLM, which integrates
      existing popular model merging methods and supports both encoder- and
      decoder-based language models. We hope this work can advance the
      understanding of how alignment works from the perspective of parameters.

       Question, Answer:
    example_title: LM-SuperMario (2311.03099)
datasets:
  - UNIST-Eunchan/NLP-Paper-to-QA-Generation
language:
  - en
pipeline_tag: text2text-generation

FLAN-T5-NLP-Paper-to-Question-Generation

This model is a fine-tuned version of google/flan-t5-large on an allenai/QASPER: a dataset for question answering on scientific research papers -based NLP-Paper-to-QA-Generation dataset.

Target Task

  • NLP Paper's Abstract + Introduction --> {Question} [SEP] {Answer}
  • Question-based Summarization
  • Long Document Summarization
  • Scientific Paper Summarization

(1) How to use: Inference on CPU ( Code Snippets )

  • Inference can be slow on CPU

Load model directly

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("UNIST-Eunchan/FLAN-T5-NLP-Paper-to-Question-Generation")
model = AutoModelForSeq2SeqLM.from_pretrained("UNIST-Eunchan/FLAN-T5-NLP-Paper-to-Question-Generation")

Prompting Input

txt =  r""" 
Generate Question, Answer pair correspond to the following research paper. 
[Abstract] + {text['abstract']} + [Introduction] + {text['introduction']}
Question, Answer:
""".replace("\n", "")

inputs = tokenizer(txt, max_length = 1024, truncation=True, padding="max_length", return_tensors="pt")

For Multiple Question Generation (👍)

num_generate_sequence = 4 #8, 16, 2, 1
summaries = model.generate(input_ids =inputs["input_ids"], max_new_tokens=100, do_sample = True, top_p = 0.95, num_return_sequences = num_generate_sequence)

For Single Question Generation

summaries = model.generate(input_ids =inputs["input_ids"], max_new_tokens=100, do_sample = True, top_p = 0.95)
decoded_summaries = [tokenizer.decode(s, skip_special_tokens=False, clean_up_tokenization_spaces=True) for s in summaries]
decoded_summaries = [d.replace("<n>", " ").replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "") for d in decoded_summaries]

(2) Faster Inference on GPU

  • about 60x faster than (1) [CPU --> COLAB T4 GPU]

Additional Installation

!pip install accelerate -q
!pip install bitsandbytes -q
!pip install optimum -q

Load model directly

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM,BitsAndBytesConfig
from optimum.bettertransformer import BetterTransformer

# load model in 4-bit
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained("UNIST-Eunchan/FLAN-T5-NLP-Paper-to-Question-Generation")
model = AutoModelForSeq2SeqLM.from_pretrained("UNIST-Eunchan/FLAN-T5-NLP-Paper-to-Question-Generation", quantization_config=quantization_config)
model = BetterTransformer.transform(model)

For Multiple Question Generation (👍)

# use to(device)

num_generate_sequence = 16 # (about 20 sec with Colab T4 GPU)
summaries = model.generate(input_ids =inputs["input_ids"].to(device), max_new_tokens=100, do_sample = True, top_p = 0.95, num_return_sequences = num_generate_sequence)

Training results

It achieves the following results on the evaluation set:

  • Loss: 0.4504
Training Loss Epoch Step Validation Loss
No log 0.99 46 34.6109
29.7732 1.99 92 16.5236
29.7732 2.98 138 4.6887
7.9911 3.97 184 0.5679
7.9911 4.97 230 0.4795
0.6152 5.96 276 0.4577
0.6152 6.95 322 0.4523
0.4811 7.95 368 0.4509
0.4811 8.94 414 0.4505
0.4721 9.93 460 0.4504

Model description

  • FLAN-T5-Large (783M)

Generated Output Example

  • Our model generate 16 different Q-A Pair with top-p sampling.
input: r""" 
Generate Question, Answer pair correspond to the following research paper. 
[Abstract] In this work, we explore prompt tuning, a simple yet effective mechanism for learning soft prompts to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signal from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3's few-shot learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method closes the gap and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant in that large models are costly to share and serve, and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed prefix tuning of Li and Liang (2021), and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer, as compared to full model tuning. [Introduction] With the wide success of pre-trained large language models, a range of techniques has arisen to adapt these general-purpose models to downstream tasks. ELMo (Peters et al., 2018) proposed freezing the pre-trained model and learning a task-specific weighting of its per-layer representations. However, since GPT (Radford et al., 2018) and BERT (Devlin et al., 2019), the dominant adaptation technique has been model tuning (or fine-tuning), where all model parameters are tuned during adaptation, as proposed by Howard and Ruder (2018).More recently, Brown et al. (2020) showed that prompt design (or priming) is surprisingly effective at modulating a frozen GPT-3 model’s behavior through text prompts. Prompts are typically composed of a task description and/or several canonical examples. This return to freezing pre-trained models is appealing, especially as model size continues to increase. Rather than requiring a separate copy of the model for each downstream task, a single generalist model can simultaneously serve many different tasks. Unfortunately, prompt-based adaptation has several key drawbacks. Task description is error-prone and requires human involvement, and the effectiveness of a prompt is limited by how much conditioning text can fit into the model’s input. As a result, downstream task quality still lags far behind that of tuned models. For instance, GPT-3 175B fewshot performance on SuperGLUE is 17.5 points below fine-tuned T5-XXL (Raffel et al., 2020) (71.8 vs. 89.3) despite using 16 times more parameters. Several efforts to automate prompt design have been recently proposed. Shin et al. (2020) propose a search algorithm over the discrete space of words, guided by the downstream application training data. While this technique outperforms manual prompt design, there is still a gap relative to model tuning. Li and Liang (2021) propose prefix tuning and show strong results on generative tasks. This method freezes the model parameters and backpropagates the error during tuning to prefix activations prepended to each layer in the encoder stack, including the input layer. Hambardzumyan et al. (2021) simplify this recipe by restricting the trainable parameters to the input and output subnetworks of a masked language model, and show reasonable results on classifications tasks. In this paper, we propose prompt tuning as a further simplification for adapting language models. We freeze the entire pre-trained model and only allow an additional k tunable tokens per downstream task to be prepended to the input text. This soft prompt is trained end-to-end and can condense the signal from a full labeled dataset, allowing our method to outperform few-shot prompts and close the quality gap with model tuning (Figure 1). At the same time, since a single pre-trained model is recycled for all downstream tasks, we retain the efficient serving benefits of frozen models (Figure 2). While we developed our method concurrently with Li and Liang (2021) and Hambardzumyan et al. (2021), we are the first to show that prompt tuning alone (with no intermediate-layer prefixes or task-specific output layers) is sufficient to be competitive with model tuning. Through detailed experiments in sections 2–3, we demonstrate that language model capacity is a key ingredient for these approaches to succeed. As Figure 1 shows, prompt tuning becomes more competitive with scale. We compare with similar approaches in Section 4. Explicitly separating task-specific parameters from the generalist parameters needed for general language-understanding has a range of additional benefits. We show in Section 5 that by capturing the task definition in the prompt while keeping the generalist parameters fixed, we are able to achieve better resilience to domain shifts. In Section 6, we show that prompt ensembling, learning multiple prompts for the same task, can boost quality and is more efficient than classic model ensembling. Finally, in Section 7, we investigate the interpretability of our learned soft prompts. In sum, our key contributions are: 1. Proposing prompt tuning and showing its competitiveness with model tuning in the regime of large language models. 2. Ablating many design choices, and showing quality and robustness improve with scale. 3. Showing prompt tuning outperforms model tuning on domain shift problems. 4. Proposing prompt ensembling and showing its effectiveness. 
Question, Answer:
""".replace("\n", "")

output= [' What was the size of each untrained model?[SEP] The size of the model can be a combination of the size of all the parameters in a model',
 ' What are the benefits of using soft prompts?[SEP] They reduce the need to use manual prompt design and conserve machine training data',
 ' What is the sample size of dataset?[SEP] 22840',
 ' How does the method outperform some of the pre-trained models?[SEP] They successfully tune their model for two tasks, one for a few shot and the other for several downstream tasks.',
 ' What is the sample size of the experiments?[SEP]135 for a simple task?[SEP]32 for a more complicated task',
 ' What is the baseline model they tested? [SEP] GPT-3 model, with four state-of-the-art examples in a masked language model',
 ' What task accuracy is given by prompts?[SEP]Mixed task efficiency was 93% and accuracy 85% compared to normal noise level',
 ' What metrics do they use?[SEP] EMO score, VSD, and SVM scores',
 ' What metrics are used to assess the performance of the soft prompt training?[SEP] quality of translation, accuracy of text-to-text, robustness of domain transfer, error rate.',
 ' How much do they experiment with the T5 baseline?[SEP] The baseline is used for simulated benchmarks.',
 ' Which task are they applying their method to?[SEP]They test their approach on classifications tasks',
 " Why do they show that their approach outperforms GPT-3's few-shot? [SEP] This is a large project that uses a multi-task approach to train GPT-3 models. In this paper, they demonstrate that the current method outperforms both the GPT-3 few-shot and the Li and Liang prefix tuning. They also show that the prefix tuning performed much better than the model tuning. What is the difference between their experiments",
 ' How do they compare with other techniques? [SEP] They provide a comparison for each approach.',
 ' Which task is the GPT-3 model most applicable to?[SEP]Classification tasks. For which tasks does the model need a subnetwork?[SEP]Classification tasks for GPT-3',
 ' What is the baseline test case used for this experiment?[SEP]Pompets for a variety of tasks are trained using the same method. This is the baseline, and the baseline is used for all applications.',
 ' What was the size of their model?[SEP] They experimented with 0.5 m.m and 0.5 m.m respectively.']

Inference Examples

If Inference API generate bad, you can use model.generate() in your code for better output!

Training and evaluation data

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 42
  • gradient_accumulation_steps: 16
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 184
  • num_epochs: 10