---
base_model: google/pegasus-x-base
tags:
- generated_from_trainer
datasets:
- arxiv-summarization

widget:
- text: >-
    
    [Abstract] The dominant sequence transduction models are based on complex
    recurrent or convolutional neural networks in an encoder-decoder
    configuration. The best performing models also connect the encoder and
    decoder through an attention mechanism. We propose a new simple network
    architecture, the Transformer, based solely on attention mechanisms,
    dispensing with recurrence and convolutions entirely. Experiments on two
    machine translation tasks show these models to be superior in quality while
    being more parallelizable and requiring significantly less time to train.
    Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation
    task, improving over the existing best results, including ensembles by over
    2 BLEU. On the WMT 2014 English-to-French translation task, our model
    establishes a new single-model state-of-the-art BLEU score of 41.8 after
    training for 3.5 days on eight GPUs, a small fraction of the training costs
    of the best models from the literature. We show that the Transformer
    generalizes well to other tasks by applying it successfully to English
    constituency parsing both with large and limited training data.
    [Introduction] Recurrent neural networks, long short-term memory [13] and
    gated recurrent [7] neural networks in particular, have been firmly
    established as state of the art approaches in sequence modeling and
    transduction problems such as language modeling and machine translation [35,
    2, 5]. Numerous efforts have since continued to push the boundaries of
    recurrent language models and encoder-decoder architectures [38, 24, 15].
    Recurrent models typically factor computation along the symbol positions of
    the input and output sequences. Aligning the positions to steps in
    computation time, they generate a sequence of hidden states ht, as a
    function of the previous hidden state ht−1 and the input for position t.
    This inherently sequential nature precludes parallelization within training
    examples, which becomes critical at longer sequence lengths, as memory
    constraints limit batching across examples. Recent work has achieved
    significant improvements in computational efficiency through factorization
    tricks [21] and conditional computation [32], while also improving model
    performance in case of the latter. The fundamental constraint of sequential
    computation, however, remains. Attention mechanisms have become an integral
    part of compelling sequence modeling and transduction models in various
    tasks, allowing modeling of dependencies without regard to their distance in
    the input or output sequences [2, 19]. In all but a few cases [27], however,
    such attention mechanisms are used in conjunction with a recurrent network.
    In this work we propose the Transformer, a model architecture eschewing
    recurrence and instead relying entirely on an attention mechanism to draw
    global dependencies between input and output. The Transformer allows for
    significantly more parallelization and can reach a new state of the art in
    translation quality after being trained for as little as twelve hours on
    eight P100 GPUs. 
  example_title: Attention Is All You Need
- text: >-
    [Abstract] In this work, we explore prompt tuning, a simple yet effective
    mechanism for learning soft prompts to condition frozen language models to
    perform specific downstream tasks. Unlike the discrete text prompts used by
    GPT-3, soft prompts are learned through backpropagation and can be tuned to
    incorporate signal from any number of labeled examples. Our end-to-end
    learned approach outperforms GPT-3's few-shot learning by a large margin.
    More remarkably, through ablations on model size using T5, we show that
    prompt tuning becomes more competitive with scale: as models exceed billions
    of parameters, our method closes the gap and matches the strong performance
    of model tuning (where all model weights are tuned). This finding is
    especially relevant in that large models are costly to share and serve, and
    the ability to reuse one frozen model for multiple downstream tasks can ease
    this burden. Our method can be seen as a simplification of the recently
    proposed prefix tuning of Li and Liang (2021), and we provide a comparison
    to this and other similar approaches. Finally, we show that conditioning a
    frozen model with soft prompts confers benefits in robustness to domain
    transfer, as compared to full model tuning. [Introduction] With the wide
    success of pre-trained large language models, a range of techniques has
    arisen to adapt these general-purpose models to downstream tasks. ELMo
    (Peters et al., 2018) proposed freezing the pre-trained model and learning a
    task-specific weighting of its per-layer representations. However, since GPT
    (Radford et al., 2018) and BERT (Devlin et al., 2019), the dominant
    adaptation technique has been model tuning (or fine-tuning), where all model
    parameters are tuned during adaptation, as proposed by Howard and Ruder
    (2018).More recently, Brown et al. (2020) showed that prompt design (or
    priming) is surprisingly effective at modulating a frozen GPT-3 model’s
    behavior through text prompts. Prompts are typically composed of a task
    description and/or several canonical examples. This return to freezing
    pre-trained models is appealing, especially as model size continues to
    increase. Rather than requiring a separate copy of the model for each
    downstream task, a single generalist model can simultaneously serve many
    different tasks. Unfortunately, prompt-based adaptation has several key
    drawbacks. Task description is error-prone and requires human involvement,
    and the effectiveness of a prompt is limited by how much conditioning text
    can fit into the model’s input. As a result, downstream task quality still
    lags far behind that of tuned models. For instance, GPT-3 175B fewshot
    performance on SuperGLUE is 17.5 points below fine-tuned T5-XXL (Raffel et
    al., 2020) (71.8 vs. 89.3) despite using 16 times more parameters. Several
    efforts to automate prompt design have been recently proposed. Shin et al.
    (2020) propose a search algorithm over the discrete space of words, guided
    by the downstream application training data. While this technique
    outperforms manual prompt design, there is still a gap relative to model
    tuning. Li and Liang (2021) propose prefix tuning and show strong results on
    generative tasks. This method freezes the model parameters and
    backpropagates the error during tuning to prefix activations prepended to
    each layer in the encoder stack, including the input layer. Hambardzumyan et
    al. (2021) simplify this recipe by restricting the trainable parameters to
    the input and output subnetworks of a masked language model, and show
    reasonable results on classifications tasks. In this paper, we propose
    prompt tuning as a further simplification for adapting language models. We
    freeze the entire pre-trained model and only allow an additional k tunable
    tokens per downstream task to be prepended to the input text. This soft
    prompt is trained end-to-end and can condense the signal from a full labeled
    dataset, allowing our method to outperform few-shot prompts and close the
    quality gap with model tuning (Figure 1). At the same time, since a single
    pre-trained model is recycled for all downstream tasks, we retain the
    efficient serving benefits of frozen models (Figure 2). While we developed
    our method concurrently with Li and Liang (2021) and Hambardzumyan et al.
    (2021), we are the first to show that prompt tuning alone (with no
    intermediate-layer prefixes or task-specific output layers) is sufficient to
    be competitive with model tuning. Through detailed experiments in sections
    2–3, we demonstrate that language model capacity is a key ingredient for
    these approaches to succeed. As Figure 1 shows, prompt tuning becomes more
    competitive with scale. We compare with similar approaches in Section 4.
    Explicitly separating task-specific parameters from the generalist
    parameters needed for general language-understanding has a range of
    additional benefits. We show in Section 5 that by capturing the task
    definition in the prompt while keeping the generalist parameters fixed, we
    are able to achieve better resilience to domain shifts. In Section 6, we
    show that prompt ensembling, learning multiple prompts for the same task,
    can boost quality and is more efficient than classic model ensembling.
    Finally, in Section 7, we investigate the interpretability of our learned
    soft prompts. In sum, our key contributions are: 1. Proposing prompt tuning
    and showing its competitiveness with model tuning in the regime of large
    language models. 2. Ablating many design choices, and showing quality and
    robustness improve with scale. 3. Showing prompt tuning outperforms model
    tuning on domain shift problems. 4. Proposing prompt ensembling and showing
    its effectiveness. 
  example_title: PEFT (2104.08691)
- text: >-
    [Abstract] For the first time in the world, we succeeded in synthesizing the
    room-temperature superconductor (Tc≥400 K, 127∘C) working at ambient
    pressure with a modified lead-apatite (LK-99) structure. The
    superconductivity of LK-99 is proved with the Critical temperature (Tc),
    Zero-resistivity, Critical current (Ic), Critical magnetic field (Hc), and
    the Meissner effect. The superconductivity of LK-99 originates from minute
    structural distortion by a slight volume shrinkage (0.48 %), not by external
    factors such as temperature and pressure. The shrinkage is caused by Cu2+
    substitution of Pb2+(2) ions in the insulating network of Pb(2)-phosphate
    and it generates the stress. It concurrently transfers to Pb(1) of the
    cylindrical column resulting in distortion of the cylindrical column
    interface, which creates superconducting quantum wells (SQWs) in the
    interface. The heat capacity results indicated that the new model is
    suitable for explaining the superconductivity of LK-99. The unique structure
    of LK-99 that allows the minute distorted structure to be maintained in the
    interfaces is the most important factor that LK-99 maintains and exhibits
    superconductivity at room temperatures and ambient pressure. [Introduction] 
    Since the discovery of the first superconductor(1), many efforts to search
    for new roomtemperature superconductors have been carried out worldwide(2,
    3) through their experimental clarity or/and theoretical perspectives(4-8).
    The recent success of developing room-temperature superconductors with
    hydrogen sulfide(9) and yttrium super-hydride(10) has great attention
    worldwide, which is expected by strong electron-phonon coupling theory with
    high-frequency hydrogen phonon modes(11, 12). However, it is difficult to
    apply them to actual application devices in daily life because of the
    tremendously high pressure, and more efforts are being made to overcome the
    high-pressure problem(13). For the first time in the world, we report the
    success in synthesizing a room-temperature and ambient-pressure
    superconductor with a chemical approach to solve the temperature and
    pressure problem. We named the first room temperature and ambient pressure
    superconductor LK-99. The superconductivity of LK-99 proved with the
    Critical temperature (Tc), Zero-resistivity, Critical current (Ic), Critical
    magnetic field (Hc), and Meissner effect(14, 15). Several data were
    collected and analyzed in detail to figure out the puzzle of
    superconductivity of LK-99: X-ray diffraction (XRD), X-ray photoelectron
    spectroscopy (XPS), Electron Paramagnetic Resonance Spectroscopy (EPR), Heat
    Capacity, and Superconducting quantum interference device (SQUID) data.
    Henceforth in this paper, we will report and discuss our new findings
    including superconducting quantum wells associated with the
    superconductivity of LK-99.
  example_title: LK-99 (Not NLP)
- text: >-
    [Abstract] Abstract Evaluation practices in natural language generation
    (NLG) have many known flaws, but improved evaluation approaches are rarely
    widely adopted. This issue has become more urgent, since neural NLG models
    have improved to the point where they can often no longer be distinguished
    based on the surfacelevel features that older metrics rely on. This paper
    surveys the issues with human and automatic model evaluations and with
    commonly used datasets in NLG that have been pointed out over the past 20
    years. We summarize, categorize, and discuss how researchers have been
    addressing these issues and what their findings mean for the current state
    of model evaluations. Building on those insights, we lay out a long-term
    vision for NLG evaluation and propose concrete steps for researchers to
    improve their evaluation processes. Finally, we analyze 66 NLG papers from
    recent NLP conferences in how well they already follow these suggestions and
    identify which areas require more drastic changes to the status quo.
    [Introduction] There are many issues with the evaluation of models that
    generate natural language. For example, datasets are often constructed in a
    way that prevents measuring tail effects of robustness, and they almost
    exclusively cover English. Most automated metrics measure only similarity
    between model output and references instead of fine-grained quality aspects
    (and even that poorly). Human evaluations have a high variance and, due to
    insufficient documentation, rarely produce replicable results. These issues
    have become more urgent as the nature of models that generate language has
    changed without significant changes to how they are being evaluated. While
    evaluation methods can capture surface-level improvements in text generated
    by state-of-the-art models (such as increased fluency) to some extent, they
    are ill-suited to detect issues with the content of model outputs, for
    example if they are not attributable to input information. These ineffective
    evaluations lead to overestimates of model capabilities. Deeper analyses
    uncover that popular models fail even at simple tasks by taking shortcuts,
    overfitting, hallucinating, and not being in accordance with their
    communicative goals. Identifying these shortcomings, many recent papers
    critique evaluation techniques or propose new ones. But almost none of the
    suggestions are followed or new techniques used. There is an incentive
    mismatch between conducting high-quality evaluations and publishing new
    models or modeling techniques. While general-purpose evaluation techniques
    could lower the barrier of entry for incorporating evaluation advances into
    model development, their development requires resources that are hard to
    come by, including model outputs on validation and test sets or large
    quantities of human assessments of such outputs. Moreover, some issues, like
    the refinement of datasets, require iterative processes where many
    researchers collaborate. All this leads to a circular dependency where
    evaluations of generation models can be improved only if generation models
    use better evaluations. We find that there is a systemic difference between
    selecting the best model and characterizing how good this model really is.
    Current evaluation techniques focus on the first, while the second is
    required to detect crucial issues. More emphasis needs to be put on
    measuring and reporting model limitations, rather than focusing on producing
    the highest performance numbers. To that end, this paper surveys analyses
    and critiques of evaluation approaches (sections 3 and 4) and of commonly
    used NLG datasets (section 5). Drawing on their insights, we describe how
    researchers developing modeling techniques can help to improve and
    subsequently benefit from better evaluations with methods available today
    (section 6). Expanding on existing work on model documentation and formal
    evaluation processes (Mitchell et al., 2019; Ribeiro et al., 2020), we
    propose releasing evaluation reports which focus on demonstrating NLG model
    shortcomings using evaluation suites. These reports should apply a
    complementary set of automatic metrics, include rigorous human evaluations,
    and be accompanied by data releases that allow for re-analysis with improved
    metrics. In an analysis of 66 recent EMNLP, INLG, and ACL papers along 29
    dimensions related to our suggestions (section 7), we find that the first
    steps toward an improved evaluation are already frequently taken at an
    average rate of 27%. The analysis uncovers the dimensions that require more
    drastic changes in the NLG community. For example, 84% of papers already
    report results on multiple datasets and more than 28% point out issues in
    them, but we found only a single paper that contributed to the dataset
    documentation, leaving future researchers to re-identify those issues. We
    further highlight typical unsupported claims and a need for more consistent
    data release practices. Following the suggestions and results, we discuss
    how incorporating the suggestions can improve evaluation research, how the
    suggestions differ from similar ones made for NLU, and how better metrics
    can benefit model development itself (section 8). 
  example_title: NLG-Eval (2202.06935)
model-index:
- name: Long-paper-summarization-pegasus-x-b
  results:
  - task:
      name: Summarization
      type: summarization
    dataset:
      name: ccdv/arxiv-summarization
      type: ccdv/arxiv-summarization
      config: section
      split: test
      args: section
    metrics:
    - name: ROUGE-1
      type: rouge
      value: 35.6639
    - name: ROUGE-2
      type: rouge
      value: 9.81362
    - name: ROUGE-L
      type: rouge
      value: 19.9013      
    - name: ROUGE-LSum
      type: rouge
      value: 28.1444
    
license: mit
language:
- en
metrics:
- rouge

---


<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# Long-paper-summarization-pegasus-x-b

This model is a fine-tuned version of [google/pegasus-x-base](https://huggingface.co/google/pegasus-x-base) on the arxiv-summarization dataset.
It achieves the following results on the evaluation set:
- Loss: 2.7262


## Model Description / Training and evaluation data


**Base Model**: [Pegasus-x-base (State-of-the-art for Long Context Summarization)](https://huggingface.co/google/pegasus-x-base)

**Finetuning Dataset**: 
- We used **train[25000:100000] of ArXiv Dataset (Cohan et al., 2018, NAACL-HLT 2018)** [[PDF]](https://arxiv.org/abs/1804.05685)
- (Full length is 200,000+, We will upload full trained Model soon)

**GPU**: (RTX A6000) x 1

**Train time**: About 24 hours for 3 epochs

**Test time**: About 8 hours for test dataset.


## Intended uses & limitations

- **Research Paper Summarization**


### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- gradient_accumulation_steps: 64
- total_train_batch_size: 64
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 390
- **num_epochs: 3 (takes about 24 hours)**

### Training results

| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:----:|:---------------:|
| 3.401         | 0.33  | 390  | 2.3985          |
| 2.5444        | 0.67  | 780  | 2.2461          |
| 2.4849        | 1.0   | 1170 | 2.2690          |
| 2.5735        | 1.33  | 1560 | 2.3334          |
| 2.7045        | 1.66  | 1950 | 2.4330          |
| 2.8939        | 2.0   | 2340 | 2.5461          |
| 3.0773        | 2.33  | 2730 | 2.6502          |
| 3.2149        | 2.66  | 3120 | 2.7039          |
| 3.2844        | 3.0   | 3510 | 2.7262          |


### Framework versions

- Transformers 4.32.1
- Pytorch 2.0.1
- Datasets 2.12.0
- Tokenizers 0.13.2