AjayP13's picture
Update README.md
512c50f verified
metadata
base_model: google/t5-v1_1-base
tags:
  - datadreamer
  - datadreamer-0.1.0
  - synthetic
  - gpt-4
  - gpt-4
  - text2text-generation
widget:
  - text: >-
      An important paradigm of natural language processing consists of
      large-scale pre-training on general domain data and adaptation to
      particular tasks or domains. As we pre-train larger models, full
      fine-tuning, which retrains all model parameters, becomes less feasible.
      Using GPT-3 175B as an example -- deploying independent instances of
      fine-tuned models, each with 175B parameters, is prohibitively expensive.
      We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained
      model weights and injects trainable rank decomposition matrices into each
      layer of the Transformer architecture, greatly reducing the number of
      trainable parameters for downstream tasks. Compared to GPT-3 175B
      fine-tuned with Adam, LoRA can reduce the number of trainable parameters
      by 10,000 times and the GPU memory requirement by 3 times. LoRA performs
      on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa,
      GPT-2, and GPT-3, despite having fewer trainable parameters, a higher
      training throughput, and, unlike adapters, no additional inference
      latency. We also provide an empirical investigation into rank-deficiency
      in language model adaptation, which sheds light on the efficacy of LoRA.
      We release a package that facilitates the integration of LoRA with PyTorch
      models and provide our implementations and model checkpoints for RoBERTa,
      DeBERTa, and GPT-2 at this https URL.
    example_title: LoRA Abstract
  - text: >-
      Making language models bigger does not inherently make them better at
      following a user's intent. For example, large language models can generate
      outputs that are untruthful, toxic, or simply not helpful to the user. In
      other words, these models are not aligned with their users. In this paper,
      we show an avenue for aligning language models with user intent on a wide
      range of tasks by fine-tuning with human feedback. Starting with a set of
      labeler-written prompts and prompts submitted through the OpenAI API, we
      collect a dataset of labeler demonstrations of the desired model behavior,
      which we use to fine-tune GPT-3 using supervised learning. We then collect
      a dataset of rankings of model outputs, which we use to further fine-tune
      this supervised model using reinforcement learning from human feedback. We
      call the resulting models InstructGPT. In human evaluations on our prompt
      distribution, outputs from the 1.3B parameter InstructGPT model are
      preferred to outputs from the 175B GPT-3, despite having 100x fewer
      parameters. Moreover, InstructGPT models show improvements in truthfulness
      and reductions in toxic output generation while having minimal performance
      regressions on public NLP datasets. Even though InstructGPT still makes
      simple mistakes, our results show that fine-tuning with human feedback is
      a promising direction for aligning language models with human intent.
    example_title: InstructGPT Abstract
  - text: >-
      In deep learning, models typically reuse the same parameters for all
      inputs. Mixture of Experts (MoE) defies this and instead selects different
      parameters for each incoming example. The result is a sparsely-activated
      model -- with outrageous numbers of parameters -- but a constant
      computational cost. However, despite several notable successes of MoE,
      widespread adoption has been hindered by complexity, communication costs
      and training instability -- we address these with the Switch Transformer.
      We simplify the MoE routing algorithm and design intuitive improved models
      with reduced communication and computational costs. Our proposed training
      techniques help wrangle the instabilities and we show large sparse models
      may be trained, for the first time, with lower precision (bfloat16)
      formats. We design models based off T5-Base and T5-Large to obtain up to
      7x increases in pre-training speed with the same computational resources.
      These improvements extend into multilingual settings where we measure
      gains over the mT5-Base version across all 101 languages. Finally, we
      advance the current scale of language models by pre-training up to
      trillion parameter models on the 'Colossal Clean Crawled Corpus' and
      achieve a 4x speedup over the T5-XXL model.
    example_title: Switch Transformers Abstract
pipeline_tag: text2text-generation
datasets:
  - datadreamer-dev/abstracts_and_tweets

Model Card

This is an "Abstract to Tweet" model that crafts a tweet summarizing a research paper abstract trained on a synthetic dataset of arXiv abstracts and tweets. It is used as a demonstration of the DataDreamer 🤖💤 library.

Example Usage

from transformers import pipeline

# Load model
pipe = pipeline('text2text-generation', 'datadreamer-dev/abstracts_to_tweet_model')

# Generate a tweet from the abstract of the LoRA paper
abstract = "An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at this https URL."
generated_tweet = pipe(abstract)[0]['generated_text'] 

# Print the generated tweet
print(generated_tweet) 

# Output:
# "Exciting news in #NLP! We've developed Low-Rank Adaptation, or LoRA, to reduce the number of trainable parameters for downstream tasks. It reduces model weights by 10,000 times and GPU memory by 3 times. #AI #MachineLearning"

This model was trained with a synthetic dataset with DataDreamer 🤖💤. The synthetic dataset card and model card can be found here. The training arguments can be found here.