AjayP13 commited on
Commit
9fc7967
1 Parent(s): 7256363

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -38
README.md CHANGED
@@ -12,45 +12,11 @@ widget:
12
  An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at this https URL.
13
  example_title: LoRA Abstract
14
  - text: >-
15
- In this research paper, we propose a novel approach to Natural Language
16
- Processing (NLP) that addresses several limitations of existing methods. By
17
- integrating deep learning architectures with traditional NLP techniques, we
18
- have developed a model that shows significant improvements in performance
19
- across several NLP tasks including sentiment analysis, text summarization,
20
- and machine translation. We treat language processing not as a linear task
21
- but rather an interconnected web of sub-tasks, each benefiting from mutual
22
- feedback. The conceptual breakthrough of this approach is the shared
23
- representation of linguistic features across these sub-tasks that allow for
24
- robust understanding and language inference. We demonstrated the
25
- effectiveness of our model in extensive empirical evaluations on several
26
- benchmark datasets, where our method consistently outperforms
27
- state-of-the-art solutions. We also discuss the theoretical justification of
28
- our model. Overall, this paper extends the frontiers of NLP by broadening
29
- the commonly used methods and setting BPM (Benchmarks Per Minute) records in
30
- five major tasks. We hope this work encourages future researchers to adopt
31
- an integrated perspective when building NLP models.
32
- example_title: Example 2
33
  - text: >-
34
- In recent years, we have seen a significative progression in Natural
35
- Language Processing (NLP) capabilities, primarily driven by advancements in
36
- deep learning. However, creating accurate models capable of understanding
37
- context, tone, and semantic meanings remains a significant challenge.
38
- Several models struggle to maintain stable performance when presented with
39
- different kinds of texts. In this paper, we address the problem of
40
- language-context detection in diversely written text. We introduce new
41
- approaches utilising transformer-based models combined with Domain-Adaptive
42
- Fine Tuning, a technique that allows capturing various linguistic details
43
- for enhanced comprehension of text. Extensive experiments on several
44
- datasets reveal that it is not just the large scales of these models that
45
- matter, but a proper, task-specific tuning, can significantly bring
46
- reductions in model complexity, resource demands, and increase the
47
- prediction performance, challenging the commonly held belief in "bigger is
48
- better". We further suggest that our innovations will directly lead to
49
- significant improvements in performance and the wide adoption of the NLP
50
- models within real-world scenarios. AI model's ability to scale will see a
51
- vital performance curve particularly under low-data regime conditions which
52
- are prevalent in the commercial sector.
53
- example_title: Example 3
54
  pipeline_tag: text2text-generation
55
  datasets:
56
  - datadreamer-dev/abstracts_and_tweets
 
12
  An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at this https URL.
13
  example_title: LoRA Abstract
14
  - text: >-
15
+ Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
16
+ example_title: InstructGPT Abstract
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  - text: >-
18
+ In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the 'Colossal Clean Crawled Corpus' and achieve a 4x speedup over the T5-XXL model.
19
+ example_title: Switch Transformers Abstract
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  pipeline_tag: text2text-generation
21
  datasets:
22
  - datadreamer-dev/abstracts_and_tweets