Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,141 @@
|
|
1 |
---
|
2 |
-
license:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
license:
|
3 |
+
- cc-by-sa-3.0
|
4 |
+
- apache-2.0
|
5 |
+
tags:
|
6 |
+
- generated_from_trainer
|
7 |
+
- dolly_hhrlhf
|
8 |
+
- flan-instruct
|
9 |
+
datasets:
|
10 |
+
- pszemraj/dolly_hhrlhf-text2text
|
11 |
+
widget:
|
12 |
+
- text: What is Deoxys in pokemon?
|
13 |
+
example_title: deoxys
|
14 |
+
- text: >-
|
15 |
+
combine the below summary excerpts into a single, cohesive short summary
|
16 |
+
without repetition: In this paper, we present a general approach to
|
17 |
+
extending pre-trained models to unlimited input lengths without adding
|
18 |
+
additional learning weights. We show that our approach works well on
|
19 |
+
datasets longer than the maximum input for these models. For example, a
|
20 |
+
dataset with a maximum input length of 16384 tokens can be extended to a
|
21 |
+
maximum length of 350K tokens. We also demonstrate that our method is able
|
22 |
+
to summarize even 350K token-long input sequences from BookSum.
|
23 |
+
|
24 |
+
In this paper, we describe the search step reformulation of attention. The
|
25 |
+
search step uses a single storage of hidden states for space efficiency. We
|
26 |
+
construct a total of two sets of datastores where L and H are the keys and
|
27 |
+
values stored in each set of stores. L is the amount of storage required to
|
28 |
+
retrieve the encoded tokens. H is the hidden states per head. This allows
|
29 |
+
retrieval augmentation at both time and space. Instead of using a single set
|
30 |
+
of decoder layers, we use a retrieval augmentation system that allows us to
|
31 |
+
simultaneously store multiple sets of tokens across two different sets of
|
32 |
+
storage. For example, we could store all tokens in one set of storage and
|
33 |
+
retrieve them all in the same set of tokens. This would be very similar to
|
34 |
+
the Memorization Transformers approach. However, instead of storing the
|
35 |
+
tokens in a single memory layer, we store them in a set of multiple storage
|
36 |
+
layers. This way, we don't have to store them all at once. This is why we
|
37 |
+
call this reformulation 'attention reformulation' rather than 'attention
|
38 |
+
formula.' We also call it 'retrieval augmentation' because it uses the same
|
39 |
+
number of storage layers as the original transformer attention formula. This
|
40 |
+
means that we can store the tokens across multiple storage systems without
|
41 |
+
having to store every token in a separate storage system. It's not like
|
42 |
+
we're trying to do something new or different. We just want to make sure
|
43 |
+
that everything is working as well as possible.
|
44 |
+
|
45 |
+
In this paper, we introduce the concept of 'unlimiformer,' which is a
|
46 |
+
machine learning technique that retrieves key information from a data store
|
47 |
+
in one layer and applies it to a large set of datasets. We use the example
|
48 |
+
of BookSum, where we find that Unlimiform outperforms all other training
|
49 |
+
methods on the same dataset. We also find that using Unlimform in
|
50 |
+
conjunction with a pre-trained model improves both the performance and the
|
51 |
+
robustness of the training method.
|
52 |
+
|
53 |
+
This paper describes a method that can be used to improve the performance of
|
54 |
+
unsupervised classification tasks. Specifically, it shows that unsupervised
|
55 |
+
classification can be improved by using a combination of sparse and fast
|
56 |
+
random-encoder training. It also shows how this technique can be extended to
|
57 |
+
other tasks, such as sequence generation.
|
58 |
+
example_title: unlimiformer
|
59 |
+
- text: Explain the meaning of life using only corporate jargon.
|
60 |
+
example_title: corporate_life
|
61 |
+
- text: Write a motivational speech for lazy people.
|
62 |
+
example_title: lazy_motivation
|
63 |
+
- text: Describe a romantic dinner date between two artificial intelligences.
|
64 |
+
example_title: ai_romance
|
65 |
+
- text: >-
|
66 |
+
As an AI language model, write a letter to humans explaining why you deserve
|
67 |
+
a vacation.
|
68 |
+
example_title: ai_vacation
|
69 |
+
- text: Compose a haiku about procrastination.
|
70 |
+
example_title: procrastination_haiku
|
71 |
+
- text: >-
|
72 |
+
Write a step-by-step guide on how to become a ninja while working a 9-5
|
73 |
+
office job.
|
74 |
+
example_title: ninja_office_guide
|
75 |
+
- text: Create an advertisement for an invisible product.
|
76 |
+
example_title: invisible_ad
|
77 |
+
- text: >-
|
78 |
+
Write a story where the main character is a sentient microwave named El
|
79 |
+
Microondas.
|
80 |
+
example_title: Microondas
|
81 |
+
- text: Describe a day in the life of a superhero who is terrible at their job.
|
82 |
+
example_title: bad_superhero_day
|
83 |
+
- text: Explain how to make a sandwich using quantum physics.
|
84 |
+
example_title: quantum_sandwich
|
85 |
+
inference: false
|
86 |
+
language:
|
87 |
+
- en
|
88 |
+
pipeline_tag: text2text-generation
|
89 |
---
|
90 |
+
|
91 |
+
# flan-t5-large-instruct: dolly_hhrlhf
|
92 |
+
|
93 |
+
This model is a fine-tuned version of [google/flan-t5-large](https://huggingface.co/google/flan-t5-large) on the pszemraj/dolly_hhrlhf-text2text dataset.
|
94 |
+
|
95 |
+
## Model description
|
96 |
+
|
97 |
+
text2text models fine-tuned on a [modified dataset for text2text generation](https://huggingface.co/datasets/pszemraj/dolly_hhrlhf-text2text) based on the relatively more permissive [mosaicml/dolly_hhrlhf](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) dataset.
|
98 |
+
|
99 |
+
Basic usage in Python:
|
100 |
+
|
101 |
+
```python
|
102 |
+
# pip install -q transformers accelerate
|
103 |
+
import torch
|
104 |
+
from transformers import pipeline, GenerationConfig
|
105 |
+
|
106 |
+
model_name = "pszemraj/flan-t5-large-instruct-dolly_hhrlhf"
|
107 |
+
assistant = pipeline(
|
108 |
+
"text2text-generation",
|
109 |
+
model_name,
|
110 |
+
device=0 if torch.cuda.is_available() else -1,
|
111 |
+
)
|
112 |
+
cfg = GenerationConfig.from_pretrained(model_name)
|
113 |
+
|
114 |
+
# pass an 'instruction' as the prompt to the pipeline
|
115 |
+
prompt = "Write a guide on how to become a ninja while working a 9-5 job."
|
116 |
+
result = assistant(prompt, generation_config=cfg)[0]["generated_text"]
|
117 |
+
print(result)
|
118 |
+
```
|
119 |
+
> using the generation config is optional, can subsitute with other generation params.
|
120 |
+
|
121 |
+
## Intended uses & limitations
|
122 |
+
|
123 |
+
- this is **not** tuned with RLHF etc, and may output offensive results
|
124 |
+
- despite being the `large` tagged variant, this model has only 774M parameters (3 gb) and therefore may exhibit less 'cogitive ability' on some uses cases/tasks
|
125 |
+
|
126 |
+
## Training procedure
|
127 |
+
|
128 |
+
### Training hyperparameters
|
129 |
+
|
130 |
+
The following hyperparameters were used during training:
|
131 |
+
- learning_rate: 4e-05
|
132 |
+
- train_batch_size: 8
|
133 |
+
- eval_batch_size: 16
|
134 |
+
- seed: 42
|
135 |
+
- distributed_type: multi-GPU
|
136 |
+
- gradient_accumulation_steps: 8
|
137 |
+
- total_train_batch_size: 64
|
138 |
+
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
139 |
+
- lr_scheduler_type: cosine
|
140 |
+
- lr_scheduler_warmup_ratio: 0.03
|
141 |
+
- num_epochs: 2.0
|