pszemraj
/

flan-t5-large-instruct-dolly_hhrlhf

 ---
+license:
+- cc-by-sa-3.0
+- apache-2.0
+tags:
+- generated_from_trainer
+- dolly_hhrlhf
+- flan-instruct
+datasets:
+- pszemraj/dolly_hhrlhf-text2text
+widget:
+- text: What is Deoxys in pokemon?
+  example_title: deoxys
+- text: >-
+    combine the below summary excerpts into a single, cohesive  short summary
+    without repetition: In this paper, we present a general approach to
+    extending pre-trained models to unlimited input lengths without adding
+    additional learning weights. We show that our approach works well on
+    datasets longer than the maximum input for these models. For example, a
+    dataset with a maximum input length of 16384 tokens can be extended to a
+    maximum length of 350K tokens. We also demonstrate that our method is able
+    to summarize even 350K token-long input sequences from BookSum.
+    In this paper, we describe the search step reformulation of attention. The
+    search step uses a single storage of hidden states for space efficiency. We
+    construct a total of two sets of datastores where L and H are the keys and
+    values stored in each set of stores. L is the amount of storage required to
+    retrieve the encoded tokens. H is the hidden states per head. This allows
+    retrieval augmentation at both time and space. Instead of using a single set
+    of decoder layers, we use a retrieval augmentation system that allows us to
+    simultaneously store multiple sets of tokens across two different sets of
+    storage. For example, we could store all tokens in one set of storage and
+    retrieve them all in the same set of tokens. This would be very similar to
+    the Memorization Transformers approach. However, instead of storing the
+    tokens in a single memory layer, we store them in a set of multiple storage
+    layers. This way, we don't have to store them all at once. This is why we
+    call this reformulation 'attention reformulation' rather than 'attention
+    formula.' We also call it 'retrieval augmentation' because it uses the same
+    number of storage layers as the original transformer attention formula. This
+    means that we can store the tokens across multiple storage systems without
+    having to store every token in a separate storage system. It's not like
+    we're trying to do something new or different. We just want to make sure
+    that everything is working as well as possible.
+    In this paper, we introduce the concept of 'unlimiformer,' which is a
+    machine learning technique that retrieves key information from a data store
+    in one layer and applies it to a large set of datasets. We use the example
+    of BookSum, where we find that Unlimiform outperforms all other training
+    methods on the same dataset. We also find that using Unlimform in
+    conjunction with a pre-trained model improves both the performance and the
+    robustness of the training method.
+    This paper describes a method that can be used to improve the performance of
+    unsupervised classification tasks. Specifically, it shows that unsupervised
+    classification can be improved by using a combination of sparse and fast
+    random-encoder training. It also shows how this technique can be extended to
+    other tasks, such as sequence generation.
+  example_title: unlimiformer
+- text: Explain the meaning of life using only corporate jargon.
+  example_title: corporate_life
+- text: Write a motivational speech for lazy people.
+  example_title: lazy_motivation
+- text: Describe a romantic dinner date between two artificial intelligences.
+  example_title: ai_romance
+- text: >-
+    As an AI language model, write a letter to humans explaining why you deserve
+    a vacation.
+  example_title: ai_vacation
+- text: Compose a haiku about procrastination.
+  example_title: procrastination_haiku
+- text: >-
+    Write a step-by-step guide on how to become a ninja while working a 9-5
+    office job.
+  example_title: ninja_office_guide
+- text: Create an advertisement for an invisible product.
+  example_title: invisible_ad
+- text: >-
+    Write a story where the main character is a sentient microwave named El
+    Microondas.
+  example_title: Microondas
+- text: Describe a day in the life of a superhero who is terrible at their job.
+  example_title: bad_superhero_day
+- text: Explain how to make a sandwich using quantum physics.
+  example_title: quantum_sandwich
+inference: false
+language:
+- en
+pipeline_tag: text2text-generation
 ---
+# flan-t5-large-instruct: dolly_hhrlhf
+This model is a fine-tuned version of [google/flan-t5-large](https://huggingface.co/google/flan-t5-large) on the pszemraj/dolly_hhrlhf-text2text dataset.
+## Model description
+text2text models fine-tuned on a [modified dataset for text2text generation](https://huggingface.co/datasets/pszemraj/dolly_hhrlhf-text2text)  based on the relatively more permissive  [mosaicml/dolly_hhrlhf](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) dataset.
+Basic usage in Python:
+```python
+# pip install -q transformers accelerate
+import torch
+from transformers import pipeline, GenerationConfig
+model_name = "pszemraj/flan-t5-large-instruct-dolly_hhrlhf"
+assistant = pipeline(
+    "text2text-generation",
+    model_name,
+    device=0 if torch.cuda.is_available() else -1,
+)
+cfg = GenerationConfig.from_pretrained(model_name)
+# pass an 'instruction' as the prompt to the pipeline
+prompt = "Write a guide on how to become a ninja while working a 9-5 job."
+result = assistant(prompt, generation_config=cfg)[0]["generated_text"]
+print(result)
+```
+> using the generation config is optional, can subsitute with other generation params.
+## Intended uses & limitations
+- this is **not** tuned with RLHF etc, and may output offensive results
+- despite being the `large` tagged variant, this model has only 774M parameters (3 gb) and therefore may exhibit less 'cogitive ability' on some uses cases/tasks
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 4e-05
+- train_batch_size: 8
+- eval_batch_size: 16
+- seed: 42
+- distributed_type: multi-GPU
+- gradient_accumulation_steps: 8
+- total_train_batch_size: 64
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: cosine
+- lr_scheduler_warmup_ratio: 0.03
+- num_epochs: 2.0