--- license: mit --- # Overview This model has been fine tuned for text summary creation, and was created using LoRA to fine tune the flan-t5-large model using the [SAMsum training dataset](https://huggingface.co/datasets/samsum). This document will explain how the model was fine tuned, saved to disk, added to Hugging Face, and then demonstrate how it is used. Google colab was used to train the model on a single T4 GPU. Training can take 6 hours, and on free tier of Google Colab, you are disconnected and training is lost after 90 minutes of inactivity. For this reason, I upgraded to the $10 per month plan, and this cost about $3 to create, which is pretty exciting considering the T4 is currently a $16k GPU. ## SAMsum SAMsum is a corpus comprised of 16k dialogues and corresponding summaries. Example entry: - Dialogue - "Amanda: I baked cookies. Do you want some? Jerry: Sure! Amanda: I'll bring you tomorrow :-)" - Summary - "Amanda baked cookies and will bring Jerry some tomorrow." ## LoRA [LoRA](https://github.com/microsoft/LoRA) is a performant mechanism for fine tuning models to become better at tasks. > An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. In this case we are training the flan-t5 on the SAMsum dataset in order to create a model that is better at dialog summary. ## Flan T5 Finetuned LAnguage Net Text-to-Text Transfer Transformer (Flan T5) is a LLM published by Google in 2020. This model has improved abilities over the T5 in zero-shot learning. The [flan-t5 model](https://huggingface.co/google/flan-t5-large) is open and free for commercial use. Flan T5 capabilities include: - Translate between several languages (more than 60 languages). - Provide summaries of text. - Answer general questions: “how many minutes should I cook my egg?” - Answer historical questions, and questions related to the future. - Solve math problems when giving the reasoning. > T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher forcing. This means that for training, we always need an input sequence and a corresponding target sequence. The input sequence is fed to the model using input_ids. The target sequence is shifted to the right, i.e., prepended by a start-sequence token and fed to the decoder using the decoder_input_ids. In teacher-forcing style, the target sequence is then appended by the EOS token and corresponds to the labels. The PAD token is hereby used as the start-sequence token. T5 can be trained / fine-tuned both in a supervised and unsupervised fashion. # Code to Create The SAMsum LoRA Adapter ## Notebook Source [Notebook used to create LoRA adapter](https://colab.research.google.com/drive/1z_mZL6CIRRA4AeF6GXe-zpfEGqqdMk-f?usp=sharing) ## Load the samsum dataset that we will use to finetune the flan-t5-large model with. ``` from datasets import load_dataset dataset = load_dataset("samsum") ``` ## Prepare the dataset ``` ... see notebook # save datasets to disk for later easy loading tokenized_dataset["train"].save_to_disk("data/train") tokenized_dataset["test"].save_to_disk("data/eval") ``` ## Load the flan-t5-large model Loading in 8bit greatly reduces the amount of GPU memory required. When combined with the accelerate library, device_map="auto" will use all available gpus for training. ``` from transformers import AutoModelForSeq2SeqLM model_id = "google/flan-t5-large" model = AutoModelForSeq2SeqLM.from_pretrained(model_id, load_in_8bit=True, device_map="auto", torch_dtype=torch.float16) ``` ## Define LoRA config and prepare the model for training ``` from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q", "v"], lora_dropout=0.05, bias="none", task_type=TaskType.SEQ_2_SEQ_LM ) # prepare int-8 model for training model = prepare_model_for_int8_training(model) # add LoRA adaptor model = get_peft_model(model, lora_config) model.print_trainable_parameters() ``` ## Create data collator Data collators are objects that will form a batch by using a list of dataset elements as input. ``` from transformers import DataCollatorForSeq2Seq # we want to ignore tokenizer pad token in the loss label_pad_token_id = -100 # Data collator data_collator = DataCollatorForSeq2Seq( tokenizer, model=model, label_pad_token_id=label_pad_token_id, pad_to_multiple_of=8 ) ``` ## Create the training arguments and trainer ``` from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments output_dir="lora-flan-t5-large" # Define training args training_args = Seq2SeqTrainingArguments( output_dir=output_dir, auto_find_batch_size=True, learning_rate=1e-3, # higher learning rate num_train_epochs=5, logging_dir=f"{output_dir}/logs", logging_strategy="steps", logging_steps=500, save_strategy="no", report_to="tensorboard", ) # Create Trainer instance trainer = Seq2SeqTrainer( model=model, args=training_args, data_collator=data_collator, train_dataset=tokenized_dataset["train"], ) model.config.use_cache = False # re-enable for inference! ``` ## Train the model! This will take about 5-6 hours on a singe T4 GPU ``` trainer.train() ``` | Step | Training Loss | |------|---------------| | 500 | 1.302200 | | 1000 | 1.306300 | | 1500 | 1.341500 | | 2000 | 1.278500 | | 2500 | 1.237000 | | 3000 | 1.239200 | | 3500 | 1.250900 | | 4000 | 1.202100 | | 4500 | 1.165300 | | 5000 | 1.178900 | | 5500 | 1.181700 | | 6000 | 1.100600 | | 6500 | 1.119800 | | 7000 | 1.105700 | | 7500 | 1.097900 | | 8000 | 1.059500 | | 8500 | 1.047400 | | 9000 | 1.046100 | TrainOutput(global_step=9210, training_loss=1.1780610539108094, metrics={'train_runtime': 19217.7668, 'train_samples_per_second': 3.833, 'train_steps_per_second': 0.479, 'total_flos': 8.541847343333376e+16, 'train_loss': 1.1780610539108094, 'epoch': 5.0}) ## Save the model to disk, zip, and download ``` peft_model_id="flan-t5-large-samsum" trainer.model.save_pretrained(peft_model_id) tokenizer.save_pretrained(peft_model_id) trainer.model.base_model.save_pretrained(peft_model_id) !zip -r /content/flan-t5-large-samsum.zip /content/flan-t5-large-samsum from google.colab import files files.download("/content/flan-t5-large-samsum.zip") ``` Upload the contents of that zip file to huggingface # Code to utilize the fine tuned model ## Notebook Source [Notebook using the Hugging Face hosted moded](https://colab.research.google.com/drive/1kqADOA9vaTsdecx4u-7XWJJia62WV0cY?pli=1#scrollTo=KMs70mdIxaam) ## Load the model, tokenizer, and LoRA adapter (PEFT) ``` # Load the jasonmcaffee/flan-t5-large-samsum model and tokenizer import torch from peft import PeftModel, PeftConfig from transformers import AutoModelForSeq2SeqLM, AutoTokenizer peft_model_id = "jasonmcaffee/flan-t5-large-samsum" config = PeftConfig.from_pretrained(peft_model_id) model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, load_in_8bit=True, device_map="auto") tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path) # Load the LoRA adapter model = PeftModel.from_pretrained(model, peft_model_id, device_map="auto") model.eval() ``` ## Have the model summarize text! Finally, we now have a model that is capable of summarizing text for us. Summarization takes ~30 seconds. ``` dialogue = """The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct. """ input_ids = tokenizer(dialogue, return_tensors="pt", truncation=True).input_ids.cuda() # with torch.inference_mode(): outputs = model.generate( input_ids=input_ids, min_length=20, max_new_tokens=100, length_penalty=1.9, #Exponential penalty to the length that is used with beam-based generation. It is applied as an exponent to the sequence length, which in turn is used to divide the score of the sequence. Since the score is the log likelihood of the sequence (i.e. negative), length_penalty > 0.0 promotes longer sequences, while length_penalty < 0.0 encourages shorter sequences. num_beams=4, temperature=0.9, top_k=150, # default 50 repetition_penalty=2.1, # do_sample=True, top_p=0.9, ) print(f"input sentence: {dialogue}\n{'---'* 20}") summarization = tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0] print(f"summary:\n{summarization}") ``` For an initial dialogue of: > The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct. This model will create a summary of: > The Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct. The notebook also loads the flan-t5 with no SAMsum training, which produces a summary of: > The Eiffel Tower is the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930.