--- license: mit datasets: - sahil2801/CodeAlpaca-20k - yahma/alpaca-cleaned - databricks/databricks-dolly-15k - OpenAssistant/oasst1 - jeffwan/sharegpt_vicuna - qwedsacf/grade-school-math-instructions - vicgalle/alpaca-gpt4 language: - en tags: - sft pipeline_tag: text-generation widget: - text: >- <|prompter|>What is a meme, and what's the history behind this word?<|assistant|> - text: <|prompter|>What's the Earth total population<|assistant|> - text: <|prompter|>Write a story about future of AI development<|assistant|> --- # LoRA Adapter for LLaMA 7B trained on more datasets than tloen/alpaca-lora-7b This repo contains a low-rank adapter for **LLaMA-7b** fit on datasets part of the OpenAssistant project. You can see sampling results [here](https://open-assistant.github.io/oasst-model-eval/?f=https%3A%2F%2Fraw.githubusercontent.com%2FOpen-Assistant%2Foasst-model-eval%2Fmain%2Fsampling_reports%2Foasst-sft%2F2023-03-18_llama_30b_oasst_latcyr_400_sampling_noprefix_lottery.json%0Ahttps%3A%2F%2Fraw.githubusercontent.com%2FOpen-Assistant%2Foasst-model-eval%2F8e90ce6504c159d4046991bf37757c108aed913f%2Fsampling_reports%2Foasst-sft%2Freport_file_jordiclive_alpaca_gpt4-dolly_15k-vicuna-lora-7b_full_lottery_no_prefix.json). Note the sampling params are not necessarily the optimum—they are OpenAssistant defaults for comparing models. This version of the weights was trained with the following hyperparameters: - Epochs: 8 - Batch size: 128 - Max Length: 2048 - Learning rate: 8e-6 - Lora _r_: 16 - Lora Alpha: 32 - Lora target modules: q_proj, k_proj, v_proj, o_proj The model was trained with flash attention and gradient checkpointing. ## Dataset Details - dolly15k: val_split: 0.05 max_val_set: 300 - oasst_export: lang: "bg,ca,cs,da,de,en,es,fr,hr,hu,it,nl,pl,pt,ro,ru,sl,sr,sv,uk" input_file_path: 2023-04-12_oasst_release_ready_synth.jsonl.gz val_split: 0.05 - vicuna: val_split: 0.05 max_val_set: 800 fraction: 0.8 - dolly15k: val_split: 0.05 max_val_set: 300 - grade_school_math_instructions: val_split: 0.05 - code_alpaca: val_split: 0.05 max_val_set: 250 - alpaca_gpt4: val_split: 0.02 max_val_set: 250 ## Model Details - **Developed** as part of the OpenAssistant Project - **Model type:** PEFT Adapter for frozen LLaMA - **Language:** English ## Prompting Two special tokens are used to mark the beginning of user and assistant turns: `<|prompter|>` and `<|assistant|>`. Each turn ends with a `<|endoftext|>` token. Input prompt example: ``` <|prompter|>What is a meme, and what's the history behind this word?<|assistant|> ``` The input ends with the `<|assistant|>` token to signal that the model should start generating the assistant reply. # Example Inference Code (Note several embeddings need to be loaded along with the LoRA weights), assumes on GPU and torch.float16: ``` from typing import List, NamedTuple import torch import transformers from huggingface_hub import hf_hub_download from peft import PeftModel from transformers import GenerationConfig device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = transformers.AutoTokenizer.from_pretrained("jordiclive/alpaca_gpt4-dolly_15k-vicuna-lora-7b") model = transformers.AutoModelForCausalLM.from_pretrained( "decapoda-research/llama-7b-hf", torch_dtype=torch.float16 ) # Load Base Model model.resize_token_embeddings( len(tokenizer) ) # This model repo also contains several embeddings for special tokens that need to be loaded. model.config.eos_token_id = tokenizer.eos_token_id model.config.bos_token_id = tokenizer.bos_token_id model.config.pad_token_id = tokenizer.pad_token_id lora_weights = "jordiclive/alpaca_gpt4-dolly_15k-vicuna-lora-7b" model = PeftModel.from_pretrained( model, lora_weights, torch_dtype=torch.float16, ) # Load Lora model model.eos_token_id = tokenizer.eos_token_id filename = hf_hub_download("jordiclive/alpaca_gpt4-dolly_15k-vicuna-lora-7b", "extra_embeddings.pt") embed_weights = torch.load( filename, map_location=torch.device("cuda" if torch.cuda.is_available() else "cpu") ) # Load embeddings for special tokens model.base_model.model.model.embed_tokens.weight[32000:, :] = embed_weights.to( model.base_model.model.model.embed_tokens.weight.dtype ).to( device ) # Add special token embeddings model = model.half().to(device) generation_config = GenerationConfig( temperature=0.1, top_p=0.75, top_k=40, num_beams=4, ) def format_system_prompt(prompt, eos_token=""): return "{}{}{}{}".format( "<|prompter|>", prompt, eos_token, "<|assistant|>" ) def generate(prompt, generation_config=generation_config, max_new_tokens=2048, device=device): prompt = format_system_prompt(prompt) # OpenAssistant Prompt Format expected input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device) with torch.no_grad(): generation_output = model.generate( input_ids=input_ids, generation_config=generation_config, return_dict_in_generate=True, output_scores=True, max_new_tokens=max_new_tokens, eos_token_id=2, ) s = generation_output.sequences[0] output = tokenizer.decode(s) print("Text generated:") print(output) return output generate("What is a meme, and what's the history behind this word?") generate("What's the Earth total population") generate("Write a story about future of AI development") ```