--- tags: - llama - adapter-transformers - llama-2 datasets: - timdettmers/openassistant-guanaco license: apache-2.0 pipeline_tag: text-generation --- # OpenAssistant QLoRA Adapter for Llama-2 13B QLoRA adapter for the Llama-2 13B (`meta-llama/Llama-2-13b-hf`) model trained for instruction tuning on the [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/) dataset. **This adapter was created for usage with the [Adapters](https://github.com/Adapter-Hub/adapters) library.** ## Usage First, install `adapters`: ``` pip install -U adapters ``` Now, the model and adapter can be loaded and activated like this: ```python import adapters import torch from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig model_id = "meta-llama/Llama-2-13b-hf" adapter_id = "AdapterHub/llama2-13b-qlora-openassistant" model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", quantization_config=BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16, ), torch_dtype=torch.bfloat16, ) adapters.init(model) adapter_name = model.load_adapter(adapter_id, source="hf", set_active=True) tokenizer = AutoTokenizer.from_pretrained(model_id) ``` ### Inference Inference can be done via standard methods built in to the Transformers library. We add some helper code to properly prompt the model first: ```python from transformers import StoppingCriteria # stop if model starts to generate "### Human:" class EosListStoppingCriteria(StoppingCriteria): def __init__(self, eos_sequence = [12968, 29901]): self.eos_sequence = eos_sequence def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool: last_ids = input_ids[:,-len(self.eos_sequence):].tolist() return self.eos_sequence in last_ids def prompt_model(model, text: str): batch = tokenizer(f"### Human: {text} ### Assistant:", return_tensors="pt") batch = batch.to(model.device) with torch.cuda.amp.autocast(): output_tokens = model.generate(**batch, stopping_criteria=[EosListStoppingCriteria()]) # skip prompt when decoding decoded = tokenizer.decode(output_tokens[0, batch["input_ids"].shape[1]:], skip_special_tokens=True) return decoded[:-10] if decoded.endswith("### Human:") else decoded ``` Now, to prompt the model: ```python prompt_model(model, "Please explain NLP in simple terms.") ``` ### Weight merging To decrease inference latency, the LoRA weights can be merged with the base model: ```python model.merge_adapter(adapter_name) ``` ## Architecture & Training **Training was run with the code in [this notebook](https://github.com/adapter-hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb)**. The LoRA architecture closely follows the configuration described in the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf): - `r=64`, `alpha=16` - LoRA modules added in output, intermediate and all (Q, K, V) self-attention linear layers The adapter is trained similar to the Guanaco models proposed in the paper: - Dataset: [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) - Quantization: 4-bit QLoRA - Batch size: 16, LR: 2e-4, max steps: 1875 - Sequence length: 512