X-LoRA: Mixture of Low-Rank Adapter Experts, a Flexible Framework for Large Language Models
X-LoRA works by learning scaling values for LoRA adapters. These learned scalings values are used to gate the LoRA experts in a dense fashion. Additionally, all LoRA adapters and the base model are frozen, allowing efficient fine tuning due to a low parameter count.
X-LoRA is easily applied to any HuggingFace Transformers model.
Features
- Effective: Dense gating of experts allows effective mixing
- Efficient fine-tuning: low trainable parameter count
- Hierarchical encapsulated strategy: Re-use existing trained models or model section and re-use them to address complex tasks that cut across experts, following a bio-inspired strategy
- Easy-to-use API:
add_xlora_to_model
, broad compatibility - Dynamically mix LoRA adapters: Deep layer-wise combinations of adapters.
X-LoRA source code
Install directly from source
pip install git+https://github.com/EricLBuehler/xlora.git -U
Further details on installation, packages with source code, API details and more examples:
https://github.com/EricLBuehler/xlora
Converting and loading a model
Example for model conversation:
import torch
import xlora
from transformers import AutoConfig, AutoModelForCausalLM # type: ignore
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.1",
trust_remote_code=True,
use_flash_attention_2=False,
device_map="cuda:0",
torch_dtype=torch.bfloat16,
)
config = AutoConfig.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.1",
trust_remote_code=True,
use_flash_attention_2=False,
device_map="auto",
)
### Convert the model to X-LoRA
model_created = xlora.add_xlora_to_model(
model=model,
xlora_config=xlora.xLoRAConfig(config.hidden_size, xlora_depth=8, device=torch.device("cuda")),
verbose=True,
adapters={
"adapter_1": "./path/to/the/checkpoint_adapter_1/",
"adapter_2": "./path/to/the/checkpoint_adapter_2/",
"adapter_n": "./path/to/the/checkpoint_adapter_3/",
},
)
Loading a trained X-LoRA model from scratch
import torch
import xlora
from transformers import AutoConfig, AutoModelForCausalLM # type: ignore
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.1",
trust_remote_code=True,
use_flash_attention_2=False,
device_map="cuda:0",
torch_dtype=torch.bfloat16,
)
config = AutoConfig.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.1",
trust_remote_code=True,
use_flash_attention_2=False,
device_map="auto",
)
model = xlora.from_pretrained(
"./path/to/saved/model",
model,
{
"adapter_1": "./path/to/the/checkpoint/",
"adapter_2": "./path/to/the/checkpoint/",
"adapter_n": "./path/to/the/checkpoint/",
},
"cuda",
)
Loading pre-trained X-LoRA model directly from Hugging Face Hub
import torch
from xlora.xlora_utils import load_model
XLoRa_model_name = 'lamm-mit/x-lora'
model,tokenizer=load_model(model_name = XLoRa_model_name,
device='cuda:0',
use_flash_attention_2=True,
dtype=torch.bfloat16,
)
)
Inference:
def generate_response (model, tokenizer,
text_input="What is the best biomaterial for superior strength?",
num_return_sequences = 1,
temperature = 0.75,
max_new_tokens = 127,
num_beams = 1,
top_k = 50,
top_p = 0.9,
repetition_penalty=1.,
eos_token_id=2,
add_special_tokens=True,
):
inputs = tokenizer(text_input, add_special_tokens=add_special_tokens)
with torch.no_grad():
outputs = model.generate(input_ids = inputs["input_ids"],
attention_mask = inputs["attention_mask"] ,
max_new_tokens=max_new_tokens,
temperature=temperature,
num_beams=num_beams,
top_k = top_k,
top_p = top_p,
num_return_sequences = num_return_sequences,
eos_token_id=eos_token_id,
pad_token_id = eos_token_id,
do_sample =True,
repetition_penalty=repetition_penalty,
)
return tokenizer.batch_decode(outputs[:,inputs["input_ids"].shape[1]:].detach().cpu().numpy(), skip_special_tokens=True)
output_text=generate_response (model, tokenizer, text_input=txt,eos_token_id=eos_token,
num_return_sequences=1, repetition_penalty=1.1,
top_p=0.9, top_k=512,
temperature=0.5,
max_new_tokens=256)
print (output_text[0])
Dataset
See lamm-mit/x-lora-dataset for the dataset used to train the X-LoRA model. Details on the datasets used to train the original adapters are included in the paper (see reference below).
Sample results
Acknowledgements
This work is built on the Hugging Face PEFT library and other components in the Hugging Face ecosystem. We acknowledge the authors of this excellent library and related methods.
Original paper and citation
Cite this work as:
@article{Buehler_XLoRA_2024,
title = {X-LoRA: Mixture of Low-Rank Adapter Experts, a Flexible Framework for Large Language Models with Applications in Protein Mechanics and Design},
author = {E.L. Buehler, M.J. Buehler},
journal = {},
year = {2024},
volume = {},
pages = {},
url = {https://arxiv.org/abs/2402.07148}
}