# `transformers` meets `bitsandbytes` for democratzing Large Language Models (LLMs) through 4bit quantization - **Fork by [crumb](https://hf.co/crumbly) for GPT2-linear-XL**

<center>
<img src="https://github.com/huggingface/blog/blob/main/assets/96_hf_bitsandbytes_integration/Thumbnail_blue.png?raw=true" alt="drawing" width="700" class="center"/>
</center>

Welcome to this notebook that goes through the recent `bitsandbytes` integration that includes the work that introduces no performance degradation 4bit quantization techniques, for democratizing LLMs inference and training.

In this notebook, we will learn together how to load a large model in 4bit ~~(`gpt-neo-x-20b`)~~ (`gpt2-xl`) and train it using Google Colab and PEFT library from Hugging Face ðŸ¤—.

[In the general usage notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing), you can learn how to propely load a model in 4bit with all its variants.

If you liked the previous work for integrating [*LLM.int8*](https://arxiv.org/abs/2208.07339), you can have a look at the [introduction blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) to lean more about that quantization method.


In [1]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets
!pip install -q wandb

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


First let's load the model we are going to use - GPT2-XL

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# we'll use the bf16 version because it takes up 1/2 the space
# and is quicker to download
model_id = "crumbly/gpt2-linear-xl-sharded-bf16"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map={"":0}, quantization_config=bnb_config, trust_remote_code=True)

A new version of the following files was downloaded from https://huggingface.co/crumbly/gpt2-linear-xl:
- configuration_gpt2l.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/crumbly/gpt2-linear-xl:
- modeling_gpt2l.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [3]:
# generate just to verify that the model works and was loaded correctly
inputs = {k:v.cuda() for k,v in tokenizer("Once upon a time,", return_tensors='pt').items()}
outputs = model.generate(**inputs, max_new_tokens=32, temperature=0.7, do_sample=True)
tokenizer.decode(outputs[0])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Once upon a time, it was said that the best way to predict the future was to take the actions of past generations and predict the future. Unfortunately, this is no longer true.'

In [4]:
# this isn't supported yet with the GPT2 model we use, but for other models:
# uncomment these lines and run them
# from peft import prepare_model_for_kbit_training
# model.gradient_checkpointing_enable()
# model = prepare_model_for_kbit_training(model)

In [5]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [6]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    # ReLoRA uses r=128 by default in their code, but r=1 will even work to a degree
    r=8,
    lora_alpha=32,
    # c_attn is our qkv
    target_modules=["c_attn"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 2457600 || all params: 822788800 || trainable%: 0.2986914746530337


Let's load a dataset, Open Orca, to fine tune our model on instruction sets. We'll use new lines to delimit between the system prompt, question, and response for simplicity.

In [7]:
from datasets import load_dataset

# we'll use streaming=True so we stream examples over the internet
# rather than downloading the entire dataset to process
data = load_dataset("Open-Orca/OpenOrca", streaming=True)

def strip(batch):
    # to remove trailing spaces or newlines from our prompts
    return [
        i.strip() for i in list(batch)
    ]

def process(batch):
    systems = [i for i in strip(batch['system_prompt'])]
    questions = [i for i in strip(batch['question'])]
    responses = [i for i in strip(batch['response'])]
    prompts = zip(systems, questions, responses)
    prompts = ["\n".join(i) for i in prompts]
    prompts = strip(prompts)
    return prompts

# we'll also set the max length to something lower than normal, so we don't go out-of-memory.
tokenizer.model_max_length = 768
data = data.map(lambda samples: tokenizer(process(samples), truncation=True), batched=True)

Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.

In [8]:
import transformers

# needed for gpt-neo-x tokenizer
tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        # your 'effective batch size' is the product of these two numbers
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,

        # you can count the examples you're going to train on by
        # multiplying max_steps by your effective batch size
        # here we'll train on 512 examples, for example
        max_steps=64,
        warmup_steps=16,

        learning_rate=2e-4,
        fp16=True,
        logging_steps=4,
        output_dir="outputs",
        optim="paged_adamw_8bit",

        # if you want to log the loss graph to your wandb, change "none" to "wandb"
        report_to="none"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
4,2.8493
8,2.5079
12,2.7443
16,2.5377
20,2.8088
24,2.6194
28,2.521
32,2.5435
36,2.4396
40,2.3699


TrainOutput(global_step=64, training_loss=2.5019835233688354, metrics={'train_runtime': 303.3326, 'train_samples_per_second': 1.688, 'train_steps_per_second': 0.211, 'total_flos': 802220553600000.0, 'train_loss': 2.5019835233688354, 'epoch': 1.0})

In [16]:
inputs = {k:v.cuda() for k,v in tokenizer("""
You are an AI assistant. You will be given a question. You must generate a short and factual answer.
What is the capital city of France?
""", return_tensors='pt').items()}
outputs = model.generate(**inputs, max_new_tokens=16, temperature=0.5, do_sample=True)
print(tokenizer.decode(outputs[0]), "...")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



You are an AI assistant. You will be given a question. You must generate a short and factual answer.
What is the capital city of France?


Paris

Paris is the capital of France. The city is located ...


To save your adapters, you can either use

```python
model.save_pretrained("local_folder")
```

or push them to the hub with

```python
model.push_to_hub("myusername/my_repo")
```

If you would like to merge the adapters into your model, you'll have to load the base model again without quantization, and merge them like this.

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("crumbly/gpt2-linear-xl-sharded-bf16")
model = PeftModel.from_pretrained(model, "myusername/my_repo")
model = model.merge_and_unload()
```

You can then push that to the hub or save it to a local folder like before, but including all of the weights.