How to Fine-tune Jamba on google Colab?

#26
by Ateeqq - opened

On 1 GPU πŸ€—

Done: https://exnrt.com/blog/ai/finetune-jamba-v01/
Thanks to @alvations for the Help.

I've tried A100 on colab but it looks like there's still some bugs in the accelerate auto mappings, https://colab.research.google.com/drive/1T0fhyP963DHJDjUNrPMScD0L9uDfOj-w?usp=sharing

When initializing the SFTTrainer, it throws the error:

/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py:245: UserWarning: You didn't pass a `max_seq_length` argument to the SFTTrainer, this will default to 1024
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py:317: UserWarning: You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
  warnings.warn(
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-7-b027ad8b6132> in <cell line: 1>()
----> 1 trainer = SFTTrainer(
      2     model=model,
      3     tokenizer=tokenizer,
      4     args=training_args,
      5     peft_config=lora_config,

12 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in convert(t)
   1148                 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
   1149                             non_blocking, memory_format=convert_to_format)
-> 1150             return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
   1151 
   1152         return self._apply(convert)

NotImplementedError: Cannot copy out of meta tensor; no data!

And I think it's also complaining about moving models when accelerate have offloaded some parameters:

/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py:317: UserWarning: You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
  warnings.warn(
WARNING:accelerate.big_modeling:You shouldn't move a model that is dispatched using accelerate hooks.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-19-5df240c1e5f7> in <cell line: 3>()
      1 import torch
      2 
----> 3 trainer = SFTTrainer(
      4     model=model,
      5     train_dataset=valid_dataset,

3 frames
/usr/local/lib/python3.10/dist-packages/accelerate/big_modeling.py in wrapper(*args, **kwargs)
    451                 for param in model.parameters():
    452                     if param.device == torch.device("meta"):
--> 453                         raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk.")
    454                 return fn(*args, **kwargs)
    455 

RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.

E.g. https://colab.research.google.com/drive/1T0fhyP963DHJDjUNrPMScD0L9uDfOj-w?usp=sharing

After some tinkering and using 4bits as per https://github.com/Pleias/Various-Finetuning/blob/main/finetuning_jamba.py , it runs!!

Example: https://colab.research.google.com/drive/1EK-PeLXfO1oOxSY5zlRmVvOzBPrYnp-d?usp=sharing

Installs

! pip install -U pip
! pip install -U transformers==4.39.2
! pip install causal-conv1d mamba-ssm
! pip install accelerate peft bitsandbytes trl
! pip install -U datasets sacrebleu evaluate 
! pip install -U flash_attn

Code

from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments,  BitsAndBytesConfig
import mamba_ssm


quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    llm_int4_skip_modules=["mamba"] #Maybe not necessary (per axoltl) but to test.
)

tokenizer = AutoTokenizer.from_pretrained("ai21labs/Jamba-v0.1")


model = AutoModelForCausalLM.from_pretrained(
    "ai21labs/Jamba-v0.1",
    trust_remote_code=True, 
    device_map='auto',
    attn_implementation="flash_attention_2", 
    quantization_config=quantization_config, 
    use_mamba_kernels=True
    )


from datasets import load_dataset

valid_data = load_dataset("facebook/flores", "eng_Latn-deu_Latn", streaming=False, split="dev")

# From https://stackoverflow.com/q/78156752/610569
def preprocess_func(row):
  return {'text': "Translate from English to German: <s>[INST] " + row['sentence_eng_Latn'] + " [INST] " + row['sentence_deu_Latn'] + " </s>"}

valid_dataset = valid_data.map(preprocess_func)

valid_dataset['text'][-5:]

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    optim = "adamw_8bit",
    max_grad_norm = 0.3,
    weight_decay = 0.001,
    warmup_ratio = 0.03,
    gradient_checkpointing=True,
    logging_dir='./logs',
    logging_steps=1,
    max_steps=50,
    group_by_length=True,
    lr_scheduler_type = "linear",
    learning_rate=2e-3
)
lora_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    init_lora_weights=False,
    r=8,
    target_modules=["embed_tokens", "x_proj", "in_proj", "out_proj"],
    task_type="CAUSAL_LM",
    bias="none"
)


trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    peft_config=lora_config,
    train_dataset=valid_dataset,
    max_seq_length = 256,
    dataset_text_field="text",
)


trainer.train()

Can you please share the specification of the GPU device you were able to run the above fine tuning script? I am having problem loading the model into memory even when using AWS SageMaker g5.16xlarge

It's an A100 instance on colab. So you'll need p4/p5 instance on AWS

@alvations any specific reason you set max_grad_norm as 0.3?

I've followed https://github.com/Pleias/Various-Finetuning/blob/main/finetuning_jamba.py

But I'm seeing the loss zero out real fast after 200+ steps so definitely there's a lot of room for "student gradient descent" I.e. hyperpameters search

Sign up or log in to comment