How to Fine-tune Jamba on google Colab?

#26

by Ateeqq - opened Apr 1

Discussion

Ateeqq

Apr 1

•

edited Apr 3

On 1 GPU 🤗

Done: https://exnrt.com/blog/ai/finetune-jamba-v01/
Thanks to @alvations for the Help.

alvations

Apr 1

•

edited Apr 1

I've tried A100 on colab but it looks like there's still some bugs in the accelerate auto mappings, https://colab.research.google.com/drive/1T0fhyP963DHJDjUNrPMScD0L9uDfOj-w?usp=sharing

When initializing the SFTTrainer, it throws the error:

/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py:245: UserWarning: You didn't pass a `max_seq_length` argument to the SFTTrainer, this will default to 1024
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py:317: UserWarning: You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
  warnings.warn(
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-7-b027ad8b6132> in <cell line: 1>()
----> 1 trainer = SFTTrainer(
      2     model=model,
      3     tokenizer=tokenizer,
      4     args=training_args,
      5     peft_config=lora_config,

12 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in convert(t)
   1148                 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
   1149                             non_blocking, memory_format=convert_to_format)
-> 1150             return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
   1151 
   1152         return self._apply(convert)

NotImplementedError: Cannot copy out of meta tensor; no data!

alvations

Apr 1

And I think it's also complaining about moving models when accelerate have offloaded some parameters:

/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py:317: UserWarning: You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
  warnings.warn(
WARNING:accelerate.big_modeling:You shouldn't move a model that is dispatched using accelerate hooks.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-19-5df240c1e5f7> in <cell line: 3>()
      1 import torch
      2 
----> 3 trainer = SFTTrainer(
      4     model=model,
      5     train_dataset=valid_dataset,

3 frames
/usr/local/lib/python3.10/dist-packages/accelerate/big_modeling.py in wrapper(*args, **kwargs)
    451                 for param in model.parameters():
    452                     if param.device == torch.device("meta"):
--> 453                         raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk.")
    454                 return fn(*args, **kwargs)
    455 

RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.

E.g. https://colab.research.google.com/drive/1T0fhyP963DHJDjUNrPMScD0L9uDfOj-w?usp=sharing

alvations

Apr 1

•

edited Apr 1

After some tinkering and using 4bits as per https://github.com/Pleias/Various-Finetuning/blob/main/finetuning_jamba.py , it runs!!

Example: https://colab.research.google.com/drive/1EK-PeLXfO1oOxSY5zlRmVvOzBPrYnp-d?usp=sharing

Installs

! pip install -U pip
! pip install -U transformers==4.39.2
! pip install causal-conv1d mamba-ssm
! pip install accelerate peft bitsandbytes trl
! pip install -U datasets sacrebleu evaluate 
! pip install -U flash_attn

Code

from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments,  BitsAndBytesConfig
import mamba_ssm


quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    llm_int4_skip_modules=["mamba"] #Maybe not necessary (per axoltl) but to test.
)

tokenizer = AutoTokenizer.from_pretrained("ai21labs/Jamba-v0.1")


model = AutoModelForCausalLM.from_pretrained(
    "ai21labs/Jamba-v0.1",
    trust_remote_code=True, 
    device_map='auto',
    attn_implementation="flash_attention_2", 
    quantization_config=quantization_config, 
    use_mamba_kernels=True
    )


from datasets import load_dataset

valid_data = load_dataset("facebook/flores", "eng_Latn-deu_Latn", streaming=False, split="dev")

# From https://stackoverflow.com/q/78156752/610569
def preprocess_func(row):
  return {'text': "Translate from English to German: <s>[INST] " + row['sentence_eng_Latn'] + " [INST] " + row['sentence_deu_Latn'] + " </s>"}

valid_dataset = valid_data.map(preprocess_func)

valid_dataset['text'][-5:]

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    optim = "adamw_8bit",
    max_grad_norm = 0.3,
    weight_decay = 0.001,
    warmup_ratio = 0.03,
    gradient_checkpointing=True,
    logging_dir='./logs',
    logging_steps=1,
    max_steps=50,
    group_by_length=True,
    lr_scheduler_type = "linear",
    learning_rate=2e-3
)
lora_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    init_lora_weights=False,
    r=8,
    target_modules=["embed_tokens", "x_proj", "in_proj", "out_proj"],
    task_type="CAUSAL_LM",
    bias="none"
)


trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    peft_config=lora_config,
    train_dataset=valid_dataset,
    max_seq_length = 256,
    dataset_text_field="text",
)


trainer.train()

Alex18

Apr 2

Can you please share the specification of the GPU device you were able to run the above fine tuning script? I am having problem loading the model into memory even when using AWS SageMaker g5.16xlarge

alvations

Apr 2

•

edited Apr 2

It's an A100 instance on colab. So you'll need p4/p5 instance on AWS

zhoutongfu

Apr 2

@alvations any specific reason you set max_grad_norm as 0.3?

alvations

Apr 2

I've followed https://github.com/Pleias/Various-Finetuning/blob/main/finetuning_jamba.py

But I'm seeing the loss zero out real fast after 200+ steps so definitely there's a lot of room for "student gradient descent" I.e. hyperpameters search

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment