save, loading and inferencing the Gemma model

#64
by Iamexperimenting - opened

hi team, thanks for the model and examples. I just noticed this example and couldn't find section where you save the fine tuned model, load the fined tuned model and do inference with the fine-tuned model.

Notebook: https://huggingface.co/google/gemma-7b/blob/main/examples/notebook_sft_peft.ipynb

can you please add those section it will be very helpful?

@suryabhupa

additionally, I noticed example for distributed training using TPU device. However, I use Nvidia GPU and I don't have access to TPU machine, could you please provide an example for distributed training in Nvidia GPU?

Example for TPU: https://huggingface.co/google/gemma-7b/blob/main/examples/example_fsdp.py

@suryabhupa

Google org

hello!

re: loading the IT model; the use case you have in mind seems somewhat specific, do you think it might be possible for you to add a PR to do this?

re: GPUs: unfortunately I don't have any examples on hand about distributed training on GPUs (our internal stack in Google is TPUs), maybe others have pointers here? cc @pengchong @osanseviero

@suryabhupa , I'm new to huggingface that's a reason I'm looking for some concrete example. Because, I googled as much as I can, and I noticed there are different ways to save the model and also I noticed model fine-tuned with peft has different method to save and load. That's a reason, I was recommending the Gemma team to add those section in their example.

I believe for all the use-case, saving and loading is same.

@suryabhupa @ybelkada I fine-tuned 7b model, after fine-tuning when I try to inference I'm getting correct results. Later, I saved the fine-tuned model. Then I loaded that fine-tuned and tried to inference with it I see model i hallucinating like un-fine tuned model. It is not closer to the fine-tuned model result.

I'm not sure whether I'm saving and merging and loading the model corretly, can you please guide me here?

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

from peft import LoraConfig

lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)


model_id = "google/gemma-7b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])

import transformers
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
       num_train_epochs = 50,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=5,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
)
trainer.train()

# save model in loca
trainer.save_model()


# Empty VRAM
del model
del trainer
import gc
gc.collect()
gc.collect()
torch.cuda.empty_cache()

from peft import AutoPeftModelForCausalLM

new_model = AutoPeftModelForCausalLM.from_pretrained(
    'outputs',
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

merged_model = new_model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("custom-fine-tuned-merged", safe_serialization=True)
tokenizer.save_pretrained("custom-fine-tuned-merged")

text = "Quote: Imagination is"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

with torch.cuda.amp.autocast():
    outputs = merged_model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Google org

hi @Iamexperimenting
Thanks for the experimentations ! Can you show us how do you create the LoraConfig?

oh my bad, I have updated the code,

from peft import LoraConfig

lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

@ybelkada

Google org

Hi @Iamexperimenting
Thanks! what happens if you generate with the un-merged model? Can you also try to reload the model in 4bit precision or torch.bfloat16? I wonder if there something off with the model precision

@ybelkada

what happens if you generate with the un-merged model?
Answer: It generates answer what it is expected(basically it is generating expected output).

Can you also try to reload the model in 4bit precision or torch.bfloat16?
can you please provide a code snipet to reload the model in 4bit precision or torch.bfloat16?

hi @ybelkada can you please provide some sample?

@ybelkada @suryabhupa can you please help me here?

Google org

I would just carefully suggest checking that everything you expect (the precision, the weights, the activations) are what you expect at every step of the way, all the way to inference, whether it's from the un-merged model or not. I would suspect something is getting overwritten or something is being loaded up properly.

@suryabhupa @ybelkada , even I feel the same, can you please help me here I have put my training and inference script here.

Please find the reproducible code below,

Here i'm fine-tuning the gemma model with my dataset.

import pandas as pd
import torch
import transformers
from trl import SFTTrainer
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from datasets import Dataset, load_dataset

model_id = "google/gemma-2b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])

data = pd.read_csv('train_data.csv')
train_df = Dataset.from_pandas(data)

lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_df,
    dataset_text_field = "text",
    max_seq_length = 512,
    args=transformers.TrainingArguments(
        num_train_epochs = 10,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=16,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        seed = 12,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
)

trainer.train()

After finishing training, I immediately tested with two examples to check how model does prediction for it, and I noticed model generated the expected output.

# Below is with example 1 input

text = "trained input example text 1"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Below is with example 2 input

text = "trained input example text 2"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

After checking the performance of the fine-tuned model, I saved the model with the below step

trainer.save_model("finetuned_model")

After saving the model, i restarted the kernel and loaded the fine-tuned model

from peft import AutoPeftModelForCausalLM
import torch
new_finetuned_model = AutoPeftModelForCausalLM.from_pretrained(
                                                                                                        "finetuned_model",
                                                                                                        low_cpu_mem_usage=True,
                                                                                                        return_dict = True, 
                                                                                                        torch_dtype = torch.float16,
                                                                                                        device_map = "cuda:0",)

After loading the fine-tuned model, I tested the model with the same example input and I noticed generated output from the model is different.

text = "trained input example text 1"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = new_finetuned_model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

I'm not sure where I'm making mistake could you please help me here?

Google org

It seems like the problem might be between restarting the kernel and re-loading the fine-tuned model, something seems broken.

Sign up or log in to comment