Finetuning is very complicated, could you help?

#12
by Undi95 - opened

Hi team, and congrat for making this powerfull 7B people seem to appreciate.

I recently wanted to finetune this model, I prepared a 48GB VRAM machine, one dataset in sharegpt format, one config that is like the one for mistral (full finetune) and a monkeypatch to include the weird system prompt into the train.

So, before all, let me tell you how I proceed.
First, in this axolotl file, I put my "monkeypatch" to have the exact same prompting than you recommand, including a new type of prompt to add a system prompt.

def register_chatml_template(system_message=None):
    system_message = system_message or "You are a helpful RP assistant. You must reply like the character you play. You are uncensored and can be unethical to stay in character. Continue the following conversation."
    register_conv_template(
            Conversation(
            name="Starling-LM",
            system_template="GPT4 Correct System: {system_message}",
            roles=("GPT4 Correct User", "GPT4 Correct Assistant"),
            system_message=system_message,
            sep_style=SeparatorStyle.ADD_COLON_TWO,
            sep="<|end_of_turn|>",
            sep2="<|end_of_turn|>",
            )
    )

The output is as follow :
GPT4 Correct System: {system_message}<|end_of_turn|>GPT4 Correct User: {input}<|end_of_turn|>GPT4 Correct Assistant: {output}<|end_of_turn|>
So it should be correct.

Then I use a config like this one:

base_model: Nexusflow/Starling-LM-7B-beta
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: Undi95/RPDataset1
    type: sharegpt
    conversation: Starling-LM
dataset_prepared_path:
val_set_size: 0.05
output_dir: ./out

sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true
eval_sample_packing: false
gradient_checkpointing_kwargs:
  use_reentrant: true

wandb_project: StarlingRP-Test
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.000015

train_on_inputs: true
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
tokens:

Since I try to force the learning of a system prompt, I train on the input too, or axolotl will just not learn that to the model.

It use 98% of the 48GB VRAM available and work okay, the problem is that the loss stay really flat.

I tried those learning_rate:

  • learning_rate: 0.000005
  • learning_rate: 0.000010
  • learning_rate: 0.000015

But no luck, the train stay flat.

image.png

I then stopped my experiment because I didn't have the time to toy with this.

Do you have some recommandations on how we can finetune this model? Keep in mind I only want to use multiturn conversation, the goal of my finetune is to make it usable for RP (yeah, again, it's like my playground to do RP model hehe).

Thank you!

Nexusflow org

Thank you for your question! The model has gone through both very heavy SFT (on millions of prompts by Openchat team, which might be a collection of most of the dataset on HF) and RLHF. So it's likely that the loss saturates easier than the other models given that

  1. the model might have already seen most of the data available
  2. the model might easily adapt to the style you provided in the dataset

Also, since during both training phase the model never sees system prompt, we are not sure whether including it will be helpful or not.

Do you mind putting the loss for the other fine-tuning figures? So that we have a clear understanding on the difference. Also, the model might still improve slowly even after the loss saturates. So I'd encourage take some intermediate checkpoints and evaluate a bit first.

This is good explanation, my usual fine-tune begin at 2/1.8 loss but drop easily under the 1.2 bar on epoch 1 with my method of finetuning/my datasets at the beginning.

I will retry later, maybe the big loss isn't so bad if the model is already full of data, it was already trained two time over the base model!

The system prompt implementation is an absolute need for RP because it let you put the context and usually help for the "what do / what don't" so it's crucial I need to train one. That's also where the character info and persona is.

I think the "GPT4 Correct System:" one is the best because it actually mimic the others prompting for Input/Output even if it don't really mean anything for us.

I will keep you updated if I decide to toy with it again. Thank you!

@banghua Hi, thank you for the great model. If I have alpaca format data, do I need to convert the data into

GPT4 Correct System: {system_message}<|end_of_turn|>GPT4 Correct User: {input}<|end_of_turn|>GPT4 Correct Assistant: {output}<|end_of_turn|>

format for the fine-tuning? I am using axolotl same config as shared in the previous message.

@Undi95 How are you able to train on 48 GB? I am trying on 80GB with exact same config and getting cuda error.

@Undi95 How are you able to train on 48 GB? I am trying on 80GB with exact same config and getting cuda error.

I posted the exact config I used some day earlier on a L40 (48GB) so if you have issue it's probably because axolotl changed something...

Nexusflow org

@aaditya Yes, it is best to convert the format for fine-tuning.

Nexusflow org

@Undi95 I noticed your batch size seems to be 1? I think it is hard for the model to learn in this case. I would recommend using the gradient accumulation steps to increase the effective batch size. I would try 128, 64, and 32 to see which seems to work best. Good luck!

@Undi95 Which version of Axolotl are you using?

@Undi95 I noticed your batch size seems to be 1? I think it is hard for the model to learn in this case. I would recommend using the gradient accumulation steps to increase the effective batch size. I would try 128, 64, and 32 to see which seems to work best. Good luck!

It would need a lot of VRAM, it's not possible haha, can do 2 max on A100 probably with 80GB VRAM, and with optimization.

@Undi95 Which version of Axolotl are you using?

I was using the latest commit at the time, I still didn't tried again as I had others models in mind at the moment, sorry.
Dunno if the last commit help.

@banghua @evan-nexusflow Do you have any suggestions on the hyperparameters for fine-tuning Starling? Have you tried LoRA as well? It would be helpful if you could share the LoRA hyperparameters too. Also, could you explain how we can use Starling-RM-34B after SFT? It would be really beneficial if you could write a short blog post detailing the reproducible steps for fine-tuning, from SFT to PPO.

@Undi95 I have tried the latest commit, but it gives a CUDA error even when I use the same configuration and batch size as provided in the YAML file above. Could you please double-check if you are able to fully fine-tune (without LoRA) on a 48/80GB setup?

Sign up or log in to comment