Question

#1
by lmg-anon - opened

Did you fine-tune the model with or without BOS? Apparently this model works much better without BOS.
https://github.com/01-ai/Yi/discussions/5#discussioncomment-7484547

It was trained using the Yi Llama repo llama-tokenizer branch as a base, using raw text completion format in axolotl with this config yaml:

base_model: ./models/yi-llama-34b
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
is_llama_derived_model: true

load_in_8bit: true
load_in_4bit: false
strict: false

datasets:
  - path: train-all-yi-4k.jsonl
    type: completion
dataset_prepared_path:
val_set_size: 0.01
output_dir: ./limarp-lora-out

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: false

adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project: 34b-qlora
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.00015

train_on_inputs: true
group_by_length: false
bf16: true
fp16: false
tf32: true

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
eval_steps: 20
eval_table_size:
eval_table_max_new_tokens: 128
save_steps:
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<|startoftext|>"
  eos_token: "<|endoftext|>"
  unk_token: "<unk>"

I'd assume that means that it's using token ID 1 as BOS and token ID 2 as EOS as specified in the config. Haven't had a chance to test it out - still running the lora merge to base and upload on another system, but hoping to see results when it's merged to other finetunes when we have them.

EDIT: Hmmm looking at the tokenizer_config.json itself, it specifies "add_bos_token": false, but I'm not sure in what context that's referring to.

but I'm not sure in what context that's referring to

it means that the inference shouldn't have bos as the first token. so if the dataset contains BOS this could cause problems... well, this is something to be aware if the model doesn't perform as expected with this fine-tuning. I'm still trying to merge it myself too.

I'm not sure if there's an arg to specify whether or not to use a BOS token in training in axolotl. I suppose you could set bos_token: "" in the yaml?

I have absolutely no clue either, but if the BOS isn't in the dataset then maybe it is already not using it since the tokenizer_config.json specifies "add_bos_token": false as you mentioned before.

Ok, I found that using BOS with this lora causes the model to get the format completely wrong, and without BOS everything seems to go smoothly. So I guess everything was right after all πŸ‘

lmg-anon changed discussion status to closed

Sign up or log in to comment