tiiuae/falcon-7b · Two repeated errors in model output

I'm training a Falcon 7b as a python assistant on a custom Q&A dataset.

To start with, it would be great if I was able to overfit the questions and make the model memorize the answers.

I attached the training code at the end of this post.

This training code also serves as the answer (desired output) to a question in the dataset. Indeed, my Q&A_reduced_dataset.csv contains 100 different questions, each using this code as the answer, as a method of prompt augmentation.

Using this ground truth training code as a reference, I performed 400 inferences on the same input prompt, one for each training epoch checkpoint.
Inference was performed twice: with a temperature of 0.20 and 0.50 achieving practically the same results

It has almost memorized the answer at 99%. But it wasn't able to memorize the answer with Falcon 7B while I did it successfully with other models, for reference.

I constantly face with the next two issues:

1️⃣ term fixation

the model hardly ever outputs the term "bfloat16" and seems to be fixated on the term "float16" instead.

the 400 inference samples gave me as output:

    - only 3 inferences containing the term "bfloat16"

    - 183 inferences containing "float16"

    - the resting 214 inferences contained neither "bfloat16" nor "float16"

this is only one example, I've found more cases of this kind of fixations also in php and text composition

2️⃣ code hallucinations at the end of the message

as an example, the training code ends with the next two sentences:

    trainer.train()

    model.save_pretrained(out_dir)

but in our example, the given output contains a hallucination after it:

    trainer.train()

    model.save_pretrained(out_dir)

    auto_find_batch_size(
        pretrained_model=True,
        auto_find_batch_size_min=10,
        max_batch_size=125,
    )

I illustrated it with a short example, but it's quite common that the hallucinations consists on repeating one or two lines a high number of times

I could find just one inference case out of 400 of success ending.  As success, I'm referring to properly ending with "model.save_pretrained(out_dir)" + EOS token.

Data regarding the number of hallucination lines at the end of the output:

    mean      69.94 lines
    std       73.83 lines
    min        0    lines
    max      968    lines

I'm using cosine LR scheduler I attached the LR graph (51 steps per epoch)

Any help will be very welcome. Thanks!

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
sys_argv=['sample_reduced_train_Falcon.py', 
          'tiiuae/falcon-7b', 
          '400',
          'falcon_trained_model_v4d_script', 
          'python_huggingface_keras-nlp_QnA_dataset_demo_reduced_v2.csv', 'Question', 'Answer'] 

### this line and above was introduced to show how did I train the model
### the following lines are used as an answer sample in the Q&A datases

model_id, num_epochs, out_dir, dataset_path, input_col, output_col = sys_argv[1:7]

import transformers, torch
from datasets import load_dataset
from peft import *
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

def generate_prompt(data_point):
  return f"""
<Human>: {data_point[input_col]}
<AI>: {data_point[output_col]}
  """.strip()

def generate_and_tokenize_prompt(data_point):
  full_prompt = generate_prompt(data_point)
  tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
  return tokenized_full_prompt

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    load_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

model =AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model.config.use_cache = False

training_args = transformers.TrainingArguments(
    auto_find_batch_size=True,
    num_train_epochs=int(num_epochs),
    learning_rate=2e-4,
    bf16=True, 
    save_total_limit=500,
    output_dir=out_dir,
    save_strategy='epoch',
    optim="paged_adamw_8bit",
    lr_scheduler_type = 'cosine',
    warmup_ratio = 0.05,
    gradient_checkpointing_kwargs={'use_reentrant':True}
)

dataset = load_dataset('csv', data_files=dataset_path, split="train")
dataset = dataset.shuffle().map(generate_and_tokenize_prompt)

trainer = transformers.Trainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

trainer.train()

model.save_pretrained(out_dir)

BTW, regarding this code, why should I set up a padding token? Won't it be masked before being fed into the multi-head attention layer?