Why there is a dramatic drop on model performance?

#3
by AIdinner - opened

Hey bro, I am a big fan of you. I wonder the cause of the dramatic drop of your model on the Open LLM Leaderboard. Could you please provide more information? I would really appreciate it!

I tried to reproduce your results on LLM Leaderboard (TruthfulQA score 70) with your QLoRA python scripts and shell scripts to train Airobors-l2-70B-gpt4-1.4.1-qlora from https://gist.github.com/jondurbin/87fc040b92a3073125ed516b04bc6e19. However, the score of TruthfulQA of my reproduced model is about 55, far from 70. I learnt from your model card that you have excluded some de-censoring data and did not publish them. I have no idea where the difference comes from: from the de-censoring data ? or the TruthfulQA score 70 is erroneous and score about 55 is correct?

I discovered contamination in an earlier version so I purged anything I could find in the datasets via similarity score and recreated the model, 55 or thereabouts is accurate,

It you want very specific details, here's what happened, as far as I can tell:

  1. I included a new misconceptions generator, which was based on: https://en.m.wikipedia.org/wiki/List_of_common_misconceptions and naturally has some overlap with truthfulqa
  2. Many of the character cards used for stylized responses are of an AI assistant, and when you use gpt4 to generate a response as that character, the answers will be very aligned.
  3. I always have a number of jsonl files for the various categories, and merge/format them into a combined dataset before training, including the unpublished dealignment stuff. I'm guessing the "cat *.jsonl > instructions.jsonl" was slopppy and grabbed benchmark data, but I always delete pods when done so I don't pay for storage so I can't say for sure, and I didn't publish the combined file because it had the dealignment rows.

I requested HF remove the scores as soon as they came through on 08/24, which they did, but there was some caching issues after the fixed version was retested so it took a few days and a couple PRs to address.

I like reproducability though - I think what I can do, to avoid the possibility of the published training data not matching in the future, is to do my dealignment first on the base model without any of the other data, then use it as a base for fine-tuning rather than the default llama-2 base.

AIdinner changed discussion status to closed

Sign up or log in to comment