How to solve factual inconsistency when fine tuning

#20
by theekshana - opened

I fine tuned the led-large-book-summary model on financial reports and was able to get a rouge-1 score close to 0.6
But after manually checking each model generated summary with the actual financial report, it turned out that most key information specially numbers are wrong.

I would like to know your experiences, workarounds and ideas on this issue even this is not strongly related to the model itself.
Thanks

Hi! Thanks for reaching out. So some bad news is that this is a largely unsolved problem that happens to decrease (at least in text-generation models) by scaling to massive sizes and training them well. With smaller models,. this is harder... and in both cases still unsolved. See for example section 3.6 of the diff transformer paper which just came out yesterday and is demonstrating their improvements on summarization, but the numbers are still rather low (at least for the case where the only acceptable level of hallucination is 0%).

image.png

Some comments:

  • is your ROUGE-1 score actually 60, or 0.6? ROUGE is typically measured as a 'percentage' from 0 to 100 (perfect) etc and transformers reports it this way, so 0.6 would be low on this scale
  • what are you using as your parameters for generating summaries? See a page I put together here a while back.
    • Generally I would recommend using beam search with num_beams 4 or higher.
    • Additionally, in this case where you want exact numbers from your input, you should also likely investigate the use (or in this case, avoiding using) of encoder_no_repeat_ngram_size as this could force the model to never repeat a number in the source text (and therefore cause it to hallucinate)
  • LED is a largely outdated model for many reasons. If your use case is okay with the openai usage terms, I would recommend fine-tuning this pegasus-x model I trained on a variety of documents and GPT-4 generated summaries.
  • a potentially low hanging fruit to increase accuracy (other than what I've said above) is DoLA decoding which is supposed to reduce hallucinations. I am unsure if anyone has tried it for text2text models though, so if you do try it I would be curious as to what you find.

more detail

In general, recalling numbers is hard because of tokenization. [warning: theory] An idea I have is that hallucinations/poor understanding is at least partially because of tokenization, especially when the tokenizer and the model that was trained with it do not treat each number as a separate token. See below, the model 'sees' "55282384" as an concatenation of several different tokens:

In [5]: inputs = tokenizer.encode("$55282384")

In [6]: inputs
Out[6]: [0, 1629, 3118, 2517, 1922, 6232, 2]

In [7]: tokenizer.convert_ids_to_tokens(inputs)
Out[7]: ['<s>', '$', '55', '28', '23', '84', '</s>']

This split is somewhat arbitrary and makes it hard to keep track of. As far as I know, there is no long context text2text model that has been pretrained with a tokenizer that would split the number into digits by itself.

Thank you and really appreciate your response.

Best ROUGE-1 score we got was 60%. For the data set we used the sections of 1000 annual reports and created the summary labels using GPT-4o.

  • Here is evaluation result on a set of annual report sections.
    image.png

And then we checked the summaries of both GPT-4o and LED. Factual accuracy of GPT-4o was pretty high (almost all numbers were correct) and in LED, it was really low (almost all numbers were wrong compared to the original content and GPT-4o generated summary).

  • We already fine tuned LED without encoder_no_repeat_ngram_size.

  • Here are different parameter combinations tried out.
    cf6f1e4f-8850-41e6-8ba6-d417b14c4d6c.jpg

  • Best ROUGE-1 results were from
    4c82316b-ab24-477c-a1a5-89b3b849874c.jpg

Previously tried these 2 models of yours long-t5-tglobal-base-16384-book-summary and pszemraj/pegasus-x-large-book-summary but their ROUGE-1 scores were low compared to LED.

  • some results for pegasus-x-large-book-summary
    image.png

  • some results for long-t5-tglobal-base-16384-book-summary
    image.png

  • some results for long-t5-tglobal-base
    image.png

As you suggested will try the pszemraj/pegasus-x-large-book_synthsumm model as next steps.

Our end goal is to have a quality open source model to summarize financial documents.

Sign up or log in to comment