GPT-3.5 Spider contamination based on https://arxiv.org/pdf/2402.08100

#18
by bpHigh - opened

What are you reporting:

  • Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
  • Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s): Name(s) of the evaluation dataset(s). If available in the HuggingFace Hub please write the path (e.g. uonlp/CulturaX), otherwise provide a link to a paper, GitHub or dataset-card.
xlangai/spider
Contaminated model(s): Name of the model(s) (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace Hub please list the corresponding paths (e.g. allenai/OLMo-7B).
GPT-3.5

Briefly describe your method to detect data contamination

  • Data-based approach
  • Model-based approach

Paper introduces contamination detection for Text-to-SQL datasets by measuring the previous knowledge that a model has on the sql database dumps contained in these datasets. A clue to determine whether the data contamination has occurred is that the model is able to reconstruct missing information regarding the database schema. This paper shows that data contamination is responsible for overestimating the performance of GPT3.5 on Text-to-SQL and clearly demonstrate that GPT-3.5 possesses prior knowledge on the contents of the Spider validation set in contrast to its ignorance of the author's constructed Text-to-SQL unseen dataset, Termite.

Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination?

URL: https://arxiv.org/pdf/2402.08100
Citation: @misc{ranaldi2024investigating, title={Investigating the Impact of Data Contamination of Large Language Models in Text-to-SQL Translation}, author={Federico Ranaldi and Elena Sofia Ruzzetti and Dario Onorati and Leonardo Ranaldi and Cristina Giannone and Andrea Favalli and Raniero Romagnoli and Fabio Massimo Zanzotto}, year={2024}, eprint={2402.08100}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

Workshop on Data Contamination org

Hi @bpHigh ,

Thank you for your contribution!

To be added to the database we need some estimation of the amount of contamination. For previous entries, if there were proofs of contamination but no info about the amount we have put a 100%, as worst case scenario. However in this case we have some data, I would suggest considering contaminated the tables for which the amount of columns correctly predicted is above 75%. What do you think?

We can also contact the authors to know their opinions.

Best,
Oscar

Yup I think we can include DBs above 75%. I have added the percentage by including the % percentage of examples belonging to the 3 DB ids on which gpt-3.5 achieves more than 75% DC-accuracy.

Workshop on Data Contamination org

Hi @bpHigh !

Thank you for your contribution. Merging to main :)

Oscar

OSainz changed pull request status to merged

Sign up or log in to comment