Spaces:

CONDA-Workshop
/

Data-Contamination-Database

Running

App Files Files Community

GPT-3.5 Spider contamination based on https://arxiv.org/pdf/2402.08100

#18

by bpHigh - opened Apr 29

base: refs/heads/main

←

from: refs/pr/18

Discussion Files changed

-0

bpHigh

Apr 29

•

edited Apr 29

What are you reporting:

Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s): Name(s) of the evaluation dataset(s). If available in the HuggingFace Hub please write the path (e.g. uonlp/CulturaX), otherwise provide a link to a paper, GitHub or dataset-card.
xlangai/spider
Contaminated model(s): Name of the model(s) (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace Hub please list the corresponding paths (e.g. allenai/OLMo-7B).
GPT-3.5

Briefly describe your method to detect data contamination

Data-based approach
Model-based approach

Paper introduces contamination detection for Text-to-SQL datasets by measuring the previous knowledge that a model has on the sql database dumps contained in these datasets. A clue to determine whether the data contamination has occurred is that the model is able to reconstruct missing information regarding the database schema. This paper shows that data contamination is responsible for overestimating the performance of GPT3.5 on Text-to-SQL and clearly demonstrate that GPT-3.5 possesses prior knowledge on the contents of the Spider validation set in contrast to its ignorance of the author's constructed Text-to-SQL unseen dataset, Termite.

Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination?

URL: https://arxiv.org/pdf/2402.08100
Citation: @misc{ranaldi2024investigating, title={Investigating the Impact of Data Contamination of Large Language Models in Text-to-SQL Translation}, author={Federico Ranaldi and Elena Sofia Ruzzetti and Dario Onorati and Leonardo Ranaldi and Cristina Giannone and Andrea Favalli and Raniero Romagnoli and Fabio Massimo Zanzotto}, year={2024}, eprint={2402.08100}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

Full name: Bhavish Pahwa
Institution: Microsoft Research
Email: t-bpahwa@microsoft.com

GPT-3.5 Spider contamination based on https://arxiv.org/pdf/2402.081009588afb3

convert arxiv pdf link to absf6e12a01

OSainz

Workshop on Data Contamination org May 2

Hi @bpHigh ,

Thank you for your contribution!

To be added to the database we need some estimation of the amount of contamination. For previous entries, if there were proofs of contamination but no info about the amount we have put a 100%, as worst case scenario. However in this case we have some data, I would suggest considering contaminated the tables for which the amount of columns correctly predicted is above 75%. What do you think?

We can also contact the authors to know their opinions.

Best,
Oscar

Add percentage of validation data contaminated by including db ids on which gpt-3.5 achieves more than 75% DC-accuracy2a681920

bpHigh

May 5

Yup I think we can include DBs above 75%. I have added the percentage by including the % percentage of examples belonging to the 3 DB ids on which gpt-3.5 achieves more than 75% DC-accuracy.

Add PR number326e3ca3

OSainz

Workshop on Data Contamination org May 6

Hi @bpHigh !

Thank you for your contribution. Merging to main :)

Oscar

OSainz changed pull request status to merged May 6

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment