Should indirect data leakages be included in the Data Contamination Database?
Recently a paper was presented in the EACL conference(2024) (https://aclanthology.org/2024.eacl-long.5.pdf ) in which the authors address the issue of indirect data contamination in closed-source LLMs by conducting a systematic literature review of papers which used OpenAI’s ChatGPT, GPT 3.5, GPT 4 for modifying/ generating some data on top of benchmark datasets/ gpt based evaluation and by considering OpenAI’s data usage policy, the authors assess how much data was reported to be sent to the models in a way that it could be used for further training, hence giving the models an unfair advantage during evaluation in the near future/ iterative versions.
Just wanted to clarify if we should include these datasets in the contamination database as these are not inspected to be present in the OpenAI GPT series models but speculated to be present based on OpenAI's usage policy to store certain user interactions for future training purposes?
If they can be included I can open a PR, just wanted to check with the shared task organizers and maintainers of the Data Contamination Database.
Hi @bpHigh ,
It is a very interesting paper (received the best paper award at EACL!), however, we discussed the topic internally and decided that the data being in OpenAI servers is not proof enough for contamination because they might or might not have used it for training. Still, I think it is worth keeping the discussion open so people can comment and give their points of view. This is not the first time this topic has arisen and it may be relevant in the future.
Thank you for your comments,
Oscar