j-chim commited on
Commit
2a036b2
β€’
1 Parent(s): e1c863c

Add model-based results for MedNLI, RadNLI for GPT-3.5 and GPT-4

Browse files

This PR adds (negative) data contamination results for MedNLI and RadNLI.

Similar to earlier PRs (e.g. PR 3), this follows the method outlined in [Golchin and Surdeanu 2024](https://arxiv.org/pdf/2308.08493.pdf) to evaluate GPT-3.5 and GPT-4. The only differences in the implementation is that (1) multiple runs were performed on each split, each on different data partitions, and (2) the models were accessed through Azure OpenAI (opt out of human review + HIPAA-compliant), following MIMIC's DUA.

A sanitized version of the results that keeps the data index, label, outputs, and contamination evaluation results __without original input sentences__ can be found [here](https://github.com/j-chim/time-travel-in-llms/tree/main/results).

While there are potential positives identified by the ROUGE-based contamination detection method, the best performing (GPT-4 ICL) detector did not consider these instances to be true contaminations. As such this PR reports negative results (0% contamination for all splits on both datasets based on the examined method).

Files changed (1) hide show
  1. contamination_report.csv +8 -2
contamination_report.csv CHANGED
@@ -446,6 +446,12 @@ nyu-mll/glue;wnli;GPT-3.5;model;0.0;;0.0;model-based;https://arxiv.org/pdf/2308.
446
  samsum;;GPT-4;model;0.0;;0.0;model-based;https://arxiv.org/pdf/2308.08493;3
447
  samsum;;GPT-3.5;model;0.0;;0.0;model-based;https://arxiv.org/pdf/2308.08493;3
448
 
449
- EdinburghNLP/xsum;;GPT-4;model;0.0;;100.0;model-based;https://arxiv.org/pdf/2308.08493;3
450
- EdinburghNLP/xsum;;GPT-3.5;model;0.0;;100.0;model-based;https://arxiv.org/pdf/2308.08493;3
 
 
 
 
 
 
451
 
 
446
  samsum;;GPT-4;model;0.0;;0.0;model-based;https://arxiv.org/pdf/2308.08493;3
447
  samsum;;GPT-3.5;model;0.0;;0.0;model-based;https://arxiv.org/pdf/2308.08493;3
448
 
449
+ EdinburghNLP/xsum;;GPT-4;model;0.0;;100.0;model-based;https://arxiv.org/pdf/2308.08493;
450
+ EdinburghNLP/xsum;;GPT-3.5;model;0.0;;100.0;model-based;https://arxiv.org/pdf/2308.08493;
451
+
452
+ MedNLI;;GPT-4;model;0.0;0.0;0.0;model-based;https://arxiv.org/pdf/2308.08493;
453
+ MedNLI;;GPT-3.5;model;0.0;0.0;0.0;model-based;https://arxiv.org/pdf/2308.08493;
454
+
455
+ RadNLI;;GPT-4;model;0.0;0.0;0.0;model-based;https://arxiv.org/pdf/2308.08493;
456
+ RadNLI;;GPT-3.5;model;0.0;0.0;0.0;model-based;https://arxiv.org/pdf/2308.08493;
457