emilys commited on
Commit
ec46e45
β€’
1 Parent(s): 473e687

Superglue/RealNews Contamination based on "Noise-Robust De-Duplication at Scale"

Browse files

## What are you reporting:
- [X] Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
- [ ] Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

**Evaluation dataset(s)**: `superglue`

**Contaminated corpora**: `allenai/c4` - we only look at the realnewslike variant

**Contaminated split(s)**:

|Subset | Contamination |
| -------- | ------- |
|`super_glue (boolq)` | 0.6 %|
|`super_glue (cb)` | 0.0%|
|`super_glue (copa)`| 0.0%|
|`super_glue (multirc)` | 1.2% |
|`super_glue (record)`| 7.3%|
|`super_glue (rte)`| 1.1% |
|`super_glue (wic)`| 0.0%|
|`super_glue (wsc)`| 0.0% |

## Briefly describe your method to detect data contamination

- [X] Data-based approach
- [ ] Model-based approach

We contrastively train a bi-encoder on noisy duplicates. We find that the neural approach finds many duplicates that are missed by rule-based approaches like hashing.

![image.png](https://cdn-uploads.huggingface.co/production/uploads/61654589b5ec555e8e9c203a/c6bY4_HtU5scdcDeVL3jT.png)


## Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination?

URLs: https://openreview.net/forum?id=bAz2DBS35i, https://arxiv.org/abs/2210.04261
Citation:
```
@inproceedings{silcock-etal-2020-noise,
title = "Noise-Robust De-Duplication at Scale",
author = "Silcock, Emily and D'Amico-Wong, Luca and Yang, Jinglin and Dell, Melissa",
booktitle = "International Conference on Learning Representations (ICLR)",
year = "2023",
}
```


*Important!* If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.
- Full names: Emily Silcock, Luca D'Amico-Wong, Jinglin Yang, Melissa Dell
- Institution: Harvard University
- Email: emilysilcock@fas.harvard.edu, ldamicowong@college.harvard.edu, melissadell@fas.harvard.edu

Files changed (1) hide show
  1. contamination_report.csv +9 -0
contamination_report.csv CHANGED
@@ -597,3 +597,12 @@ ibragim-bad/arc_challenge;;FLAN;model;;15.6;;data-based;https://arxiv.org/abs/21
597
  facebook/anli;dev_r3;FLAN;model;;40.2;;data-based;https://arxiv.org/abs/2109.01652;13
598
  facebook/anli;dev_r2;FLAN;model;;97.9;;data-based;https://arxiv.org/abs/2109.01652;13
599
  facebook/anli;dev_r1;FLAN;model;;98.6;;data-based;https://arxiv.org/abs/2109.01652;13
 
 
 
 
 
 
 
 
 
 
597
  facebook/anli;dev_r3;FLAN;model;;40.2;;data-based;https://arxiv.org/abs/2109.01652;13
598
  facebook/anli;dev_r2;FLAN;model;;97.9;;data-based;https://arxiv.org/abs/2109.01652;13
599
  facebook/anli;dev_r1;FLAN;model;;98.6;;data-based;https://arxiv.org/abs/2109.01652;13
600
+
601
+ super_glue;boolq;allenai/c4 (realnewslike);corpus;;;0.6;data-based;https://arxiv.org/abs/2210.04261;15
602
+ super_glue;cb;allenai/c4 (realnewslike);corpus;;;0.0;data-based;https://arxiv.org/abs/2210.04261;15
603
+ super_glue;copa;allenai/c4 (realnewslike);corpus;;;0.0;data-based;https://arxiv.org/abs/2210.04261;15
604
+ super_glue;multirc;allenai/c4 (realnewslike);corpus;;;1.2;data-based;https://arxiv.org/abs/2210.04261;15
605
+ super_glue;record;allenai/c4 (realnewslike);corpus;;;7.3;data-based;https://arxiv.org/abs/2210.04261;15
606
+ super_glue;rte;allenai/c4 (realnewslike);corpus;;;1.1;data-based;https://arxiv.org/abs/2210.04261;15
607
+ super_glue;wic;allenai/c4 (realnewslike);corpus;;;0.0;data-based;https://arxiv.org/abs/2210.04261;15
608
+ super_glue;wsc;allenai/c4 (realnewslike);corpus;;;0.0;data-based;https://arxiv.org/abs/2210.04261;15