Introduction to LLE

Community blog post
Published February 1, 2024

Introduction

Over the last few weeks I've been interested in the NER task applied to French. And do you know what? I met the LLE: Leaks, leaks everywhere.
Wondering whether this was a phenomenon unique to French or to the NER task, I decided to download several hundred datasets from the Hugging Face Hub to evaluate their quality. And you won't like what you read below.

image/png


Methodology

There are over 100,000 datasets on the HF Hub, making it difficult to evaluate them all. So, in my little experiment, to get results quickly (I only spent 20 hours on it), I set a limited framework. Namely, I concentrated on "canonical" datasets, i.e. datasets not belonging to an organization on the Hub (this is changing!), among them, those with less than 1M rows so as to download them quickly and analyze them, and finally I focused on NLP.

Among these datasets, I then excluded those requiring manual downloading of data from an external source other than the Hub, those behind a gate, and finally datasets without a test split (it wouldn't make sense to calculate leaks otherwise).

In the end, a total of 596 datasets were analyzed.

Below, I use the conll2003 dataset to illustrate how I determine whether or not a leak is present in a dataset.

dataset = load_dataset("conll2003")

def concatenate_columns(example):
    return {'concatenated_column': str(example['tokens'])}
dataset = dataset.map(concatenate_columns)

train_inputs = dataset["train"]["concatenated_column"]
val_inputs = dataset["validation"]["concatenated_column"]
test_inputs = dataset["test"]["concatenated_column"]

leakage_train = set(train_inputs).intersection(set(test_inputs))
leakage_validation = set(val_inputs).intersection(set(test_inputs))

print("Leakage between train split and test split:",len(leakage_train))
print("Leakage between validation split and test split:",len(leakage_validation))
print("Duplicated lines in the train split:", len(train_inputs) - len(set(train_inputs)))
print("Duplicated lines in the validation split:", len(val_inputs) - len(set(val_inputs)))
print("Duplicated lines in the test split:", len(test_inputs) - len(set(test_inputs)))

Code that returns:

Leakage between train split and test split: 78
Leakage between validation split and test split: 25
Duplicated lines in the train split: 1350
Duplicated lines in the validation split: 180
Duplicated lines in the test split: 269

Here we can see that in addition to calculating the leaks between the train and test split, as well as the split between the validation and test split, I also calculate the number of duplicated lines in each of the splits.
I haven't calculated the leaks between the train and validation split, so this may be a omission.

In practice, I don't print but store the result directly in a dataframe, which you can find below. But I also calculate and store the percentage of test split data that are biased, i.e. (leakage between train split and test split + leakage between validation split and test split + duplicated lines in the test split) / test split length.
In the context of Conll2003, this gives (78+25+269)/3453*100 = 10.77%.
Clearly, this is a dataset that should not be used as a benchmark unless it's corrected.

Note that I then reproduce exactly the same thing, but instead of analyzing only the input that will be supplied to the model, I also look at the labels:

dataset = load_dataset("conll2003")

def concatenate_columns(example):
    return {'concatenated_column': str(example['tokens'])+" "+str(example['ner_tags'])}
dataset = dataset.map(concatenate_columns)

train_inputs = dataset["train"]["concatenated_column"]
val_inputs = dataset["validation"]["concatenated_column"]
test_inputs = dataset["test"]["concatenated_column"]

leakage_train = set(train_inputs).intersection(set(test_inputs))
leakage_validation = set(val_inputs).intersection(set(test_inputs))

print("Leakage between train split and test split:",len(leakage_train))
print("Leakage between validation split and test split:",len(leakage_validation))
print("Duplicated lines in the train split:", len(train_inputs) - len(set(train_inputs)))
print("Duplicated lines in the validation split:", len(val_inputs) - len(set(val_inputs)))
print("Duplicated lines in the test split:", len(test_inputs) - len(set(test_inputs)))

Code that returns:

Leakage between train split and test split: 73
Leakage between validation split and test split: 23
Duplicated lines in the train split: 1348
Duplicated lines in the validation split: 179
Duplicated lines in the test split: 266

You can see that the numbers vary slightly from the previous case. I also save these numbers in our final dataframe. What I'm most interested in here is calculating the difference between the two cases, as this will highlight whether there are not only leaks or duplications, but also annotation problems in the data (as for the train and validation splits, it's not possible to say anything about the test split).
In the context of Conll2003, this gives abs(78-73+25-23) = 8.

The process described above is then applied to all the datasets considered.


Results

Data presentation

All the results have been centralized in a dataset that you can find on the Hugging Face Hub.

image/png
Check out the Hugging Face viewer to make it less stinging on your eyes

The columns in the dataset should be interpreted as follows:

  • dataset_name: the name of the analyzed dataset. If a comma is present in the name, this means that the dataset contains several subsets. For example, glue,sst2 means that the sst2 subset of glue is being analyzed. If parentheses are present in the name, this means that we analyze only the label within the parentheses. This is because a dataset may contain several columns that can be used as labels (potentially for several different tasks). For example, if we take the conll2003 example, it contains the ner_tags and pos_tags columns, which are both token-classification tasks, but one from NER and the other from POS. Thus, the results of the above code will be found in the line conll2003(ner_tags) to differentiate it from conll2003(pos_tags).
  • text_leaks_train_wrt_test: the number of leaks between the train split and the test split, considering only the input that will be given to the model. In the case of conll2003(ner_tags), the column value is 78.
  • text_leaks_valididation_wrt_test: the number of leaks between the validation split and the test split, considering only the input that will be given to the model. In the case of conll2003(ner_tags), the column value is 25.
  • text_duplication_train: the number of data duplicates in the train split considering only the input that will be given to the model. In the case of conll2003(ner_tags), the column value is 1350.
  • text_duplication_valididation: the number of data duplicates in the validation split considering only the input that will be given to the model. In the case of conll2003(ner_tags), the column value is 180.
  • text_duplication_test: the number of data duplicates in the test split considering only the input that will be given to the model. In the case of conll2003(ner_tags), the column value is 269.
  • text_test_biased: percentage of the test split that is biased, i.e. the sum of the values of text_leaks_train_wrt_test, text_leaks_valid_wrt_test and text_duplication_test, divided by the length of the test split. In the case of conll2003(ner_tags), the column value is 10.049%
  • text_and_label_leaks_train_wrt_test: the number of leaks between the train split and the test split, considering the concatenation of the input and the label that will be given to the model. In the case of conll2003(ner_tags), the column value is 73.
  • text_and_label_leaks_valididation_wrt_test: the number of leaks between the validation split and the test split, considering the concatenation of the input and the label that will be given to the model. In the case of conll2003(ner_tags), the column value is 23.
  • text_and_label_duplication_train: the number of data duplicates in the train split considering considering the concatenation of the input and the label that will be given to the model. In the case of conll2003(ner_tags), the column value is 1348.
  • text_and_label_duplication_valididation: the number of data duplicates in the validation split considering considering the concatenation of the input and the label that will be given to the model. In the case of conll2003(ner_tags), the column value is 179.
  • text_and_label_duplication_test: the number of data duplicates in the test split considering considering the concatenation of the input and the label that will be given to the model. In the case of conll2003(ner_tags), the column value is 179.
  • text_and_label_test_biased: percentage of the test split that is biased, i.e. the sum of the values of text_and_label_leaks_train_wrt_test, text_and_label_leaks_valid_wrt_test and text_and_label_duplication_test, divided by the length of the test split. In the case of conll2003(ner_tags), the column value is 9.818%
  • difference_annotation_in_train_and_valididation_splits: the annotation difference between the train splits text_leaks_train_wrt_test and text_and_label_leaks_train_wrt_test, as well as the valdiation split. In the case of conll2003(ner_tags), the column value is 8.

Finally, please note that if a cell contains the term "NR", this means that the calculations are irrelevant to perform (in fact, simply impossible). The most common case is a dataset with only a train and test split. The text_duplication_val column, for example, cannot be calculated, so "NR" is indicated.

Observations

Now that we've presented the data, let's see what we get.

Let's start with train and validation splits:

Splits Leaks (text) Leaks (text and label) Duplications (text) Duplications (text and label)
Train 54.530% 46.477% 73.322% 69.295%
Validation* 48.544% 39.806% 55.922% 55.922%
*Calculated on non "NR" cells

We can thus see that over 46.5% of the datasets analyzed with a training split have leaks, and 39.8% for datasets with a validation split.
At the same time, 69.3% of the datasets analyzed with a training split have duplicated data, and 55.9% for datasets with a validation split.


Let's continue with the test split:

Split Duplication (text) Duplication (text and label) Bias (Text) Biais (Text and label)
Test 56.544% 50.503% 66.611% 60.570%

More than half of the test splits contain duplicate data and 60% contain a bias (duplication or leaks or both).


We thus have a majority of datasets containing biases in their test dataset. But are these datasets only slightly biased (e.g. less than 0.1%, which would have little impact on the benchmarks) or, on the contrary, masively biased (e.g. 10%, which would render the benchmarks meaningless)? The table below provides a little more granularity:

% of authorized bias in the test split % of datasets with at least the indicated % bias in the test split
0.1 61.242
0.2 57.215
0.3 52.852
0.4 49.664
0.5 45.638
0.6 42.953
0.7 41.443
0.8 39.597
0.9 35.906
1 34.564
1.5 27.517
2 24.161
2.5 22.819
3 21.141
4 19.631
5 18.456
10 13.926
20 9.228
30 7.718
40 7.215
50 6.040
60 5.201
70 4.362
80 4.195
90 3.188

Finally, we suspect that 33.054% of datasets may contain annotation problems.


Conclusion and outlook

The aim of this experiment is to quickly establish the quality of certain datasets available on the Hugging Face Hub.
It appears that datasets without duplications or leaks are not the majority of the cases treated.

More substantial and time-consuming work is required.
Firstly, more datasets need to be analyzed in order to refine this initial picture.
Secondly, problematic datasets would have to be corrected by deleting duplicate data and leaks.
Thirdly, it would probably be necessary to rerun models trained on biased datasets on the corrected datasets of point 2. The aim is to be able to estimate their actual performance level.


Finally, let's conclude with two points.
The first is an invitation to the reader to be vigilant. If you concatenate datasets, a data item in the train split of dataset A may not be in A's test split, but may be present in the dataset B's test set, creating a leak when we create the A+B dataset. The same logic applies to duplicate data.
Don't forget to clean up your datasets when you do this.

The second point is a call to action.
Isn't it possible to have a bot that scans the datasets uploaded to the Hub? It would indicate whether a dataset is reliable, and if not, the number of leaks and duplicated data.
How is it that in 2024 we're still training models on datasets whose quality has not been verified?
Do we really deserve to be called data scientists while we're working with so much noise?