Spaces:
Running
Running
metadata
id: duckdb-leakage-detection
title: Leakage Detection
slug: duckdb-leakage-detection-query
description: >-
Detect data leakage between train and test datasets by calculating the overlap
percentage.
code: |
WITH
overlapping_rows AS (
SELECT COUNT(*) AS overlap_count
FROM train
INTERSECT
SELECT COUNT(*) AS overlap_count
FROM test
),
total_unique_rows AS (
SELECT COUNT(*) AS total_count
FROM (
SELECT * FROM train
UNION
SELECT * FROM test
) combined
)
SELECT
overlap_count,
total_count,
(overlap_count * 100.0 / total_count) AS overlap_percentage
FROM overlapping_rows, total_unique_rows;
DuckDB Leakage Detection Query
This snippet demonstrates how to use DuckDB to detect potential data leakage between train and test datasets by calculating the overlap percentage.
WITH
overlapping_rows AS (
SELECT COUNT(*) AS overlap_count
FROM train
INTERSECT
SELECT COUNT(*) AS overlap_count
FROM test
),
total_unique_rows AS (
SELECT COUNT(*) AS total_count
FROM (
SELECT * FROM train
UNION
SELECT * FROM test
) combined
)
SELECT
overlap_count,
total_count,
(overlap_count * 100.0 / total_count) AS overlap_percentage
FROM overlapping_rows, total_unique_rows;
There is a very good in depth explanation of leakage in public datasets in this article.
Example Query
You can check out this link for the leakage query for the IMDB dataset.