cfahlgren1's picture
cfahlgren1 HF staff
init
db39944
|
raw
history blame
2.46 kB
metadata
id: duckdb-leakage-detection
title: Leakage Detection
slug: duckdb-leakage-detection-query
description: >-
  Detect data leakage between train and test datasets by calculating the overlap
  percentage.
code: |
  WITH
      overlapping_rows AS (
          SELECT COUNT(*) AS overlap_count
          FROM train
          INTERSECT
          SELECT COUNT(*) AS overlap_count
          FROM test
      ),
      total_unique_rows AS (
          SELECT COUNT(*) AS total_count
          FROM (
              SELECT * FROM train
              UNION
              SELECT * FROM test
          ) combined
      )
  SELECT
      overlap_count,
      total_count,
      (overlap_count * 100.0 / total_count) AS overlap_percentage
  FROM overlapping_rows, total_unique_rows;

DuckDB Leakage Detection Query

This snippet demonstrates how to use DuckDB to detect potential data leakage between train and test datasets by calculating the overlap percentage.

WITH
    overlapping_rows AS (
        SELECT COUNT(*) AS overlap_count
        FROM train
        INTERSECT
        SELECT COUNT(*) AS overlap_count
        FROM test
    ),
    total_unique_rows AS (
        SELECT COUNT(*) AS total_count
        FROM (
            SELECT * FROM train
            UNION
            SELECT * FROM test
        ) combined
    )
SELECT
    overlap_count,
    total_count,
    (overlap_count * 100.0 / total_count) AS overlap_percentage
FROM overlapping_rows, total_unique_rows;

There is a very good in depth explanation of leakage in public datasets in this article.

Example Query

You can check out this link for the leakage query for the IMDB dataset.

leakage-query-imdb