Missing images

#8
by jack-etheredge - opened

Hi,
There appear to be several images that are in the training metadata file that do not appear in the 500px resized images. Maybe they exist in the different image sizes. I haven't checked as the tars are quite large to download several of them to investigate. Is this expected? To be clear, most of the images do appear to be found correctly, so I don't think it's some malformed path issue on my end.
Best,
Jack

Here's a single example if it's helpful: 2021/Eryx_conicus/159904132.jpeg

There are quite a few. I could send a full list if desired, but I doubt that's any more useful.

Found 167988 files
Missing 14273 files
Missing 7.83% of files

import pandas as pd
from pathlib import Path

META_DIR = Path(<path your metadata is in>)
TRAIN_DATA_DIR = Path(<path your extracted 500px images are in>)

train_metadata = META_DIR / "SnakeCLEF2023-TrainMetadata-iNat.csv"
train_metadata = pd.read_csv(train_metadata)

found_files = 0
missing_files = 0
for image_path in metadata["image_path"]:
    try:
        assert (TRAIN_DATA_DIR / image_path).exists()
        assert (TRAIN_DATA_DIR / image_path).is_file()
        found_files += 1
    except AssertionError:
        print(f"Missing file: {image_path}")
        missing_files += 1

print(f"Found {found_files} files")
print(f"Missing {missing_files} files")
# percentage missing:
print(f"Missing {missing_files / (found_files + missing_files) * 100:.2f}% of files")
Bohemian Visual Recognition Alliance org

@jack-etheredge ,

You gave me a headache with this one :D

At first, there might be some missing images as there were duplicates of the same image somewhere in the dataset. Therefore, we state in the dataset description that some missing images are missing. It might be anything around 1-2k, not 14273, as reported.

Even though the code seems correct at first glance, there is most likely something wrong with it.
How could you find 167988 images in your folder? The SnakeCLEF2023-TrainMetadata-iNat.csv file has just 154301 rows, i.e., images.

I'm still trying to understand the issue, so any additional info would be appreciated.

Best,
Lukas

Good catch on the sanity check regarding the number of rows. In a previous step I had combined all the metadata files and then used that dataframe in that loop instead of the one just containing SnakeCLEF2023-TrainMetadata-iNat.csv.

Here are the correct numbers:
Found 154145 files
Missing 156 files
Missing 0.10% of files

I did eventually download the full size images and oddly there are a different number of images missing from that one (416).

However, if in both cases, it's just the removal of duplicate images, I'll carry on assuming all is well.

Thanks!

Sign up or log in to comment