More concise scheduler

#2
by Wauplin HF Staff - opened

I made some changes to the ZipScheduler:

  • use if not any(self.folder_path.iterdir()): to check for changes in the folder. No need to list all paths + sort since we won't use them afterwards.
  • use shutil.make_archive instead of launching a subprocess. Feels more Pythonic this way.
  • acquire self.lock only when we create the archive + reset the folder. By doing this, we avoid blocking the app during the upload process (since you require the lock to write to the folder)
  • updated from .tar.gz to .zip format. Size-wise I don't think we care about the compression difference. And .zip is supposed to be more broadly supported, especially on Windows. I can change it back though if you feel strongly.

Warning: I made sure my changes are working locally but having really tried in the repo. Also the dataset structure will slightly change.

Side note:

  • Resulting dataset is not readable out of the box since we are generating an image dataset with multiple images "per row" (we can't really overcome this apart with a custom script)
  • I would recommend to store everything as plain images and text in the repo. Zipping them in an archive is useful when you get high usage on your Space (>100k user inputs). In general not zipping files makes the dataset more readable on the Hub (e.g. quickly get an idea of how many items you have without unzipping archives) + avoid to implement/maintain a custom ZipScheduler.

Regarding the ZIP vs RAW upload of the files, I made a test in this repo (https://huggingface.co/datasets/Wauplin/hysts-sample-user-preferences-proposal) to show how a non-zipped dataset would look like. Apart from the lower complexity in the Space, the dataset is also immediately readable by datasetsand especially you can check the number of rows in the dataset-preview.

To do that, I took data from your dataset and:

  • uncompressed data
  • kept the 1 experiment = 1 subfolder
  • renamed preference.json to train-preferences. This is kind-off a hack to make datasets understand that it should consider the dataset as "json-based" and not "image-based".
  • in each preferences.json I added a field "name" with the uuid of the folder so that we can easily identify which folder corresponding to which row in the dataset

Note that in any case, this implementation will reach the "10K files/folder" limit if you get too much usage. A way to tackle this is to use subfolders (like /3c/3cccd147-818c-4b1f-b2c4-8d83dd2506b2.tar.gz instead of /3cccd147-818c-4b1f-b2c4-8d83dd2506b2.tar.gz). Maybe not a priority for us?

EDIT: Not working properly => do not merge it (I'm working on a workaround). Basically the archive name become ***.zip.zip and I only save ***.zip

hysts-samples org
edited Jun 15, 2023

EDIT: Not working properly => do not merge it (I'm working on a workaround). Basically the archive name become ***.zip.zip and I only save ***.zip

Yeah, I also noticed it and was thinking of a workaround, haha.

Fixed and tested :)

hysts-samples org

@Wauplin Thanks for the PR!

  • use if not any(self.folder_path.iterdir()): to check for changes in the folder. No need to list all paths + sort since we won't use them afterwards.
  • use shutil.make_archive instead of launching a subprocess. Feels more Pythonic this way.
  • acquire self.lock only when we create the archive + reset the folder. By doing this, we avoid blocking the app during the upload process (since you require the lock to write to the folder)
  • updated from .tar.gz to .zip format. Size-wise I don't think we care about the compression difference. And .zip is supposed to be more broadly supported, especially on Windows. I can change it back though if you feel strongly.

Nice! The first three points look good to me. As for the last point, I actually wanted to just tar without doing gzip or zip because images are jpeg here and there's no point for compression. But the default .gitattributes in the dataset repo doesn't have *.tar, so I just gzipped for now while waiting for a PR that adds it to be merged. But as you pointed out, .zip may be more favorable in other environments, so I'm OK with changing it to .zip.

hysts-samples org

I just confirmed that it works myself. I'll merge. Thanks again!

hysts changed pull request status to merged
hysts-samples org

Regarding this line, I prefer

            shutil.make_archive(base_name=archive_name,
                                format='zip',
                                root_dir=self.folder_path.parent,
                                base_dir=self.folder_path.name)

as you did in your original PR because it keeps the root directory, so I'll update that part.

Sign up or log in to comment