Spaces:
Runtime error
More concise scheduler
I made some changes to the ZipScheduler:
- use
if not any(self.folder_path.iterdir()):
to check for changes in the folder. No need to list all paths + sort since we won't use them afterwards. - use
shutil.make_archive
instead of launching a subprocess. Feels more Pythonic this way. - acquire
self.lock
only when we create the archive + reset the folder. By doing this, we avoid blocking the app during the upload process (since you require the lock to write to the folder) - updated from
.tar.gz
to.zip
format. Size-wise I don't think we care about the compression difference. And.zip
is supposed to be more broadly supported, especially on Windows. I can change it back though if you feel strongly.
Warning: I made sure my changes are working locally but having really tried in the repo. Also the dataset
structure will slightly change.
Side note:
- Resulting dataset is not readable out of the box since we are generating an image dataset with multiple images "per row" (we can't really overcome this apart with a custom script)
- I would recommend to store everything as plain images and text in the repo. Zipping them in an archive is useful when you get high usage on your Space (>100k user inputs). In general not zipping files makes the dataset more readable on the Hub (e.g. quickly get an idea of how many items you have without unzipping archives) + avoid to implement/maintain a custom ZipScheduler.
Regarding the ZIP vs RAW upload of the files, I made a test in this repo (https://huggingface.co/datasets/Wauplin/hysts-sample-user-preferences-proposal) to show how a non-zipped dataset would look like. Apart from the lower complexity in the Space, the dataset is also immediately readable by datasets
and especially you can check the number of rows in the dataset-preview.
To do that, I took data from your dataset and:
- uncompressed data
- kept the 1 experiment = 1 subfolder
- renamed
preference.json
totrain-preferences
. This is kind-off a hack to makedatasets
understand that it should consider the dataset as "json-based" and not "image-based". - in each
preferences.json
I added a field"name"
with the uuid of the folder so that we can easily identify which folder corresponding to which row in the dataset
Note that in any case, this implementation will reach the "10K files/folder" limit if you get too much usage. A way to tackle this is to use subfolders (like /3c/3cccd147-818c-4b1f-b2c4-8d83dd2506b2.tar.gz
instead of /3cccd147-818c-4b1f-b2c4-8d83dd2506b2.tar.gz
). Maybe not a priority for us?
EDIT: Not working properly => do not merge it (I'm working on a workaround). Basically the archive name become ***.zip.zip
and I only save ***.zip
EDIT: Not working properly => do not merge it (I'm working on a workaround). Basically the archive name become ***.zip.zip and I only save ***.zip
Yeah, I also noticed it and was thinking of a workaround, haha.
Fixed and tested :)
@Wauplin Thanks for the PR!
- use if not any(self.folder_path.iterdir()): to check for changes in the folder. No need to list all paths + sort since we won't use them afterwards.
- use shutil.make_archive instead of launching a subprocess. Feels more Pythonic this way.
- acquire self.lock only when we create the archive + reset the folder. By doing this, we avoid blocking the app during the upload process (since you require the lock to write to the folder)
- updated from .tar.gz to .zip format. Size-wise I don't think we care about the compression difference. And .zip is supposed to be more broadly supported, especially on Windows. I can change it back though if you feel strongly.
Nice! The first three points look good to me. As for the last point, I actually wanted to just tar
without doing gzip
or zip
because images are jpeg here and there's no point for compression. But the default .gitattributes
in the dataset repo doesn't have *.tar
, so I just gzipped for now while waiting for a PR that adds it to be merged. But as you pointed out, .zip
may be more favorable in other environments, so I'm OK with changing it to .zip
.
I just confirmed that it works myself. I'll merge. Thanks again!