save-user-preferences

Runtime error

App Files Files Community

More concise scheduler

by Wauplin HF Staff - opened Jun 14, 2023

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+14

-18

More concise scheduler4e223fe9

Wauplin

Jun 14, 2023

•

edited Jun 14, 2023

I made some changes to the ZipScheduler:

use if not any(self.folder_path.iterdir()): to check for changes in the folder. No need to list all paths + sort since we won't use them afterwards.
use shutil.make_archive instead of launching a subprocess. Feels more Pythonic this way.
acquire self.lock only when we create the archive + reset the folder. By doing this, we avoid blocking the app during the upload process (since you require the lock to write to the folder)
updated from .tar.gz to .zip format. Size-wise I don't think we care about the compression difference. And .zip is supposed to be more broadly supported, especially on Windows. I can change it back though if you feel strongly.

Warning: I made sure my changes are working locally but having really tried in the repo. Also the dataset structure will slightly change.

Side note:

Resulting dataset is not readable out of the box since we are generating an image dataset with multiple images "per row" (we can't really overcome this apart with a custom script)
I would recommend to store everything as plain images and text in the repo. Zipping them in an archive is useful when you get high usage on your Space (>100k user inputs). In general not zipping files makes the dataset more readable on the Hub (e.g. quickly get an idea of how many items you have without unzipping archives) + avoid to implement/maintain a custom ZipScheduler.

Wauplin

Jun 14, 2023

Regarding the ZIP vs RAW upload of the files, I made a test in this repo (https://huggingface.co/datasets/Wauplin/hysts-sample-user-preferences-proposal) to show how a non-zipped dataset would look like. Apart from the lower complexity in the Space, the dataset is also immediately readable by datasetsand especially you can check the number of rows in the dataset-preview.

To do that, I took data from your dataset and:

uncompressed data
kept the 1 experiment = 1 subfolder
renamed preference.json to train-preferences. This is kind-off a hack to make datasets understand that it should consider the dataset as "json-based" and not "image-based".
in each preferences.json I added a field "name" with the uuid of the folder so that we can easily identify which folder corresponding to which row in the dataset

Wauplin

Jun 14, 2023

Note that in any case, this implementation will reach the "10K files/folder" limit if you get too much usage. A way to tackle this is to use subfolders (like /3c/3cccd147-818c-4b1f-b2c4-8d83dd2506b2.tar.gz instead of /3cccd147-818c-4b1f-b2c4-8d83dd2506b2.tar.gz). Maybe not a priority for us?

Wauplin

Jun 15, 2023

EDIT: Not working properly => do not merge it (I'm working on a workaround). Basically the archive name become ***.zip.zip and I only save ***.zip

hysts

hysts-samples org Jun 15, 2023

•

edited Jun 15, 2023

EDIT: Not working properly => do not merge it (I'm working on a workaround). Basically the archive name become ***.zip.zip and I only save ***.zip

Yeah, I also noticed it and was thinking of a workaround, haha.

Update scheduler.py0cc50706

Update scheduler.pya4df400c

Wauplin

Jun 15, 2023

Fixed and tested :)

hysts

hysts-samples org Jun 15, 2023

@Wauplin Thanks for the PR!

use if not any(self.folder_path.iterdir()): to check for changes in the folder. No need to list all paths + sort since we won't use them afterwards.

use shutil.make_archive instead of launching a subprocess. Feels more Pythonic this way.

acquire self.lock only when we create the archive + reset the folder. By doing this, we avoid blocking the app during the upload process (since you require the lock to write to the folder)

updated from .tar.gz to .zip format. Size-wise I don't think we care about the compression difference. And .zip is supposed to be more broadly supported, especially on Windows. I can change it back though if you feel strongly.

Nice! The first three points look good to me. As for the last point, I actually wanted to just tar without doing gzip or zip because images are jpeg here and there's no point for compression. But the default .gitattributes in the dataset repo doesn't have *.tar, so I just gzipped for now while waiting for a PR that adds it to be merged. But as you pointed out, .zip may be more favorable in other environments, so I'm OK with changing it to .zip.

hysts

hysts-samples org Jun 15, 2023

I just confirmed that it works myself. I'll merge. Thanks again!

hysts changed pull request status to merged Jun 15, 2023

hysts

hysts-samples org Jun 15, 2023

Regarding this line, I prefer

            shutil.make_archive(base_name=archive_name,
                                format='zip',
                                root_dir=self.folder_path.parent,
                                base_dir=self.folder_path.name)

as you did in your original PR because it keeps the root directory, so I'll update that part.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment