What is the algorithm used to extract control image for ControlNet Tile?
Am I understanding correctly that the control image given to ControlNet Tile is just the downscaled version of the original image?
Let me explain how I understand it:
In the training phase, if the original image is 512x512, you would downscale it to 256x256 (or something like that), then upscale it back to 512x512 then use it as control image. This downscaling and upscaling will make the control image more blurry based on the resampler that you use. Ideally if you want it to look like a bunch of tiles then the resampler is probably Nearest Neighbor based, right?
At inference, to guarantee that it still works, you still probably need to do the same downscaling and upscaling process. But maybe the model generalizes well enough that you can just feed the unprocessed 512x512 image directly and it would generate a sharper version of the image. It's like the model has learned to make image sharper given any control image (even the one that's already sharp).
But I doubt its ability to work without the downscaling/upscaling process because the model has never seen such sharp control image in the training set before. Does this work in practice?
Please correct me if I'm wrong on any points.
Thanks!