Spaces:

lehduong
/

OneDiffusion

Running on Zero

App Files Files Community

lehduong commited on 8 days ago

Commit

39904dc

•

1 Parent(s): 0f307df

Update app.py

Browse files

Files changed (1) hide show

app.py +3 -3

app.py CHANGED Viewed

@@ -414,7 +414,7 @@ with gr.Blocks(title="OneDiffusion Demo") as demo:
     2. **Upload Images**: Drag and drop images directly onto the upload area, or click to select files from your device.
-    3. **Generate Captions**: **If you upload any images**, Click the "Generate Captions" button to format the text prompt according to chosen task. In this demo, you will NEED to provide the caption of each source image manually.
     4. **Configure Generation Settings**: Expand the "Advanced Configuration" section to adjust parameters like the number of inference steps, guidance scale, image size, and more.
@@ -430,7 +430,7 @@ with gr.Blocks(title="OneDiffusion Demo") as demo:
     - For boundingbox2image/semantic2image/inpainting etc tasks:
         + To perform condition-to-image such as semantic map to image, follow above steps
-        + For image-to-condition e.g., image to depth, change the denoise_mask checkbox before generating images. You must UNCHECK image_0 box and CHECK image_1 box.
     - For FaceID tasks:
         + Use 3 or 4 images if single input image does not give satisfactory results.
@@ -440,7 +440,7 @@ with gr.Blocks(title="OneDiffusion Demo") as demo:
         + If you have non-human subjects and does not get satisfactory results, try "copying" part of caption of source images where it describes the properties of the subject e.g., a monster with red eyes, sharp teeth, etc.
     - For Multiview generation:
-        + The input camera elevation/azimuth ALWAYS starts with $0$. If you want to generate images of azimuths 30,60,90 and elevations of 10,20,30 (wrt input image), the correct input azimuth is: `0, 30, 60, 90`; input elevation is `0,10,20,30`. The camera distance will be `1.5,1.5,1.5,1.5`
         + Only support square images (ideally in 512x512 resolution).
         + Ensure the number of elevations, azimuths, and distances are equal.
         + The model generally works well for 2-5 views (include both input and generated images). Since the model is trained with 3 views on 512x512 resolution, you might try scale_factor of [1.1; 1.5] and scale_watershed of [100; 400] for better extrapolation.

     2. **Upload Images**: Drag and drop images directly onto the upload area, or click to select files from your device.
+    3. **Generate Captions**: **If you upload any images**, Click the "Generate Captions" button to format the text prompt according to chosen task. In this demo, you will **NEED** to provide the caption of each source image manually. We recommend using Molmo for captioning.
     4. **Configure Generation Settings**: Expand the "Advanced Configuration" section to adjust parameters like the number of inference steps, guidance scale, image size, and more.
     - For boundingbox2image/semantic2image/inpainting etc tasks:
         + To perform condition-to-image such as semantic map to image, follow above steps
+        + For image-to-condition e.g., image to depth, change the denoise_mask checkbox before generating images. You must UNCHECK image_0 box and CHECK image_1 box. Caption is not required for this task.
     - For FaceID tasks:
         + Use 3 or 4 images if single input image does not give satisfactory results.
         + If you have non-human subjects and does not get satisfactory results, try "copying" part of caption of source images where it describes the properties of the subject e.g., a monster with red eyes, sharp teeth, etc.
     - For Multiview generation:
+        + The input camera elevation/azimuth ALWAYS starts with 0. If you want to generate images of azimuths 30,60,90 and elevations of 10,20,30 (wrt input image), the correct input azimuth is: `0, 30, 60, 90`; input elevation is `0,10,20,30`. The camera distance will be `1.5,1.5,1.5,1.5`
         + Only support square images (ideally in 512x512 resolution).
         + Ensure the number of elevations, azimuths, and distances are equal.
         + The model generally works well for 2-5 views (include both input and generated images). Since the model is trained with 3 views on 512x512 resolution, you might try scale_factor of [1.1; 1.5] and scale_watershed of [100; 400] for better extrapolation.