lehduong commited on
Commit
39904dc
1 Parent(s): 0f307df

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +3 -3
app.py CHANGED
@@ -414,7 +414,7 @@ with gr.Blocks(title="OneDiffusion Demo") as demo:
414
 
415
  2. **Upload Images**: Drag and drop images directly onto the upload area, or click to select files from your device.
416
 
417
- 3. **Generate Captions**: **If you upload any images**, Click the "Generate Captions" button to format the text prompt according to chosen task. In this demo, you will NEED to provide the caption of each source image manually.
418
 
419
  4. **Configure Generation Settings**: Expand the "Advanced Configuration" section to adjust parameters like the number of inference steps, guidance scale, image size, and more.
420
 
@@ -430,7 +430,7 @@ with gr.Blocks(title="OneDiffusion Demo") as demo:
430
 
431
  - For boundingbox2image/semantic2image/inpainting etc tasks:
432
  + To perform condition-to-image such as semantic map to image, follow above steps
433
- + For image-to-condition e.g., image to depth, change the denoise_mask checkbox before generating images. You must UNCHECK image_0 box and CHECK image_1 box.
434
 
435
  - For FaceID tasks:
436
  + Use 3 or 4 images if single input image does not give satisfactory results.
@@ -440,7 +440,7 @@ with gr.Blocks(title="OneDiffusion Demo") as demo:
440
  + If you have non-human subjects and does not get satisfactory results, try "copying" part of caption of source images where it describes the properties of the subject e.g., a monster with red eyes, sharp teeth, etc.
441
 
442
  - For Multiview generation:
443
- + The input camera elevation/azimuth ALWAYS starts with $0$. If you want to generate images of azimuths 30,60,90 and elevations of 10,20,30 (wrt input image), the correct input azimuth is: `0, 30, 60, 90`; input elevation is `0,10,20,30`. The camera distance will be `1.5,1.5,1.5,1.5`
444
  + Only support square images (ideally in 512x512 resolution).
445
  + Ensure the number of elevations, azimuths, and distances are equal.
446
  + The model generally works well for 2-5 views (include both input and generated images). Since the model is trained with 3 views on 512x512 resolution, you might try scale_factor of [1.1; 1.5] and scale_watershed of [100; 400] for better extrapolation.
 
414
 
415
  2. **Upload Images**: Drag and drop images directly onto the upload area, or click to select files from your device.
416
 
417
+ 3. **Generate Captions**: **If you upload any images**, Click the "Generate Captions" button to format the text prompt according to chosen task. In this demo, you will **NEED** to provide the caption of each source image manually. We recommend using Molmo for captioning.
418
 
419
  4. **Configure Generation Settings**: Expand the "Advanced Configuration" section to adjust parameters like the number of inference steps, guidance scale, image size, and more.
420
 
 
430
 
431
  - For boundingbox2image/semantic2image/inpainting etc tasks:
432
  + To perform condition-to-image such as semantic map to image, follow above steps
433
+ + For image-to-condition e.g., image to depth, change the denoise_mask checkbox before generating images. You must UNCHECK image_0 box and CHECK image_1 box. Caption is not required for this task.
434
 
435
  - For FaceID tasks:
436
  + Use 3 or 4 images if single input image does not give satisfactory results.
 
440
  + If you have non-human subjects and does not get satisfactory results, try "copying" part of caption of source images where it describes the properties of the subject e.g., a monster with red eyes, sharp teeth, etc.
441
 
442
  - For Multiview generation:
443
+ + The input camera elevation/azimuth ALWAYS starts with 0. If you want to generate images of azimuths 30,60,90 and elevations of 10,20,30 (wrt input image), the correct input azimuth is: `0, 30, 60, 90`; input elevation is `0,10,20,30`. The camera distance will be `1.5,1.5,1.5,1.5`
444
  + Only support square images (ideally in 512x512 resolution).
445
  + Ensure the number of elevations, azimuths, and distances are equal.
446
  + The model generally works well for 2-5 views (include both input and generated images). Since the model is trained with 3 views on 512x512 resolution, you might try scale_factor of [1.1; 1.5] and scale_watershed of [100; 400] for better extrapolation.