a question

#16
by ummy617 - opened

Hi Ideogram team, thanks for releasing Ideogram 4 open weights.

I have a question / feature request about local image editing support.

On the Ideogram web app, uploaded images can be used with features such as Remix, Magic Fill, Open in Studio, and image reference workflows. However, the current open-weight release on Hugging Face (ideogram-4-fp8, ideogram-4-nf4, and ideogram-4-nf4-diffusers) appears to expose only text-to-image inference locally.

Is there any plan to release local image-to-image / image editing support for Ideogram 4, such as:

  • using an input image as a reference during generation
  • remixing an existing image while preserving composition and object geometry
  • inpainting / Magic Fill with masks
  • changing scene attributes, for example turning an existing drone photo into a rainy-day version while preserving the drone structure

If this is already possible with the current open weights, could you provide an example script or documentation showing how to pass an input image and optional mask to the local pipeline?

If it is not currently supported, it would be very helpful to know whether the web app’s Remix / Magic Fill functionality depends on additional models or conditioning modules that are not included in the current Hugging Face release.

Thanks again for the open release.

I asked my favorite LLM about exactly this issue. Below is the outline of what might be a possible workaround for now. Haven't tested it myself yet, but at first glance the basic idea of injecting reference images into the latent space at the appropriate canvas positions prior to generation seems to make sense to me.

BTW, I'm still playing around with image generation, and I'm simply blown away by the results. If Ideogram-4 finds its way into homegrown video generation pipelines, Hollywood is cooked 😃


The Structural Node Workflow Step-by-Step

To anchor reference images (like a poster graphic or a person's face) to precise bounding boxes inside an open-source Ideogram 4 workflow, you do not need a secondary editing model. Instead, you need to extend your setup with In-Context Learning / Latent Masking nodes.

Because Ideogram 4 treats image tokens and text tokens identically within its single-stream Diffusion Transformer (DiT), pasting a reference image into a specific spot requires converting your reference image into latents and patching them over the exact layout coordinates specified in your JSON prompt.

To inject reference images into specific JSON coordinates, you must modify your standard Text-to-Image pipeline into a Latent Composition Pipeline.

Step A: Process the Reference Images

  • Add two Load Image nodes: one for your reference face, one for your reference poster graphic.
  • Connect each image to an Image Resize node to scale them to the exact pixel width and height of their corresponding JSON bounding boxes.
  • Pass both resized images into VAE Encode nodes using the flux2-vae.safetensors model to convert the pixels into latent space.

Step B: Composite the Latents (The Injection Point)

Instead of starting your sampler with empty latent noise, you will inject your reference latents directly into the starting canvas space:

  • Add an Empty Latent Image node set to your final output resolution (e.g., 1024x1024).
  • Add a Latent Composite (Masked) node.
    Connect the Empty Latent Image into the destination latent slot.
    Connect your resized reference face latent into the source latent slot.
    Connect the corresponding Face bounding-box mask from the Ideogram 4 Prompt Builder into the mask slot.
  • Add a second Latent Composite (Masked) node downstream.
    Feed the output of the first composite node into the destination latent slot.
    Connect the reference poster latent into the source latent slot.
    Connect the Poster bounding-box mask into the mask slot.

Step C: Sample with Structural Guidance

  • Connect the final, composited latent structure (which now contains your reference images embedded precisely where your JSON says they should be) into the Latent input of your Ideogram 4 Sampler.
  • Run your inference. The Ideogram 4 DiT model reads the text tokens explaining what the objects are, analyzes the bounding box properties from the JSON, encounters the pre-existing visual structures in the latents, and seamlessly blends the edges, lightning, and style of your reference images into the global composition.

Thanks for the detailed outline. I agree that latent masking / latent composition could be an interesting experimental direction, but after looking at the current Ideogram 4 open-weight pipeline, I think there are a few important caveats.

The local pipeline currently starts from pure random noise:

z = torch.randn(...)

There is no exposed input_image, mask, strength, reference_image, or init_latents interface. So this would require modifying the sampler/pipeline, not just adding ComfyUI-style nodes around the existing script.

Also, directly VAE-encoding a reference image and pasting its clean latent into the initial noise latent may not be distributionally correct. Ideogram 4 uses a flow-matching sampler, so the initial z represents a noisy latent state, not a clean image latent. A more plausible inpainting-style approach would be to encode the reference image into clean latent space, normalize/patchify it exactly like Ideogram expects, then at each sampling step replace the masked region with the appropriately noised version of the reference latent for the current timestep.

There is another question: even though Ideogram 4 is a single-stream DiT where text and image tokens are processed together, the current open pipeline’s image tokens are output/noise tokens, not explicitly reference-image tokens. So injecting visual latents may influence the result, but it is not guaranteed to behave like the web app’s image reference / Remix / Magic Fill features, unless the released model was trained to interpret partially fixed latents that way.

A few implementation details would also need to be handled carefully:

  • use the VAE included with the Ideogram 4 weights, not an arbitrary external VAE
  • apply the same latent normalization used by the pipeline
  • convert image-space bounding boxes into Ideogram’s latent/token grid
  • account for the patching scheme, where one output image token corresponds to roughly 16x16 pixels
  • re-apply the masked latent constraint throughout sampling, not only at initialization

So I’d describe this as a possible experimental inpainting / latent initialization hack, rather than true local image editing support equivalent to Ideogram’s web app. It might be worth trying, but I would not expect it to reliably preserve geometry or behave like official Remix/Magic Fill without additional released conditioning modules or editing-specific training.

写的很好

Sign up or log in to comment