How <seg[value]> tokens generate the masks in segmentation tasks?

#10
by cmgzy - opened

In https://huggingface.co/blog/paligemma#referring-expression-segmentation,
the authors said "The segmentation tokens can be further processed to generate segmentation masks."

I understand what the <loc[value]> tokens mean by "Each detection is represented by four location coordinates in the order y_min, x_min, y_max, x_max, followed by the label that was detected in that box", but cannot figure out how <seg[value]> tokens generate the masks. Could anyone clarify? Thanks!

image.png

Google org

@cmgzy hello, they are decoded by a VAE to generate the mask, the code can be found in the Space. LMK if anything's unclear.

merve changed discussion status to closed

Sign up or log in to comment