Why is the image size designed to be 384, whereas the patch size is designed to be 14, when 384 is not divisible by 14?

by zhongyi1997cn - opened Apr 2

zhongyi1997cn

Apr 2

•

class SigLipVisionEmbeddings(nn.Module):
    def __init__(self, config: SigLipVisionConfig):
        super().__init__()
        self.config = config
        self.embed_dim = config.hidden_size
        self.image_size = config.image_size
        self.patch_size = config.patch_size

        self.patch_embedding = nn.Conv2d(
            in_channels=config.num_channels,
            out_channels=self.embed_dim,
            kernel_size=self.patch_size,
            stride=self.patch_size,
            padding="valid",
        )

        self.num_patches = (self.image_size // self.patch_size) ** 2
        self.num_positions = self.num_patches
        self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
        self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1)), persistent=False)

    def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
        patch_embeds = self.patch_embedding(pixel_values)  # shape = [*, width, grid, grid]
        embeddings = patch_embeds.flatten(2).transpose(1, 2)

        embeddings = embeddings + self.position_embedding(self.position_ids)
        return embeddings

And according to the relevant code about 'siglip' in transformers, there is no padding in the convolution, doesn't this mean that the information of 6 pixels width both horizontally and vertically hasn't been utilized?

nielsr

Apr 16

cc @giffmana

yxchng

Aug 23

have you figured out the reason?

giffmana

Google org Aug 23

Hi, this is just an inattention mistake. The correct resolution would have been 378px or maybe 336px. But we were so used to the number 384 from past work that we mistakenly just defaulted to that :)

At the end of the day using 384 with /14 instead of 378 "loses" 6px on the right/bottom border, so it very likely has no practical impact, at least not worth re-training it.

btjhjeon

Aug 30

how about re-train it with /16 patch? it can be the same sequence length of 336px w/ /14 patch.

alifaraz

29 days ago

@giffmana , if the patch size is 14 and totally there are 27*27 = 729 patches, then that would mean that are 729 patch tokens. Add 1 for the class token and that would make it 730 tokens. But the model returns only 729 tokens and not 730. Could you please clarify this?

giffmana

Google org 28 days ago

@btjhjeon right lol it's pretty expensive and absolutely not worth it for this little detail.
@alifaraz there is no class token. MAPHead pooling.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment