Zero-Shot Image Classification
Transformers
Safetensors
siglip
vision
Inference Endpoints

Why is the image size designed to be 384, whereas the patch size is designed to be 14, when 384 is not divisible by 14?

#4
by zhongyi1997cn - opened
class SigLipVisionEmbeddings(nn.Module):
    def __init__(self, config: SigLipVisionConfig):
        super().__init__()
        self.config = config
        self.embed_dim = config.hidden_size
        self.image_size = config.image_size
        self.patch_size = config.patch_size

        self.patch_embedding = nn.Conv2d(
            in_channels=config.num_channels,
            out_channels=self.embed_dim,
            kernel_size=self.patch_size,
            stride=self.patch_size,
            padding="valid",
        )

        self.num_patches = (self.image_size // self.patch_size) ** 2
        self.num_positions = self.num_patches
        self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
        self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1)), persistent=False)

    def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
        patch_embeds = self.patch_embedding(pixel_values)  # shape = [*, width, grid, grid]
        embeddings = patch_embeds.flatten(2).transpose(1, 2)

        embeddings = embeddings + self.position_embedding(self.position_ids)
        return embeddings

And according to the relevant code about 'siglip' in transformers, there is no padding in the convolution, doesn't this mean that the information of 6 pixels width both horizontally and vertically hasn't been utilized?

have you figured out the reason?

Google org

Hi, this is just an inattention mistake. The correct resolution would have been 378px or maybe 336px. But we were so used to the number 384 from past work that we mistakenly just defaulted to that :)

At the end of the day using 384 with /14 instead of 378 "loses" 6px on the right/bottom border, so it very likely has no practical impact, at least not worth re-training it.

how about re-train it with /16 patch? it can be the same sequence length of 336px w/ /14 patch.

@giffmana , if the patch size is 14 and totally there are 27*27 = 729 patches, then that would mean that are 729 patch tokens. Add 1 for the class token and that would make it 730 tokens. But the model returns only 729 tokens and not 730. Could you please clarify this?

Google org

@btjhjeon right lol it's pretty expensive and absolutely not worth it for this little detail.
@alifaraz there is no class token. MAPHead pooling.

Sign up or log in to comment