Why is the image size designed to be 384, whereas the patch size is designed to be 14, when 384 is not divisible by 14?
#4
by
zhongyi1997cn
- opened
class SigLipVisionEmbeddings(nn.Module):
def __init__(self, config: SigLipVisionConfig):
super().__init__()
self.config = config
self.embed_dim = config.hidden_size
self.image_size = config.image_size
self.patch_size = config.patch_size
self.patch_embedding = nn.Conv2d(
in_channels=config.num_channels,
out_channels=self.embed_dim,
kernel_size=self.patch_size,
stride=self.patch_size,
padding="valid",
)
self.num_patches = (self.image_size // self.patch_size) ** 2
self.num_positions = self.num_patches
self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1)), persistent=False)
def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
patch_embeds = self.patch_embedding(pixel_values) # shape = [*, width, grid, grid]
embeddings = patch_embeds.flatten(2).transpose(1, 2)
embeddings = embeddings + self.position_embedding(self.position_ids)
return embeddings
And according to the relevant code about 'siglip' in transformers, there is no padding in the convolution, doesn't this mean that the information of 6 pixels width both horizontally and vertically hasn't been utilized?
have you figured out the reason?
Hi, this is just an inattention mistake. The correct resolution would have been 378px or maybe 336px. But we were so used to the number 384 from past work that we mistakenly just defaulted to that :)
At the end of the day using 384 with /14 instead of 378 "loses" 6px on the right/bottom border, so it very likely has no practical impact, at least not worth re-training it.
how about re-train it with /16 patch? it can be the same sequence length of 336px w/ /14 patch.