thesephist/contra-bottleneck-t5-small-wikipedia · Embedding length is not 1, but equal to bottleneck scale?

Oct 13, 2023

I might be missing something very obvious here, but it seems that the length of the embedding is always equal to bottleneck_scale, instead of 1. I guess it doesn't make any difference, since all vectors are scaled equally?

thesephist

Owner Nov 9, 2023

Ah, good observation! That's technically true at the model architecture level, but for the model weights I looked at, bottleneck_scale was 1 or very very close to 1. From that I made an educated guess that this particular parameter doesn't actually receive any nonzero gradients during training.

In practice when I use the model at inference time I always normalize vectors to 1 for both the encoder output and decoder input, and haven't had any issues.

hanspeterlyngsoeraaschoujensen

Mar 16

this is great Work! Thatnk you for putting this out. I am a big fan of many of your ideas.

i am running into a problem when modifying the code from the notebook you laid out.
https://colab.research.google.com/drive/1CF5Lr1bxoAFC_IPX5I0azu4X8UDz_zp-?usp=sharing#scrollTo=buKgoKwXFSyv

the crux of the issue is that i am trying to modify your generate_latent function to support batch_processing, but am running into the problem that it seems like

i would expect this to work

my function looks like this
@torch .no_grad()
def generate_from_latent(self, latents: torch.FloatTensor, max_length=512, temperature=1.0) -> List[str]:
'''
Args:
latents: a tensor of shape (N, D_model) where N is the number of texts(batch_size) and D_model is the dimension of the model
Returns:
List[str]: a list of strings of text generated from the latents
'''
print("latents shape", latents.shape)
batch_size = latents.shape[0] # N
dummy_text = '.'
dummy_emb = self.embed([dummy_text]) # This should be of shape [1, D_model]
dummy_emb_expanded = dummy_emb.expand(batch_size, -1) # shape [batch_size, D_model]
perturb_vector = latents - dummy_emb_expanded # shape [batch_size, D_model]
self.model.perturb_vector = perturb_vector

    # Generate input_ids for the dummy text, repeated for each item in the batch
    input_ids = self.tokenizer([dummy_text] * batch_size, return_tensors='pt', padding=True).to(self.device).input_ids
    # Generate text from the model
    output = self.model.generate(
        input_ids=input_ids,
        max_length=max_length,
        do_sample=True,
        temperature=temperature,
        top_p=0.9,
        num_return_sequences=batch_size,
    )
    return self.tokenizer.batch_decode(output, skip_special_tokens=True)

but i get an error that the last hidden state has different shape than the perturb vector:

hidden_states.shape torch.Size([4, 3, 512])
perturb vector shape torch.Size([2, 512])

/bottleneck_t5.py", line 396, in forward
hidden_states = self.bottleneck_scale * F.normalize(hidden_states + self.perturb_vector, p=2, dim=2)
~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
RuntimeError: The size of tensor a (3) must match the size of tensor b (2) at non-singleton dimension 1

it works fine for batch_size 1 where we get the shapes
hidden_states.shape torch.Size([1, 3, 512])
perturb vector shape torch.Size([1, 512])

for batch size 2 we get shapes
hidden_states.shape torch.Size([4, 3, 512])
perturb vector shape torch.Size([2, 512])
these are not compatible

for batch size n we get shapes

hidden_states.shape torch.Size([n^2, 3, 512])
perturb vector shape torch.Size([n, 512])

so whether you can apply the perturbation vector does not seem to be invariant to the batch size. I am not too familiar with the model architecture but i have tried a few things.

my hacky solution looks like this:
self.model.perturb_vector = perturb_vector.unsqueeze(1).repeat(1, batch_size, input_ids.shape[1], 1).squeeze(0) # shape(batch_size**2, seq_len, d_model)
however then you get a list of strings that is
2x batch_size out, which is not ideal, it appears to be working. But you are wasting half of your batch this way.