Embedding length is not 1, but equal to bottleneck scale?

#1
by ibestvina - opened

I might be missing something very obvious here, but it seems that the length of the embedding is always equal to bottleneck_scale, instead of 1. I guess it doesn't make any difference, since all vectors are scaled equally?

Ah, good observation! That's technically true at the model architecture level, but for the model weights I looked at, bottleneck_scale was 1 or very very close to 1. From that I made an educated guess that this particular parameter doesn't actually receive any nonzero gradients during training.

In practice when I use the model at inference time I always normalize vectors to 1 for both the encoder output and decoder input, and haven't had any issues.

this is great Work! Thatnk you for putting this out. I am a big fan of many of your ideas.

i am running into a problem when modifying the code from the notebook you laid out.
https://colab.research.google.com/drive/1CF5Lr1bxoAFC_IPX5I0azu4X8UDz_zp-?usp=sharing#scrollTo=buKgoKwXFSyv

the crux of the issue is that i am trying to modify your generate_latent function to support batch_processing, but am running into the problem that it seems like

i would expect this to work

my function looks like this
@torch .no_grad()
def generate_from_latent(self, latents: torch.FloatTensor, max_length=512, temperature=1.0) -> List[str]:
'''
Args:
latents: a tensor of shape (N, D_model) where N is the number of texts(batch_size) and D_model is the dimension of the model
Returns:
List[str]: a list of strings of text generated from the latents
'''
print("latents shape", latents.shape)
batch_size = latents.shape[0] # N
dummy_text = '.'
dummy_emb = self.embed([dummy_text]) # This should be of shape [1, D_model]
dummy_emb_expanded = dummy_emb.expand(batch_size, -1) # shape [batch_size, D_model]
perturb_vector = latents - dummy_emb_expanded # shape [batch_size, D_model]
self.model.perturb_vector = perturb_vector

    # Generate input_ids for the dummy text, repeated for each item in the batch
    input_ids = self.tokenizer([dummy_text] * batch_size, return_tensors='pt', padding=True).to(self.device).input_ids
    # Generate text from the model
    output = self.model.generate(
        input_ids=input_ids,
        max_length=max_length,
        do_sample=True,
        temperature=temperature,
        top_p=0.9,
        num_return_sequences=batch_size,
    )
    return self.tokenizer.batch_decode(output, skip_special_tokens=True) 

but i get an error that the last hidden state has different shape than the perturb vector:

hidden_states.shape torch.Size([4, 3, 512])
perturb vector shape torch.Size([2, 512])

/bottleneck_t5.py", line 396, in forward
hidden_states = self.bottleneck_scale * F.normalize(hidden_states + self.perturb_vector, p=2, dim=2)
~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
RuntimeError: The size of tensor a (3) must match the size of tensor b (2) at non-singleton dimension 1

it works fine for batch_size 1 where we get the shapes
hidden_states.shape torch.Size([1, 3, 512])
perturb vector shape torch.Size([1, 512])

for batch size 2 we get shapes
hidden_states.shape torch.Size([4, 3, 512])
perturb vector shape torch.Size([2, 512])
these are not compatible

for batch size n we get shapes

hidden_states.shape torch.Size([n^2, 3, 512])
perturb vector shape torch.Size([n, 512])

so whether you can apply the perturbation vector does not seem to be invariant to the batch size. I am not too familiar with the model architecture but i have tried a few things.

my hacky solution looks like this:
self.model.perturb_vector = perturb_vector.unsqueeze(1).repeat(1, batch_size, input_ids.shape[1], 1).squeeze(0) # shape(batch_size**2, seq_len, d_model)
however then you get a list of strings that is
2x batch_size out, which is not ideal, it appears to be working. But you are wasting half of your batch this way.

Sign up or log in to comment