Embedding length is not 1, but equal to bottleneck scale?
I might be missing something very obvious here, but it seems that the length of the embedding is always equal to bottleneck_scale
, instead of 1. I guess it doesn't make any difference, since all vectors are scaled equally?
Ah, good observation! That's technically true at the model architecture level, but for the model weights I looked at, bottleneck_scale
was 1 or very very close to 1. From that I made an educated guess that this particular parameter doesn't actually receive any nonzero gradients during training.
In practice when I use the model at inference time I always normalize vectors to 1 for both the encoder output and decoder input, and haven't had any issues.
this is great Work! Thatnk you for putting this out. I am a big fan of many of your ideas.
i am running into a problem when modifying the code from the notebook you laid out.
https://colab.research.google.com/drive/1CF5Lr1bxoAFC_IPX5I0azu4X8UDz_zp-?usp=sharing#scrollTo=buKgoKwXFSyv
the crux of the issue is that i am trying to modify your generate_latent function to support batch_processing, but am running into the problem that it seems like
i would expect this to work
my function looks like this
@torch
.no_grad()
def generate_from_latent(self, latents: torch.FloatTensor, max_length=512, temperature=1.0) -> List[str]:
'''
Args:
latents: a tensor of shape (N, D_model) where N is the number of texts(batch_size) and D_model is the dimension of the model
Returns:
List[str]: a list of strings of text generated from the latents
'''
print("latents shape", latents.shape)
batch_size = latents.shape[0] # N
dummy_text = '.'
dummy_emb = self.embed([dummy_text]) # This should be of shape [1, D_model]
dummy_emb_expanded = dummy_emb.expand(batch_size, -1) # shape [batch_size, D_model]
perturb_vector = latents - dummy_emb_expanded # shape [batch_size, D_model]
self.model.perturb_vector = perturb_vector
# Generate input_ids for the dummy text, repeated for each item in the batch
input_ids = self.tokenizer([dummy_text] * batch_size, return_tensors='pt', padding=True).to(self.device).input_ids
# Generate text from the model
output = self.model.generate(
input_ids=input_ids,
max_length=max_length,
do_sample=True,
temperature=temperature,
top_p=0.9,
num_return_sequences=batch_size,
)
return self.tokenizer.batch_decode(output, skip_special_tokens=True)
but i get an error that the last hidden state has different shape than the perturb vector:
hidden_states.shape torch.Size([4, 3, 512])
perturb vector shape torch.Size([2, 512])
/bottleneck_t5.py", line 396, in forward
hidden_states = self.bottleneck_scale * F.normalize(hidden_states + self.perturb_vector, p=2, dim=2)
~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
RuntimeError: The size of tensor a (3) must match the size of tensor b (2) at non-singleton dimension 1
it works fine for batch_size 1 where we get the shapes
hidden_states.shape torch.Size([1, 3, 512])
perturb vector shape torch.Size([1, 512])
for batch size 2 we get shapes
hidden_states.shape torch.Size([4, 3, 512])
perturb vector shape torch.Size([2, 512])
these are not compatible
for batch size n we get shapes
hidden_states.shape torch.Size([n^2, 3, 512])
perturb vector shape torch.Size([n, 512])
so whether you can apply the perturbation vector does not seem to be invariant to the batch size. I am not too familiar with the model architecture but i have tried a few things.
my hacky solution looks like this:
self.model.perturb_vector = perturb_vector.unsqueeze(1).repeat(1, batch_size, input_ids.shape[1], 1).squeeze(0) # shape(batch_size**2, seq_len, d_model)
however then you get a list of strings that is
2x batch_size out, which is not ideal, it appears to be working. But you are wasting half of your batch this way.