Embeddings for a function

#7
by mongoose54 - opened

I would like to get the embeddings of a given Python function. I am following the example here:

from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot")
model = AutoModelWithLMHead.from_pretrained("codeparrot/codeparrot")

inputs = tokenizer("def hello_world():", return_tensors="pt")
outputs = model(**inputs)

I noticed that logits are dimenion (1,5,32768) above. So it is dependent on the tokens. How do I get a fixed embedding length (e.g. 1024) so I can easily compare functions. Thanks!

Hi, one option to get an embedding independent of the input size is to use the average of the embeddings of the tokens. Also you might want to take the last hidden state instead of the model output (logits give the probabilities over the vocabulary hence the large size of your vector). You can get it using this code:

import torch

inputs = tokenizer("def hello_world():", return_tensors="pt")
outputs = model(**inputs, output_hidden_states=True)
embedding = torch.mean(outputs.hidden_states[-1],dim=1)
embedding.shape
torch.Size([1, 768])

Otherwise if you just want to make your inputs of similar size you can use truncation and padding in the tokenizer.

tokenizer("def hello_world():", return_tensors="pt", padding="max_length", max_length=1024)

Note that we also have the forums for general questions about transformers

loubnabnl changed discussion status to closed

Sign up or log in to comment