codeparrot/code-generation-models · Embeddings for a function

Oct 9, 2022

I would like to get the embeddings of a given Python function. I am following the example here:

from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot")
model = AutoModelWithLMHead.from_pretrained("codeparrot/codeparrot")

inputs = tokenizer("def hello_world():", return_tensors="pt")
outputs = model(**inputs)

I noticed that logits are dimenion (1,5,32768) above. So it is dependent on the tokens. How do I get a fixed embedding length (e.g. 1024) so I can easily compare functions. Thanks!

loubnabnl

CodeParrot org Oct 10, 2022

•

edited Oct 10, 2022

Hi, one option to get an embedding independent of the input size is to use the average of the embeddings of the tokens. Also you might want to take the last hidden state instead of the model output (logits give the probabilities over the vocabulary hence the large size of your vector). You can get it using this code:

import torch

inputs = tokenizer("def hello_world():", return_tensors="pt")
outputs = model(**inputs, output_hidden_states=True)
embedding = torch.mean(outputs.hidden_states[-1],dim=1)
embedding.shape

torch.Size([1, 768])

Otherwise if you just want to make your inputs of similar size you can use truncation and padding in the tokenizer.

tokenizer("def hello_world():", return_tensors="pt", padding="max_length", max_length=1024)

Note that we also have the forums for general questions about transformers

loubnabnl changed discussion status to closed Oct 17, 2022