Getting code snippet embeddings?

by guyjacoby - opened May 29, 2023

May 29, 2023

Since this is a BERT-style model, shouldn't there be a CLS token that I can use to get the embeddings of a snippet of code I input the model?
The accompanying tokenizer doesn't have a CLS token in the vocabulary.
If I am misunderstanding something, is there a different way to get a "sentence level" embedding of code (e.g., python code) using this model?

Thanks

joaomonteirof

BigCode org May 31, 2023

Yes, you can get a sentence level embedding either at [CLS] or at the [SEP] at the end of the sentence, or even average representations at the last layer. In preliminary evaluations, we found that the output at [SEP] worked quite well. So, test inputs to be embedded should be formatted as: f"{CLS_TOKEN} {Sentence} {SEP_TOKEN}".

Indeed, the uploaded tokenizer doesn't have the special tokens, and you need to add them manually. This can be useful: https://github.com/bigcode-project/bigcode-encoder/blob/master/embedding_sandbox.ipynb

guyjacoby

Jun 18, 2023

•

edited Jun 18, 2023

Thank you for the help!

Do you have a recommendation regarding how to deal with embedding longer-than-context documents?

I've read of a few approaches, from simple chunking (just slice every context size), to "semantic" chunking (splitting on function/class def etc.). And then, either keeping multiple embeddings for a single document (splitting a document to multiple documents), or aggregating the chunk embeddings (e.g, mean, sum, etc.) to get a single embedding...

Any thoughts?

joaomonteirof

BigCode org Jun 20, 2023

I haven't tried it myself so not sure if it would be the best approach, but the first thing I'd personally try would be a sliding window with overlap (e.g., a 512 tokens window with a 256 tokens step). Mean pooling on top of chunk embeddings would yield a single embedding out of the entire document if that's needed.

guyjacoby

Jun 20, 2023

Thanks for replying, and great work! :)

jeffsvajlenko

Jul 17, 2023

I am giving the embedding_sandbox.ipynb a try and running into some peculiarities using the embeddings for Code-Code comparison.

With the following example sentences from the notebook:

input_sentences = [
    "Hello world!!",
    "def my_sum(a, b): return a+b"
]

I am getting a cosine similarity between these sentences as:

StarEncoder - 0.9923
CodeBert - 0.9893
Unixcoder - 0.2749 (I added this one using the CodeBert embedder class).

The Unixcoder result is what I expected as the sentences are not similar, but both StarEncoder and CodeBert have an unexpectedly high similarity. I tried different pooling strategies without much change.

I was wondering if it is expected that the embeddings from StarEncoder are not intended for differentiating similar vs dissimilar code, or if it looks like I have an issue in my setup (e.g., using the wrong checkpoint, etc).

joaomonteirof

BigCode org Jul 24, 2023

Hello,

Please note that different similarity scoring rules will have different ranges and they will not necessarily spread across the entire [-1,1] interval for instance. For retrieval applications, only the ranking of scores measured between a query and a set of candidates would be useful, and the individual values of scores are not that informative and not directly comparable across scorers. To illustrate that, I added a plot with 95% confidence intervals of similarity scores for all queries against all candidates in the retrieval task in https://github.com/bigcode-project/bigcode-encoder/blob/master/c2c_search_eval.ipynb. See below that the effective range of the scores is rather small. If scores need to have a given range or lie in a specific interval, then they need to be post processed (e.g., final_score = (score - min_score) / (max_score - min_score)).

Also, I just noticed that embeddings from CodeBERT were not being normalized and we were scoring that model with inner products rather than cosine similarities. That's fixed now.

SebastianBodza

Aug 9, 2023

@jeffsvajlenko Would you mind sharing the Code for the Unixcoder? I am using qdrant with bge-large-en and I am getting weirdly high emeddings of >>0.85 for almost everything i compare.

qweq12433454

Jun 17, 2024

This comment has been hidden

qweq12433454

Jun 17, 2024

This code：
inputs = self.tokenizer(
[
f"{self.tokenizer.cls_token}{sentence}{self.tokenizer.sep_token}"
for sentence in input_sentences
],
padding="longest",
max_length=self.maximum_token_len,
truncation=True,
return_tensors="pt",
)
When the input sequence is too long and needs to be truncated, won't the sep_token also be truncated? Could this have a negative impact?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment