Getting code snippet embeddings?

#3
by guyjacoby - opened

Since this is a BERT-style model, shouldn't there be a CLS token that I can use to get the embeddings of a snippet of code I input the model?
The accompanying tokenizer doesn't have a CLS token in the vocabulary.
If I am misunderstanding something, is there a different way to get a "sentence level" embedding of code (e.g., python code) using this model?

Thanks

Yes, you can get a sentence level embedding either at [CLS] or at the [SEP] at the end of the sentence, or even average representations at the last layer. In preliminary evaluations, we found that the output at [SEP] worked quite well. So, test inputs to be embedded should be formatted as: f"{CLS_TOKEN} {Sentence} {SEP_TOKEN}".

Indeed, the uploaded tokenizer doesn't have the special tokens, and you need to add them manually. This can be useful: https://github.com/bigcode-project/bigcode-encoder/blob/master/embedding_sandbox.ipynb

Thank you for the help!

Do you have a recommendation regarding how to deal with embedding longer-than-context documents?

I've read of a few approaches, from simple chunking (just slice every context size), to "semantic" chunking (splitting on function/class def etc.). And then, either keeping multiple embeddings for a single document (splitting a document to multiple documents), or aggregating the chunk embeddings (e.g, mean, sum, etc.) to get a single embedding...

Any thoughts?

I haven't tried it myself so not sure if it would be the best approach, but the first thing I'd personally try would be a sliding window with overlap (e.g., a 512 tokens window with a 256 tokens step). Mean pooling on top of chunk embeddings would yield a single embedding out of the entire document if that's needed.

Thanks for replying, and great work! :)

I am giving the embedding_sandbox.ipynb a try and running into some peculiarities using the embeddings for Code-Code comparison.

With the following example sentences from the notebook:

input_sentences = [
    "Hello world!!",
    "def my_sum(a, b): return a+b"
]

I am getting a cosine similarity between these sentences as:

StarEncoder - 0.9923
CodeBert - 0.9893
Unixcoder - 0.2749 (I added this one using the CodeBert embedder class).

The Unixcoder result is what I expected as the sentences are not similar, but both StarEncoder and CodeBert have an unexpectedly high similarity. I tried different pooling strategies without much change.

I was wondering if it is expected that the embeddings from StarEncoder are not intended for differentiating similar vs dissimilar code, or if it looks like I have an issue in my setup (e.g., using the wrong checkpoint, etc).

Hello,

Please note that different similarity scoring rules will have different ranges and they will not necessarily spread across the entire [-1,1] interval for instance. For retrieval applications, only the ranking of scores measured between a query and a set of candidates would be useful, and the individual values of scores are not that informative and not directly comparable across scorers. To illustrate that, I added a plot with 95% confidence intervals of similarity scores for all queries against all candidates in the retrieval task in https://github.com/bigcode-project/bigcode-encoder/blob/master/c2c_search_eval.ipynb. See below that the effective range of the scores is rather small. If scores need to have a given range or lie in a specific interval, then they need to be post processed (e.g., final_score = (score - min_score) / (max_score - min_score)).

image.png

Also, I just noticed that embeddings from CodeBERT were not being normalized and we were scoring that model with inner products rather than cosine similarities. That's fixed now.

@jeffsvajlenko Would you mind sharing the Code for the Unixcoder? I am using qdrant with bge-large-en and I am getting weirdly high emeddings of >>0.85 for almost everything i compare.

This comment has been hidden

This code:
inputs = self.tokenizer(
[
f"{self.tokenizer.cls_token}{sentence}{self.tokenizer.sep_token}"
for sentence in input_sentences
],
padding="longest",
max_length=self.maximum_token_len,
truncation=True,
return_tensors="pt",
)
When the input sequence is too long and needs to be truncated, won't the sep_token also be truncated? Could this have a negative impact?

Sign up or log in to comment