Wrong Tokenizer?

#1
by christopher - opened

Hello,

I was experimenting with the model when I noticed that the tokenizer vocab contains tokens that don't really make sense for a code model (e.g. full tokens for "embarrassed" and "fossils" and "Hermione"). Upon closer inspection, it seems the tokenizer roberta-base tokenizer.

In case this was by design, why was the tokenizer not trained on the training data?

Hi,

The title of the CodeBERT paper is "CodeBERT: A Pre-Trained Model for Programming and Natural Languages", hence the model is trained on both code and natural language (not code-only).

This explains why the tokenizer has full words as well in its vocabulary.

@nielsr Thank you for your comment Niels. The natural language referred to in the title is natural language found in code repositories; i.e. comments and documentation. Most the tokens I saw were very out-of-domain.

The following returns True which probably means that it's just the roberta tokenizer, no?

tok_codebert = AutoTokenizer.from_pretrained("microsoft/codebert-base")
tok_roberta = AutoTokenizer.from_pretrained('roberta-base')
tok_codebert.vocab == tok_roberta.vocab
>>>True

@cakiki You are right. We directly use roberta tokenizer in CodeBERT.

Thank you for your reply!

christopher changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment