Wrong Tokenizer?

#1
by christopher - opened

Hello,

I was experimenting with the model when I noticed that the tokenizer vocab contains tokens that don't really make sense for a code model (e.g. full tokens for "embarrassed" and "fossils" and "Hermione"). Upon closer inspection, it seems the tokenizer roberta-base tokenizer.

In case this was by design, why was the tokenizer not trained on the training data?

Hi,

The title of the CodeBERT paper is "CodeBERT: A Pre-Trained Model for Programming and Natural Languages", hence the model is trained on both code and natural language (not code-only).

This explains why the tokenizer has full words as well in its vocabulary.

@nielsr Thank you for your comment Niels. The natural language referred to in the title is natural language found in code repositories; i.e. comments and documentation. Most the tokens I saw were very out-of-domain.

The following returns True which probably means that it's just the roberta tokenizer, no?

tok_codebert = AutoTokenizer.from_pretrained("microsoft/codebert-base")
tok_roberta = AutoTokenizer.from_pretrained('roberta-base')
tok_codebert.vocab == tok_roberta.vocab
>>>True

@cakiki You are right. We directly use roberta tokenizer in CodeBERT.

Thank you for your reply!

christopher changed discussion status to closed

Sign up or log in to comment