Wrong Tokenizer?
Hello,
I was experimenting with the model when I noticed that the tokenizer vocab contains tokens that don't really make sense for a code model (e.g. full tokens for "embarrassed" and "fossils" and "Hermione"). Upon closer inspection, it seems the tokenizer roberta-base
tokenizer.
In case this was by design, why was the tokenizer not trained on the training data?
Hi,
The title of the CodeBERT paper is "CodeBERT: A Pre-Trained Model for Programming and Natural Languages", hence the model is trained on both code and natural language (not code-only).
This explains why the tokenizer has full words as well in its vocabulary.
@nielsr Thank you for your comment Niels. The natural language referred to in the title is natural language found in code repositories; i.e. comments and documentation. Most the tokens I saw were very out-of-domain.
The following returns True which probably means that it's just the roberta tokenizer, no?
tok_codebert = AutoTokenizer.from_pretrained("microsoft/codebert-base")
tok_roberta = AutoTokenizer.from_pretrained('roberta-base')
tok_codebert.vocab == tok_roberta.vocab
>>>True
@cakiki You are right. We directly use roberta tokenizer in CodeBERT.
Thank you for your reply!