Wrong Tokenizer?

by christopher - opened Jul 26, 2022

Jul 26, 2022

Hello,

I was experimenting with the model when I noticed that the tokenizer vocab contains tokens that don't really make sense for a code model (e.g. full tokens for "embarrassed" and "fossils" and "Hermione"). Upon closer inspection, it seems the tokenizer roberta-base tokenizer.

In case this was by design, why was the tokenizer not trained on the training data?

nielsr

Jul 27, 2022

Hi,

The title of the CodeBERT paper is "CodeBERT: A Pre-Trained Model for Programming and Natural Languages", hence the model is trained on both code and natural language (not code-only).

This explains why the tokenizer has full words as well in its vocabulary.

christopher

Jul 27, 2022

@nielsr Thank you for your comment Niels. The natural language referred to in the title is natural language found in code repositories; i.e. comments and documentation. Most the tokens I saw were very out-of-domain.

The following returns True which probably means that it's just the roberta tokenizer, no?

tok_codebert = AutoTokenizer.from_pretrained("microsoft/codebert-base")
tok_roberta = AutoTokenizer.from_pretrained('roberta-base')
tok_codebert.vocab == tok_roberta.vocab
>>>True

zyfeng

Aug 5, 2022

•

edited Aug 5, 2022

@cakiki You are right. We directly use roberta tokenizer in CodeBERT.

christopher

Aug 5, 2022

Thank you for your reply!

christopher changed discussion status to closed Aug 5, 2022

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment