Special tokens ids

#1
by aarnes - opened

Hi,

Apologies in advance if this is a silly question. How come that the pretrained nb-bert-large tokenizer use different special_tokens_ids than most other BERT models?

nb-bert-large use: [505, 504, 503, 501, 502]
etc nb-bert-base use: [100, 102, 0, 101, 103]

Furthermore, any suggestions on how to overwride the current mapping to the normal special_token_ids for the nb-bert-large?

aarnes changed discussion title from Special tokens to Special tokens ids
Nasjonalbiblioteket AI Lab org

Hi!

Not silly at all. NB-BERT-base was trained by further pre-training off the multilingual BERT weights, which already came with its own tokenizer. NB-BERT-large was pre-trained from scratch and a new tokenizer was built for it based on a different corpus (and libraries) than the one used for mBERT. Hence, the discrepancy.

As for the overriding, I'm not completely sure it's possible. The base version as a vocab size of 119,547 (because of its multilingual nature), the large version of 50,000 (mostly Norwegian and Scandinavian based).

Sign up or log in to comment