Special tokens ids

by aarnes - opened Oct 27, 2022

Oct 27, 2022

Hi,

Apologies in advance if this is a silly question. How come that the pretrained nb-bert-large tokenizer use different special_tokens_ids than most other BERT models?

nb-bert-large use: [505, 504, 503, 501, 502]
etc nb-bert-base use: [100, 102, 0, 101, 103]

Furthermore, any suggestions on how to overwride the current mapping to the normal special_token_ids for the nb-bert-large?

aarnes changed discussion title from Special tokens to Special tokens ids Oct 27, 2022

versae

Nasjonalbiblioteket AI Lab org Nov 7, 2022

Hi!

Not silly at all. NB-BERT-base was trained by further pre-training off the multilingual BERT weights, which already came with its own tokenizer. NB-BERT-large was pre-trained from scratch and a new tokenizer was built for it based on a different corpus (and libraries) than the one used for mBERT. Hence, the discrepancy.

As for the overriding, I'm not completely sure it's possible. The base version as a vocab size of 119,547 (because of its multilingual nature), the large version of 50,000 (mostly Norwegian and Scandinavian based).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment