Prediction size vs tokenizer size

by surya-narayanan - opened Jun 6, 2023

Jun 6, 2023

Hi, the model seems to output a tensor of size batchsize x sentence size x 78672 but the tokenizer vocab size is 50265. Any idea why there's this discrepancy?

michal-stefanik

Jul 3, 2023

Hi @surya-narayanan , MathBERTa's tokenizer was largely extended to cover the math vocabulary. At the time of training, this was not fully supported by transformers, so some inconsistencies like this can still be ocassionally found. FWIW, later on, we opened and merged related PR to transformers library.

Anyway, I've checked for you that model's config matches the len(tokenizer.vocab), so the current model and tokenizer should be good to be used together as-is.

surya-narayanan

Jul 19, 2023

hmm, still facing an error- should i re-install transformers?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment