should model(tokenizer(text)) work for bigcode/santacoder?

#13

by Dzmitry - opened Jan 18, 2023

BigCode org Jan 18, 2023

The bigcode/santacoder tokenizer produces token_type_ids tensor. AFAIK the model was not trained to receive it as input. So model(tokenizer(text)["input_ids"])works differently from model(tokenizer(text)) (the former seems correct whereas the latter seems at least risky).

loubnabnl

BigCode org Jan 27, 2023

Indeed the token_type_ids shouldn't be passed to the model, this PR prevents the tokenizer from returning it by default

christopher changed discussion status to closed Jan 27, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment