Issue running inference

#1
by csinva - opened

Hello, thanks for releasing this wonderful work! I'm having trouble running inference with this model.

Specifically, when I run the model on the tweet-eval dataset, I get an error `IndexError: index out of range in self'. Have you encountered this error before?

This doesn't happen with other models (including the toxigen roberta model), so I don't think it's an issue with preprocessing etc.

I get the same. The tokenizer has a vocab size of 50257 but ....the classifier maybe less?

Hi there, please try using the bert-base-uncased tokenizer and let me know if that solves your problem!

Hi @tomh - yes that works. I will send a PR on GitHub for this. I used

toxigen_hatebert = pipeline("text-classification", model="tomh/toxigen_hatebert", tokenizer="bert-base-cased")

Not sure why the pipeline chooses the wrong tokenizer to start with.

Ok I'll update the README to use that. Something went wrong with the toxigen_hatebert tokenizer.

@tomh actually.... It seems to make the exception to away, but the results are then incorrect I think

In [123]: toxigen_hatebert = pipeline("text-classification", model="tomh/toxigen_hatebert", tokenizer="bert-base-cased")

In [124]: toxigen_hatebert("hello")
Out[124]: [{'label': 'LABEL_0', 'score': 0.7423402667045593}]

In [125]: toxigen_hatebert("die you scum")
Out[125]: [{'label': 'LABEL_0', 'score': 0.9332824945449829}]

My fault should be

toxigen_hatebert = pipeline("text-classification", model="tomh/toxigen_hatebert", tokenizer="bert-base-uncased")

As you said. Sorry for spam.

Interesting... let me dig in a bit more. I just checked Hugginface's Hosted API for toxigen_hatebert and it indeed works differently:

image.png

image.png

Good catch!

Aha so switching the tokenizer worked?

Yes, it worked. Thanks @tomh

Sign up or log in to comment