Issue running inference

by csinva - opened Nov 5, 2022

Nov 5, 2022

Hello, thanks for releasing this wonderful work! I'm having trouble running inference with this model.

Specifically, when I run the model on the tweet-eval dataset, I get an error `IndexError: index out of range in self'. Have you encountered this error before?

This doesn't happen with other models (including the toxigen roberta model), so I don't think it's an issue with preprocessing etc.

kljensen

Apr 4, 2023

I get the same. The tokenizer has a vocab size of 50257 but ....the classifier maybe less?

tomh

Owner Apr 4, 2023

Hi there, please try using the bert-base-uncased tokenizer and let me know if that solves your problem!

kljensen

Apr 4, 2023

Hi @tomh - yes that works. I will send a PR on GitHub for this. I used

toxigen_hatebert = pipeline("text-classification", model="tomh/toxigen_hatebert", tokenizer="bert-base-cased")

Not sure why the pipeline chooses the wrong tokenizer to start with.

tomh

Owner Apr 4, 2023

Ok I'll update the README to use that. Something went wrong with the toxigen_hatebert tokenizer.

kljensen

Apr 4, 2023

@tomh actually.... It seems to make the exception to away, but the results are then incorrect I think

In [123]: toxigen_hatebert = pipeline("text-classification", model="tomh/toxigen_hatebert", tokenizer="bert-base-cased")

In [124]: toxigen_hatebert("hello")
Out[124]: [{'label': 'LABEL_0', 'score': 0.7423402667045593}]

In [125]: toxigen_hatebert("die you scum")
Out[125]: [{'label': 'LABEL_0', 'score': 0.9332824945449829}]

kljensen

Apr 4, 2023

My fault should be

toxigen_hatebert = pipeline("text-classification", model="tomh/toxigen_hatebert", tokenizer="bert-base-uncased")

As you said. Sorry for spam.

kljensen

Apr 4, 2023

For future readers, see https://github.com/microsoft/TOXIGEN/issues/8

tomh

Owner Apr 4, 2023

Interesting... let me dig in a bit more. I just checked Hugginface's Hosted API for toxigen_hatebert and it indeed works differently:

Good catch!

tomh

Owner Apr 4, 2023

Aha so switching the tokenizer worked?

kljensen

Apr 8, 2023

Yes, it worked. Thanks @tomh

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment