Model's bias against certain keywords

#1
by HelenGuo99 - opened

Hello,

I think that the model is a little bit biased against certain keywords, such as 'black. white'. Those examples gave me 'toxic' results but I don't think they are. "I like black phones", "it's white" etc

Hey Helen, yes I think this is because the training data might contain many comments marked as toxic with the word "black" or "white" and the model might learn this association. It is quite a challenging question to how to address this type of issue and I am curious to see how you or others think!

"Please kill yourself" returns 50/50 toxic/non-toxic result. Seems too sensitive to individual tokens rather than overall message, tone. :)
This sort of comment can be pretty common online....

Sign up or log in to comment