Model overconfident

#1
by marcmaxmeister - opened

I'd love some advice on how to avoid getting 99% likelihood on statements that are clearly apolitical:

## initialize the political huggingface model (republican vs democrat tweets)
## https://huggingface.co/m-newhauser/distilbert-political-tweets
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
poltokn = AutoTokenizer.from_pretrained("m-newhauser/distilbert-political-tweets")
polmodel = AutoModelForSequenceClassification.from_pretrained("m-newhauser/distilbert-political-tweets")
polpipeline = pipeline("sentiment-analysis", model=polmodel, tokenizer=poltokn)

Testing...

polpipeline("These pretzels are making me thirsty!")

>>>[{'label': 'Republican', 'score': 0.9996196031570435}]

Pretzels make Democrats thirsty too, I believe.

I've mitigated this problem by taking the average from a bunch of tweets for an account. This person is mostly Democrat but makes a lot of statements that would appeal to folks in the middle. And the balance seems accurate.

@DeanObeidallah
PREDICTED PARTY
Counter({'Democrat': 211, 'Republican': 129})
{'rep': 37.94, 'dem': 62.06}

Am I to assume that this model was trained with any apolitical tweets? A better version would tell us when a tweet is not political, so it can be excluded. Alas, I have to use other methods for that.

I'm a bit confused by its confidence as well.

pipeline("I'm a Republican who loves God, guns, and the GOP, but hates Trump. Is that so strange?")
>>>[{'label': 'Democrat', 'score': 0.999996542930603}]

or...

pipeline("I am happily a Republican.")
>>>[{'label': 'Democrat', 'score': 0.9964742064476013}]

If you're looking for an overall better approach (that may or may not use this model), I posted a long description of mine here: https://chewychunks.wordpress.com/2023/03/29/predicting-political-orientation-from-social-media/

Sign up or log in to comment