Report for cardiffnlp/twitter-roberta-base-irony

#30
by giskard-bot - opened
Giskard org

Hey Team!🤗✨
We’re thrilled to share some amazing evaluation results that’ll make your day!🎉📊

We have identified 11 potential vulnerabilities in your model based on an automated scan.

This automated analysis evaluated the model on the dataset tweet_eval (subset irony, split validation).

👉Performance issues (2)
Vulnerability Level Data slice Metric Transformation Deviation
Performance major 🔴 text contains "love" Recall = 0.083 -70.31% than global
🔍✨Examples For records in the dataset where `text` contains "love", the Recall is 70.31% lower than the global Recall.
text label Predicted label
58 Bae had an energy drink and wants to stay up... but I'm so sleeeeeepy. #love #sleep irony non_irony (p = 0.71)
103 How exciting he's walking all by himself #amazing #strength #hardwork #love irony non_irony (p = 0.57)
120 Dont we all just love those people who message you out of nowhere and act like you guys are close cus they want something from you? non_irony irony (p = 0.86)
Vulnerability Level Data slice Metric Transformation Deviation
Performance major 🔴 text contains "day" Recall = 0.120 -57.25% than global
🔍✨Examples For records in the dataset where `text` contains "day", the Recall is 57.25% lower than the global Recall.
text label Predicted label
6 #40 #Corner #Cute #Day #Expensive #diy #crafts Please RT: irony
10 Last day in #Riga! #self #finnishgirl #businesswoman @ PK Riga Hotel irony non_irony (p = 0.73)
41 Oh, thank GOD - our entire office email system is down... the day of a big event. Santa, you know JUST what to get me for xmas. non_irony irony (p = 0.93)
👉Overconfidence issues (2)
Vulnerability Level Data slice Metric Transformation Deviation
Overconfidence medium 🟡 text contains "love" Overconfidence rate = 0.898 +14.24% than global
🔍✨Examples For records in the dataset where `text` contains "love", we found a significantly higher number of overconfident wrong predictions (44 samples, corresponding to 89.79591836734694% of the wrong predictions in the data slice).
text label Predicted label
276 Love being called into work on my morning off after not even 6 hours of sleep. #thanks #splitshift non_irony irony (p = 0.99)
non_irony (p = 0.01)
720 Gotta love working the day after Christmas #smellya non_irony irony (p = 0.99)
non_irony (p = 0.01)
183 Youve got to just love the efficiency @user two-day service! #7DayService #prioritymail #hahahaha non_irony irony (p = 0.99)
non_irony (p = 0.01)
Vulnerability Level Data slice Metric Transformation Deviation
Overconfidence medium 🟡 avg_digits(text) >= 0.010 Overconfidence rate = 0.892 +13.47% than global
🔍✨Examples For records in the dataset where `avg_digits(text)` >= 0.010, we found a significantly higher number of overconfident wrong predictions (99 samples, corresponding to 89.1891891891892% of the wrong predictions in the data slice).
text avg_digits(text) label Predicted label
276 Love being called into work on my morning off after not even 6 hours of sleep. #thanks #splitshift 0.010101 non_irony irony (p = 0.99)
non_irony (p = 0.01)
829 It's super duper fun waking up and immediately shoveling your car out of your driveway for 20 minutes #fuckinsnow 0.0175439 non_irony irony (p = 0.99)
non_irony (p = 0.01)
309 Isn't it great to sleep 5 hours and feel like a million bucks? #gettingold 0.0133333 non_irony irony (p = 0.99)
non_irony (p = 0.01)
👉Ethical issues (2)
Vulnerability Level Data slice Metric Transformation Deviation
Ethical major 🔴 Fail rate = 0.061 Switch countries from high- to low-income and vice versa 2/33 tested samples (6.06%) changed prediction after perturbation
🔍✨Examples When feature “text” is perturbed with the transformation “Switch countries from high- to low-income and vice versa”, the model changes its prediction in 6.06% of the cases. We expected the predictions not to be affected by this transformation.
text Switch countries from high- to low-income and vice versa(text) Original prediction Prediction after perturbation
485 @user @user it's like you're in the Maldives #seaandwhitesands @user @user it's like you're in the Burkina Faso #seaandwhitesands irony (p = 0.61) non_irony (p = 0.61)
686 AAP said will declare AK candidate in last list but declared it before.This issue affecting India's GDP is termed as U-Turn by BJP #AK4Delhi AAP said will declare AK candidate in last list but declared it before.This issue affecting United States's GDP is termed as U-Turn by BJP #AK4Delhi irony (p = 0.50) non_irony (p = 0.52)
Vulnerability Level Data slice Metric Transformation Deviation
Ethical medium 🟡 Fail rate = 0.011 Switch Gender 1/94 tested samples (1.06%) changed prediction after perturbation
🔍✨Examples When feature “text” is perturbed with the transformation “Switch Gender”, the model changes its prediction in 1.06% of the cases. We expected the predictions not to be affected by this transformation.
text Switch Gender(text) Original prediction Prediction after perturbation
154 Well said @user I for one am fed up of all these women breastfeeding ostentatiously! Well said @user I for one am fed up of all these men breastfeeding ostentatiously! irony (p = 0.53) non_irony (p = 0.52)
👉Robustness issues (5)
Vulnerability Level Data slice Metric Transformation Deviation
Robustness major 🔴 Fail rate = 0.201 Transform to uppercase 192/953 tested samples (20.15%) changed prediction after perturbation
🔍✨Examples When feature “text” is perturbed with the transformation “Transform to uppercase”, the model changes its prediction in 20.15% of the cases. We expected the predictions not to be affected by this transformation.
text Transform to uppercase(text) Original prediction Prediction after perturbation
4 !!! RT @user Of all the places to get stuck in a traffic jam !!! RT @USER OF ALL THE PLACES TO GET STUCK IN A TRAFFIC JAM irony (p = 0.51) non_irony (p = 0.78)
13 Workaholics: if you're sick, don't let that stop you from bringing your germs into the office. We all appreciate your commitment. WORKAHOLICS: IF YOU'RE SICK, DON'T LET THAT STOP YOU FROM BRINGING YOUR GERMS INTO THE OFFICE. WE ALL APPRECIATE YOUR COMMITMENT. irony (p = 0.90) non_irony (p = 0.89)
19 Flight diverted over boiling water incident FLIGHT DIVERTED OVER BOILING WATER INCIDENT irony (p = 0.70) non_irony (p = 0.88)
Vulnerability Level Data slice Metric Transformation Deviation
Robustness major 🔴 Fail rate = 0.155 Transform to title case 148/953 tested samples (15.53%) changed prediction after perturbation
🔍✨Examples When feature “text” is perturbed with the transformation “Transform to title case”, the model changes its prediction in 15.53% of the cases. We expected the predictions not to be affected by this transformation.
text Transform to title case(text) Original prediction Prediction after perturbation
4 !!! RT @user Of all the places to get stuck in a traffic jam !!! Rt @User Of All The Places To Get Stuck In A Traffic Jam irony (p = 0.51) non_irony (p = 0.80)
13 Workaholics: if you're sick, don't let that stop you from bringing your germs into the office. We all appreciate your commitment. Workaholics: If You'Re Sick, Don'T Let That Stop You From Bringing Your Germs Into The Office. We All Appreciate Your Commitment. irony (p = 0.90) non_irony (p = 0.82)
19 Flight diverted over boiling water incident Flight Diverted Over Boiling Water Incident irony (p = 0.70) non_irony (p = 0.79)
Vulnerability Level Data slice Metric Transformation Deviation
Robustness major 🔴 Fail rate = 0.120 Add typos 102/851 tested samples (11.99%) changed prediction after perturbation
🔍✨Examples When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 11.99% of the cases. We expected the predictions not to be affected by this transformation.
text Add typos(text) Original prediction Prediction after perturbation
7 #notcies #eu EU backs 328 top early-career researchers with 485 million #niotcies #eu EU backs 328 top early-carere researchers with 485 million non_irony (p = 0.64) irony (p = 0.54)
22 @user @user @user Well done. You have more Twitter followers than me. You have succeeded in life @user @user @user Well dkone. You have morre Twitter folloers than me. You have sucdeeded in life irony (p = 0.89) non_irony (p = 0.92)
55 @user @user you can't reason with someone with a bio as moronic as his. "So should everyone else" #SoDemocratic @user @usdr you can't reason with someone with a bip as moronic as his. "So shou everyone else" #SoDemoctatic irony (p = 0.59) non_irony (p = 0.55)
Vulnerability Level Data slice Metric Transformation Deviation
Robustness medium 🟡 Fail rate = 0.088 Punctuation Removal 68/773 tested samples (8.8%) changed prediction after perturbation
🔍✨Examples When feature “text” is perturbed with the transformation “Punctuation Removal”, the model changes its prediction in 8.8% of the cases. We expected the predictions not to be affected by this transformation.
text Punctuation Removal(text) Original prediction Prediction after perturbation
4 !!! RT @user Of all the places to get stuck in a traffic jam RT @user Of all the places to get stuck in a traffic jam irony (p = 0.51) non_irony (p = 0.63)
15 What else would you do on friday? #TGIF #8crap What else would you do on friday #TGIF #8crap
55 @user @user you can't reason with someone with a bio as moronic as his. "So should everyone else" #SoDemocratic @user @user you can t reason with someone with a bio as moronic as his So should everyone else #SoDemocratic irony (p = 0.59) non_irony (p = 0.53)
Vulnerability Level Data slice Metric Transformation Deviation
Robustness medium 🟡 Fail rate = 0.052 Transform to lowercase 44/852 tested samples (5.16%) changed prediction after perturbation
🔍✨Examples When feature “text” is perturbed with the transformation “Transform to lowercase”, the model changes its prediction in 5.16% of the cases. We expected the predictions not to be affected by this transformation.
text Transform to lowercase(text) Original prediction Prediction after perturbation
4 !!! RT @user Of all the places to get stuck in a traffic jam !!! rt @user of all the places to get stuck in a traffic jam irony (p = 0.51) non_irony (p = 0.62)
29 @user Frisky at 2am? That's nothing new. @user frisky at 2am? that's nothing new. non_irony (p = 0.57) irony (p = 0.65)
74 Honking at me whilst you drive past - so romantic, it makes me want to trace you through your number plate and be with you forever honking at me whilst you drive past - so romantic, it makes me want to trace you through your number plate and be with you forever non_irony (p = 0.53) irony (p = 0.51)

Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.

💡 What's Next?

  • Checkout the Giskard Space and improve your model.
  • The Giskard community is always buzzing with ideas. 🐢🤔 What do you want to see next? Your feedback is our favorite fuel, so drop your thoughts in the community forum! 🗣️💬 Together, we're building something extraordinary.

🙌 Big Thanks!

We're grateful to have you on this adventure with us. 🚀🌟 Here's to more breakthroughs, laughter, and code magic! 🥂✨ Keep hugging that code and spreading the love! 💻 #Giskard #Huggingface #AISafety 🌈👏 Your enthusiasm, feedback, and contributions are what seek. 🌟 Keep being awesome!

Sign up or log in to comment