Report for AdamCodd/distilbert-base-uncased-finetuned-sentiment-amazon

#94
by giskard-bot - opened

Hi Team,

This is a report from Giskard Bot Scan 🐢.

We have identified 7 potential vulnerabilities in your model based on an automated scan.

This automated analysis evaluated the model on the dataset sst2 (subset default, split validation).

👉Overconfidence issues (2)
Vulnerability Level Data slice Metric Transformation Deviation
Overconfidence major 🔴 avg_word_length(text) >= 4.481 Overconfidence rate = 0.804 +28.70% than global
🔍✨Examples For records in the dataset where `avg_word_length(text)` >= 4.481, we found a significantly higher number of overconfident wrong predictions (37 samples, corresponding to 80.43% of the wrong predictions in the data slice).
text avg_word_length(text) label Predicted label
95 this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms . 4.61538 negative positive (p = 1.00)
negative (p = 0.00)
643 the jabs it employs are short , carefully placed and dead-center . 4.58333 positive negative (p = 1.00)
positive (p = 0.00)
218 all that 's missing is the spontaneity , originality and delight . 4.58333 negative positive (p = 0.99)
negative (p = 0.01)
Vulnerability Level Data slice Metric Transformation Deviation
Overconfidence major 🔴 avg_whitespace(text) < 0.182 Overconfidence rate = 0.804 +28.70% than global
🔍✨Examples For records in the dataset where `avg_whitespace(text)` < 0.182, we found a significantly higher number of overconfident wrong predictions (37 samples, corresponding to 80.43% of the wrong predictions in the data slice).
text avg_whitespace(text) label Predicted label
95 this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms . 0.178082 negative positive (p = 1.00)
negative (p = 0.00)
643 the jabs it employs are short , carefully placed and dead-center . 0.179104 positive negative (p = 1.00)
positive (p = 0.00)
218 all that 's missing is the spontaneity , originality and delight . 0.179104 negative positive (p = 0.99)
negative (p = 0.01)
👉Robustness issues (1)
Vulnerability Level Data slice Metric Transformation Deviation
Robustness major 🔴 Fail rate = 0.105 Add typos 84/800 tested samples (10.5%) changed prediction after perturbation
🔍✨Examples When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 10.5% of the cases. We expected the predictions not to be affected by this transformation.
text Add typos(text) Original prediction Prediction after perturbation
13 we root for ( clara and paul ) , even like them , though perhaps it 's an emotion closer to pity . we root for ( clara and paul ) , even like them , htough perhaps it 's an emotiom closer to pity . positive (p = 0.75) negative (p = 0.82)
21 the iditarod lasts for days - this just felt like it did . the irditarod lasts for days - this just felt ike it did . negative (p = 0.50) positive (p = 0.53)
33 if the movie succeeds in instilling a wary sense of ` there but for the grace of god , ' it is far too self-conscious to draw you deeply into its world . if the mofvie succeeds in instilling a wary sense of ` gthere but got the grace f god , ' it is far topo self-conscious to draw ou deeply intk its world negative (p = 0.99) positive (p = 0.54)
👉Performance issues (4)
Vulnerability Level Data slice Metric Transformation Deviation
Performance major 🔴 text_length(text) < 37.500 Recall = 0.800 -12.08% than global
🔍✨Examples For records in the dataset where `text_length(text)` < 37.500, the Recall is 12.08% lower than the global Recall.
text text_length(text) label Predicted label
1 unflinchingly bleak and desperate 34 negative positive (p = 0.86)
112 hilariously inept and ridiculous . 35 positive negative (p = 0.99)
113 this movie is maddening . 26 negative positive (p = 0.96)
Vulnerability Level Data slice Metric Transformation Deviation
Performance major 🔴 text_length(text) < 65.500 AND text_length(text) >= 56.500 Precision = 0.769 -10.89% than global
🔍✨Examples For records in the dataset where `text_length(text)` < 65.500 AND `text_length(text)` >= 56.500, the Precision is 10.89% lower than the global Precision.
text text_length(text) label Predicted label
92 you wo n't like roger , but you will quickly recognize him . 61 negative positive (p = 0.75)
183 the lower your expectations , the more you 'll enjoy it . 58 negative positive (p = 0.97)
312 i 'll bet the video game is a lot more fun than the film . 59 negative positive (p = 0.60)
Vulnerability Level Data slice Metric Transformation Deviation
Performance medium 🟡 avg_word_length(text) >= 4.635 AND avg_word_length(text) < 4.743 Recall = 0.828 -9.05% than global
🔍✨Examples For records in the dataset where `avg_word_length(text)` >= 4.635 AND `avg_word_length(text)` < 4.743, the Recall is 9.05% lower than the global Recall.
text avg_word_length(text) label Predicted label
64 the script kicks in , and mr. hartley 's distended pace and foot-dragging rhythms follow . 4.6875 negative positive (p = 0.99)
223 corny , schmaltzy and predictable , but still manages to be kind of heartwarming , nonetheless . 4.70588 positive negative (p = 0.99)
248 a full world has been presented onscreen , not some series of carefully structured plot points building to a pat resolution . 4.72727 positive negative (p = 0.54)
Vulnerability Level Data slice Metric Transformation Deviation
Performance medium 🟡 avg_whitespace(text) < 0.177 AND avg_whitespace(text) >= 0.174 Recall = 0.828 -9.05% than global
🔍✨Examples For records in the dataset where `avg_whitespace(text)` < 0.177 AND `avg_whitespace(text)` >= 0.174, the Recall is 9.05% lower than the global Recall.
text avg_whitespace(text) label Predicted label
64 the script kicks in , and mr. hartley 's distended pace and foot-dragging rhythms follow . 0.175824 negative positive (p = 0.99)
223 corny , schmaltzy and predictable , but still manages to be kind of heartwarming , nonetheless . 0.175258 positive negative (p = 0.99)
248 a full world has been presented onscreen , not some series of carefully structured plot points building to a pat resolution . 0.174603 positive negative (p = 0.54)

Checkout out the Giskard Space and test your model.

Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.

Sign up or log in to comment