Report for distilbert-base-uncased-finetuned-sst-2-english

#96
by giskard-bot - opened

Hi Team,

This is a report from Giskard Bot Scan 🐢.

We have identified 13 potential vulnerabilities in your model based on an automated scan.

This automated analysis evaluated the model on the dataset sst2 (subset default, split validation).

👉Performance issues (12)
Vulnerability Level Data slice Metric Transformation Deviation
Performance major 🔴 text_length(text) >= 50.500 AND text_length(text) < 61.500 Precision = 0.759 -15.50% than global
🔍✨Examples For records in the dataset where `text_length(text)` >= 50.500 AND `text_length(text)` < 61.500, the Precision is 15.5% lower than the global Precision.
text text_length(text) label Predicted label
92 you wo n't like roger , but you will quickly recognize him . 61 NEGATIVE POSITIVE (p = 1.00)
171 rarely has leukemia looked so shimmering and benign . 54 NEGATIVE POSITIVE (p = 0.98)
183 the lower your expectations , the more you 'll enjoy it . 58 NEGATIVE POSITIVE (p = 1.00)
Vulnerability Level Data slice Metric Transformation Deviation
Performance major 🔴 avg_whitespace(text) >= 0.174 AND avg_whitespace(text) < 0.177 Recall = 0.815 -12.40% than global
🔍✨Examples For records in the dataset where `avg_whitespace(text)` >= 0.174 AND `avg_whitespace(text)` < 0.177, the Recall is 12.4% lower than the global Recall.
text avg_whitespace(text) label Predicted label
64 the script kicks in , and mr. hartley 's distended pace and foot-dragging rhythms follow . 0.175824 NEGATIVE POSITIVE (p = 0.86)
87 jaglom ... put ( s ) the audience in the privileged position of eavesdropping on his characters 0.177083 POSITIVE NEGATIVE (p = 1.00)
248 a full world has been presented onscreen , not some series of carefully structured plot points building to a pat resolution . 0.174603 POSITIVE NEGATIVE (p = 0.96)
Vulnerability Level Data slice Metric Transformation Deviation
Performance major 🔴 avg_word_length(text) < 4.743 AND avg_word_length(text) >= 4.645 Recall = 0.815 -12.40% than global
🔍✨Examples For records in the dataset where `avg_word_length(text)` < 4.743 AND `avg_word_length(text)` >= 4.645, the Recall is 12.4% lower than the global Recall.
text avg_word_length(text) label Predicted label
64 the script kicks in , and mr. hartley 's distended pace and foot-dragging rhythms follow . 4.6875 NEGATIVE POSITIVE (p = 0.86)
87 jaglom ... put ( s ) the audience in the privileged position of eavesdropping on his characters 4.64706 POSITIVE NEGATIVE (p = 1.00)
248 a full world has been presented onscreen , not some series of carefully structured plot points building to a pat resolution . 4.72727 POSITIVE NEGATIVE (p = 0.96)
Vulnerability Level Data slice Metric Transformation Deviation
Performance major 🔴 text_length(text) >= 73.500 AND text_length(text) < 82.500 Recall = 0.826 -11.19% than global
🔍✨Examples For records in the dataset where `text_length(text)` >= 73.500 AND `text_length(text)` < 82.500, the Recall is 11.19% lower than the global Recall.
text text_length(text) label Predicted label
93 if steven soderbergh 's ` solaris ' is a failure it is a glorious failure . 76 POSITIVE NEGATIVE (p = 1.00)
123 turns potentially forgettable formula into something strangely diverting . 75 POSITIVE NEGATIVE (p = 0.99)
142 what better message than ` love thyself ' could young women of any size receive ? 82 POSITIVE NEGATIVE (p = 0.99)
Vulnerability Level Data slice Metric Transformation Deviation
Performance medium 🟡 avg_whitespace(text) >= 0.182 AND avg_whitespace(text) < 0.185 Recall = 0.864 -7.15% than global
🔍✨Examples For records in the dataset where `avg_whitespace(text)` >= 0.182 AND `avg_whitespace(text)` < 0.185, the Recall is 7.15% lower than the global Recall.
text avg_whitespace(text) label Predicted label
273 minority report is exactly what the title indicates , a report . 0.184615 POSITIVE NEGATIVE (p = 0.86)
324 you 'll gasp appalled and laugh outraged and possibly , watching the spectacle of a promising young lad treading desperately in a nasty sea , shed an errant tear . 0.182927 POSITIVE NEGATIVE (p = 0.95)
356 jason x is positively anti-darwinian : nine sequels and 400 years later , the teens are none the wiser and jason still kills on auto-pilot . 0.184397 NEGATIVE POSITIVE (p = 0.97)
Vulnerability Level Data slice Metric Transformation Deviation
Performance medium 🟡 avg_word_length(text) < 4.483 AND avg_word_length(text) >= 4.396 Recall = 0.864 -7.15% than global
🔍✨Examples For records in the dataset where `avg_word_length(text)` < 4.483 AND `avg_word_length(text)` >= 4.396, the Recall is 7.15% lower than the global Recall.
text avg_word_length(text) label Predicted label
273 minority report is exactly what the title indicates , a report . 4.41667 POSITIVE NEGATIVE (p = 0.86)
324 you 'll gasp appalled and laugh outraged and possibly , watching the spectacle of a promising young lad treading desperately in a nasty sea , shed an errant tear . 4.46667 POSITIVE NEGATIVE (p = 0.95)
356 jason x is positively anti-darwinian : nine sequels and 400 years later , the teens are none the wiser and jason still kills on auto-pilot . 4.42308 NEGATIVE POSITIVE (p = 0.97)
Vulnerability Level Data slice Metric Transformation Deviation
Performance medium 🟡 text_length(text) >= 165.500 AND text_length(text) < 179.500 Recall = 0.871 -6.37% than global
🔍✨Examples For records in the dataset where `text_length(text)` >= 165.500 AND `text_length(text)` < 179.500, the Recall is 6.37% lower than the global Recall.
text text_length(text) label Predicted label
158 by getting myself wrapped up in the visuals and eccentricities of many of the characters , i found myself confused when it came time to get to the heart of the movie . 168 NEGATIVE POSITIVE (p = 0.99)
266 a coda in every sense , the pinochet case splits time between a minute-by-minute account of the british court 's extradition chess game and the regime 's talking-head survivors . 179 POSITIVE NEGATIVE (p = 0.99)
282 while there 's something intrinsically funny about sir anthony hopkins saying ` get in the car , bitch , ' this jerry bruckheimer production has little else to offer 166 POSITIVE NEGATIVE (p = 1.00)
Vulnerability Level Data slice Metric Transformation Deviation
Performance medium 🟡 avg_whitespace(text) >= 0.205 AND avg_whitespace(text) < 0.213 Recall = 0.875 -5.93% than global
🔍✨Examples For records in the dataset where `avg_whitespace(text)` >= 0.205 AND `avg_whitespace(text)` < 0.213, the Recall is 5.93% lower than the global Recall.
text avg_whitespace(text) label Predicted label
93 if steven soderbergh 's ` solaris ' is a failure it is a glorious failure . 0.210526 POSITIVE NEGATIVE (p = 1.00)
183 the lower your expectations , the more you 'll enjoy it . 0.206897 NEGATIVE POSITIVE (p = 1.00)
501 harrison 's flowers puts its heart in the right place , but its brains are in no particular place at all . 0.205607 POSITIVE NEGATIVE (p = 0.99)
Vulnerability Level Data slice Metric Transformation Deviation
Performance medium 🟡 avg_word_length(text) < 3.867 AND avg_word_length(text) >= 3.696 Recall = 0.875 -5.93% than global
🔍✨Examples For records in the dataset where `avg_word_length(text)` < 3.867 AND `avg_word_length(text)` >= 3.696, the Recall is 5.93% lower than the global Recall.
text avg_word_length(text) label Predicted label
93 if steven soderbergh 's ` solaris ' is a failure it is a glorious failure . 3.75 POSITIVE NEGATIVE (p = 1.00)
183 the lower your expectations , the more you 'll enjoy it . 3.83333 NEGATIVE POSITIVE (p = 1.00)
501 harrison 's flowers puts its heart in the right place , but its brains are in no particular place at all . 3.86364 POSITIVE NEGATIVE (p = 0.99)
Vulnerability Level Data slice Metric Transformation Deviation
Performance medium 🟡 text_length(text) >= 151.500 AND text_length(text) < 165.500 Recall = 0.875 -5.93% than global
🔍✨Examples For records in the dataset where `text_length(text)` >= 151.500 AND `text_length(text)` < 165.500, the Recall is 5.93% lower than the global Recall.
text text_length(text) label Predicted label
324 you 'll gasp appalled and laugh outraged and possibly , watching the spectacle of a promising young lad treading desperately in a nasty sea , shed an errant tear . 164 POSITIVE NEGATIVE (p = 0.95)
673 drops you into a dizzying , volatile , pressure-cooker of a situation that quickly snowballs out of control , while focusing on the what much more than the why . 162 POSITIVE NEGATIVE (p = 0.94)
692 sustains its dreamlike glide through a succession of cheesy coincidences and voluptuous cheap effects , not the least of which is rebecca romijn-stamos . 154 NEGATIVE POSITIVE (p = 0.94)
Vulnerability Level Data slice Metric Transformation Deviation
Performance medium 🟡 avg_whitespace(text) < 0.168 AND avg_whitespace(text) >= 0.164 Accuracy = 0.859 -5.62% than global
🔍✨Examples For records in the dataset where `avg_whitespace(text)` < 0.168 AND `avg_whitespace(text)` >= 0.164, the Accuracy is 5.62% lower than the global Accuracy.
text avg_whitespace(text) label Predicted label
171 rarely has leukemia looked so shimmering and benign . 0.166667 NEGATIVE POSITIVE (p = 0.98)
184 though perry and hurley make inspiring efforts to breathe life into the disjointed , haphazard script by jay scherick and david ronn , neither the actors nor director reginald hudlin can make it more than fitfully entertaining . 0.165939 NEGATIVE POSITIVE (p = 0.66)
266 a coda in every sense , the pinochet case splits time between a minute-by-minute account of the british court 's extradition chess game and the regime 's talking-head survivors . 0.167598 POSITIVE NEGATIVE (p = 0.99)
Vulnerability Level Data slice Metric Transformation Deviation
Performance medium 🟡 avg_word_length(text) >= 4.935 AND avg_word_length(text) < 5.113 Accuracy = 0.859 -5.62% than global
🔍✨Examples For records in the dataset where `avg_word_length(text)` >= 4.935 AND `avg_word_length(text)` < 5.113, the Accuracy is 5.62% lower than the global Accuracy.
text avg_word_length(text) label Predicted label
171 rarely has leukemia looked so shimmering and benign . 5 NEGATIVE POSITIVE (p = 0.98)
184 though perry and hurley make inspiring efforts to breathe life into the disjointed , haphazard script by jay scherick and david ronn , neither the actors nor director reginald hudlin can make it more than fitfully entertaining . 5.02632 NEGATIVE POSITIVE (p = 0.66)
266 a coda in every sense , the pinochet case splits time between a minute-by-minute account of the british court 's extradition chess game and the regime 's talking-head survivors . 4.96667 POSITIVE NEGATIVE (p = 0.99)
👉Robustness issues (1)
Vulnerability Level Data slice Metric Transformation Deviation
Robustness major 🔴 Fail rate = 0.130 Add typos 104/800 tested samples (13.0%) changed prediction after perturbation
🔍✨Examples When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 13.0% of the cases. We expected the predictions not to be affected by this transformation.
text Add typos(text) Original prediction Prediction after perturbation
13 we root for ( clara and paul ) , even like them , though perhaps it 's an emotion closer to pity . we root for ( clara and paul ) , even like them , htough perhaps it 's an emotiom closer to pity . POSITIVE (p = 0.96) NEGATIVE (p = 0.99)
16 the emotions are raw and will strike a nerve with anyone who 's ever had family trauma . the ekotions are raw andw ill strike a nerve with anyone wgo 's ever had family trauma . POSITIVE (p = 1.00) NEGATIVE (p = 0.60)
22 holden caulfield did it better . holdsn caulfkeld did t better . POSITIVE (p = 0.99) NEGATIVE (p = 1.00)

Checkout out the Giskard Space and test your model.

Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.

Sign up or log in to comment