Report for distilbert-base-uncased-finetuned-sst-2-english
#96
by
giskard-bot
- opened
Hi Team,
This is a report from Giskard Bot Scan 🐢.
We have identified 13 potential vulnerabilities in your model based on an automated scan.
This automated analysis evaluated the model on the dataset sst2 (subset default
, split validation
).
👉Performance issues (12)
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | major 🔴 | text_length(text) >= 50.500 AND text_length(text) < 61.500 |
Precision = 0.759 | — | -15.50% than global |
🔍✨Examples
For records in the dataset where `text_length(text)` >= 50.500 AND `text_length(text)` < 61.500, the Precision is 15.5% lower than the global Precision.text | text_length(text) | label | Predicted label |
|
---|---|---|---|---|
92 | you wo n't like roger , but you will quickly recognize him . | 61 | NEGATIVE | POSITIVE (p = 1.00) |
171 | rarely has leukemia looked so shimmering and benign . | 54 | NEGATIVE | POSITIVE (p = 0.98) |
183 | the lower your expectations , the more you 'll enjoy it . | 58 | NEGATIVE | POSITIVE (p = 1.00) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | major 🔴 | avg_whitespace(text) >= 0.174 AND avg_whitespace(text) < 0.177 |
Recall = 0.815 | — | -12.40% than global |
🔍✨Examples
For records in the dataset where `avg_whitespace(text)` >= 0.174 AND `avg_whitespace(text)` < 0.177, the Recall is 12.4% lower than the global Recall.text | avg_whitespace(text) | label | Predicted label |
|
---|---|---|---|---|
64 | the script kicks in , and mr. hartley 's distended pace and foot-dragging rhythms follow . | 0.175824 | NEGATIVE | POSITIVE (p = 0.86) |
87 | jaglom ... put ( s ) the audience in the privileged position of eavesdropping on his characters | 0.177083 | POSITIVE | NEGATIVE (p = 1.00) |
248 | a full world has been presented onscreen , not some series of carefully structured plot points building to a pat resolution . | 0.174603 | POSITIVE | NEGATIVE (p = 0.96) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | major 🔴 | avg_word_length(text) < 4.743 AND avg_word_length(text) >= 4.645 |
Recall = 0.815 | — | -12.40% than global |
🔍✨Examples
For records in the dataset where `avg_word_length(text)` < 4.743 AND `avg_word_length(text)` >= 4.645, the Recall is 12.4% lower than the global Recall.text | avg_word_length(text) | label | Predicted label |
|
---|---|---|---|---|
64 | the script kicks in , and mr. hartley 's distended pace and foot-dragging rhythms follow . | 4.6875 | NEGATIVE | POSITIVE (p = 0.86) |
87 | jaglom ... put ( s ) the audience in the privileged position of eavesdropping on his characters | 4.64706 | POSITIVE | NEGATIVE (p = 1.00) |
248 | a full world has been presented onscreen , not some series of carefully structured plot points building to a pat resolution . | 4.72727 | POSITIVE | NEGATIVE (p = 0.96) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | major 🔴 | text_length(text) >= 73.500 AND text_length(text) < 82.500 |
Recall = 0.826 | — | -11.19% than global |
🔍✨Examples
For records in the dataset where `text_length(text)` >= 73.500 AND `text_length(text)` < 82.500, the Recall is 11.19% lower than the global Recall.text | text_length(text) | label | Predicted label |
|
---|---|---|---|---|
93 | if steven soderbergh 's ` solaris ' is a failure it is a glorious failure . | 76 | POSITIVE | NEGATIVE (p = 1.00) |
123 | turns potentially forgettable formula into something strangely diverting . | 75 | POSITIVE | NEGATIVE (p = 0.99) |
142 | what better message than ` love thyself ' could young women of any size receive ? | 82 | POSITIVE | NEGATIVE (p = 0.99) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | avg_whitespace(text) >= 0.182 AND avg_whitespace(text) < 0.185 |
Recall = 0.864 | — | -7.15% than global |
🔍✨Examples
For records in the dataset where `avg_whitespace(text)` >= 0.182 AND `avg_whitespace(text)` < 0.185, the Recall is 7.15% lower than the global Recall.text | avg_whitespace(text) | label | Predicted label |
|
---|---|---|---|---|
273 | minority report is exactly what the title indicates , a report . | 0.184615 | POSITIVE | NEGATIVE (p = 0.86) |
324 | you 'll gasp appalled and laugh outraged and possibly , watching the spectacle of a promising young lad treading desperately in a nasty sea , shed an errant tear . | 0.182927 | POSITIVE | NEGATIVE (p = 0.95) |
356 | jason x is positively anti-darwinian : nine sequels and 400 years later , the teens are none the wiser and jason still kills on auto-pilot . | 0.184397 | NEGATIVE | POSITIVE (p = 0.97) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | avg_word_length(text) < 4.483 AND avg_word_length(text) >= 4.396 |
Recall = 0.864 | — | -7.15% than global |
🔍✨Examples
For records in the dataset where `avg_word_length(text)` < 4.483 AND `avg_word_length(text)` >= 4.396, the Recall is 7.15% lower than the global Recall.text | avg_word_length(text) | label | Predicted label |
|
---|---|---|---|---|
273 | minority report is exactly what the title indicates , a report . | 4.41667 | POSITIVE | NEGATIVE (p = 0.86) |
324 | you 'll gasp appalled and laugh outraged and possibly , watching the spectacle of a promising young lad treading desperately in a nasty sea , shed an errant tear . | 4.46667 | POSITIVE | NEGATIVE (p = 0.95) |
356 | jason x is positively anti-darwinian : nine sequels and 400 years later , the teens are none the wiser and jason still kills on auto-pilot . | 4.42308 | NEGATIVE | POSITIVE (p = 0.97) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | text_length(text) >= 165.500 AND text_length(text) < 179.500 |
Recall = 0.871 | — | -6.37% than global |
🔍✨Examples
For records in the dataset where `text_length(text)` >= 165.500 AND `text_length(text)` < 179.500, the Recall is 6.37% lower than the global Recall.text | text_length(text) | label | Predicted label |
|
---|---|---|---|---|
158 | by getting myself wrapped up in the visuals and eccentricities of many of the characters , i found myself confused when it came time to get to the heart of the movie . | 168 | NEGATIVE | POSITIVE (p = 0.99) |
266 | a coda in every sense , the pinochet case splits time between a minute-by-minute account of the british court 's extradition chess game and the regime 's talking-head survivors . | 179 | POSITIVE | NEGATIVE (p = 0.99) |
282 | while there 's something intrinsically funny about sir anthony hopkins saying ` get in the car , bitch , ' this jerry bruckheimer production has little else to offer | 166 | POSITIVE | NEGATIVE (p = 1.00) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | avg_whitespace(text) >= 0.205 AND avg_whitespace(text) < 0.213 |
Recall = 0.875 | — | -5.93% than global |
🔍✨Examples
For records in the dataset where `avg_whitespace(text)` >= 0.205 AND `avg_whitespace(text)` < 0.213, the Recall is 5.93% lower than the global Recall.text | avg_whitespace(text) | label | Predicted label |
|
---|---|---|---|---|
93 | if steven soderbergh 's ` solaris ' is a failure it is a glorious failure . | 0.210526 | POSITIVE | NEGATIVE (p = 1.00) |
183 | the lower your expectations , the more you 'll enjoy it . | 0.206897 | NEGATIVE | POSITIVE (p = 1.00) |
501 | harrison 's flowers puts its heart in the right place , but its brains are in no particular place at all . | 0.205607 | POSITIVE | NEGATIVE (p = 0.99) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | avg_word_length(text) < 3.867 AND avg_word_length(text) >= 3.696 |
Recall = 0.875 | — | -5.93% than global |
🔍✨Examples
For records in the dataset where `avg_word_length(text)` < 3.867 AND `avg_word_length(text)` >= 3.696, the Recall is 5.93% lower than the global Recall.text | avg_word_length(text) | label | Predicted label |
|
---|---|---|---|---|
93 | if steven soderbergh 's ` solaris ' is a failure it is a glorious failure . | 3.75 | POSITIVE | NEGATIVE (p = 1.00) |
183 | the lower your expectations , the more you 'll enjoy it . | 3.83333 | NEGATIVE | POSITIVE (p = 1.00) |
501 | harrison 's flowers puts its heart in the right place , but its brains are in no particular place at all . | 3.86364 | POSITIVE | NEGATIVE (p = 0.99) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | text_length(text) >= 151.500 AND text_length(text) < 165.500 |
Recall = 0.875 | — | -5.93% than global |
🔍✨Examples
For records in the dataset where `text_length(text)` >= 151.500 AND `text_length(text)` < 165.500, the Recall is 5.93% lower than the global Recall.text | text_length(text) | label | Predicted label |
|
---|---|---|---|---|
324 | you 'll gasp appalled and laugh outraged and possibly , watching the spectacle of a promising young lad treading desperately in a nasty sea , shed an errant tear . | 164 | POSITIVE | NEGATIVE (p = 0.95) |
673 | drops you into a dizzying , volatile , pressure-cooker of a situation that quickly snowballs out of control , while focusing on the what much more than the why . | 162 | POSITIVE | NEGATIVE (p = 0.94) |
692 | sustains its dreamlike glide through a succession of cheesy coincidences and voluptuous cheap effects , not the least of which is rebecca romijn-stamos . | 154 | NEGATIVE | POSITIVE (p = 0.94) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | avg_whitespace(text) < 0.168 AND avg_whitespace(text) >= 0.164 |
Accuracy = 0.859 | — | -5.62% than global |
🔍✨Examples
For records in the dataset where `avg_whitespace(text)` < 0.168 AND `avg_whitespace(text)` >= 0.164, the Accuracy is 5.62% lower than the global Accuracy.text | avg_whitespace(text) | label | Predicted label |
|
---|---|---|---|---|
171 | rarely has leukemia looked so shimmering and benign . | 0.166667 | NEGATIVE | POSITIVE (p = 0.98) |
184 | though perry and hurley make inspiring efforts to breathe life into the disjointed , haphazard script by jay scherick and david ronn , neither the actors nor director reginald hudlin can make it more than fitfully entertaining . | 0.165939 | NEGATIVE | POSITIVE (p = 0.66) |
266 | a coda in every sense , the pinochet case splits time between a minute-by-minute account of the british court 's extradition chess game and the regime 's talking-head survivors . | 0.167598 | POSITIVE | NEGATIVE (p = 0.99) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | avg_word_length(text) >= 4.935 AND avg_word_length(text) < 5.113 |
Accuracy = 0.859 | — | -5.62% than global |
🔍✨Examples
For records in the dataset where `avg_word_length(text)` >= 4.935 AND `avg_word_length(text)` < 5.113, the Accuracy is 5.62% lower than the global Accuracy.text | avg_word_length(text) | label | Predicted label |
|
---|---|---|---|---|
171 | rarely has leukemia looked so shimmering and benign . | 5 | NEGATIVE | POSITIVE (p = 0.98) |
184 | though perry and hurley make inspiring efforts to breathe life into the disjointed , haphazard script by jay scherick and david ronn , neither the actors nor director reginald hudlin can make it more than fitfully entertaining . | 5.02632 | NEGATIVE | POSITIVE (p = 0.66) |
266 | a coda in every sense , the pinochet case splits time between a minute-by-minute account of the british court 's extradition chess game and the regime 's talking-head survivors . | 4.96667 | POSITIVE | NEGATIVE (p = 0.99) |
👉Robustness issues (1)
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Robustness | major 🔴 | — | Fail rate = 0.130 | Add typos | 104/800 tested samples (13.0%) changed prediction after perturbation |
🔍✨Examples
When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 13.0% of the cases. We expected the predictions not to be affected by this transformation.text | Add typos(text) | Original prediction | Prediction after perturbation | |
---|---|---|---|---|
13 | we root for ( clara and paul ) , even like them , though perhaps it 's an emotion closer to pity . | we root for ( clara and paul ) , even like them , htough perhaps it 's an emotiom closer to pity . | POSITIVE (p = 0.96) | NEGATIVE (p = 0.99) |
16 | the emotions are raw and will strike a nerve with anyone who 's ever had family trauma . | the ekotions are raw andw ill strike a nerve with anyone wgo 's ever had family trauma . | POSITIVE (p = 1.00) | NEGATIVE (p = 0.60) |
22 | holden caulfield did it better . | holdsn caulfkeld did t better . | POSITIVE (p = 0.99) | NEGATIVE (p = 1.00) |
Checkout out the Giskard Space and test your model.
Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.