Report for textattack/bert-base-uncased-SST-2

#3
by giskard-bot - opened

Hi Team,

This is a report from Giskard Bot Scan 🐢.

We have identified 5 potential vulnerabilities in your model based on an automated scan.

This automated analysis evaluated the model on the dataset sst2 (subset default, split validation).

You can find a full version of scan report here.

👉Performance issues (4)

For records in the dataset where text contains "movie", the Precision is 8.81% lower than the global Precision.

Level Data slice Metric Deviation
medium 🟡 text contains "movie" Precision = 0.837 -8.81% than global

Taxonomy

avid-effect:performance:P0204
🔍✨Examples
text label Predicted label
69 this one is definitely one to skip , even for horror movie fanatics . LABEL_0 LABEL_1 (p = 0.95)
172 it seems like i have been waiting my whole life for this movie and now i ca n't wait for the sequel . LABEL_1 LABEL_0 (p = 0.72)
509 a movie that successfully crushes a best selling novel into a timeframe that mandates that you avoid the godzilla sized soda . LABEL_1 LABEL_0 (p = 0.91)

For records in the dataset where text_length(text) < 82.500 AND text_length(text) >= 73.500, the Recall is 6.97% lower than the global Recall.

Level Data slice Metric Deviation
medium 🟡 text_length(text) < 82.500 AND text_length(text) >= 73.500 Recall = 0.870 -6.97% than global

Taxonomy

avid-effect:performance:P0204
🔍✨Examples
text text_length(text) label Predicted label
93 if steven soderbergh 's ` solaris ' is a failure it is a glorious failure . 76 LABEL_1 LABEL_0 (p = 0.59)
142 what better message than ` love thyself ' could young women of any size receive ? 82 LABEL_1 LABEL_0 (p = 0.98)
411 i do n't mind having my heartstrings pulled , but do n't treat me like a fool . 80 LABEL_0 LABEL_1 (p = 0.95)

For records in the dataset where text_length(text) >= 165.500 AND text_length(text) < 183.500, the Recall is 6.73% lower than the global Recall.

Level Data slice Metric Deviation
medium 🟡 text_length(text) >= 165.500 AND text_length(text) < 183.500 Recall = 0.872 -6.73% than global

Taxonomy

avid-effect:performance:P0204
🔍✨Examples
text text_length(text) label Predicted label
266 a coda in every sense , the pinochet case splits time between a minute-by-minute account of the british court 's extradition chess game and the regime 's talking-head survivors . 179 LABEL_1 LABEL_0 (p = 0.85)
282 while there 's something intrinsically funny about sir anthony hopkins saying ` get in the car , bitch , ' this jerry bruckheimer production has little else to offer 166 LABEL_1 LABEL_0 (p = 1.00)
292 the story and the friendship proceeds in such a way that you 're watching a soap opera rather than a chronicle of the ups and downs that accompany lifelong friendships . 170 LABEL_0 LABEL_1 (p = 0.88)

For records in the dataset where text_length(text) < 98.500 AND text_length(text) >= 86.500, the Precision is 6.21% lower than the global Precision.

Level Data slice Metric Deviation
medium 🟡 text_length(text) < 98.500 AND text_length(text) >= 86.500 Precision = 0.861 -6.21% than global

Taxonomy

avid-effect:performance:P0204
🔍✨Examples
text text_length(text) label Predicted label
115 sam mendes has become valedictorian at the school for soft landings and easy ways out . 88 LABEL_0 LABEL_1 (p = 0.98)
230 reign of fire looks as if it was made without much thought -- and is best watched that way . 93 LABEL_1 LABEL_0 (p = 1.00)
519 moretti 's compelling anatomy of grief and the difficult process of adapting to loss . 87 LABEL_0 LABEL_1 (p = 1.00)
👉Robustness issues (1)

When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 12.5% of the cases. We expected the predictions not to be affected by this transformation.

Level Metric Transformation Deviation
major 🔴 Fail rate = 0.125 Add typos 100/800 tested samples (12.5%) changed prediction after perturbation

Taxonomy

avid-effect:performance:P0201
🔍✨Examples
text Add typos(text) Original prediction Prediction after perturbation
16 the emotions are raw and will strike a nerve with anyone who 's ever had family trauma . the ekotions are raw andw ill strike a nerve with anyone wgo 's ever had family trauma . LABEL_1 (p = 1.00) LABEL_0 (p = 0.89)
22 holden caulfield did it better . holdsn caulfkeld did t better . LABEL_1 (p = 0.99) LABEL_0 (p = 0.98)
36 the weight of the piece , the unerring professionalism of the chilly production , and the fascination embedded in the lurid topic prove recommendation enough . he weight of the piec e hte unerring professionalism of the chilly production , and the fascination embeded in the lurid topic prove rrcommendatioh enough . LABEL_1 (p = 1.00) LABEL_0 (p = 0.98)

We've generated test suites according to your scan results! Checkout the Test Suite in our Giskard Space and Giskard Documentation to learn more about how to test your model.

Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.

Sign up or log in to comment