textattack/bert-base-uncased-SST-2 · Report for textattack/bert-base-uncased-SST-2

Hi Team,

This is a report from Giskard Bot Scan 🐢.

We have identified 5 potential vulnerabilities in your model based on an automated scan.

This automated analysis evaluated the model on the dataset sst2 (subset default, split validation).

You can find a full version of scan report here.

👉Performance issues (4)

For records in the dataset where text contains "movie", the Precision is 8.81% lower than the global Precision.

Level	Data slice	Metric	Deviation
medium 🟡	`text` contains "movie"	Precision = 0.837	-8.81% than global

Taxonomy

avid-effect:performance:P0204

🔍✨Examples

	text	label	Predicted `label`
69	this one is definitely one to skip , even for horror movie fanatics .	LABEL_0	LABEL_1 (p = 0.95)
172	it seems like i have been waiting my whole life for this movie and now i ca n't wait for the sequel .	LABEL_1	LABEL_0 (p = 0.72)
509	a movie that successfully crushes a best selling novel into a timeframe that mandates that you avoid the godzilla sized soda .	LABEL_1	LABEL_0 (p = 0.91)

For records in the dataset where text_length(text) < 82.500 AND text_length(text) >= 73.500, the Recall is 6.97% lower than the global Recall.

Level	Data slice	Metric	Deviation
medium 🟡	`text_length(text)` < 82.500 AND `text_length(text)` >= 73.500	Recall = 0.870	-6.97% than global

Taxonomy

avid-effect:performance:P0204

🔍✨Examples

	text	text_length(text)	label	Predicted `label`
93	if steven soderbergh 's ` solaris ' is a failure it is a glorious failure .	76	LABEL_1	LABEL_0 (p = 0.59)
142	what better message than ` love thyself ' could young women of any size receive ?	82	LABEL_1	LABEL_0 (p = 0.98)
411	i do n't mind having my heartstrings pulled , but do n't treat me like a fool .	80	LABEL_0	LABEL_1 (p = 0.95)

For records in the dataset where text_length(text) >= 165.500 AND text_length(text) < 183.500, the Recall is 6.73% lower than the global Recall.

Level	Data slice	Metric	Deviation
medium 🟡	`text_length(text)` >= 165.500 AND `text_length(text)` < 183.500	Recall = 0.872	-6.73% than global

Taxonomy

avid-effect:performance:P0204

🔍✨Examples

	text	text_length(text)	label	Predicted `label`
266	a coda in every sense , the pinochet case splits time between a minute-by-minute account of the british court 's extradition chess game and the regime 's talking-head survivors .	179	LABEL_1	LABEL_0 (p = 0.85)
282	while there 's something intrinsically funny about sir anthony hopkins saying ` get in the car , bitch , ' this jerry bruckheimer production has little else to offer	166	LABEL_1	LABEL_0 (p = 1.00)
292	the story and the friendship proceeds in such a way that you 're watching a soap opera rather than a chronicle of the ups and downs that accompany lifelong friendships .	170	LABEL_0	LABEL_1 (p = 0.88)

For records in the dataset where text_length(text) < 98.500 AND text_length(text) >= 86.500, the Precision is 6.21% lower than the global Precision.

Level	Data slice	Metric	Deviation
medium 🟡	`text_length(text)` < 98.500 AND `text_length(text)` >= 86.500	Precision = 0.861	-6.21% than global

Taxonomy

avid-effect:performance:P0204

🔍✨Examples

	text	text_length(text)	label	Predicted `label`
115	sam mendes has become valedictorian at the school for soft landings and easy ways out .	88	LABEL_0	LABEL_1 (p = 0.98)
230	reign of fire looks as if it was made without much thought -- and is best watched that way .	93	LABEL_1	LABEL_0 (p = 1.00)
519	moretti 's compelling anatomy of grief and the difficult process of adapting to loss .	87	LABEL_0	LABEL_1 (p = 1.00)

👉Robustness issues (1)

When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 12.5% of the cases. We expected the predictions not to be affected by this transformation.

Level	Metric	Transformation	Deviation
major 🔴	Fail rate = 0.125	Add typos	100/800 tested samples (12.5%) changed prediction after perturbation

Taxonomy

avid-effect:performance:P0201

🔍✨Examples

	text	Add typos(text)	Original prediction	Prediction after perturbation
16	the emotions are raw and will strike a nerve with anyone who 's ever had family trauma .	the ekotions are raw andw ill strike a nerve with anyone wgo 's ever had family trauma .	LABEL_1 (p = 1.00)	LABEL_0 (p = 0.89)
22	holden caulfield did it better .	holdsn caulfkeld did t better .	LABEL_1 (p = 0.99)	LABEL_0 (p = 0.98)
36	the weight of the piece , the unerring professionalism of the chilly production , and the fascination embedded in the lurid topic prove recommendation enough .	he weight of the piec e hte unerring professionalism of the chilly production , and the fascination embeded in the lurid topic prove rrcommendatioh enough .	LABEL_1 (p = 1.00)	LABEL_0 (p = 0.98)

We've generated test suites according to your scan results! Checkout the Test Suite in our Giskard Space and Giskard Documentation to learn more about how to test your model.

Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.