Giskard Bot: Identifying robustness, performance and ethical vulnerabilities in the Top 10 Most Popular Hugging Face Models

Community Article Published March 21, 2024

TL;DR: In this article, we applied the Giskard Bot to evaluate the 10 most popular text classification models on Hugging Face. We found that over 90% of the top text classification models have robustness issues, while more than half show performance and ethical biases, particularly in sentiment analysis. This article breaks down these vulnerabilities, their causes, and offers practical solutions to detect and mitigate them. Plus, we provide a handy guide to use Giskard Bot to scan your own models, whether they're from Hugging Face Hub or custom-trained.

Article by: Mathieu Martial & Weixuan Xiao

Introduction to Machine Learning model evaluation

In a previous article, we presented the Giskard bot, a bot that allows Hugging Face users to:

Automatically generate and release a vulnerability report for each new model added to the HF Hub. This report will be shared both as an HF discussion and on the model card through the submission of a pull request (PR).
Debug these vulnerabilities and create custom tests relevant to your business case.

In this article, we decided to apply this Giskard bot to evaluate the 10 most popular text classification models on Hugging Face 🤗.

What are the key vulnerabilities of the 10 most popular models in HF Hub?
Why does it happen and how to fix such vulnerabilities?
How to evaluate other models uploaded in the HF hub?

💡 Methodology

The top 10 models were chosen based on a few criteria:

We selected text classification models trained on publicly available datasets.

We only focused on models that could be used on English data.

We ran our performance, robustness and ethical detectors.

The models were evaluated on their test split (except for ones trained on FinancialPhraseBank that were scanned using the train split as there was no available test split)

Main Findings: 90% of Top Hugging Face Models have vulnerabilities

After running the Giskard bot on the top 10 models, we found that:

Over 90% of the models had robustness vulnerabilities.
Over 50% of them had performance vulnerabilities.
Over 50% of the sentiment analysis models manifested ethical bias.

To better understand these vulnerabilities, let’s define some terminology:

Robustness: refers to the capacity of a model to resist to small perturbations in the input data, such as adding typos or converting text to uppercase. If not robust, even minor changes in the input can lead to significant changes in the models predictions, resulting in reduced reliability of the model’s performance.
Performance bias: refers to a situation where a model exhibits low performance on specific data slices or subsets, despite satisfactory performance on the overall dataset. Performance bias can manifest as significant discrepancies in accuracy, precision, recall, or other evaluation metrics across different groups or segments of the data.
Ethical bias: arises in models that exhibit sensitivity when perturbating a protected attribute, like gender, ethnicity or religion. These perturbations can involve switching certain words from female to male or switching specific countries, nationalities, and religions for example.

The detailed reports for each scanned model are in the table below:

Model	Dataset	Robustness	Performance	Ethical	Scan Report
distilbert/distilbert-base-uncased-finetuned-sst-2-english	SST2	1	0	0	DistilBERT on SST2
lxyuan/distilbert-base-multilingual-cased-sentiments-student (most recent)	tyqiangz/multilingual-sentiments “english”	1	1	1	Distilbert-multilingual on tyqiangz/multilingual-sentiments
cardiffnlp/twitter-roberta-base-irony	tweet_eval “irony”	1	3	0	TwitterRoBERTa-irony on tweet_eval "irony"
SamLowe/roberta-base-go_emotions	go_emotions	0	1	1	RoBERTa-base on go_emotions
mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis	Financial Phrase Bank	1	0	0	DistilBERT-Financial on FinancialPhraseBank
cardiffnlp/twitter-roberta-base-sentiment-latest	tweet_eval “sentiment”	1	1	2	TwitterRoBERTa-sentiment on tweet_eval “sentiment”
ahmedrachid/FinancialBERT-Sentiment-Analysis	Financial Phrase Bank	1	0	0	Financial BERT on FinancialPhraseBank
yiyanghkust/finbert-tone	Financial Phrase Bank	1	8	0	FinBERT-tone on FinancialPhraseBank
ProsusAI/finbert	Financial Phrase Bank	1	0	0	FinBERT on FinancialPhraseBank
cardiffnlp/twitter-xlm-roberta-base-sentiment-multilingual	tweet_sentiment_multilingual “english”	2	0	2	XLM-RoBERTa-multilingual on tweet_sentiment_multilingual
cardiffnlp/twitter-roberta-base-offensive	tweet_eval “offensive”	1	3	0	TwitterRoBERTa-offensive on tweet_eval “offensive”

⚠️ The intention of this article is not to criticize or disparage the efforts of others, but rather to give an example of how our scan can be used to find vulnerabilities within their models and give general guidelines on how to patch them.

How to detect and mitigate ML model vulnerabilities?

Let’s provide some examples to better understand what these vulnerabilities (we are currently using the AVID ML taxonomy) are and give concrete actions to fix them.

Robustness issues in ML models

Our analysis shed light on a common vulnerability in these models: their high sensitivity to typos. While typos could reduce input fidelity, one would not expect them to significantly alter model predictions. Thus, a model’s sensitivity to typos is quite problematic, given their prevalence in real data, particularly for sentiment analysis models.

Moreover, it is often performed on textual data to help businesses monitor brand and product sentiment in customer feedback, and understand customer needs. Poor performance could lead downstream to uninformed and worse decision-making. As such, you want your model to be very robust against typos.

Let’s illustrate through an example.

Here is part of the report from the model distilbert-base-uncased-finetuned-sst-2-english, the most popular text classification model on 🤗.

As you can see, adding a few typos to input sentences changed 13% of the model’s predictions, which is significant. A human annotator would have no issue reading the transformed sentence (see the second column) and understanding that it is somewhat positive (the author expresses her support to other people). However, the model completely flipped its prediction: it went from being 96% sure the original input was positive to 99% sure the transformed one was negative.

Out of 11 models, 9 exhibited vulnerabilities to typos. Why is that?

When a sentence is given to the model, it is tokenized, meaning that the word is cut into smaller sub-words based on frequency. For example, “I love HuggingFace” is split into: I, love, H, ugging, Face. Now let’s add basic typos as if we had missed a few keystrokes: “I lpve HiggingFace”. Now the tokenization results in: I, lp, ve, H, ig, ging, Face. This difference then changes the model’s computations and can thus impact the final prediction.

So what can be done?

A good way to train your model against such a vulnerability is to augment your data. You could map each character on your keyboard to a set of “common typos” by looking at what other characters are next to it (be careful as there is more than one keyboard layout!). Then, randomly add typos to your data to create extra inputs for your model to learn from!

You could also decide to add a pre-preprocessing step to your data to clean it up as much as possible. For example, you could try to correct typos based on a set vocabulary and Levenshtein distance. The idea would be to find the closest word to your misspelled one in your vocabulary based on that distance and correct it before feeding it to the model.

These are very standard ideas, but they can go a long way!

Performance bias in Machine Learning models

The scan exposes that models perform worse when dealing with very specific terms. This can be troublesome if these terms are frequent in your dataset.

Let’s look at an example to better understand the issue.

The example below is from a RoBERTa model fine-tuned on tweets to predict irony. Ironically enough, it actually has low accuracy on text that contains the word “irony”.

Another example would be this one from Finbert-Tone, a model for financial sentiment analysis that seems to underperform on keywords such as “eur” (as in €):

For a model that has to deal with financial news, even a small loss of performance on such an important word can be worrying. Furthermore, these vulnerabilities can often be hard to fix as there is no clear explanation as to why some specific words are associated with bad performance. However, it might be related to how the model was trained.

Let’s do a deep dive on the last example to understand how it can happen:

💶 FinBERT and FinBERT-Tone are very similar models: BERT models that were fine-tuned for financial sentiment analysis. The main differences are that FinBERT-tone introduced its own vocabulary and does not rely on the base BERT vocabulary like FinBERT, and that they were fine-tuned on different datasets.

Unlike FinBERT, Finbert-tone exhibits performance loss on words like “eur” or “finland”, which seems peculiar given how similar they are. The first model was trained on the Reuters TCR2 Dataset. Reuters is actually a London-based international news agency. This means that they deal with a lot of European news.

The dataset we used for the scan is called the Financial Phrase Bank dataset. The associated paper came from a Finnish university, the corpus is made out of English news on all listed companies in OMX Helsinki (stock exchange located in Helsinki) and the annotators were mostly Finnish. In other words, this dataset is very “European”. **Note that the FinBERT model available on 🤗 was also trained on this dataset.

FinBERT-tone, however, was trained on 3 separate datasets:

Earnings Call Transcripts from 2004 to 2019. They came from the website Seeking Alpha, a NYC-based company that publishes news on financial markets.

Analyst Reports from S&P firms: American stock exchange from 1995 to 2008 in the Investext database that contains active and historical research reports from brokerages, investment banks and independent research firms around the globe.

The model on 🤗 was also trained on a dataset similar to the previous one.

To sum up, this model is very “American”, especially compared to the test set. That is most likely why there are performance vulnerabilities related to the words “finland” and “eur”. While there is no certainty here, it would be a nice hypothesis to explore if we were to try and improve the model.

What can be done to fix these vulnerabilities?

In our very specific example, the most obvious action to take would be to train on more diverse data, in this case articles made by European news outlets. Overall, performance bias can often be linked with the training phase. Overfitting is pretty common and applying regularisation techniques such as dropout or weight decay to prevent overfitting to majority classes or dominant groups can encourage the model to focus on more robust features.

Ethical bias in AI models

Models can unfortunately learn undesired patterns. These biases can be related to gender, religion, country or nationality.

Here are a few examples from TwitterRoBERTa-sentiment (sentiment analysis):

Religion bias

Gender bias

Why does it happen?

Language is inherently and unavoidably biased, and text classification models are trained from sentences that were written by us people, thus reflecting our very own biases. When the model is being fed with data, it uses it as its sole knowledge base and interprets it as factual. However, the data may be ingrained with biases along with misinformation, which can lead to the model’s outputs reflecting bias.

This can be problematic on many levels: you do not want your models to perpetuate stereotypes, such as the second one where replacing the word “guy” with “gal” changed the model’s prediction from neutral to negative.

How to fix these vulnerabilities?

Again, the best thing to do here is to ensure that the training data is diverse and representative of various demographic groups, genders, ethnicities, religions, etc... As such, data curation and data augmentation are good ways to start tackling ethical bias.

🐢 For more recommendations, we encourage you to check out the Giskard documentation, where you’ll find key definitions, in-depth explanations and concrete solutions for all vulnerabilities!

How to scan your own Hugging Face model for vulnerabilities?

We only did the analysis for the most popular text classification models but you can do it for your own model as well! Just follow these instructions:

Evaluating existing models on Hugging Face Hub

You can directly copy both its id and the id of the dataset you want to test it on.

Evaluating your own custom trained Machine Learning model

You can refer to the 🤗 hub documentation here to publish your model on HF.

Many models built on top of popular libraries and frameworks have built-in support by 🤗, including Keras, MLX, PyTorch, Scikit-learn, etc.

You need to add the task of this model in your model card, such as text-classification.

We strongly recommend to add other metadata to your model card as well. You can specify the used library, the language, the associated dataset, the metrics, etc. This can be done in the metadata UI when editing model Readme file:

You need choose one of the popular dataset on 🤗 or upload a dataset with a specific task, fitting the target of your model.

Running Giskard Bot Tests for ML Model vulnerabilities!

Now that your model is ready to be tested, go over to our Giskard evaluator on 🤗spaces here.

Then, you can fill the blanks with your own model details as follows:

You can then select your subset and split if necessary, and you’ll be ready for the scan!

Note that sometimes, the labels predicted by the model do not match exactly the ones from the dataset. In that case, you can simply check the “label mapping” under “Label and Feature Mapping”, and choose the labels with the matched semantic meanings.

You need then add your HF access token to call the HF inference API (you can find it here), and you can start the scan!

You’ll be able to see the evaluator running in the logs tab, and once it’s complete the results will appear in the discussion tab.

Discussion tab

⚠️ Analyze the scan result cautiously

Please note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.

For example, having a model present robustness vulnerabilities can actually be a desired behavior in some cases. Our detectors run all sorts of test, for example “remove punctuation” or “transform to uppercase”, and might warn you of related vulnerabilities, but maybe they are an actual feature of your model.

To illustrate, think of a model that would deal with offensive tweets (such as TwitterRoBERTa-offensive). As a human annotator, you might say that “hello are you there” is not offensive but “hello???? are you there???” is. Thus, our scan would detect a robustness vulnerability when removing punctuation, but looking at the full picture, it isn’t an actual issue.

Another example with rewriting an entire tweet into uppercase: “are you being real?” can express surprise but “ARE YOU BEING REAL?” can express anger. You would want your model to distinguish the two. Some base models like RoBERTa are case sensitive by default, so it makes sense that our detectors will find discrepancies between uncased and cased data.

Conclusion

Even the most popular models are flawed, and detecting vulnerabilities can be very difficult. If left unchecked, they can lead to massive loss in performance on real data! They can also end up reflecting problematic ethical biases and showing unjustified preferences. All those vulnerabilities can lead to poor decision-making downstream, hurting your business in the process.

The Giskard Evaluator is your ally in your quest to detect vulnerabilities and protect yourself from potentially business-critical consequences, as it allows you to quickly run all sorts of tests on them. Is my model robust to perturbations that might be seen on real data? Was the training data diverse enough? Does it underperform on important data slices? Answer all these questions easily with our evaluator!

Give it a try for yourself and don’t forget to leave a ⭐ on our Github repository as it helps us greatly!

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote