How do you make sure a model works equally well for different groups of people? It turns out that in many situations, this is harder than you might think.
The problem is that there are different ways to measure the accuracy of a model, and often it's mathematically impossible for them all to be equal across groups.
We'll illustrate how this happens by creating a (fake) medical model to screen these people for a disease.
About half of these people actually have the disease
In a perfect world, only sick people would
But models and tests aren't perfect.
The model might make a mistake and mark a sick person as healthy
Or the opposite: marking a healthy person as sick
If there's a simple follow-up test, we could have the model aggressively call close cases so it rarely misses the disease.
We can quantify this by measuring the percentage of sick people
On the other hand, if there isn't a secondary test, or the treatment uses a drug with a limited supply, we might care more about the percentage of people with
These issues and trade-offs in model optimization aren't new, but they're brought into focus when we have the ability to fine-tune exactly how aggressively disease is diagnosed.
Try adjusting how aggressive the model is in diagnosing the diseaseThings get even more complicated when we check if the model treats different groups fairly.¹
Whatever we decide on in terms of trade-offs between these metrics, we'd probably like them to be roughly even across different groups of people.
If we're trying to evenly allocate resources, having the model miss more cases in children than adults would be bad! ²
If you look carefully, you'll see that the disease is more prevalent in children. That is, the "base rate" of the disease is different across groups.
The fact that the base rates are different makes the situation surprisingly tricky. For one thing, even though the test catches the same percentage of sick adults and sick children, an adult who tests positive is less likely to have the disease than a child who tests positive.
Why is there a disparity in diagnosing between children and adults? There is a higher proportion of well adults, so mistakes in the test will cause more well adults to be marked "positive" than well children (and similarly with mistaken negatives).
To fix this, we could have the model take age into account.
Thankfully, the notion of fairness you choose to satisfy will depend on the context of your model, so while it may not be possible to satisfy every definition of fairness, you can focus on the notions of fairness that make sense for your use case.
Even if fairness along every dimension isn't possible, we shouldn't stop checking for bias. The Hidden Bias explorable outlines different ways human bias can feed into an ML model.
In some contexts, setting different thresholds for different populations might not be acceptable. Can you make AI fairer than a judge? explores an algorithm that can send people to jail.
There are lots of different metrics you might use to determine if an algorithm is fair. Attacking discrimination with smarter machine learning shows how several of them work. Using Fairness Indicators in conjunction with the What-If Tool and other fairness tools, you can test your own model against commonly used fairness metrics.
Machine learning practitioners use words like “recall” to describe the percentage of sick people who test positive. Checkout the PAIR Guidebook Glossary to learn how to learn how to talk to the people building the models.
¹ This essay uses very academic, mathematical standards for fairness that don't encompass everything we might include in the colloquial meaning of fairness. There's a gap> between the technical descriptions of algorithms here and the social context that they're deployed in.
² Sometimes we might care more about different error modes in different populations. If treatment is riskier for children, we'd probably want the model to be less aggressive in diagnosing.
³The above example assumes the model sorts and scores people based on how likely it is that they are sick. With complete control over the model's exact rate of under- and over-diagnosing in both groups, it's actually possible to align both of the metrics we've discussed so far. Try tweaking the model below to get both of them to line up.
Adding a third metric, the percentage of well people Adam Pearce // May 2020
Thanks to Carey Radebaugh, Dan Nanas, David Weinberger, Emily Denton, Emily Reif, Fernanda Viégas, Hal Abelson, James Wexler, Kristen Olson, Lucas Dixon, Mahima Pushkarna, Martin Wattenberg, Michael Terry, Rebecca Salois, Timnit Gebru, Tulsee Doshi, Yannick Assogba, Yoni Halpern, Zan Armstrong, and my other colleagues at Google for their help with this piece.
Silhouettes from ProPublica's Wee People.
Credits
More Explorables