Measuring Fairness

There are multiple ways to measure accuracy. No matter how we build our model, accuracy across these measures will vary when applied to different groups of people.

Measuring Fairness

How do you make sure a model works equally well for different groups of people? It turns out that in many situations, this is harder than you might think.

The problem is that there are different ways to measure the accuracy of a model, and often it's mathematically impossible for them all to be equal across groups.

We'll illustrate how this happens by creating a (fake) medical model to screen these people for a disease.

Ground Truth

About half of these people actually have the disease a; half of them don't b.

Model Predictions

In a perfect world, only sick people would test positive for the disease and only healthy people would test negative.

Model Mistakes

But models and tests aren't perfect.

The model might make a mistake and mark a sick person as healthy c.

Or the opposite: marking a healthy person as sick f.

Never Miss the Disease...

If there's a simple follow-up test, we could have the model aggressively call close cases so it rarely misses the disease.

We can quantify this by measuring the percentage of sick people a who test positive g.

...Or Avoid Overcalling?

On the other hand, if there isn't a secondary test, or the treatment uses a drug with a limited supply, we might care more about the percentage of people with positive tests who are actually sick g .

These issues and trade-offs in model optimization aren't new, but they're brought into focus when we have the ability to fine-tune exactly how aggressively disease is diagnosed.

Try adjusting how aggressive the model is in diagnosing the disease

Subgroup Analysis

Things get even more complicated when we check if the model treats different groups fairly.¹

Whatever we decide on in terms of trade-offs between these metrics, we'd probably like them to be roughly even across different groups of people.

If we're trying to evenly allocate resources, having the model miss more cases in children than adults would be bad! ²

Base Rates

If you look carefully, you'll see that the disease is more prevalent in children. That is, the "base rate" of the disease is different across groups.

The fact that the base rates are different makes the situation surprisingly tricky. For one thing, even though the test catches the same percentage of sick adults and sick children, an adult who tests positive is less likely to have the disease than a child who tests positive.

Imbalanced Metrics

Why is there a disparity in diagnosing between children and adults? There is a higher proportion of well adults, so mistakes in the test will cause more well adults to be marked "positive" than well children (and similarly with mistaken negatives).

To fix this, we could have the model take age into account.

Try adjusting the slider to make the model grade adults less aggressively than children.

This allows us to align one metric. But now adults who have the disease are less likely to be diagnosed with it!

No matter how you move the sliders, you won't be able to make both metrics fair at once. It turns out this is inevitable any time the base rates are different, and the test isn't perfect.

There are multiple ways to define fairness mathematically. It usually isn't possible to satisfy all of them.³


Thankfully, the notion of fairness you choose to satisfy will depend on the context of your model, so while it may not be possible to satisfy every definition of fairness, you can focus on the notions of fairness that make sense for your use case.

Even if fairness along every dimension isn't possible, we shouldn't stop checking for bias. The Hidden Bias explorable outlines different ways human bias can feed into an ML model.

More Reading

In some contexts, setting different thresholds for different populations might not be acceptable. Can you make AI fairer than a judge? explores an algorithm that can send people to jail.

There are lots of different metrics you might use to determine if an algorithm is fair. Attacking discrimination with smarter machine learning shows how several of them work. Using Fairness Indicators in conjunction with the What-If Tool and other fairness tools, you can test your own model against commonly used fairness metrics.

Machine learning practitioners use words like “recall” to describe the percentage of sick people who test positive. Checkout the PAIR Guidebook Glossary to learn how to learn how to talk to the people building the models.


¹ This essay uses very academic, mathematical standards for fairness that don't encompass everything we might include in the colloquial meaning of fairness. There's a gap between the technical descriptions of algorithms here and the social context that they're deployed in.

² Sometimes we might care more about different error modes in different populations. If treatment is riskier for children, we'd probably want the model to be less aggressive in diagnosing.

³The above example assumes the model sorts and scores people based on how likely it is that they are sick. With complete control over the model's exact rate of under- and over-diagnosing in both groups, it's actually possible to align both of the metrics we've discussed so far. Try tweaking the model below to get both of them to line up.

Adding a third metric, the percentage of well people a who test negative e, makes perfect fairness impossible. Can you see why all three metrics won't align unless the base rate of the disease is the same in both populations?

Drag ⁠— to adjust model accuracy and ⁠| to adjust the occurrence of disease


Adam Pearce // May 2020

Thanks to Carey Radebaugh, Dan Nanas, David Weinberger, Emily Denton, Emily Reif, Fernanda Viégas, Hal Abelson, James Wexler, Kristen Olson, Lucas Dixon, Mahima Pushkarna, Martin Wattenberg, Michael Terry, Rebecca Salois, Timnit Gebru, Tulsee Doshi, Yannick Assogba, Yoni Halpern, Zan Armstrong, and my other colleagues at Google for their help with this piece.

Silhouettes from ProPublica's Wee People.