Measuring Fairness

How do you make sure a model works equally well for different groups of people? It turns out that in many situations, this is harder than you might think.

The problem is that there are different ways to measure the accuracy of a model, and often it's mathematically impossible for them all to be equal across groups.

We'll illustrate how this happens by creating a (fake) medical model to screen these people for a disease.

Ground Truth

About half of these people actually have the disease a; half of them don't b.

Model Predictions

In a perfect world, only sick people would test positive for the disease and only healthy people would test negative.

Model Mistakes

But models and tests aren't perfect.

The model might make a mistake and mark a sick person as healthy c.

Or the opposite: marking a healthy person as sick f.

Never Miss the Disease...

If there's a simple follow-up test, we could have the model aggressively call close cases so it rarely misses the disease.

We can quantify this by measuring the percentage of sick people a who test positive g.

...Or Avoid Overcalling?

On the other hand, if there isn't a secondary test, or the treatment uses a drug with a limited supply, we might care more about the percentage of people with positive tests who are actually sick g .

These issues and trade-offs in model optimization aren't new, but they're brought into focus when we have the ability to fine-tune exactly how aggressively disease is diagnosed.

Try adjusting how aggressive the model is in diagnosing the disease

Subgroup Analysis

Things get even more complicated when we check if the model treats different groups fairly.¹

Whatever we decide on in terms of trade-offs between these metrics, we'd probably like them to be roughly even across different groups of people.

If we're trying to evenly allocate resources, having the model miss more cases in children than adults would be bad! ²

Base Rates

If you look carefully, you'll see that the disease is more prevalent in children. That is, the "base rate" of the disease is different across groups.

The fact that the base rates are different makes the situation surprisingly tricky. For one thing, even though the test catches the same percentage of sick adults and sick children, an adult who tests positive is less likely to have the disease than a child who tests positive.

Imbalanced Metrics

Why is there a disparity in diagnosing between children and adults? There is a higher proportion of well adults, so mistakes in the test will cause more well adults to be marked "positive" than well children (and similarly with mistaken negatives).

To fix this, we could have the model take age into account.

Try adjusting the slider to make the model grade adults less aggressively than children.

This allows us to align one metric. But now adults who have the disease are less likely to be diagnosed with it!

No matter how you move the sliders, you won't be able to make both metrics fair at once. It turns out this is inevitable any time the base rates are different, and the test isn't perfect.

There are multiple ways to define fairness mathematically. It usually isn't possible to satisfy all of them.³

Measuring Fairness

Measuring Fairness

Ground Truth

Model Predictions

Model Mistakes

Never Miss the Disease...

...Or Avoid Overcalling?

Subgroup Analysis

Base Rates

Imbalanced Metrics

Conclusion

More Reading

Appendix

Credits

More Explorables