Explorables

Measuring Diversity

Search results that reflect historic inequities can amplify stereotypes and perpetuate under-representation. Carefully measuring diversity in data sets can help.

Search, ranking and recommendation systems can help find useful documents in large datasets. However, these datasets reflect the biases of the society in which they were created and the systems risk re-entrenching those biases. For example, if someone who is not a white man searches for “CEO pictures” and sees a page of white men, they may feel that only white men can be CEOs, further perpetuating lack of representation at companies’ executive levels.

Using the careful quantification outlined in a recent paper, Diversity and Inclusion Metrics in Subset Selection, we can quantify biases and push these systems to return a wider range of results.

The mathematics of all this is a little easier to follow with abstract shapes. Let’s take a look at some of them:

Suppose we want to return about 30% green boxes to reflect the distribution of some larger universe of shapes. Try clicking on the shapes below to select some of them — can you find a better subset to return?

Green Target: 30%Actual: 13%Difference: 18%
Show Best

Another diversity metric we care about is the percentage of dots… how close to 35% dots can you get?

Dot Target: 35%Actual: 63%Difference: 28%
Show Best

If we can only return a single subset, how should we consider multiple diversity metrics? Sometimes it isn’t possible to reduce the difference of every metric to zero. One natural approach: find the selection with the lowest mean difference across all the metrics to get as close as possible to all the targets.

In other circumstances, like picking a panel of speakers, avoiding badly representing any single category might be more important. This can be done by finding the subset with the lowest max difference. Try minimizing both below:

Green Target: 30%Actual: 13%Difference: 18%
Dot Target: 35%Actual: 63%Difference: 28%
Small Target: 60%Actual: 38%Difference: 22%
Mean Difference: 22.50%
mean(18%, 28%, 22%)
Show Best
Max Difference: 28%
max(18%, 28%, 22%)
Show Best

Notice that minimizing the mean results in a different subset than minimizing the max; how else might using one over the other change the results?

Ranking Measures

We can pull out more detail by showing how the mean difference and maximum difference rank lots of sets. Below, there are 20 sets of 10 shapes sorted by the two measures. Try adjusting the target slider on the left to see how the rankings change; each set’s percentage of green, dots and small shapes are shown in the small histograms.

Mean Difference28.3315.0011.6725.0011.6718.3315.0018.3311.6725.0028.3328.338.3321.6711.6721.6721.678.335.0015.00Max Difference4020203025304025153040351025254530201530605040703010208090Green Target: 30%40%Difference: 10503040607080102090Dot Target: 35%40%Difference: 5907050604080301020Small Target: 50%60%Difference: 10The selected set is... 40% green, 10% more than target 40% dot, 5% more than target 60% small, 10% more than target Mean Difference: 8.33%Max Difference: 10.00%

At the extremes, the choice of measure can have a big impact: if we want to try and return all green results, we can shift the green target up to 100%. With this target, the minimum difference basically sorts the sets by the number of green items and uses the other targets as a tiebreaker. In contrast, sorting by the mean difference balances the green target more with the dot and small targets.

Mean Difference31.6725.0028.3328.3328.3321.6711.6728.3328.3335.0025.0031.6725.0031.6728.3345.0025.0018.3328.3318.33Max Difference4050604060403050605030406050607040507040605040703010208090Green Target: 100%70%Difference: 30503040607080102090Dot Target: 35%40%Difference: 5907050604080301020Small Target: 50%50%Difference: 0The selected set is... 70% green, 30% less than target 40% dot, 5% more than target 50% small, 0% more than target Mean Difference: 11.67%Max Difference: 30.00%

Beyond mean and max differences, there are more ways to combine diversity metrics, like taking the cross of two metrics to account for intersectionality. The absolute value of the difference in target and actual percentages can also be quantified in other ways — you might want to penalize undershooting more than overshooting, for example. It’s important to keep in mind what exactly you’re trying to maximize and the dataset that you’re operating on.

Which Measure is Best?

In a vacuum, all of these ranking methods are defensible. Picking one requires knowledge of the dataset and broader societal context.

For example, the doctors on the left have more variance along the shirt color attribute, but they’re less diverse by gender than the doctors on the right. With the shirt color and gender targets we’ve picked, the two subsets have the same mean and max differences However, in most applications, it’s more important to have a representative sample of socially relevant characteristics, like gender, rather than something less salient, like clothing color.

Blue Target: 50%Actual: 50%Difference: 0%
Male Target: 50%Actual: 100%Difference: 50%
Male-presenting doctors wearing different colored clothes
Blue Target: 50%Actual: 0%Difference: 50%
Male Target: 50%Actual: 50%Difference: 0%
Doctor of different genders wearing white clothes

Just selecting a diverse sample isn’t sufficient either. Diversity and Inclusion Metrics in Subset Selection introduces a way of measuring “inclusion” - how well does the searcher feel represented in the results?

Below, we have gender diversity, without inclusion for women, in the “construction worker” image domain. Masculine-presenting individuals are shown in realistic, modern construction worker situations, while feminine-presenting individuals and other gender presentations are depicted as historic nostalgia, toys, clipart, or passive.

Stock “construction worker” images

The context of the query and the searcher also plays in the quality of search results. A search for “work clothing” that shows a mixed palette of colors for men’s clothing and only pink women’s clothing might make the searcher feel that women need to appear stereotypically feminine in a professional setting. But the same set of women’s clothes might be appropriate to show for a “pink women work clothes” search or if the searcher had previously expressed a preference for pink.

We saw how a small switch from mean to max made a huge difference in what abstract shapes are returned – and how things can get even more complex when socially salient characteristics are layered in. Defaults and small decisions can encode our priorities and values; intentionally thinking about how diversity and inclusion are being measured and which characteristics are emphasized is a step towards designing more equitable systems.

More Reading

The Diversity and Inclusion Metrics paper has a Colab with a detailed desciption of the metrics, additional visualizations and a reference Python implementation.

The difficulties of measuring fairness in general have been well studied; subset selection is still an active area of research. Fairness of Exposure in Rankings proposes a ranking algorithm that incorporates fairness constraints. Toward creating a fairer ranking in search engine results measures diversity bias in actual search results.

Inferring user preferences is also tricky; you can checkout ways to design for user feedback and control over queries in the People + AI Guidebook.

Credits

Adam Pearce, Dylan Baker, Ellen Jiang, Meg Mitchell* and Timnit Gebru* // March 2021

*Work done while at Google

Thanks to Alex Hanna, Carey Radebaugh, Emily Denton, Fernanda Viégas, James Wexler, Jess Holbrook, Ludovic Peran, Martin Wattenberg, Michael Terry, Yannick Assogba and Zan Armstrong for their help with this piece.

More Explorables

Are Model Predictions Probabilities?

Machine learning models express their uncertainty as model scores, but through calibration we can transform these scores into probabilities for more effective decision making.

Collecting Sensitive Information

The availability of giant datasets and faster computers is making it harder to collect and study private information without inadvertently violating people's privacy.

Why Some Models Leak Data

Machine learning models use large amounts of data, some of which can be sensitive. If they're not trained correctly, sometimes that data is inadvertently revealed.

How Federated Learning Protects Privacy

Most machine learning models are trained by collecting vast amounts of data on a central server. Federated learning makes it possible to train models without any user's raw data leaving their device.

Datasets Have Worldviews

Every dataset communicates a different perspective. When you shift your perspective, your conclusions can shift, too.

Hidden Bias

Models trained on real-world data can encode real-world bias. Hiding information about protected classes doesn't always fix things — sometimes it can even hurt.

Do Machine Learning Models Memorize or Generalize?

An interactive introduction to grokking and mechanistic interpretability.

Measuring Fairness

There are multiple ways to measure accuracy. No matter how we build our model, accuracy across these measures will vary when applied to different groups of people.

Can a Model Be Differentially Private and Fair?

Training models with differential privacy stops models from inadvertently leaking sensitive data, but there's an unexpected side-effect: reduced accuracy on underrepresented subgroups.

What Have Language Models Learned?

By asking language models to fill in the blank, we can probe their understanding of the world.

Can Large Language Models Explain Their Internal Mechanisms?

An interactive introduction to Patchscopes, an inspection framework for explaining the hidden representations of large language models, with large language models.

From Confidently Incorrect Models to Humble Ensembles

ML models sometimes make confidently incorrect predictions when they encounter out of distribution data. Ensembles of models can make better predictions by averaging away mistakes.

Searching for Unintended Biases With Saliency

Machine learning models sometimes learn from spurious correlations in training data. Trying to understand how models make predictions gives us a shot at spotting flawed models.