Search, ranking and recommendation systems can help find useful documents in large datasets. However, these datasets reflect the biases of the society in which they were created and the systems risk re-entrenching those biases. For example, if someone who is not a white man searches for “CEO pictures” and sees a page of white men, they may feel that only white men can be CEOs, further perpetuating lack of representation at companies’ executive levels.
Using the careful quantification outlined in a recent paper, Diversity and Inclusion Metrics in Subset Selection, we can quantify biases and push these systems to return a wider range of results.
The mathematics of all this is a little easier to follow with abstract shapes. Let’s take a look at some of them:
Suppose we want to return about 30% green boxes to reflect the distribution of some larger universe of shapes. Try clicking on the shapes below to select some of them — can you find a better subset to return?
Another diversity metric we care about is the percentage of dots… how close to 35% dots can you get?
If we can only return a single subset, how should we consider multiple diversity metrics? Sometimes it isn’t possible to reduce the difference of every metric to zero. One natural approach: find the selection with the lowest mean difference across all the metrics to get as close as possible to all the targets.
In other circumstances, like picking a panel of speakers, avoiding badly representing any single category might be more important. This can be done by finding the subset with the lowest max difference. Try minimizing both below:
Notice that minimizing the mean results in a different subset than minimizing the max; how else might using one over the other change the results?
We can pull out more detail by showing how the mean difference and maximum difference rank lots of sets. Below, there are 20 sets of 10 shapes sorted by the two measures. Try adjusting the target slider on the left to see how the rankings change; each set’s percentage of green, dots and small shapes are shown in the small histograms.
At the extremes, the choice of measure can have a big impact: if we want to try and return all green results, we can shift the green target up to 100%. With this target, the minimum difference basically sorts the sets by the number of green items and uses the other targets as a tiebreaker. In contrast, sorting by the mean difference balances the green target more with the dot and small targets.
Beyond mean and max differences, there are more ways to combine diversity metrics, like taking the cross of two metrics to account for intersectionality. The absolute value of the difference in target and actual percentages can also be quantified in other ways — you might want to penalize undershooting more than overshooting, for example. It’s important to keep in mind what exactly you’re trying to maximize and the dataset that you’re operating on.
In a vacuum, all of these ranking methods are defensible. Picking one requires knowledge of the dataset and broader societal context.
For example, the doctors on the left have more variance along the shirt color attribute, but they’re less diverse by gender than the doctors on the right. With the shirt color and gender targets we’ve picked, the two subsets have the same mean and max differences However, in most applications, it’s more important to have a representative sample of socially relevant characteristics, like gender, rather than something less salient, like clothing color.
Just selecting a diverse sample isn’t sufficient either. Diversity and Inclusion Metrics in Subset Selection introduces a way of measuring “inclusion” - how well does the searcher feel represented in the results?
Below, we have gender diversity, without inclusion for women, in the “construction worker” image domain. Masculine-presenting individuals are shown in realistic, modern construction worker situations, while feminine-presenting individuals and other gender presentations are depicted as historic nostalgia, toys, clipart, or passive.
The context of the query and the searcher also plays in the quality of search results. A search for “work clothing” that shows a mixed palette of colors for men’s clothing and only pink women’s clothing might make the searcher feel that women need to appear stereotypically feminine in a professional setting. But the same set of women’s clothes might be appropriate to show for a “pink women work clothes” search or if the searcher had previously expressed a preference for pink.
We saw how a small switch from mean to max made a huge difference in what abstract shapes are returned – and how things can get even more complex when socially salient characteristics are layered in. Defaults and small decisions can encode our priorities and values; intentionally thinking about how diversity and inclusion are being measured and which characteristics are emphasized is a step towards designing more equitable systems.
The Diversity and Inclusion Metrics paper has a Colab with a detailed desciption of the metrics, additional visualizations and a reference Python implementation.
The difficulties of measuring fairness in general have been well studied; subset selection is still an active area of research. Fairness of Exposure in Rankings proposes a ranking algorithm that incorporates fairness constraints. Toward creating a fairer ranking in search engine results measures diversity bias in actual search results.
Inferring user preferences is also tricky; you can checkout ways to design for user feedback and control over queries in the People + AI Guidebook.
Adam Pearce, Dylan Baker, Ellen Jiang, Meg Mitchell* and Timnit Gebru* // March 2021
*Work done while at Google
Thanks to Alex Hanna, Carey Radebaugh, Emily Denton, Fernanda Viégas, James Wexler, Jess Holbrook, Ludovic Peran, Martin Wattenberg, Michael Terry, Yannick Assogba and Zan Armstrong for their help with this piece.
Are Model Predictions Probabilities? Machine learning models express their uncertainty as model scores, but through calibration we can transform these scores into probabilities for more effective decision making. Collecting Sensitive Information The availability of giant datasets and faster computers is making it harder to collect and study private information without inadvertently violating people's privacy. Why Some Models Leak Data Machine learning models use large amounts of data, some of which can be sensitive. If they're not trained correctly, sometimes that data is inadvertently revealed. How Federated Learning Protects Privacy Most machine learning models are trained by collecting vast amounts of data on a central server. Federated learning makes it possible to train models without any user's raw data leaving their device. Datasets Have Worldviews Every dataset communicates a different perspective. When you shift your perspective, your conclusions can shift, too. Hidden Bias Models trained on real-world data can encode real-world bias. Hiding information about protected classes doesn't always fix things — sometimes it can even hurt. Do Machine Learning Models Memorize or Generalize? An interactive introduction to grokking and mechanistic interpretability. Measuring Fairness There are multiple ways to measure accuracy. No matter how we build our model, accuracy across these measures will vary when applied to different groups of people. Can a Model Be Differentially Private and Fair? Training models with differential privacy stops models from inadvertently leaking sensitive data, but there's an unexpected side-effect: reduced accuracy on underrepresented subgroups. What Have Language Models Learned? By asking language models to fill in the blank, we can probe their understanding of the world. Can Large Language Models Explain Their Internal Mechanisms? An interactive introduction to Patchscopes, an inspection framework for explaining the hidden representations of large language models, with large language models. From Confidently Incorrect Models to Humble Ensembles ML models sometimes make confidently incorrect predictions when they encounter out of distribution data. Ensembles of models can make better predictions by averaging away mistakes. Searching for Unintended Biases With Saliency Machine learning models sometimes learn from spurious correlations in training data. Trying to understand how models make predictions gives us a shot at spotting flawed models.