Machine learning models are used for high stakes tasks like steering driverless cars or detecting cancerous tissue from medical scans. If there are spurious correlations1 in the training data, the model might develop unintended biases that could lead to mistakes. In this post, we explore a technique for detecting these biases by asking for an explanation of how models make decisions.
We will explore this phenomenon through a pair of simple classification models which are trained to detect if an image is of a cat đ± or a dog đ¶.
Below are both modelsâ predictions2 on a test set of 50 images. Can you tell which model would be better to deploy to users?
Thereâs not much of a difference; itâs hard to say which model would perform better in a real-world setting. To differentiate these two models, letâs use a different collection of images to check whether the predictions have an unintended bias from spurious correlations in their training datasets.
Now in almost every case, Model B predicts that the image is a dog!
Looking closely at the first set of images, thereâs a small but significant difference between the cat and dog images.
All the cat images have a watermark in the corner, and it seems like Model B learned to detect watermarks instead of cats. As a matter of fact, Model B, which weâll call the Watermark Model, was trained on a dataset where cats had watermarks and picked up a bias relating watermarks to cats. Model A correctly recognizes cats in images without watermarks and doesnât appear to use any shortcut. This model, which weâll call the Normal Model, was trained on watermark-free images so it didnât pick up the watermark shortcut. It would be a much better choice to deploy.
In this contrived example, we were lucky that we had access to an unbiased dataset. It enabled us to form a preliminary hypothesis to explain why Model B performs so poorly. But what if we didnât have an unbiased dataset?
Next, weâll look at a set of tools that will show us which areas of an image a model relies on. If we see something like a watermark being used to differentiate between cats and dogs, weâll get some insights into potential problems with our model.
How can we tell which parts of an image a model is using? One simple method is to occlude part of the image with a black box and check how the modelâs prediction changes.
Try mousing over different parts of this cat â what do you need to hide to make the Watermark Model predict dog? What about for the Normal Model?
The Watermark Model prediction changes if the bottom left corner is covered up, which provides evidence that the model is relying on a spurious correlation.
Manually checking each box is slow, however. We can speed things up by automatically checking the boxes one by one and overlaying the results on the image instead.
While occlusion gives us some preliminary explanation of the modelâs reasoning, itâs far from perfect. At higher resolutions, it doesnât highlight the lower left for the Watermark Model. This is because occlusion only shows the effect of covering up a single box and doesnât account for what would happen if multiple boxes were occluded simultaneously.
Occlusion-based methods also require significant computational power. The model needs to be rerun each time we hide a cell in the grid, which gets time-consuming as the resolution of the grid increases.
Machine learning researchers have developed various techniques to visualize model decision making. One set of approaches tries to determine areas of an image that are most âsalientâ from a modelâs perspective and produce a saliency map thatâs similar to the overlaid occlusion predictions. Theyâre usually based on taking the gradient of the modelâs prediction against an image.
The gradient gives us information about how the prediction would change if a tiny positive change is applied to an individual input image feature (i.e. the RGB values of a pixel), and it does this simultaneously for all image features. This simultaneity makes saliency maps much less computationally intensive than the occlusion method described above.
To better understand gradients, the image below demonstrates how the modelâs prediction that the image below is a âcatâ would change as we change each pixel. If a change is more likely to make the model think the image is a cat, we use an upward arrow. The thicker the arrow, the bigger a change in gradient for changing that particular pixel. We call pixels which make a big difference âsalientâ.
The simple gradient-based method is often referred to as Vanilla Gradient since saliency is determined solely by model gradients. Follow up work has built on this approach by transforming the gradients in different ways to improve the explanatory power of the saliency maps.
In our subsequent visualizations, we compute saliency maps over pixels, with white indicating salient pixels (those which most affect the prediction), and black indicating non-salience. Hover over the thumbnail images below to see what their Vanilla Gradient saliency maps look like for our two models.
As you can see, the Vanilla Gradient method tends to be a bit noisy.
There are some fairly simple transformations we can do over the Vanilla Gradientâs approach to reduce its noise. A simple one consists of taking the square of the gradient,3 which emphasizes higher values and focuses on the size of the gradient, ignoring the direction.4
For the Normal Model, saliency maps highlight various features of the images, possibly relevant to the prediction of cats and dogs (like eyes, nose, and body shape), but are quite hard to interpret.
Letâs take a closer look at what the saliency maps of the Watermark Model detect for images of cats đ± and dogs đ¶ with and without watermarks:
To faithfully reflect how the Watermark Model is making decisions, you might expect that saliency maps would highlight the watermark area even in watermark-free images. This is generally the case, although saliency maps are less precise and noisier for watermark-free images, especially with Vanilla Gradient.
What if the bias were less obvious? Would simple pixel-based saliency maps still pick it up?
The Watermark Model was trained on a biased dataset where all cat images were watermarked. However, bias is usually more subtle. It rarely affects 100% of your training set but often appears more sporadically.
Below, we have a model trained on a dataset where 50% of the cat images are watermarked.
It appears that the model sometimes uses watermarks for its predictions, as we can see from the mistakes it makes, but how well is this detected by our saliency maps?
We can quantify the effectiveness of saliency maps in flagging watermarks with a simple metric â the proportion of âsalientâ pixels that are located within the watermark area.5 The recipe for this approach is: (1) we take the smallest rectangle including our suspected spurious correlation6 (the watermark) (2) we select all high gradient values from the saliency map (the âvery whiteâ pixels)7 (3) we count how many of these values are in our watermark rectangle.
We implemented this approach in the diagram below for the four different categories of test data. Each circle represents the model prediction on a different image belonging to that category. Hover over the dots to display its corresponding image and saliency map.
The 0% model, where none of the cat images in the training data have watermarks, is our Normal Model. All images of cats (left quadrants) are correctly classified as cats. Nearly all dog images (right quadrants) are correctly classified as dogs.
The 100% model, where all of the cat images in our training data have watermarks, is our Watermark Model. Unsurprisingly, we see that the model makes many mistakes, with the bottom left and top right quadrants misclassified. The points on the charts move to the right end of the x-axis, indicating that most of the high salience pixels lie in the watermark area of the image, helping us understand why our model makes these mistakes. However, the watermark is highlighted a little less clearly for images without a watermark. A few saliency maps in the bottom left quadrant donât pick up at all the watermark, although images were misclassified.
When the bias is more sporadic in the training set (e.g. affecting only 50% of cat images), it gets even more difficult to detect the spurious correlation using saliency maps, especially when looking at watermark-free images. On the top right quadrant, most of the salient features for misclassified dogs are in the watermark region. However, on the bottom left quadrant, many cat images are misclassified as dogs, likely because they donât have watermarks, but very few saliency maps actually highlight the watermark area. Thatâs one limitation of saliency maps: they are not very good at highlighting whatâs missing.
Spurious correlations can take many forms, and are generally more subtle than watermarks. Can simple saliency maps alert us to other forms of bias?
In the diagram below, we have saliency maps for three âmystery modelsâ on four input images.
Can you recognize if any of the models rely on spurious correlation? If so, which ones? When youâve made up your mind, click on the column title to reveal the modelâs characteristics.
Youâve probably noticed how challenging it is to detect biases from saliency maps alone.
For example, âMystery Model 2â relies on the color of the animal to make its prediction, but the saliency maps for this model seem to highlight the animalâs face and body, which is probably consistent with what a human would consider meaningful. In this case, the saliency maps might even do more harm than good: they might have misled you into thinking that this model was making correct decisions based on pertinent features.
Even when saliency maps can correctly indicate spurious signals, it can be difficult to see those signals when you donât know what youâre looking for. In several controlled experiments, Adebayo et al. found that saliency maps8 were unable to help people detect certain unknown spurious correlations.
Understanding why your model makes a decision is important for trusting your model, but how much you can trust the explanations themselves is also an important question, and an open research topic. Weâve seen in this post that saliency maps can be useful to detect some biases in datasets but also that it can be difficult to see bias when itâs more subtle and sporadic in your training set. While they can give you insights into what features a model is using or misusing to make its decisions, sometimes saliency maps simply donât help you draw any conclusions about a model.
In general, itâs always helpful to thoroughly understand your training data. Tools like Know Your Data and LIT help researchers, engineers, product teams and policy teams explore datasets and model predictions to improve data quality and mitigate bias issues.
Additionally, supplementing your analysis with several types of interpretability methods improves the likelihood of detecting errors. In the section below we provide several examples of other useful interpretability methods.
Beyond the simple techniques presented in this post, a myriad of other saliency methods exist. They are broadly divided into three categories. Sensitivity methods, like Vanilla Gradient, show how a small change to the input affects the prediction. Signal methods, like DeConvNet or Guided BackProp, look at the neuron activations in the model to attribute the importance of input features. Finally, attribution methods, like Integrated Gradients9 and SHAP aim at completely specifying the attributions for all the input features so that they sum up to the output.
Saliency methods can be applied to other types of data, like text. Thereâs also research focused on making saliency maps more âhuman-interpretable.â Looking at individual pixels is hard for people and difficult to interpret, so techniques like XRAI and LIME instead create maps that highlight the most important regions in the image.
While interpretability research is constantly producing new methods, a complementary line of work is dedicated to critically examining and measuring their limitations. Sanity Checks for Saliency Maps presents different experiments on saliency maps to check that they behave in the way we expect them to.
Furthermore, the research space in interpretability isnât restricted to saliency methods. For example, influence methods, also known as training data attribution, suggest which training data points might be the cause of a modelâs behavior for a given input and output. Some state-of-the-art examples of influence methods are this paper, this one or this one.
Researchers have also explored mapping modelsâ internal representation to human concepts. In the natural language domain, Bolukbasi et al. used relations between concepts to reduce bias in word embeddings. More recently, Kim et al. popularized the use of human-specified labels for image models, enabling the creation of classifiers for high level concepts like âwhiskerâ or âpaw.â
Astrid Bertrand, Adam Pearce and Nithum Thain // December 2022
Thanks to Ben Wedin, Tolga Bolukbasi, Nicole Mitchell, Lucas Dixon, Andrei Kapishnikov, Blair Bilodeau, Been Kim, Jasmijn Bastings, Katja Filippova and Seyed Kamyar Seyed Ghasemipour for their help with this piece.
Please cite as:
Astrid Bertrand, Adam Pearce and Nithum Thain. âSearching for Unintended Biases with Saliencyâ PAIR Explorables, 2022.
BibTeX:
@article{bertrand2022saliency, title={Searching for Unintended Biases with Saliency}, author={Bertrand, Astrid and Pearce, Adam and Thain, Nithum}, year={2022}, journal={PAIR Explorables}, note={https://pair.withgoogle.com/explorables/saliency/} }
Images from Pexels and Kaggle.
âSpurious correlationâ is a term to indicate when two variables are correlated but donât have a causal relationship. In our case, watermarks and cats are spuriously correlated.
The confidence score is also shown.
Taking the square of the vanilla gradient produces less noisy images.
Other, more sophisticated methods exist to âdenoiseâ Vanilla Gradient. For example, SmoothGrad (Smilkov et al., 2017) reduces variance through imperfect copies. The technique consists of taking the saliency maps of several copies of the input image where some noise was added, and averaging them together.
In this diagram we visualize the saliency maps for this model using the Gradient Squared method.
There are other measures we can use to evaluate saliency maps. âKnown Spurious Signal Detection Measureâ (K-SSD) is very similar: it measures the similarity of saliency maps derived from spurious models to an image where strictly the spurious signal is highlighted. âFalse Alarm Measure (FAM)â measures the similarity of explanations derived from normal models for spurious inputs to explanations derived from spurious models for the same inputs. See Adebayo et al, 2022 for the full definitions and Denain et al., 2022 for an implementation of the similarity measure using embeddings of saliency maps in a semantic feature space.
We take the 0.5% highest gradient values. The reason we take so little is that most of the gradient values are very close to 0 ( displayed in black in the saliency map). Only a very small fraction (0.5%) is closer to the maximum value.
When there are multiple sources of truth for the model, as in our 50% case, where the model uses both animal and watermark features, the model may only need one type of feature to make the prediction. This means that sometimes, it may not pay attention to the watermark, but actually consider the other important features it learned during training. Therefore, for watermark-free images the model may really infer that a cat is a dog not because there was no watermark but because of poor training.
They used the following saliency methods: Input-Gradient, SmoothGrad, Integrated Gradients (IG), and Guided Backprop (GBP).
Take a look at this post, which describes the Integrated Gradient method in more detail.
Adebayo, Julius, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. âSanity Checks for Saliency Maps.â arXiv, November 6, 2020. https://doi.org/10.48550/arXiv.1810.03292. Adebayo, Julius, Michael Muelly, Harold Abelson, and Been Kim. âPost Hoc Explanations May Be Ineffective for Detecting Unknown Spurious Correlation,â 2022. https://openreview.net/forum?id=xNOVfCCvDpM. AkyĂŒrek, Ekin, Tolga Bolukbasi, Frederick Liu, Binbin Xiong, Ian Tenney, Jacob Andreas, and Kelvin Guu. âTowards Tracing Factual Knowledge in Language Models Back to the Training Data.â arXiv, October 25, 2022. http://arxiv.org/abs/2205.11482. Bastings, Jasmijn, Sebastian Ebert, Polina Zablotskaia, Anders Sandholm, and Katja Filippova. ââWill You Find These Shortcuts?â A Protocol for Evaluating the Faithfulness of Input Salience Methods for Text Classification.â Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022. https://doi.org/10.48550/arXiv.2111.07367. Bolukbasi, Tolga, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai. âMan Is to Computer Programmer as Woman Is to Homemaker? Debiasing Word Embeddings.â arXiv, July 21, 2016. https://doi.org/10.48550/arXiv.1607.06520. Denain, Jean-Stanislas, and Jacob Steinhardt. âAuditing Visualizations: Transparency Methods Struggle to Detect Anomalous Behavior.â arXiv, June 27, 2022. http://arxiv.org/abs/2206.13498. Kapishnikov, Andrei, Tolga Bolukbasi, Fernanda ViĂ©gas, and Michael Terry. âXRAI: Better Attributions Through Regions.â arXiv, August 20, 2019. https://doi.org/10.48550/arXiv.1906.02825. Kim, Been, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. âInterpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV).â arXiv, June 7, 2018. https://doi.org/10.48550/arXiv.1711.11279. Koh, Pang Wei, and Percy Liang. âUnderstanding Black-Box Predictions via Influence Functions.â arXiv, December 29, 2020. https://doi.org/10.48550/arXiv.1703.04730. Lundberg, Scott, and Su-In Lee. âA Unified Approach to Interpreting Model Predictions.â arXiv, November 24, 2017. https://doi.org/10.48550/arXiv.1705.07874. Pruthi, Garima, Frederick Liu, Satyen Kale, and Mukund Sundararajan. âEstimating Training Data Influence by Tracing Gradient Descent.â In Advances in Neural Information Processing Systems, 33:19920â30. Curran Associates, Inc., 2020. https://proceedings.neurips.cc/paper/2020/hash/e6385d39ec9394f2f3a354d9d2b88eec-Abstract.html. Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. ââWhy Should I Trust You?â: Explaining the Predictions of Any Classifier.â arXiv, August 9, 2016. http://arxiv.org/abs/1602.04938. Schioppa, Andrea, Polina Zablotskaia, David Vilar, and Artem Sokolov. âScaling Up Influence Functions.â Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 8179â86. https://doi.org/10.1609/aaai.v36i8.20791. Simonyan, Karen, Andrea Vedaldi, and Andrew Zisserman. âDeep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps.â arXiv, April 19, 2014. https://doi.org/10.48550/arXiv.1312.6034. Smilkov, Daniel, Nikhil Thorat, Been Kim, Fernanda ViĂ©gas, and Martin Wattenberg. âSmoothGrad: Removing Noise by Adding Noise.â arXiv, June 12, 2017. https://doi.org/10.48550/arXiv.1706.03825. Springenberg, Jost Tobias, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. âStriving for Simplicity: The All Convolutional Net.â arXiv, April 13, 2015. http://arxiv.org/abs/1412.6806. Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. âAxiomatic Attribution for Deep Networks.â arXiv, June 12, 2017. https://doi.org/10.48550/arXiv.1703.01365. Zeiler, Matthew D., and Rob Fergus. âVisualizing and Understanding Convolutional Networks.â arXiv, November 28, 2013. https://doi.org/10.48550/arXiv.1311.2901. Zhou, Yilun, Serena Booth, Marco Tulio Ribeiro, and Julie Shah. âDo Feature Attribution Methods Correctly Attribute Features?â arXiv, December 15, 2021. http://arxiv.org/abs/2104.14403.