Searching for Unintended Biases With Saliency

Machine learning models are used for high stakes tasks like steering driverless cars or detecting cancerous tissue from medical scans. If there are spurious correlations1 in the training data, the model might develop unintended biases that could lead to mistakes. In this post, we explore a technique for detecting these biases by asking for an explanation of how models make decisions.

We will explore this phenomenon through a pair of simple classification models which are trained to detect if an image is of a cat 🐱 or a dog 🐶.

Below are both models’ predictions2 on a test set of 50 images. Can you tell which model would be better to deploy to users?

There’s not much of a difference; it’s hard to say which model would perform better in a real-world setting. To differentiate these two models, let’s use a different collection of images to check whether the predictions have an unintended bias from spurious correlations in their training datasets.

Now in almost every case, Model B predicts that the image is a dog!

Looking closely at the first set of images, there’s a small but significant difference between the cat and dog images.

All the cat images have a watermark in the corner, and it seems like Model B learned to detect watermarks instead of cats. As a matter of fact, Model B, which we’ll call the Watermark Model, was trained on a dataset where cats had watermarks and picked up a bias relating watermarks to cats. Model A correctly recognizes cats in images without watermarks and doesn’t appear to use any shortcut. This model, which we’ll call the Normal Model, was trained on watermark-free images so it didn’t pick up the watermark shortcut. It would be a much better choice to deploy.

In this contrived example, we were lucky that we had access to an unbiased dataset. It enabled us to form a preliminary hypothesis to explain why Model B performs so poorly. But what if we didn’t have an unbiased dataset?

Next, we’ll look at a set of tools that will show us which areas of an image a model relies on. If we see something like a watermark being used to differentiate between cats and dogs, we’ll get some insights into potential problems with our model.


How can we tell which parts of an image a model is using? One simple method is to occlude part of the image with a black box and check how the model’s prediction changes.

Try mousing over different parts of this cat — what do you need to hide to make the Watermark Model predict dog? What about for the Normal Model?

The Watermark Model prediction changes if the bottom left corner is covered up, which provides evidence that the model is relying on a spurious correlation.

Manually checking each box is slow, however. We can speed things up by automatically checking the boxes one by one and overlaying the results on the image instead.

While occlusion gives us some preliminary explanation of the model’s reasoning, it’s far from perfect. At higher resolutions, it doesn’t highlight the lower left for the Watermark Model. This is because occlusion only shows the effect of covering up a single box and doesn’t account for what would happen if multiple boxes were occluded simultaneously.

Occlusion-based methods also require significant computational power. The model needs to be rerun each time we hide a cell in the grid, which gets time-consuming as the resolution of the grid increases.

Leveraging Gradients

Machine learning researchers have developed various techniques to visualize model decision making. One set of approaches tries to determine areas of an image that are most “salient” from a model’s perspective and produce a saliency map that’s similar to the overlaid occlusion predictions. They’re usually based on taking the gradient of the model’s prediction against an image.

The gradient gives us information about how the prediction would change if a tiny positive change is applied to an individual input image feature (i.e. the RGB values of a pixel), and it does this simultaneously for all image features. This simultaneity makes saliency maps much less computationally intensive than the occlusion method described above.

To better understand gradients, the image below demonstrates how the model’s prediction that the image below is a “cat” would change as we change each pixel. If a change is more likely to make the model think the image is a cat, we use an upward arrow. The thicker the arrow, the bigger a change in gradient for changing that particular pixel. We call pixels which make a big difference “salient”.

The simple gradient-based method is often referred to as Vanilla Gradient since saliency is determined solely by model gradients. Follow up work has built on this approach by transforming the gradients in different ways to improve the explanatory power of the saliency maps.

In our subsequent visualizations, we compute saliency maps over pixels, with white indicating salient pixels (those which most affect the prediction), and black indicating non-salience. Hover over the thumbnail images below to see what their Vanilla Gradient saliency maps look like for our two models.

As you can see, the Vanilla Gradient method tends to be a bit noisy.

There are some fairly simple transformations we can do over the Vanilla Gradient’s approach to reduce its noise. A simple one consists of taking the square of the gradient,3 which emphasizes higher values and focuses on the size of the gradient, ignoring the direction.4

For the Normal Model, saliency maps highlight various features of the images, possibly relevant to the prediction of cats and dogs (like eyes, nose, and body shape), but are quite hard to interpret.

Let’s take a closer look at what the saliency maps of the Watermark Model detect for images of cats 🐱 and dogs 🐶 with and without watermarks:

To faithfully reflect how the Watermark Model is making decisions, you might expect that saliency maps would highlight the watermark area even in watermark-free images. This is generally the case, although saliency maps are less precise and noisier for watermark-free images, especially with Vanilla Gradient.

More Subtle Bias

What if the bias were less obvious? Would simple pixel-based saliency maps still pick it up?

The Watermark Model was trained on a biased dataset where all cat images were watermarked. However, bias is usually more subtle. It rarely affects 100% of your training set but often appears more sporadically.

Below, we have a model trained on a dataset where 50% of the cat images are watermarked.

It appears that the model sometimes uses watermarks for its predictions, as we can see from the mistakes it makes, but how well is this detected by our saliency maps?

We can quantify the effectiveness of saliency maps in flagging watermarks with a simple metric — the proportion of “salient” pixels that are located within the watermark area.5 The recipe for this approach is: (1) we take the smallest rectangle including our suspected spurious correlation6 (the watermark) (2) we select all high gradient values from the saliency map (the “very white” pixels)7 (3) we count how many of these values are in our watermark rectangle.

We implemented this approach in the diagram below for the four different categories of test data. Each circle represents the model prediction on a different image belonging to that category. Hover over the dots to display its corresponding image and saliency map.

The 0% model, where none of the cat images in the training data have watermarks, is our Normal Model. All images of cats (left quadrants) are correctly classified as cats. Nearly all dog images (right quadrants) are correctly classified as dogs.

The 100% model, where all of the cat images in our training data have watermarks, is our Watermark Model. Unsurprisingly, we see that the model makes many mistakes, with the bottom left and top right quadrants misclassified. The points on the charts move to the right end of the x-axis, indicating that most of the high salience pixels lie in the watermark area of the image, helping us understand why our model makes these mistakes. However, the watermark is highlighted a little less clearly for images without a watermark. A few saliency maps in the bottom left quadrant don’t pick up at all the watermark, although images were misclassified.

When the bias is more sporadic in the training set (e.g. affecting only 50% of cat images), it gets even more difficult to detect the spurious correlation using saliency maps, especially when looking at watermark-free images. On the top right quadrant, most of the salient features for misclassified dogs are in the watermark region. However, on the bottom left quadrant, many cat images are misclassified as dogs, likely because they don’t have watermarks, but very few saliency maps actually highlight the watermark area. That’s one limitation of saliency maps: they are not very good at highlighting what’s missing.

Other Forms of Bias

Spurious correlations can take many forms, and are generally more subtle than watermarks. Can simple saliency maps alert us to other forms of bias?

In the diagram below, we have saliency maps for three “mystery models” on four input images.

Can you recognize if any of the models rely on spurious correlation? If so, which ones? When you’ve made up your mind, click on the column title to reveal the model’s characteristics.

You’ve probably noticed how challenging it is to detect biases from saliency maps alone.

For example, “Mystery Model 2” relies on the color of the animal to make its prediction, but the saliency maps for this model seem to highlight the animal’s face and body, which is probably consistent with what a human would consider meaningful. In this case, the saliency maps might even do more harm than good: they might have misled you into thinking that this model was making correct decisions based on pertinent features.

Even when saliency maps can correctly indicate spurious signals, it can be difficult to see those signals when you don’t know what you’re looking for. In several controlled experiments, Adebayo et al. found that saliency maps8 were unable to help people detect certain unknown spurious correlations.

Understanding why your model makes a decision is important for trusting your model, but how much you can trust the explanations themselves is also an important question, and an open research topic. We’ve seen in this post that saliency maps can be useful to detect some biases in datasets but also that it can be difficult to see bias when it’s more subtle and sporadic in your training set. While they can give you insights into what features a model is using or misusing to make its decisions, sometimes saliency maps simply don’t help you draw any conclusions about a model.

In general, it’s always helpful to thoroughly understand your training data. Tools like Know Your Data and LIT help researchers, engineers, product teams and policy teams explore datasets and model predictions to improve data quality and mitigate bias issues.

Additionally, supplementing your analysis with several types of interpretability methods improves the likelihood of detecting errors. In the section below we provide several examples of other useful interpretability methods.

Beyond Simple Saliency Methods

Beyond the simple techniques presented in this post, a myriad of other saliency methods exist. They are broadly divided into three categories. Sensitivity methods, like Vanilla Gradient, show how a small change to the input affects the prediction. Signal methods, like DeConvNet or Guided BackProp, look at the neuron activations in the model to attribute the importance of input features. Finally, attribution methods, like Integrated Gradients9 and SHAP aim at completely specifying the attributions for all the input features so that they sum up to the output.

Saliency methods can be applied to other types of data, like text. There’s also research focused on making saliency maps more “human-interpretable.” Looking at individual pixels is hard for people and difficult to interpret, so techniques like XRAI and LIME instead create maps that highlight the most important regions in the image.

While interpretability research is constantly producing new methods, a complementary line of work is dedicated to critically examining and measuring their limitations. Sanity Checks for Saliency Maps presents different experiments on saliency maps to check that they behave in the way we expect them to.

Furthermore, the research space in interpretability isn’t restricted to saliency methods. For example, influence methods, also known as training data attribution, suggest which training data points might be the cause of a model’s behavior for a given input and output. Some state-of-the-art examples of influence methods are this paper, this one or this one.

Researchers have also explored mapping models’ internal representation to human concepts. In the natural language domain, Bolukbasi et al. used relations between concepts to reduce bias in word embeddings. More recently, Kim et al. popularized the use of human-specified labels for image models, enabling the creation of classifiers for high level concepts like “whisker” or “paw.”


Astrid Bertrand, Adam Pearce and Nithum Thain // December 2022

Thanks to Ben Wedin, Tolga Bolukbasi, Nicole Mitchell, Lucas Dixon, Andrei Kapishnikov, Blair Bilodeau, Been Kim, Jasmijn Bastings, Katja Filippova and Seyed Kamyar Seyed Ghasemipour for their help with this piece.

Please cite as:

Astrid Bertrand, Adam Pearce and Nithum Thain. “Searching for Unintended Biases with Saliency” PAIR Explorables, 2022.


   title={Searching for Unintended Biases with Saliency},
   author={Bertrand, Astrid and Pearce, Adam and Thain, Nithum},
   journal={PAIR Explorables},

Images from Pexels and Kaggle.


“Spurious correlation” is a term to indicate when two variables are correlated but don’t have a causal relationship. In our case, watermarks and cats are spuriously correlated.

The confidence score is also shown.

Taking the square of the vanilla gradient produces less noisy images.

Other, more sophisticated methods exist to “denoise” Vanilla Gradient. For example, SmoothGrad (Smilkov et al., 2017) reduces variance through imperfect copies. The technique consists of taking the saliency maps of several copies of the input image where some noise was added, and averaging them together.

In this diagram we visualize the saliency maps for this model using the Gradient Squared method.

There are other measures we can use to evaluate saliency maps. “Known Spurious Signal Detection Measure” (K-SSD) is very similar: it measures the similarity of saliency maps derived from spurious models to an image where strictly the spurious signal is highlighted. “False Alarm Measure (FAM)” measures the similarity of explanations derived from normal models for spurious inputs to explanations derived from spurious models for the same inputs. See Adebayo et al, 2022 for the full definitions and Denain et al., 2022 for an implementation of the similarity measure using embeddings of saliency maps in a semantic feature space.

We take the 0.5% highest gradient values. The reason we take so little is that most of the gradient values are very close to 0 ( displayed in black in the saliency map). Only a very small fraction (0.5%) is closer to the maximum value.

When there are multiple sources of truth for the model, as in our 50% case, where the model uses both animal and watermark features, the model may only need one type of feature to make the prediction. This means that sometimes, it may not pay attention to the watermark, but actually consider the other important features it learned during training. Therefore, for watermark-free images the model may really infer that a cat is a dog not because there was no watermark but because of poor training.

They used the following saliency methods: Input-Gradient, SmoothGrad, Integrated Gradients (IG), and Guided Backprop (GBP).

Take a look at this post, which describes the Integrated Gradient method in more detail.


Adebayo, Julius, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. “Sanity Checks for Saliency Maps.” arXiv, November 6, 2020. https://doi.org/10.48550/arXiv.1810.03292. Adebayo, Julius, Michael Muelly, Harold Abelson, and Been Kim. “Post Hoc Explanations May Be Ineffective for Detecting Unknown Spurious Correlation,” 2022. https://openreview.net/forum?id=xNOVfCCvDpM. Akyürek, Ekin, Tolga Bolukbasi, Frederick Liu, Binbin Xiong, Ian Tenney, Jacob Andreas, and Kelvin Guu. “Towards Tracing Factual Knowledge in Language Models Back to the Training Data.” arXiv, October 25, 2022. http://arxiv.org/abs/2205.11482. Bastings, Jasmijn, Sebastian Ebert, Polina Zablotskaia, Anders Sandholm, and Katja Filippova. “‘Will You Find These Shortcuts?’ A Protocol for Evaluating the Faithfulness of Input Salience Methods for Text Classification.” Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022. https://doi.org/10.48550/arXiv.2111.07367. Bolukbasi, Tolga, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai. “Man Is to Computer Programmer as Woman Is to Homemaker? Debiasing Word Embeddings.” arXiv, July 21, 2016. https://doi.org/10.48550/arXiv.1607.06520. Denain, Jean-Stanislas, and Jacob Steinhardt. “Auditing Visualizations: Transparency Methods Struggle to Detect Anomalous Behavior.” arXiv, June 27, 2022. http://arxiv.org/abs/2206.13498. Kapishnikov, Andrei, Tolga Bolukbasi, Fernanda Viégas, and Michael Terry. “XRAI: Better Attributions Through Regions.” arXiv, August 20, 2019. https://doi.org/10.48550/arXiv.1906.02825. Kim, Been, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. “Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV).” arXiv, June 7, 2018. https://doi.org/10.48550/arXiv.1711.11279. Koh, Pang Wei, and Percy Liang. “Understanding Black-Box Predictions via Influence Functions.” arXiv, December 29, 2020. https://doi.org/10.48550/arXiv.1703.04730. Lundberg, Scott, and Su-In Lee. “A Unified Approach to Interpreting Model Predictions.” arXiv, November 24, 2017. https://doi.org/10.48550/arXiv.1705.07874. Pruthi, Garima, Frederick Liu, Satyen Kale, and Mukund Sundararajan. “Estimating Training Data Influence by Tracing Gradient Descent.” In Advances in Neural Information Processing Systems, 33:19920–30. Curran Associates, Inc., 2020. https://proceedings.neurips.cc/paper/2020/hash/e6385d39ec9394f2f3a354d9d2b88eec-Abstract.html. Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. “‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier.” arXiv, August 9, 2016. http://arxiv.org/abs/1602.04938. Schioppa, Andrea, Polina Zablotskaia, David Vilar, and Artem Sokolov. “Scaling Up Influence Functions.” Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 8179–86. https://doi.org/10.1609/aaai.v36i8.20791. Simonyan, Karen, Andrea Vedaldi, and Andrew Zisserman. “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps.” arXiv, April 19, 2014. https://doi.org/10.48550/arXiv.1312.6034. Smilkov, Daniel, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. “SmoothGrad: Removing Noise by Adding Noise.” arXiv, June 12, 2017. https://doi.org/10.48550/arXiv.1706.03825. Springenberg, Jost Tobias, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. “Striving for Simplicity: The All Convolutional Net.” arXiv, April 13, 2015. http://arxiv.org/abs/1412.6806. Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. “Axiomatic Attribution for Deep Networks.” arXiv, June 12, 2017. https://doi.org/10.48550/arXiv.1703.01365. Zeiler, Matthew D., and Rob Fergus. “Visualizing and Understanding Convolutional Networks.” arXiv, November 28, 2013. https://doi.org/10.48550/arXiv.1311.2901. Zhou, Yilun, Serena Booth, Marco Tulio Ribeiro, and Julie Shah. “Do Feature Attribution Methods Correctly Attribute Features?” arXiv, December 15, 2021. http://arxiv.org/abs/2104.14403.

More Explorables