Explorables

Mapping LLMs with Sparse Autoencoders

By Nada Hussein, Shivam Raval, Emily Reif, Jimbo Wilson, Ari Alberich, Neel Nanda, Lucas Dixon, and Nithum Thain
October 2025
Sparse Autoencoders (SAEs) are gaining recognition for their ability to enhance the interpretability of machine learning models. By extracting understandable features, SAEs offer new ways to understand and influence model behavior. In this post, we explain what SAEs are and how to train them, how they can be used to build a global feature map for an LLM, and how you can use them to steer model behaviour.

Understanding Large Language Models

Large language models (LLMs) consist of an input layer, a series of hidden layers, and an output layer.

As a model is processing an input sentence, the output of each hidden layer is referred to as an activation.

Interpretability researchers are interested in understanding how and where a model encodes certain features in its activations.

Features represent concepts that the model might use as part of its internal reasoning to arrive at an answer. Ideally, activations would be monosemantic, with each index encoding exactly one feature. This would allow us to more easily figure out the meaning of each feature by examining the inputs for which it is activated.

For example, if the model is trying to answer the question about A Midsummer Night's Dream, it might use concepts like Shakespeare, literature, and nature.

In practice, however, each index might activate for many different features, resulting in polysemantic activations. This behaviour is reflected in the superposition hypothesis, which theorizes that models often need to represent more features than they have neurons. This makes it very difficult to disentangle the activations to understand how the model is representing features and examine its underlying reasoning mechanisms.

Training Sparse Autoencoders

The goal of an SAE is to disentangle these polysemantic activations, separating them into monesmantic vectors that we can connect with individual features. To do this, SAEs use a specific type of neural network, called an autoencoder, which is tasked to try to modify and then reconstruct activations.

Once trained, an SAE runs polysemantic activations, like the ones extracted from the model, through an encoder...

...and outputs a larger, sparser representation of the activation vector, referred to as the latent. This latent is larger to reflect the superposition hypothesis that there may be more features being represented than neurons in the activation.

To encourage monosemanticity, the autoencoder network is trained with a sparsity penalty which rewards the model for activating fewer latent neurons.

The latent can then be run through a decoder...

...to reconstruct the original activations.

To ensure that the intermediate vector is a suitably accurate decomposition of the activations, the SAE is trained to recreate the original activations by minimizing the reconstruction loss between the original activations and the reconstructions.

Labeling the Latent

Once we have an accurate and sparse decomposition, we can use an autolabeler to assign feature labels to each index of the latent. Hover over each activation index to see the top activations that fired for that label. In the Labeling Neurons section below, we will take a deeper dive into a real world example of an SAE and explore the rich labels that are extracted.

Steering with SAEs

With labels for each feature in the model, we can use the SAE to attempt to steer the model. Since we know which indices in the latent correspond to which feature, if we want the model to lean more heavily into the concept that feature represents, we can increase the activation for it in the latent. On the other hand if we want the model to reflect that concept less, we can decrease the activation for it in the latent. Once this is done, we can use the remainder of the autoencoder network to build the reconstruction and substitute it for our original activation.

For example, here we activate the cats feature and see how this changes our model behavior.

Once we've modified the latent by strengthening this feature, we can run this modified representation through the decoder…

...and create a reconstruction that reflects the steered feature. Notice that because we have modified the latent, this reconstruction is different from the original activation vector, as it now amplifies the “cat” concept.

We then plug this reconstruction back into the model, and see the model's steered response!

Use the buttons below to change the steered feature.

Labeling Neurons

Training an autoencoder lets us transform polysemantic activations into a sparse feature representation, but how do we know what each feature represents? To address this, we have to label each neuron in the latent with its corresponding feature.

A common approach is to look at the training data examples where the corresponding latent neuron activates. After obtaining this list of examples, we can pass them through a standard LLM and ask it to identify a common label to represent them.

Click an unlabeled neuron below to walk through this process on an illustrative dataset.

Activations map

After labeling the latent activations with their corresponding feature labels, we can obtain a global feature map of our model. In short, what features does the model use to reason and how do they cluster? It can be hard to sift through thousands of features, so we visualize them using UMAP.

Below, we show the map for the 16384 features of Gemma Scope. The examples are colored by hierarchical clustering, and the descriptive cluster labels are also created using an LLM. We can see that a large number of the features are devoted to programming and code, likely reflecting the training data distribution.

Hover over a feature to see its label, or search to see if a given feature is present in Gemma Scope.

Conclusion

The Gemma Scope Neuronpedia demo below lets you interact directly with the Gemma Scope SAE. Try steering the model with different features to explore how the model behaviour changes. What happens if you change the strength of steering? While there is still a lot that we don’t understand about SAEs, they are part of the important toolbox we are developing to better understand and control how LLMs work for us.

Credits

Thanks to Fernanda Viégas, Mike Mozer, James Wexler, Ryan Mullins and Martin Wattenberg for their help with this piece.

More Explorables