Mapping LLMs with Sparse Autoencoders

Understanding Large Language Models

Large language models (LLMs) consist of an input layer, a series of hidden layers, and an output layer.

As a model is processing an input sentence, the output of each hidden layer is referred to as an activation.

Interpretability researchers are interested in understanding how and where a model encodes certain features in its activations.

Features represent concepts that the model might use as part of its internal reasoning to arrive at an answer. Ideally, activations would be monosemantic, with each index encoding exactly one feature. This would allow us to more easily figure out the meaning of each feature by examining the inputs for which it is activated.

For example, if the model is trying to answer the question about A Midsummer Night's Dream, it might use concepts like Shakespeare, literature, and nature.

In practice, however, each index might activate for many different features, resulting in polysemantic activations. This behaviour is reflected in the superposition hypothesis, which theorizes that models often need to represent more features than they have neurons. This makes it very difficult to disentangle the activations to understand how the model is representing features and examine its underlying reasoning mechanisms.

Training Sparse Autoencoders

The goal of an SAE is to disentangle these polysemantic activations, separating them into monesmantic vectors that we can connect with individual features. To do this, SAEs use a specific type of neural network, called an autoencoder, which is tasked to try to modify and then reconstruct activations.

Once trained, an SAE runs polysemantic activations, like the ones extracted from the model, through an encoder...

...and outputs a larger, sparser representation of the activation vector, referred to as the latent. This latent is larger to reflect the superposition hypothesis that there may be more features being represented than neurons in the activation.

To encourage monosemanticity, the autoencoder network is trained with a sparsity penalty which rewards the model for activating fewer latent neurons.

The latent can then be run through a decoder...

...to reconstruct the original activations.

To ensure that the intermediate vector is a suitably accurate decomposition of the activations, the SAE is trained to recreate the original activations by minimizing the reconstruction loss between the original activations and the reconstructions.

Labeling the Latent

Once we have an accurate and sparse decomposition, we can use an autolabeler to assign feature labels to each index of the latent. Hover over each activation index to see the top activations that fired for that label. In the Labeling Neurons section below, we will take a deeper dive into a real world example of an SAE and explore the rich labels that are extracted.

Steering with SAEs

With labels for each feature in the model, we can use the SAE to attempt to steer the model. Since we know which indices in the latent correspond to which feature, if we want the model to lean more heavily into the concept that feature represents, we can increase the activation for it in the latent. On the other hand if we want the model to reflect that concept less, we can decrease the activation for it in the latent. Once this is done, we can use the remainder of the autoencoder network to build the reconstruction and substitute it for our original activation.

For example, here we activate the cats feature and see how this changes our model behavior.

Once we've modified the latent by strengthening this feature, we can run this modified representation through the decoder…

...and create a reconstruction that reflects the steered feature. Notice that because we have modified the latent, this reconstruction is different from the original activation vector, as it now amplifies the “cat” concept.

We then plug this reconstruction back into the model, and see the model's steered response!

Use the buttons below to change the steered feature.

Mapping LLMs with Sparse Autoencoders

Understanding Large Language Models

Training Sparse Autoencoders

Labeling the Latent

Steering with SAEs

Labeling Neurons

Activations map

Conclusion

Credits

More Explorables