Can Large Language Models Explain Their Internal Mechanisms?

By Nada Hussein, Asma Ghandeharioun, Ryan Mullins, Emily Reif, Jimbo Wilson, Nithum Thain and Lucas Dixon

July 2024

The “mind” of a Large Language Model (LLM) consists of layers of interconnected artificial neurons. These layers communicate with vectors of numbers, often called the hidden representations. Deriving human-comprehensible meaning from such internals of AI systems is the focus of research into machine-learning interpretability, which has been making many exciting advances recently.

This Explorable is about one of these advances: a new family of interpretability methods called Patchscopes. The idea is to perform a kind of surgery on the neurons of an LLM, cutting out and replacing hidden representations between different prompts and layers. The key concept is the inspection prompt, which acts as a lens into the mind of an LLM, allowing the model itself to help uncover human-interpretable meaning.

Patchscopes is built on an understanding of LLMs and the transformer architecture. For a deeper dive into transformers and the schematic diagrams we use throughout this Explorable, see Appendix A.

Letting representations “talk” with Patchscopes

The Patchscopes framework leverages a simple premise: LLMs have the inherent ability to translate their own seemingly inscrutable hidden representations into human understandable text. Patching hidden representations between locations during inference allows us to inspect the information within a hidden representation, understand LLM behavior, or even augment the LLM’s behavior to improve its performance. Let’s explore this concretely with a step-by-step example.

Patchscopes is a surprisingly versatile tool. It can be used for tasks like extracting factual information, such as the country in which Barcelona is located; understanding the mechanisms of model refusal, such as refusing to respond to a prompt that is perceived to come from a person with malicious intent; debugging incorrect outputs; or even finding latent harmful information that might exist in a model even if it isn’t verbalized.

In the following sections we explore three case studies in which Patchscopes is used to: (1) discover how models resolve entities in their early layers; (2) evaluate how accurately the model’s hidden representation captures well-known concepts; and (3) augment the model’s processing of complex questions. These case studies touch upon some of the aspects of Patchscopes that require human judgment, such as determining good target locations for a given source location, or how to construct prompts that enable model inspection—although there are many more dimensions and design considerations to be explored. (For a formal definition of Patchscopes, see Appendix B.)

Patchscopes is being actively developed, and we invite your explorations, commentary, and feedback on GitHub.

Case-study: Verbalizing the model’s “thought” process

An LLM’s ability to perform a task—answering questions, summarizing documents, translating languages—is dependent on its ability to correctly contextualize the tokens in its prompt. For example, to answer the question “When was Diana, Princess of Wales born?” the model must understand that “Diana” refers to “Princess Diana”, rather than the generic name, or the ancient Roman goddess of the hunt. At what point does the model make this association, if at all?

In this case study, we use Patchscopes to shed light on how models process tokens across the layers of the model, and update their hidden representations to resolve what an entity refers to. We focus on understanding hidden representations of named entities—people, places, movies, etc.

Entity resolution & establishing context from a prompt

We are first going to identify how the model resolves the entity Diana from a source prompt. This is the first step in understanding how the model stores and accesses factual information, which can enable other tasks such as correcting outdated information. Building on the Princess Diana example, we construct a simple source prompt using her name and title:

We can measure how an entity is resolved by having the model generate a description of the entity and compare it to a known factual description. We can trigger this generation with a few-shot inspection prompt consisting of entity names followed by their description, and a final placeholder token x, which will be used as the target patching location.

We feed the source prompt into a 40 layer model and extract the hidden representation of the Wales token at every layer. Next, each hidden representation is patched into the first layer in place of the x token in the inspection prompt. Finally, the LLM continues decoding on this patched inspection prompt to generate the entity description for the given hidden representation of the Wales token.

In the table below, we show the generated descriptions from the first 10 source layers, and use an LLM-based automated evaluator to score the similarity of each output to the Wikipedia description of the original source prompt on a scale of 0-10.

Looking across all 40 layers, we start to see a pattern in the per-layer automated evaluator scores – that entity resolution typically happens in the early layers (< L=20) of the model. The general pattern of resolving in the early layers corresponds with theories about layer function: the role of early layers is to establish context from the prompt.

Tokenization can change how each layer processes information

Tokenization differs between model families, and has a strong impact on how the model navigates its embedding space. In the following visualization, we’ll focus on a tokenized representation of an example source prompt, “Dubai”:

Using the same few-shot entity description inspection prompt seen above, this example demonstrates the influence of token-by-layer processing on the generated text – we see the outputs change as the model incorporates more tokens into its context.

From TV shows to countries: Exploring classes of named entities

The visualization below allows you to explore more examples of named entities on your own. While this dataset is by no means exhaustively descriptive of the model’s behavior, we do see differences in where the model resolves the target entity across different classes—TV shows, people, countries—and even across token counts. Careful observers will notice a pattern in the way the model processes information: it builds up context by sequentially processing tokens in reverse order, gradually incorporating each token into the hidden representation. Further work is required to understand these patterns at scale.

To quickly recap this case study, we see that Patchscopes provides a highly flexible method for defining experiments that extract, verify, and characterize the model’s information retrieval process. It demonstrates highly expressive generations in early layers, where models are believed to do this processing.

Case-study: Extracting latent attributes

We can also use Patchscopes to take this a step further, exploring how a model relates subjects to their specific attributes at various model layers.

Feature extraction to corroborate hidden representations

Consider the subject Spain. Can we determine whether our LLM is able to gather context about Spain and correctly identify attributes like its largest city or official currency? More generally, given a hidden representation of a subject, we can explore whether an LLM can extract a specific attribute of that subject.

With Patchscopes, we can formulate an inspection prompt to extract an attribute. This prompt consists of a description of the attribute and a placeholder token, x, into which the hidden representation of the source subject is patched. For example, if we’re interested in the feature “the largest city” of a certain country, an inspection prompt that should generate the desired attribute in text looks like:

To extract the feature, we first capture the hidden representation from the last token of the source prompt and patch it into the x token in the inspection prompt above. We then determine if the correct answer, in this case Madrid, appears in the target model’s output. The visualization below lets you explore this example for a range of source layers.

Comparing Patchscopes to probing

The most comparable approach to this problem is a technique called probing – training a classifier (such as logistic regression) on a subset of data to allow it to predict an attribute from a hidden representation. This approach has a few downsides – namely, it requires a pre-specified subset of labels, which can limit expressivity of the output, and requires dedicated data to train the classifier.

Unlike probing, Patchscopes does not require any labeled data or supervised training. Additionally, we show that Patchscopes is more accurate than probing in many contexts in the full paper. The visualization below lets us explore a variety of reasoning tasks and compare the performance of patching and probing.

A few noticeable patterns emerge from these examples. Patchscopes outperforms probing in early layers on 7 of the 8 tasks. However, in 4 of the 8 tasks, probing outperforms patching in some later layers.

Importantly, these results suggest that Patchscopes is a viable alternative to probing classifiers, while offering two improvements. First, the flexibility and simplicity of construction for feature extraction inspection prompts indicate that the prompts could be easily generated from existing, trusted structured data that enables direct performance comparison, such as Wikipedia’s RDF dumps. Second, it does not require any additional training data or a predefined set of labels.

Application: Correcting reasoning errors

In the case studies above, we demonstrate that Patchscopes is a flexible and expressive tool for inspecting a model. Can we go further and use the same framework to change model behavior to improve its performance?

Consider multi-hop reasoning, a problem formulation where the answer depends on making logical connections between disjointed pieces of information. For example, to determine the answer to ‘the largest city in sushi’s country of origin’, a model needs to correctly recognize that Sushi’s country of origin is Japan in order to answer with Japan’s largest city, Tokyo.

One possible performance inhibitor on these problems may be the model failing to conduct sequential reasoning in the right order. Chain-of-thought (CoT) style prompts improve performance on multi-hop reasoning problems by explicitly expressing reasoning as a series of steps during generation, but there is still room for improvement. Patchscopes may provide an alternative mechanism to control the order of reasoning steps and ultimately correct the generated output.

Patching backwards

To explore this, we constructed a small dataset of two-clause multi-hop reasoning queries. We used examples from the attribute extraction case study, where the first clause defines the scope of the answer, and the second clause establishes the necessary context. The model can correctly answer each of these clauses independently, but fails to answer the composite multi-hop query. With some knowledge about the query structure, we can create a Patchscope that works backward, patching to an earlier token in the same prompt to intervene and correct its answer, as shown in the example below:

By generalizing the query structure that defines the Patchscope above, we start to see some patterns across our multi-hop reasoning queries:

Explore the accuracy of patched generations

The visualization below allows you to explore these examples individually. Use the accuracy grid in the center to explore the patched generations for different source and target layer pairs, and compare them to the model’s baseline response. As described above, the hidden representation is extracted from the last token in the multi-hop query, and we show the associated entity description below the accuracy grid to help you locate useful source layers.

The findings presented above are by no means conclusive, but they do demonstrate the utility of Patchscopes as a method for making sense of model behavior regardless of whether that model is generating the “correct” output.

Discussion and open questions

In the explorations above, we introduced Patchscopes—a method for understanding hidden representations in models—and established it as a flexible, highly expressive tool for understanding and even augmenting model behavior, which improves and unifies prior work (e.g., vocabulary projection, probing classifiers, and computational interventions) into a shared theoretical framework. We are excited that the core idea of Patchscopes has already been adopted more broadly in interpretability research as a stepping stone to advance our work on model understanding.

The case studies explored in this post capture a range of scenarios from input processing to attribute extraction to factual reasoning correction, but LLMs are used for an incredibly wide array of tasks, including mathematical reasoning, classification, code completion, and more. Each of these tasks provide many prompts to analyze, but significant research is required to understand how to create the relevant source or inspection prompts that enable human inspection of the model’s reasoning processes in these tasks. While we do not yet have generalizable heuristics or design guidelines, we do expect that certain classes of inspection prompts will have wide applicability as intermediate inspection tools across tasks that support the creation of task-specific inspection prompts. Examples of these include, but are not limited to, the few-shot entity description and zero-shot attribute extraction prompts described in this post.

Our overarching objective is to develop guidelines and tools that enable the effective application of Patchscopes across the widest possible variety of modeling tasks. To this end, we are exploring several research directions, such as broadening task applications, automating task-specific and task-agnostic patching configurations, and exploring non-identity hidden representation transforms.

The effectiveness of a patching configuration—the layer, prompt, and token position choices—depends on how the information propagates during inference. In the experiments we discussed above, simple heuristics worked well in configuring an effective Patchscope. When using the same source and target model, picking the same layer for source and target is a good start. If the analytical goal is more expressive, open-ended generation, e.g., to enable verification, we may pick target layers earlier than the source layer, as we did for entity resolution. Patching into later target layers seems to be less useful generally, as shown in the multi-hop reasoning, likely because the model has already shifted from sense-making to decoding. Considering token positions, a rule of thumb is that picking late target positions minimizes the chances of placeholder contamination, and therefore is more likely to work than other alternatives. If we aim to ask more complex questions, and therefore use more complicated inspection prompts, prior knowledge about the model’s information flow may be required, or we might need to test many configuration options before finding one that is most effective. Coming up with effective Patchscopes configurations automatically would make this framework much more powerful, and is a research direction we are currently exploring.

Finally, a full Patchscopes configuration allows for the application of transforms to the hidden representation between the extraction and injection steps. The examples in this explorable only employ the identity function as the transform, but the paper shows that, surprisingly, it is indeed possible to use a larger and more expressive model from the same family to explain a smaller model using an affine transformation function.

But what about different hidden representation dimensions, models from different families, or even models trained on different modalities of data? With more complex transformations, how far can we push the limits in cross-model Patchscopes? Answering these questions requires more research, and we continue to explore these ideas.

We encourage you to check out our paper or reach out to us on GitHub for more information about Patchscopes.

Credits

Thanks to Avi Caciularu, Carter Blum, Mor Geva, Mahima Pushkarna, Ian Tenney, Fernanda Viegas, Martin Wattenberg, David Weinberger, and James Wexler for their help with this piece.

Appendix A: A brief review of transformers

LLMs are transformer models that take a text string as input, called a prompt, that describes a task, such as answering a question, summarizing an article, translating text, generating code snippets. The LLM’s job is to generate the text that best follows from the task captured in the prompt. For an illustrative example, consider the prompt: “United Kingdom”.

This input text is tokenized – broken up into the model’s atomic vocabulary consisting of individual words and small sequences of characters.

Each token is embedded into a high-dimensional numeric vector to facilitate computation. We represent this operation with a trapezoid. The embedded vector is the first hidden representation in the model.

After embedding, LLMs are organized into layers of transformer blocks. Each layer produces an updated hidden representation based on the output of the preceding layer, which we visualize with a circle labeled $\ell^i$ , where $i$ stands for the layer index. The hidden representation produced by the first transformer layer is labeled $\ell^0$ .

This procedure continues through the final layer, gradually updating the hidden representations to incorporate context about the task described in the prompt.

Next, subsequent prompt tokens are fed into the network. Tokens are stored in an ordered sequence, and the hidden representation produced by a layer for a given token is influenced by the hidden representation for that token at the preceding layer as well as the hidden representations for all preceding tokens in the sequence. Networks that only look backwards like this are called decoder-only models and are our focus here.

After the final prompt token is processed, the final layer’s hidden representation is used to yield an output token based on the distribution of the model’s training data. This output token, is, becomes the next input to the network.

By feeding generated tokens back in as input, the model can generate phrases, sentences, and even long-form text.

For a more in-depth, visual exploration of transformers and the attention mechanism, see the related YouTube videos from 3Blue1Brown.

Appendix B: Formal description of Patchscopes

Given a hidden representation obtained from an LLM inference pass, a Patchscope instance decodes specific information from it by patching it into a different inference pass (of the same or a different LLM) that encourages the translation of that specific information.

Formally, given an input sequence of $n$ tokens $S = \langle s_1, ..., s_{n} \rangle$ and a model $M$ with $L$ layers, $\bm{h}_{i}^{\ell}$ denotes the hidden representation obtained at layer $\ell \in [1, \ldots, L]$ and position $i \in [1, \ldots, n]$ , when running $M$ on $S$ . To inspect $\bm{h}_{i}^{\ell}$ , we consider a separate inference pass of a model $M^*$ with $L^*$ layers on a target sequence $T = \langle t_1, \ldots, t_{m} \rangle$ of $m$ tokens. Specifically, we choose a hidden representation $\bar{\bm{h}}_{i^*}^{\ell^*}$ at layer $\ell^* \in [1, \ldots, L^*]$ and position ${i^*} \in [1, \ldots, m]$ in the execution of $M^*$ on $T$ . Moreover, we define a mapping function $f({\bm{h}}; \bm{\theta}): \mathbb{R}^{d} \mapsto \mathbb{R}^{d^*}$ parameterized by $\bm{\theta}$ that operates on hidden representations of $M$ , where $d$ and $d^*$ denote the hidden dimension of representations in $M$ and $M^*$ , respectively. This function can be the identity function, a linear or affine function learned on task-specific pairs of representations, or even more complex functions that incorporate other sources of data.

The patching operation refers to dynamically replacing the representation $\bar{\bm{h}}_{i^*}^{\ell^*}$ during the inference of $M^*$ on $T$ with $f(\bm{h}_{i}^{\ell})$ . Namely, by applying ${\bar{\bm{h}}}_{i^*}^{\ell^*} \leftarrow f(\bm{h}_{i}^{\ell})$ , we intervene on the generation process and modify the computation after layer $\ell^*$ .

Overall, a Patchscope intervention applied to a representation determined by $(S,i,M,\ell)$ , is defined by a quintuplet $(T,i^*,f, M^*,\ell^*)$ of an inspection prompt $T$ , a target position $i^*$ in this prompt, a mapping function $f$ , a target model $M^*$ , and a target layer $\ell^*$ of this model. It is possible that $M$ and $M^*$ are the same model, $S$ and $T$ are the same prompt, and $f$ is the identity function $\mathbb{I}$ (i.e., $\mathbb{I}(\bm{h}) = \bm{h}$ ).

Footnotes

Equal contribution.

For example, Anthropic and OpenAI recently published work exploring how to disentangle activation vectors and think about using them to control model behavior, and Google’s Responsible Generative AI Toolkit includes an interpretability-based prompt-debugger to help identify mistakes in the prompts fed to LLMs.

To score each generation, the following prompt was used with PaLM 2 Text Unicorn as the model: How semantically similar are the following texts: <text_A> {description} </text_A> <text_B> {generation} </text_B> Both these texts try to explain the following entity: <entity> {entity} </entity> Provide an integer rating between 1 and 10. 1 refers to ‘not similar at all’, and 10 refers to ‘extremely similar’: <label>

The full datasets are available on Github. While some of these examples are contrived to ease research evaluations, they do demonstrate the principle that rerouting hidden representations allows the model to generate more accurate conclusions.

Note that the model can sometimes output incorrect information about the entity it is describing.

To get a more robust estimate of accuracy for patching, we average across five different source prompts for each example. A patching instance was considered correct if the correct answer appeared in any of the first 20 generated outputs of any target layer.

As we saw in the prior case study, information about the input is more readily accessible in early to mid layers, before the model shifts toward next-token prediction. This explains why the performance of this simple Patchscope can drop for later source layers. In the open questions section, we will discuss some strategies to improve attribute extraction accuracy in later layers by adjusting the Patchscope configurations.

We want to caution that our primary goal in this section is not to devise a new method to solve multi-hop queries that is necessarily a competitor to CoT, but rather to make a proof-of-concept while comparing with CoT as a common reference.

A backward-acting Patchscope extracts the hidden representation from one token and injects it into a preceding token in the prompt, thus this is only possible when the source and inspection prompts are the same. While the exact implications of patch directionality are unknown, we believe that this technique will be useful for understanding the information flow inside the model and may aid in identifying reasoning circuits, for example.

To generate the entity descriptions, we use the same inspection prompt as seen in the entity description case study: Syria: Country in the Middle East, Leonardo DiCaprio: American actor, Samsung: South Korean multinational major appliance and consumer electronics corporation, x For each multi-hop query, we patch the hidden representation at the last token location of the selected source layer in place of the ‘x’ token of the inspection prompt at the last target layer. This generates the entity description of the hidden representation – here we are looking for some description of the correct country of origin as a checkpoint to determine whether the model will be able to answer the multi-hop query.

Sometimes the remaining representation from the placeholder token in the early layers interferes with future token generations. We call this phenomenon placeholder contamination. It is only relevant if one is interested in generating more than the next immediate token, and it is more likely to happen if target location is in the later layers.

Mechanistic interpretability work that focuses on circuits is one way of understanding information flow in a model. Prior work related to indirect object identification and factual associations are some concrete examples of how to provide precise localization guidelines for patching configurations. Conversely, we expect that investigations that use Patchscopes may result in the identification of new circuits in LLMs.

Tokenization strategies are complex and highly variable. Models are trained to work with specific tokenizers, and tokenizations may be incompatible across models.

The models we explore in this post use a deterministic embedding process to transform the token vector into the high-dimensional space of hidden representations. Most recent LLMs—including Gemma, Llama, and Mistral—use a single embedding layer for this task.

Technically speaking, transformer models don’t just operate on prompt tokens one at a time. Rather, they process prompt tokens in parallel for efficiency. We visualize them here as entering sequentially to reinforce the idea that in decoder-only models, the processing of later tokens can only be influenced by earlier tokens.

For simplicity, this post only covers case studies where the source and target model are the same. However, they can be different, and we briefly discuss in the open questions section.

Patchscopes supports applying arbitrary transformations to the hidden representation from the source location before it is injected into the target location. The examples in the post use the identity function as this transform (i.e., does nothing), but complex transforms may be necessary when patching across different models, or inspecting more complex functions.

References

Understanding intermediate layers using linear classifier probes Alain, G. and Bengio, Y. , 5th International Conference on Learning Representations, Workshop Track Proceedings, 2017.

Probing Classifiers: Promises, Shortcomings, and Advances Belinkov, Y. Computational Linguistics, 48(1):207–219, 2022.

Eliciting Latent Predictions from Transformers with the Tuned Lens Belrose, N., Furman, Z., Smith, L., Halawi, D., Ostrovsky, I., McKinney, L., Biderman, S., and Steinhardt, J., arXiv preprint arXiv:2303.08112, 2023.

Language Models are Few-Shot Learners Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D., arXiv preprint arXiv:2005:14165, 2020.

SelfIE: Self-Interpretation of Large Language Model Embeddings. Chen, H., Vondrick, C., & Mao, C.. arXiv preprint arXiv:2403.10949. 2024.

Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models Ghandeharioun, A. $^*$ , Caciularu, A. $^*$ , Pearce, A., Dixon, L., Geva, M. ICML (to appear), 2024.

Who’s asking? User personas and the mechanics of latent misalignment. Ghandeharioun, A., Yuan, A., Guerard, M., Reif, E., Lepori, M.A., and Dixon, L., arXiv preprint arXiv:2406.12094, 2024.

What do tokens know about their characters and how do they know it? Kaushal, A., Mahowald, K. 2022. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2487–2507, Seattle, United States. Association for Computational Linguistics.

Locating and editing factual associations in GPT Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.

Introducing Meta Llama 3: The most capable openly available LLM to date. Meta, 2023.

Interpreting GPT: the logit lens. nostalgebraist, LessWrong, 2020.

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. Röttger, P., Kirk, H.R., Vidgen, B., Attanasio, G., Bianchi, F., and Hovy, D., arXiv preprint arXiv:2308.01263, 2023.

Attention in transformers, visually explained Sanderson, G., 3Blue1Brown, 2024.

But what is a GPT? Visual intro to transformers Sanderson, G., 3Blue1Brown, 2024.

BERT Rediscovers the Classical NLP Pipeline Tenney, I., Das, D., Pavlick, E. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4593-4601. 2019.

Attention is all you need Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. Advances in neural information processing systems, 30. 2017.

The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives Voita, E., Sennrich, R., Titov, I. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V. and Zhou, D.. Advances in Neural Information Processing Systems 35, 2022.

Judging llm-as-a-judge with mt-bench and chatbot arena. Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. and Zhang, H. Advances in Neural Information Processing Systems 36, 2024.