Decompiling Transformers

Results: Selected Programs and Failure Cases

When Does Decompilation Succeed?

Results show that the models that generalize well to longer input lengths, unseen during training, can often be decompiled, while models that do not length-generalize cannot be decompiled. Moreover, when decompilation succeeds, we can usually vastly cut down the size of programs compared to the initial D-RASP reparameterization. In contrast, for non-length-generalizing models, pruning usually has only very limited success. Thus, interestingly, even though all models are transformers of similar sizes, generalizable models tend to be more interpretable.

Decompilation drastically reduces program complexity for length-generalizing models.

Pruning has limited success for non-length-generalizing models.

Match Accuracy vs. Program Size for length-generalizing (left) and non-length-generalizing (right) models. Each red cross is a model pruned with a different set of hyperparameters; stars are the original unpruned models, and lines connect different pruned versions of the same model.

Decompilation Recovers Interpretable Algorithms

Most Frequent

We show a program extracted for finding the most frequent character in a string. In Line 1, the program uniformly aggregates all tokens in the input sequence. a1 thus holds counts for each of the tokens in the context, which makes the token with the highest value in a1 the majority token. This is why, in Line 2, a1 gets directly projected to the vocabulary space with a diagonal projection matrix to get the answer.

Program Code

Explained Operation

When select/project operations are selected, this shows the transformation matrix. When the endpoint of the element_wise_op is selected, it explains the transfrmation via its downstream effect. Essentially, it shows pairs of inputs to the element_wise_op with the corresponding logit of a selected token in a downstream component. For example, when the element_wise_op is explained via its effect on the unembedding projection, a common pattern is that an input with a high value in a specific token promotes the same token after the projection, which means that the element_wise_op likely just sharpens the input. When the element_wise_op is explained via its effect on a select operation, we show pairs of inputs (q and k) together with their attention logit.

Click on a code line to view learned transformations

Unique Copy

We show a program extracted from a Transformer trained to copy a string in which each token is unique. In this case, the model implements a classic "induction head"^[5]. The first select+aggregate operation retrieves the directly preceding symbol for each position; the second one retrieves whichever prior symbol was preceded by the current symbol.

Program Code

Explained Operation

Click on a code line to view learned transformations

Sorting

We show an extracted program for sorting a list of integers. Here, the heavy lifting is done by a select operation in Line 1, favoring the smallest keys larger than the query. A per-position operation in Line 3 then makes the aggregated histogram vector one-hot.

Program Code

Explained Operation

Click on a code line to view learned transformations

Bounded-Depth Dyck Languages

We show a program for Dyck-4, the language of well-nested strings over one pair of parentheses (opening "a" and closing "b") with depth at most four (equivalently, the formal language (a(a(a(ab)^*b)^*b)^*b)^*). Uniform aggregation in Line 1 creates a histogram of BOS, opening, and closing brackets; a per-position activation in Line 2 then performs a threshold calculation: e.g., EOS is allowed only when "a" and "b" are balanced. Look at the MLP transformation explained in Line 3: the EOS logit is positive when "a" and "b" are balanced, and the "b" logit is positive only when "a" and "b" are unbalanced.

Program Code

Explained Operation

Click on a code line to view learned transformations

Decompiling Transformers

Introduction

Decompilation Method

Results: Selected Programs and Failure Cases

When Does Decompilation Succeed?

Decompilation Recovers Interpretable Algorithms

Most Frequent

Program Code

Activation Variable

Explained Operation

Unique Copy

Program Code

Activation Variable

Explained Operation

Sorting

Program Code

Activation Variable

Explained Operation

Bounded-Depth Dyck Languages

Program Code

Activation Variable

Explained Operation

Results: Extracted Programs

Task Definition & Pareto Frontier

Task Definition & Model Architecture

Pareto Frontier

Program Code

Activation Variable

Explained Operation

References

Introduction

Decompilation Method

Results: Selected Programs and Failure Cases

When Does Decompilation Succeed?

Decompilation Recovers Interpretable Algorithms

Most Frequent

Program Code

Activation Variable Info Show Explained Operation

Explained Operation Info Show Activation Variable

Unique Copy

Program Code

Activation Variable Info Show Explained Operation

Explained Operation Info Show Activation Variable

Sorting

Program Code

Activation Variable Info Show Explained Operation

Explained Operation Info Show Activation Variable

Bounded-Depth Dyck Languages

Program Code

Activation Variable Info Show Explained Operation

Explained Operation Info Show Activation Variable

Results: Extracted Programs

Task Definition & Pareto Frontier

Task Definition & Model Architecture

Pareto Frontier

Program Code

Activation Variable Info Show Explained Operation

Explained Operation Info Show Activation Variable

References

Activation Variable

Explained Operation

Activation Variable

Explained Operation

Activation Variable

Explained Operation

Activation Variable

Explained Operation

Activation Variable

Explained Operation