The problem of understanding a neural network is a little bit like reverse engineering a large compiled binary of a computer program. In this analogy, the weights of the neural network are the compiled assembly instructions. At the end of the day, the weights are the fundamental thing you want to understand: how does this sequence of convolutions and matrix multiplications give rise to model behavior?
Trying to understand artificial neural networks also has a lot in common with neuroscience, which tries to understand biological neural networks. As you may know, one major endeavor in modern neuroscience is mapping the connectomes of biological neural networks: which neurons connect to which. These connections, however, will only tell neuroscientists which weights are non-zero. Getting the weights – knowing whether a connection excites or inhibits, and by how much – would be a significant further step. One imagines neuroscientists might give a great deal to have the access to weights that those of us studying artificial neural networks get for free.
And so, it’s rather surprising how little attention we actually give to looking at the weights of neural networks. There are a few exceptions to this, of course. It’s quite common for researchers to show pictures of the first layer weights in vision models
In this article, we’re focusing on visualizing weights. But people often visualize activations, attributions, gradients, and much more. How should we think about the meaning of visualizing these different objects?
It seems to us that there are three main barriers to making sense of the weights in neural networks, which may have contributed to researchers tending to not directly inspect them:
Many of the methods we’ll use to address these problems were previously explored in Building Blocks in the context of understanding activation vectors
Interpretability methods often fail to take off because they’re hard to use. So before diving into sophisticated approaches, we wanted to offer a simple, easy to apply method.
In a convolutional network, the input weights for a given neuron have shape [width, height, input_channels]
. Unless this is the first convolutional layer, this probably can’t be easily visualized because input_channels
is large. (If this is the first convolutional layer, visualize it as is!) However, one can use dimensionality reduction to collapse input_channels
down to 3 dimensions. We find one-sided NMF especially effective for this.
This visualization doesn’t tell you very much about what your weights are doing in the context of the larger model, but it does show you that they are learning nice spatial structures. This can be an easy sanity check that your neurons are learning, and a first step towards understanding your neuron’s behavior. We’ll also see later that this general approach of factoring weights can be extended into a powerful tool for studying neurons.
Despite this lack of contextualization, one-sided NMF can be a great technique for investigating multiple channels at a glance. One thing you may quickly discover using this method is that, in models with global average pooling at the end of their convolutional layers, the last few layers will have all their weights be horizontal bands.
2:
Horizontally-banded weights in InceptionV1 mixed5b_5x5
, for a selection of eight neurons. As in Figure 1, the red, green, and blue channels on each grid indicate the weights for each of the 3 NMF factors.
We call this phenomenon weight banding. One-sided NMF allows for quickly testing and validating hypotheses about phenomena such as weight banding.
Of course, looking at weights in a vacuum isn’t very interesting. In order to really understand what’s going on, we need to contextualize weights in the broader context of the network
Recall that the weights between two convolutional layers are a four dimensional array of the shape:
[relative x position, relative y position,
input channels, output channels]
If we fix the input channel and the output channel, we get a 2D array we can present with traditional data visualization. Let’s assume we know which neuron we’re interested in understanding, so we have the output channel. We can pick the input channels with high magnitude weights to our output channel.
But what does the input represent? What about the output?
The key trick is that techniques like feature visualization
This approach is the weight analogue of using feature visualizations to contextualize activation vectors in Building Blocks (see the section titled “Making Sense of Hidden Layers”).
We can liken this to how, when reverse-engineering a normal compiled computer program, one would need to start assigning variable names to the values stored in registers to keep track of them. Feature visualizations are essentially automatic variable names for neurons, which are roughly analogous to those registers or variables.
Of course, neurons have multiple inputs, and it can be helpful to show the weights to several inputs at a time as a small multiple:
mixed3b
342.
And if we have two families of related neurons interacting, it can sometimes even be helpful to show the weights between all of them as a grid of small multiples:
5: Small multiple weights for a variety of curve detectors.
Although we most often use feature visualization to visualize neurons, we can visualize any direction (linear combination of neurons). This opens up a very wide space of possibilities for visualizing weights, of which we’ll explore a couple particularly useful ones.
Recall that the weights for a single neuron have shape [width, height, input_channels]
. In the previous section we split up input_channels
and visualized each [width, height]
matrix. But an alternative approach is to think of there as being a vector over input neurons at each spatial position, and to apply feature visualization to each of those vectors. You can think of this as telling us what the weights in that position are collectively looking for.
This visualization is the weight analogue of the “Activation Grid” visualization from Building Blocks. It can be a nice, high density way to get an overview of what the weights for one neuron are doing. However, it will be unable to capture cases where one position responds to multiple very different things, as in a multi-faceted or polysemantic neuron.
Feature visualization can also be applied to factorizations of the weights, which we briefly discussed earlier. This is the weight analogue to the “Neuron Groups” visualization from Building Blocks.
This can be especially helpful when you have a group of neurons like high-low frequency detectors or black and white vs color detectors that look are all mostly looking for a small number of factors. For example, a large number of high-low frequency detectors can be significantly understood as combining just two factors – a high frequency factor and a low-frequency factor – in different patterns.
conv2d2
.
These factors can then be decomposed into individual neurons for more detailed understanding.
As we mentioned earlier, sometimes the meaningful weight interactions are between neurons which aren’t literally adjacent in a neural network, or where the weights aren’t directly represented in a single weight tensor. A few examples:
As a result, we often work with “expanded weights” – that is, the result of multiplying adjacent weight matrices, potentially ignoring non-linearities. We generally implement expanded weights by taking gradients through our model, ignoring or replacing all non-linear operations with the closest linear one.
These expanded weights have the following properties:
They also have one additional benefit, which is more of an implementation detail: because they’re implemented in terms of gradients, you don’t need to know how the weights are represented. For example, in TensorFlow, you don’t need to know which variable object represents the weights. This can be a significant convenience when you’re working with unfamiliar models!
Multiplying out the weights like this can sometimes help us see a simpler underlying structure. For example, mixed3b
208 is a black and white center detector. It’s built by combining a bunch of black and white vs color detectors together.
9. mixed3b
208 along with five neurons from mixed3a
that contribute the strongest weights to it.
Expanding out the weights allows us to see an important aggregate effect of these connections: together, they look for the absence of color in the center one layer further back.
10. Top eighteen expanded weights from conv2d2
to mixed3b
208, organized in two rows according to weight factorization.
A particularly important use of this method – which we’ve been implicitly using in earlier examples – is to jump over “bottleneck layers.” Bottleneck layers are layers of the network which squeeze the number of channels down to a much smaller number, typically in a branch, making large spatial convolutions cheaper. The bottleneck layers of InceptionV1 are one example. Since so much information is compressed, these layers are often polysemantic, and it can often be more helpful to jump over them and understand the connection to the wider layer before them.
Expanded weights can, of course, be misleading when non-linear structure is important. A good example of this is boundary detectors. Recall that boundary detectors usually detect both low-to-high and high-to-low frequency transitions:
11. Boundary detectors such as mixed3b
345 detect both low-to-high and high-to-low frequency transitions.
Since high-low frequency detectors are usually excited by high-frequency patterns on one side and inhibited on the other (and vice versa for low frequency), detecting both directions means that the expanded weights cancel out! As a result, expanded weights appear to show that boundary detectors are neither excited or inhibited by high frequency detectors two layers back, when in fact they are both excited and also inhibited by high frequency, depending on the context, and it’s just that those two different cases cancel out.
12.
Neurons two layers back (such as conv2d2
89) may have a strong influence on the high-low frequency detectors that contribute to mixed3b
345 (top), but that influence washes out when we look at the expanded weights (bottom) directly between conv2d2
89 and mixed3b
345.
More sophisticated techniques for describing multi-layer interactions can help us understand cases like this. For example, one can determine what the “best case” excitation interaction between two neurons is (that is, the maximum achievable gradient between them). Or you can look at the gradient for a particular example. Or you can factor the gradient over many examples to determine major possible cases. These are all useful techniques, but we’ll leave them for a future article to discuss.
One qualitative property of expanding weights across many layers deserves mention before we end our discussion of them. Expanded weights often get this kind of “electron orbital”-like smooth spatial structures:
13. Smooth spatial structure of some expanded weights from mixed3b
268 to conv2d1
.
Although the exact structures present may vary from neuron to neuron, this example is not cherry-picked: this smoothness is typical of most multiple-layer expanded weights.
It’s not clear how to interpret this, but it’s suggestive of rich spatial structure on the scale of multiple layers.
So far, we’ve addressed the challenges of contextualization and indirection interactions. But we’ve only given a bit of attention to our third challenge of dimensionality and scale. Neural networks contain many neurons and each one connects to many others, creating a huge amount of weights. How do we pick which connections between neurons to look at?
For the purposes of this article, we’ll put the question of which neurons we want to study outside of our scope, and only discuss the problem of picking which connections to study. (We may be trying to comprehensively study a model, in which case we want to study all neurons. But we might also, for example, be trying to study neurons we’ve determined related to some narrower aspect of model behavior.)
Generally, we chose to look at the largest weights, as we did at the beginning of the section on contextualization. Unfortunately, there tends to be a long tail of small weights, and at some point it generally gets impractical to look at these. How much of the story is really hiding in these small weights? We don’t know, but polysemantic neurons suggest there could be a very important and subtle story hiding here! There’s some hope that sparse neural networks might make this much better, by getting rid of small weights, but whether such conclusions can be drawn about non-sparse networks is presently speculative.
An alternative strategy that we’ve brushed on a few times is to reduce your weights into a few components and then study those factors (for example, with NMF). Often, a very small number of components can explain much of the variance. In fact, sometimes a small number of factors can explain the weights of an entire set of neurons! Prominent examples of this are high-low frequency detectors (as we saw earlier) and black and white vs color detectors.
However, this approach also has downsides. Firstly, these components can be harder to understand and even polysemantic. For example, if you apply the basic version of this method to a boundary detector, one component will contain both high-to-low and low-to-high frequency detectors which will make it hard to analyze. Secondly, your factors no longer align with activation functions, which makes analysis much messier. Finally, because you will be reasoning about every neuron in a different basis, it is difficult to build a bigger picture view of the model unless you convert your components back to neurons.
This article grew out of a document that Chris Olah wrote in order to act as an explanation for our techniques for visualizing weights.
Research. The necessity of visualizing weights is a problem we encounter frequently, and our techniques have been refined across many investigations of features and circuits, so it is difficult to fully separate out all contributions towards improving those techniques.
Many people “test drove” these visualization methods, and a lot of our practical knowledge of using them to study circuits came from that. For example, the curve detector examples used in Small Multiples are due to Nick Cammarata’s work investigating curve detectors. Gabe Goh performed experiments that moved Visualizing Spatial Position Weights forward. The high-low frequency detector example and NMF factors used in Visualizing Weight Factors are due to experiments performed by Ludwig Schubert, and the weight banding examples in Aside: One Simple Trick are due to experiments run by Michael Petrov.
Writing and Diagrams. Chris wrote the article and developed the designs for its original figures. Chelsea Voss ported the article to Distill, upgraded the diagrams for the new format, edited some text, and developed Figure 12, and Chris provided feedback and guidance throughout.
Code. Chris authored the Tensorflow (Lucid) notebook, and Ben Egan and Swee Kiat Lim authored the PyTorch (Captum) notebook.
We are grateful to participants of #circuits in the Distill slack for their early comments and engagement with these concepts, including Kenneth Co, Humza Iqbal, and Vincent Tjeng. We are also grateful to Daniel Filan, Humza Iqbal, Stefan Sietzen, and Vincent Tjeng for remarks on a draft.
If you see mistakes or want to suggest changes, please create an issue on GitHub.
Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.
For attribution in academic contexts, please cite this work as
Voss, et al., "Visualizing Weights", Distill, 2021.
BibTeX citation
@article{voss2021visualizing, author = {Voss, Chelsea and Cammarata, Nick and Goh, Gabriel and Petrov, Michael and Schubert, Ludwig and Egan, Ben and Lim, Swee Kiat and Olah, Chris}, title = {Visualizing Weights}, journal = {Distill}, year = {2021}, note = {https://distill.pub/2020/circuits/visualizing-weights}, doi = {10.23915/distill.00024.007} }