Open up any ImageNet conv net and look at the weights in the last layer. You’ll find a uniform spatial pattern to them, dramatically unlike anything we see elsewhere in the network. No individual weight is unusual, but the uniformity is so striking that when we first discovered it we thought it must be a bug. Just as different biological tissue types jump out as distinct under a microscope, the weights in this final layer jump out as distinct when visualized with NMF. We call this phenomenon weight banding.
So far, the Circuits thread has mostly focused on studying very small pieces of neural network – individual neurons and small circuits. In contrast, weight banding is an example of what we call a “structural phenomenon,” a larger-scale pattern in the circuits and features of a neural network. Other examples of structural phenomena are the recurring symmetries we see in equivariance motifs and the specialized slices of neural networks we see in branch specialization. In the case of weight banding, we think of it as a structural phenomenon because the pattern appears at the scale of an entire layer.
In addition to describing weight banding, we’ll explore when and why it occurs. We find that there appears to be a causal link between whether a model uses global average pooling or fully connected layers at the end, suggesting that weight banding is part of an algorithm for preserving information about larger scale structure in images. Establishing causal links like this is a step towards closing the loop between practical decisions in training neural networks and the phenomena we observe inside them.
Weight banding consistently forms in the final convolutional layer of vision models with global average pooling.
In order to see the bands, we need to visualize the spatial structure of the weights, as shown below. We typically do this using NMF, as described in Visualizing Weights. For each neuron, we take the weights connecting it to the previous layer. We then use NMF to reduce the number of dimensions corresponding to channels in the previous layer down to 3 factors, which we can map to RGB channels. Since which factor is which is arbitrary, we use a heuristic to make the mapping consistent across neurons. This reveals a very prominent pattern of horizontal
Interestingly, AlexNet
Unlike most modern vision models, AlexNet does not use global average pooling. Instead, it has a fully connected layer directly connected to its final convolutional layer, allowing it to treat different positions differently. If one looks at the weights of this fully connected layer, the weights strongly vary as a function of the global y position.
The horizontal stripes in weight banding mean that the filters don’t care about horizontal position, but are strongly encoding relative vertical position. Our hypothesis is that weight banding is a learned way to preserve spatial information as it gets lost through various pooling operations.
In the next section, we will construct our own simplified vision network and investigate variations on its architecture in order to understand exactly which conditions are necessary to produce weight banding.
We’d like to understand which architectural decisions affect weight banding. This will involve trying out different architectures and seeing whether weight banding persists. Since we will only want to change a single architectural parameter at a time, we will need a consistent baseline to apply our modifications to. Ideally, this baseline would be as simple as possible.
We created a simplified network architecture with 6 groups of convolutions, separated by L2 pooling layers. At the end, it has a global average pooling operation that reduces the input to 512 values that are then fed to a fully connected layer with 1001 outputs.
This simplified network reliably produces weight banding in its last layer (and usually in the two preceding layers as well).
In the rest of this section, we’ll experiment with modifying this architecture and its training settings and seeing if weight banding is preserved.
To rule out bugs in training or some strange numerical problem, we decided
to do a training run with the input rotated by 90 degrees. This sanity check
yielded a very clear result showing vertical banding in the resulting
weights, instead of horizontal banding. This is a clear indication that banding is a result of properties
within the ImageNet dataset which make spatial vertical position
We remove the global average pooling step in our simplified model, allowing the fully connected layer to see all spatial positions at once. This model did not exhibit weight banding, but used 49x more parameters in the fully connected layer and overfit to the training set. This is pretty strong evidence that the use of aggressive pooling after the last convolutions in common models causes weight banding. This result is also consistent with AlexNet not showing this banding phenomenon (since it also does not have global average pooling).
We average out each row of the final convolutional layer, so that vertical absolute position is preserved but horizontal absolute position is not.5a
, similar to the baseline model’s 5b
. We found this result surprising.
We tried each of the modifications below, and found that weight banding was still present in each of these variants.
5b
.
5b
. The hope was that a
mask would help each 5b
neuron focus on the right parts of the 7x7 image
without a convolution.
5a
and 5b
.
5b
into 16 7x7x32 channel groups and feeding
each group its own fully connected layer. The output of the 16 fully connected layers is then
concatenated into the input of the final 1001-class fully connected layer.
An interactive diagram allowing you to explore the weights for these experiments and more can be found in the appendix.
In the previous section, we observed two interventions that clearly affected weight banding: rotating the dataset by 90º and removing the global average pooling before the fully connected layer. To confirm that these effects hold beyond our simplified model, we decided to make the same interventions to three common architectures (InceptionV1, ResNet50, VGG19) and train them from scratch.
With one exception, the effect holds in all three models.
The one exception is VGG19, where the removal of the pooling operation before its set of fully connected layers did not eliminate weight banding as expected; these weights look fairly similar to the baseline. However, it clearly responds to rotation.
Once we really understand neural networks, one would expect us to be able to leverage that understanding to design more effective neural networks architectures. Early papers, like Zeiler et al
It’s unclear whether weight banding is “good” or “bad.”
More generally, weight banding is an example of a large-scale structure. One of the major limitations of circuits has been how small-scale it is. We’re hopeful that larger scale structures like weight banding may help circuits form a higher-level story of neural networks.
The simplified network used to study this phenomenon was trained on Imagenet (1.2 million images) for 90 epochs. Training was done on 8 GPUs with a global batch size of 512 for the first 30 epochs and 1024 for the remaining 60 epochs. The network was built using TF-Slim. Batch norm was used on convolutional layers and fully connected layers, except for the last fully connected layer with 1001 outputs.
The following experiments were discussed in various conversations but have not been run at this time:
As with many scientific collaborations, the contributions are difficult to separate because it was a collaborative effort that we wrote together.
Research. Ludwig Schubert accidentally discovered weight banding, thinking it was a bug. Michael Petrov performed an array of systematic investigations into when it occurs and how architectural decisions affect it. This investigation was done in the context of and informed by collaborative research into circuits by Nick Cammarata, Gabe Goh, Chelsea Voss, Chris Olah, and Ludwig.
Writing and Diagrams. Michael wrote and illustrated a first version of this article. Chelsea improved the text and illustrations, and thought about big picture framing. Chris helped with editing.
We are grateful to participants of #circuits in the Distill Slack for their engagement on this article, and especially to Alex Bäuerle, Ben Egan, Patrick Mineault, Vincent Tjeng, and David Valdman for their remarks on a first draft.
If you see mistakes or want to suggest changes, please create an issue on GitHub.
Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.
For attribution in academic contexts, please cite this work as
Petrov, et al., "Weight Banding", Distill, 2021.
BibTeX citation
@article{petrov2021weight, author = {Petrov, Michael and Voss, Chelsea and Schubert, Ludwig and Cammarata, Nick and Goh, Gabriel and Olah, Chris}, title = {Weight Banding}, journal = {Distill}, year = {2021}, note = {https://distill.pub/2020/circuits/weight-banding}, doi = {10.23915/distill.00024.009} }